Archive for the 'NLP' Category
3 hour hack at Search result summarization…

One of my friends Anand Kishore along with some of his friends at Yahoo built a nice Text summarising app “Dygest” using the Search Monkey and some other SDKs. Their nice achievement is they say they were able to write a statistical text summarising algorithm. Now I don’t have the details of the algorithm with me, but what I could see from the summary was that it was HELPFUL. So first up Kudos to you guys, Good Job. Next I was kind of intrigued by the thought how they must have done it, and that set my brain rolling and I started reading what they must have done.

Few Observations I made about their results,

1. Very well presented. :-) , I am not good at web design and stuff so I really admire that.

2. Their results (sentences) were just too well formed to be machine generated. So, I was like where did they come from?

3. First hunch was, may be they just put out most interesting sentences.Which turned out to be partly true, their sentences are infact “picked” as is from the source text. But they put entire sentence so that it reads well, and also they probably put more than one sentence in order to make it sound coherent. They might also be doing some grammar analysis before putting the sentences together.

So these things were going in my head and I was like what would it take to pick up meaningful sentence form the text, the simplest thing. Then I set out to write my own code to do the same. I pulled from net a python script that could extract text from a url (it is not so powerful, but works). Then I took two web pages returned by Dygest for search term “Stimulus” and converted it to text using this python script. I wrote a small perl script to clean the text and build a matrix of the form “paragraph X words” and then scored them based on the number of meaningful words contributed by the paragraph. Here meaningful words are the words that are left after striping out the stop words (put together by searching some stop list online). The paragraph with maximum score is selected as the representative for the article.

I know this is really naive method of doing things, but I wanted to validate the thought I had in mind about the ability of this idea. And it turns out it stands validated. I don’t have results with me on this machine right now to put up here but will do that once I go home. Also, I was surprised as it did select some of the really good paragraphs as a answer.

There are a few variations I would have liked to try but did not, here are those.

1. Use a better importance measure.

2. Add more granularity to the text selected. I could easily go to sentence level and then show the top 3 sentences as answers.

3. Use second order context similarity for the search term and the paragraph selected. This would be really interesting but is a lot more involved and I did not have enough time.

All in all, it was fun app that it turned out. I will be uploading it soon here so keep watching this post or email me if you are in a real hurry. But again I don’t claim that the code is a top class code and is the best way to go. It is one of the way to go though. I also want to thank Andy (Anand Kishore) who’s post (Dygest) I mentioned before, inspired this act of mine.

Composing a mail or simply put, chatting in Devnagari. Can it be a reality sometime?

I have been away from home for almost two years now. I talk and chat frequently with my mom and dad back home. My mom is more comfortable using my mother tounge Marathi to communicate with me and has to fight with the Roman Script keyboard that comes fitted to the computer back home. What is more annoying is despite of the fact that I am a computer science Masters Stuudent she still has to struggle, can’t I just write a small program that will encode the Roman key sequence and change it to the resective Devnagari Character? Well yes I can, but does she use the same key sequence every time to mean the same letter? Is she consistant with her spelling of certain phonemes? No she is not. Now should I make her mug up (Indan term for memorize) all the key sequences just coz she has to write to me in Devanagari script? Is it really worth the pain? (Well for those emotional types it might be, but my mom would say look kid, you want me to read this and mug it up? forget it! I will call you).  To all those of you who are by now eager to point me to Google’s Indic Transliteration web interface, my point is my mom is not great with computer and although she would be keen on doing it, it would be a bigger pain for me to explain her how to really do it. So forget it.

What do I do? Leave her to call me and not pick the call coz I am either in school or playing Cricket or cooking or out with friends? No I chose to write the code which will adapt to her key sequences probabilistically and then suggest some output of the word to select. I know this is really a long term project and needs a lot of dedication. Below is my plan of action.

1. Identify the phonemes that are formed with the script consonants + vowels and other combinations.

2. Build a probabilistic model for phoneme preceedence using the two Marathi corpus available form CFILT-IIT-B.

3. draw association between Roman key strokes and Devenagari phonemes.

4. Write up a predction algorithm that that based on the previous key strokes and the probabilistic models predict the nextt phoneme in the word.

I am through with step one. It has a small python file as code. I would love to write a C++ code but I am trying to write like a small app that can be used in a web browser hence I am keeping it in something like Python. But again I don’t know if Python will be useful. So for now I am designing and developing the algo and a sample aap as my weekend project. Lets see where we go. Feedback welcome!

POS count and Poison Distribution…

Today in the Probability models class we learned about Poison Process and its use for counting instances in a particular period / time. It says that if a process follows a poison distribution then for any two time periods of the same length, the number of instances or occurrences of the event are same. This kind of triggered a small question in my mind, do the Parts of Speech used in sentences also follow this distribution? If we count number of verbs that occur in say a fixed word window, then *if* they follow Poison distribution then will they be the same? Can this be modelled over Poison distribution?

Now, I am yet to learn the properties of a Poison process and that may help clear this thought. But this is something nice to think about on the weekend. (As if I have nothing to think about ;-) …)

Well I talked with my statistics professor and he said we will discuss it on Monday. So I am looking forward to that.

Where the heck I have been???

For those of you who blame me for not staying in touch for past few weeks, sorry! I have been doing the below mentioned activities and have been out of touch with the humanity.
1. Working on my research on “Name Discrimination” for the Web People Search (WePS)  WePS task corpus.

2. Designing and developing a small POC for clustered processing of above data.

3. For the task 2 designing a small 3 machine cluster of Ubuntu virtual machines on my desktop using VMWare.

Now, that is something that has taken most of my spring break and will probably take some more time to come to reality. Below is my progress on each of the item.

1. WePS task: – I have nearly completed the program to covert the data from the corpus file structure (XML and HTMLs) to the  plain text and then finally in to SenseEval-2 format xml files to be clustered by SenseClusters software.

2.  Now what I have been thinking about is that each of these files are read sequentially by my converter program and then converted in to a SensEval-2 xml. All the file conversions are independent and hence can be converted independently in parallel. Also for the task of clustering, as it exists now we have 79 instances of names to be discriminated. Each name is represented by a xml file. Hence, all these tasks for one name are atomic and independent of each other and in such situation I wish to make this execution parallel too. I am not exploiting functionality level parallelism but for now I do wish to exploit the data parallelism that it exists now.

3 . For doing the parallel processing I am trying to setup my own cluster of Virtual Machines, this will give me a hands on creating Cluster, be a cheap test bed for my parallel programs, and something to play around with for a few days. I have created 3 identical virtual machines with SenseCluster Installation. The Machines are Named Alang, Madan, Kulang. Named after 3 forts in Maharashta, India. I am in a process of creating the cluster of these virtual machines. Each of them having RAM of 512 MB and HDD of 8 Gigs along with a dual core AMD Athalon CPU. The RAM limitation is due to the limit of my physical memory available, I have only 3 Gigs with me :( … I will update all on this front soon (Once I set up the cluster, which might not be before weekend and also I have an exam comming up so this is kind of on back burner…) till then thats it from my side…

Wish me Luck! Keep your messages flowing in… and I will try to be in touch in future… ;-)

Tag Cloud: My interpretation…

Tags, a term that most of us bloggers have become familiar with and probably have had a chance to use as well. It mostly and in ideal cases depicts what words or phrases you would associate the given text with. Now for a human mind we have a large choice between terms and a wide variety of words and phrases that we can use to represent one single meaning or tag. This very fact provides us with the challenges of synonym tags, relativity or relationship in tags and retrieving similar meaning tags and phrases that are not necessarily phrased in similar words.

Let us analyze the problem at hand, tags are assigned by the user and are based on the contents of blog according to the perception of the individuals assigning them. According to the distributional hypothesis of linguistics (Harris, 1954) we can say that words expressing similar meanings tend to occur near each-other in similar contexts. If we are to believe the users for their tags then we can assume that the different blogs with similar or same meaning tags will contain similar words. Now if they contain similar words, then question it poses to us is, can we infer using our knowledge of Natural Language Processing that these blogs will exibit similar tags or will be classified under same tag set or tag cloud.

Why are we doing this? There is a lot of text tagged these days and also there is a lot text and other material available on the web that is not tagged. So the aim of the method is to be able to classify that text under some tags may be at a lower priority level or something that will enable the discovery of that text as well based on the tags search. I prefer tag based search as it is more directed than key word based one.

This challenge relates to the classical problem of Natural Language Processing, the “Text classification problem” and we will try and survey if some of the proposed solutions for this and similar problems like Word Sense Disambiguation and similarity measure problems.

For our discussion here let us concentrate on the word vector based methods, WSD or similarity measure. For further discussion here I have a few methods in mind, which are as follows:-

1. Lesks method – one of the very old papers in WSD.

2. Using Co-occurrence vectors to find similarity of word vectors. – Niwa & Nitta 1996, and Wilk’s algorithm.

3. H. Sch¨utze, Automatic word sense discrimination. Computational Linguistics. 1998.

4. Using Wordnet based context vectors to estimate semantic relatedness of concepts. (Patwardhan, Pedersen 2006).