3 hour hack at Search result summarization…

One of my friends Anand Kishore along with some of his friends at Yahoo built a nice Text summarising app “Dygest” using the Search Monkey and some other SDKs. Their nice achievement is they say they were able to write a statistical text summarising algorithm. Now I don’t have the details of the algorithm with me, but what I could see from the summary was that it was HELPFUL. So first up Kudos to you guys, Good Job. Next I was kind of intrigued by the thought how they must have done it, and that set my brain rolling and I started reading what they must have done.

Few Observations I made about their results,

1. Very well presented. :-) , I am not good at web design and stuff so I really admire that.

2. Their results (sentences) were just too well formed to be machine generated. So, I was like where did they come from?

3. First hunch was, may be they just put out most interesting sentences.Which turned out to be partly true, their sentences are infact “picked” as is from the source text. But they put entire sentence so that it reads well, and also they probably put more than one sentence in order to make it sound coherent. They might also be doing some grammar analysis before putting the sentences together.

So these things were going in my head and I was like what would it take to pick up meaningful sentence form the text, the simplest thing. Then I set out to write my own code to do the same. I pulled from net a python script that could extract text from a url (it is not so powerful, but works). Then I took two web pages returned by Dygest for search term “Stimulus” and converted it to text using this python script. I wrote a small perl script to clean the text and build a matrix of the form “paragraph X words” and then scored them based on the number of meaningful words contributed by the paragraph. Here meaningful words are the words that are left after striping out the stop words (put together by searching some stop list online). The paragraph with maximum score is selected as the representative for the article.

I know this is really naive method of doing things, but I wanted to validate the thought I had in mind about the ability of this idea. And it turns out it stands validated. I don’t have results with me on this machine right now to put up here but will do that once I go home. Also, I was surprised as it did select some of the really good paragraphs as a answer.

There are a few variations I would have liked to try but did not, here are those.

1. Use a better importance measure.

2. Add more granularity to the text selected. I could easily go to sentence level and then show the top 3 sentences as answers.

3. Use second order context similarity for the search term and the paragraph selected. This would be really interesting but is a lot more involved and I did not have enough time.

All in all, it was fun app that it turned out. I will be uploading it soon here so keep watching this post or email me if you are in a real hurry. But again I don’t claim that the code is a top class code and is the best way to go. It is one of the way to go though. I also want to thank Andy (Anand Kishore) who’s post (Dygest) I mentioned before, inspired this act of mine.

Where the heck I have been???

For those of you who blame me for not staying in touch for past few weeks, sorry! I have been doing the below mentioned activities and have been out of touch with the humanity.
1. Working on my research on “Name Discrimination” for the Web People Search (WePS)  WePS task corpus.

2. Designing and developing a small POC for clustered processing of above data.

3. For the task 2 designing a small 3 machine cluster of Ubuntu virtual machines on my desktop using VMWare.

Now, that is something that has taken most of my spring break and will probably take some more time to come to reality. Below is my progress on each of the item.

1. WePS task: – I have nearly completed the program to covert the data from the corpus file structure (XML and HTMLs) to the  plain text and then finally in to SenseEval-2 format xml files to be clustered by SenseClusters software.

2.  Now what I have been thinking about is that each of these files are read sequentially by my converter program and then converted in to a SensEval-2 xml. All the file conversions are independent and hence can be converted independently in parallel. Also for the task of clustering, as it exists now we have 79 instances of names to be discriminated. Each name is represented by a xml file. Hence, all these tasks for one name are atomic and independent of each other and in such situation I wish to make this execution parallel too. I am not exploiting functionality level parallelism but for now I do wish to exploit the data parallelism that it exists now.

3 . For doing the parallel processing I am trying to setup my own cluster of Virtual Machines, this will give me a hands on creating Cluster, be a cheap test bed for my parallel programs, and something to play around with for a few days. I have created 3 identical virtual machines with SenseCluster Installation. The Machines are Named Alang, Madan, Kulang. Named after 3 forts in Maharashta, India. I am in a process of creating the cluster of these virtual machines. Each of them having RAM of 512 MB and HDD of 8 Gigs along with a dual core AMD Athalon CPU. The RAM limitation is due to the limit of my physical memory available, I have only 3 Gigs with me :( … I will update all on this front soon (Once I set up the cluster, which might not be before weekend and also I have an exam comming up so this is kind of on back burner…) till then thats it from my side…

Wish me Luck! Keep your messages flowing in… and I will try to be in touch in future… ;-)