Archive for the 'Text Classification' Category
3 hour hack at Search result summarization…

One of my friends Anand Kishore along with some of his friends at Yahoo built a nice Text summarising app “Dygest” using the Search Monkey and some other SDKs. Their nice achievement is they say they were able to write a statistical text summarising algorithm. Now I don’t have the details of the algorithm with me, but what I could see from the summary was that it was HELPFUL. So first up Kudos to you guys, Good Job. Next I was kind of intrigued by the thought how they must have done it, and that set my brain rolling and I started reading what they must have done.

Few Observations I made about their results,

1. Very well presented. :-) , I am not good at web design and stuff so I really admire that.

2. Their results (sentences) were just too well formed to be machine generated. So, I was like where did they come from?

3. First hunch was, may be they just put out most interesting sentences.Which turned out to be partly true, their sentences are infact “picked” as is from the source text. But they put entire sentence so that it reads well, and also they probably put more than one sentence in order to make it sound coherent. They might also be doing some grammar analysis before putting the sentences together.

So these things were going in my head and I was like what would it take to pick up meaningful sentence form the text, the simplest thing. Then I set out to write my own code to do the same. I pulled from net a python script that could extract text from a url (it is not so powerful, but works). Then I took two web pages returned by Dygest for search term “Stimulus” and converted it to text using this python script. I wrote a small perl script to clean the text and build a matrix of the form “paragraph X words” and then scored them based on the number of meaningful words contributed by the paragraph. Here meaningful words are the words that are left after striping out the stop words (put together by searching some stop list online). The paragraph with maximum score is selected as the representative for the article.

I know this is really naive method of doing things, but I wanted to validate the thought I had in mind about the ability of this idea. And it turns out it stands validated. I don’t have results with me on this machine right now to put up here but will do that once I go home. Also, I was surprised as it did select some of the really good paragraphs as a answer.

There are a few variations I would have liked to try but did not, here are those.

1. Use a better importance measure.

2. Add more granularity to the text selected. I could easily go to sentence level and then show the top 3 sentences as answers.

3. Use second order context similarity for the search term and the paragraph selected. This would be really interesting but is a lot more involved and I did not have enough time.

All in all, it was fun app that it turned out. I will be uploading it soon here so keep watching this post or email me if you are in a real hurry. But again I don’t claim that the code is a top class code and is the best way to go. It is one of the way to go though. I also want to thank Andy (Anand Kishore) who’s post (Dygest) I mentioned before, inspired this act of mine.

Tag Cloud: My interpretation…

Tags, a term that most of us bloggers have become familiar with and probably have had a chance to use as well. It mostly and in ideal cases depicts what words or phrases you would associate the given text with. Now for a human mind we have a large choice between terms and a wide variety of words and phrases that we can use to represent one single meaning or tag. This very fact provides us with the challenges of synonym tags, relativity or relationship in tags and retrieving similar meaning tags and phrases that are not necessarily phrased in similar words.

Let us analyze the problem at hand, tags are assigned by the user and are based on the contents of blog according to the perception of the individuals assigning them. According to the distributional hypothesis of linguistics (Harris, 1954) we can say that words expressing similar meanings tend to occur near each-other in similar contexts. If we are to believe the users for their tags then we can assume that the different blogs with similar or same meaning tags will contain similar words. Now if they contain similar words, then question it poses to us is, can we infer using our knowledge of Natural Language Processing that these blogs will exibit similar tags or will be classified under same tag set or tag cloud.

Why are we doing this? There is a lot of text tagged these days and also there is a lot text and other material available on the web that is not tagged. So the aim of the method is to be able to classify that text under some tags may be at a lower priority level or something that will enable the discovery of that text as well based on the tags search. I prefer tag based search as it is more directed than key word based one.

This challenge relates to the classical problem of Natural Language Processing, the “Text classification problem” and we will try and survey if some of the proposed solutions for this and similar problems like Word Sense Disambiguation and similarity measure problems.

For our discussion here let us concentrate on the word vector based methods, WSD or similarity measure. For further discussion here I have a few methods in mind, which are as follows:-

1. Lesks method – one of the very old papers in WSD.

2. Using Co-occurrence vectors to find similarity of word vectors. – Niwa & Nitta 1996, and Wilk’s algorithm.

3. H. Sch¨utze, Automatic word sense discrimination. Computational Linguistics. 1998.

4. Using Wordnet based context vectors to estimate semantic relatedness of concepts. (Patwardhan, Pedersen 2006).