Today in the Probability models class we learned about Poison Process and its use for counting instances in a particular period / time. It says that if a process follows a poison distribution then for any two time periods of the same length, the number of instances or occurrences of the event are same. This kind of triggered a small question in my mind, do the Parts of Speech used in sentences also follow this distribution? If we count number of verbs that occur in say a fixed word window, then *if* they follow Poison distribution then will they be the same? Can this be modelled over Poison distribution?
Now, I am yet to learn the properties of a Poison process and that may help clear this thought. But this is something nice to think about on the weekend. (As if I have nothing to think about
…)
Well I talked with my statistics professor and he said we will discuss it on Monday. So I am looking forward to that.
Tags, a term that most of us bloggers have become familiar with and probably have had a chance to use as well. It mostly and in ideal cases depicts what words or phrases you would associate the given text with. Now for a human mind we have a large choice between terms and a wide variety of words and phrases that we can use to represent one single meaning or tag. This very fact provides us with the challenges of synonym tags, relativity or relationship in tags and retrieving similar meaning tags and phrases that are not necessarily phrased in similar words.
Let us analyze the problem at hand, tags are assigned by the user and are based on the contents of blog according to the perception of the individuals assigning them. According to the distributional hypothesis of linguistics (Harris, 1954) we can say that words expressing similar meanings tend to occur near each-other in similar contexts. If we are to believe the users for their tags then we can assume that the different blogs with similar or same meaning tags will contain similar words. Now if they contain similar words, then question it poses to us is, can we infer using our knowledge of Natural Language Processing that these blogs will exibit similar tags or will be classified under same tag set or tag cloud.
Why are we doing this? There is a lot of text tagged these days and also there is a lot text and other material available on the web that is not tagged. So the aim of the method is to be able to classify that text under some tags may be at a lower priority level or something that will enable the discovery of that text as well based on the tags search. I prefer tag based search as it is more directed than key word based one.
This challenge relates to the classical problem of Natural Language Processing, the “Text classification problem” and we will try and survey if some of the proposed solutions for this and similar problems like Word Sense Disambiguation and similarity measure problems.
For our discussion here let us concentrate on the word vector based methods, WSD or similarity measure. For further discussion here I have a few methods in mind, which are as follows:-
1. Lesks method – one of the very old papers in WSD.
2. Using Co-occurrence vectors to find similarity of word vectors. – Niwa & Nitta 1996, and Wilk’s algorithm.
3. H. Sch¨utze, Automatic word sense discrimination. Computational Linguistics. 1998.
4. Using Wordnet based context vectors to estimate semantic relatedness of concepts. (Patwardhan, Pedersen 2006).
It was on 1st Aug 2005 I joined the erstwhile VERITAS Software India Limited as a contractor. It was a dream come true. As a BCS first year student I had dreamed of joining, the then high paying corporate, well I admit at that time it was for the money they use to pay. I joined them as a C++ developer for trying out new compilers. Since then it has been a eventful journey. I got a hang of working at a big corporation and its operations, learned the pros and cons of big company. I saw one of the biggest infrastructure of my life at my disposal. It was really magnanimous for me, for where I had seen small business of Sun as the biggest server, here they had two huge labs of Sun, IBM, HP, SG, etc. This was awesome, I worked at a company that use to deliver clustering, Backup software to the world.
It was here that I saw two Software giants merge their operations, culture, and brands. An experience that I would cherish for all my life. It is here that I published first of my research papers, it is here that I first got an invite for presenting at a international conference. It is here as well that I made some friends for life, some very close, some not so very close, some acquaintances, some professional relations, some sore relations and many good ones. Life here was a good journey.
Today 13th of August 2007 a little over 2 years down the line I stand at a juncture of my life where I will be parting ways with this phase of my professional life. A move I planned for years, a move towards further education. A journey towards my dreams, a quest to realize somethings I have dreamed of. Another new era of professional challenges and fun.
It is with a mixed feelings that I leave VERITAS, a company that was my dream as a young boy in my education years. A dream that came true and I lived every moment of it. It is with a heavy heart for parting with too many good friends, parting with my research paper which I wont be able to present. It is with a joy and satisfaction of living up my dream, and on the path of chasing another one that I say good bye to VERITAS.