Not every term has the same importance to its document. Terms need to be weighted, be it binary weights , raw term frequency or TFIDF. By TFIDF, weight is determined by the frequency of occurrence of that term in a certain document and inverted occurrence of documents containing that term. It is intuitively plausible. Mathematically:
Weight(word)= TF(Wi, Doc)*IDF(Wi)= TF(Wi,Doc)*log(N/n+1)
Where TF(Wi, Doc) is the frequency of Word Wi in Document Doc. IDF(Wi) is inverted frequency of documents containing word Wi. After normalization, N is the number of all documents, n is the number of documents containing word Wi.
Weight(word) = TF(Wi,Doc)*log(N/n+1)
However, since term-by-document matrices are usually sparse and have high dimensions, conventional vector space models are liable to noise (differences in word usage, terms that do not help distinguish documents, etc.) and are also difficult to capture the underlying semantic structure. Additionally, computing resources necessary for the storage and processing of such data is enormous. Dimensionality reduction is a way to overcome these problems. The method is to map a vector from high dimensional space to a much lower dimensional space vector. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are popular techniques for dimensionality reduction based on matrix decomposition. Latent semantic indexing (LSI) uses SVD to reduce vector space.
6. explain LSI
Here we can make an analogy to help us understand the principles of VSM and LSI. Suppose you want to know what on earth Cognitive Science is and more concretely what contributing subjects to cognitive science are. You send questionnaires with 10 subjects on them to people and ask people to rank them from 10 (most important to Cognitive Science) to 1 (least important to Cognitive Science). You note down the rankings of these 3 subjects --- linguistics, psychology and neuroscience. You can graph the results of your survey by setting up a chart with three orthogonal axes, each for one each keyword. To plot a particular “Cognitive Science”, you count the rank of the 3 keywords in each questionnaire, and then take the appropriate number of steps (1-10) along the axis for that questionnaire. Each questionnaire forms a vector in that space, with its direction and magnitude determined by how high the ranks of the three keywords appear in it. When you are finished, you get a cluster of points in this three-dimensional space.
If you draw a line from the origin of the graph to each of these points, you obtain a set of vectors in the “Cognitive Science” space. The size and direction of each vector tells you how high the ranks of the three key items were in any particular questionnaire, and the set of all the vectors taken together tells you something about the kind of Cognitive Science people favour.
We choose 3 key words, so we get a 3-D visualization. However, this space could have any number of dimensions, depending on how many keywords we chose to use. If we were to choose Artificial Intelligence, Anthropology as well, then we have a 5 dimensional term space. In real world this term space usually has many thousands of dimensions. Each document in our collection is a vector with as many components as there are content words. Although we can not visualize them, the basic idea is the same as what we have with the Cognitive Science example. If we use binary term weight, that is whether the term occur in the document or not, intuitively documents in such a space that have many words in common will have vectors that are near to each other, while documents with few shared words will have vectors that are far apart.
[5.part of termpaper submitted 2004.04 Corpus-based Semantics]