Introduction to LSI
Here we can make an analogy to help us understand the principles of VSM and LSI. Suppose you want to know what on earth Cognitive Science is and more concretely what contributing subjects to cognitive science are. You send questionnaires with 10 subjects on them to people and ask people to rank them from 10 (most important to Cognitive Science) to 1 (least important to Cognitive Science). You note down the rankings of these 3 subjects --- linguistics, psychology and neuroscience. You can graph the results of your survey by setting up a chart with three orthogonal axes, each for one each keyword. To plot a particular “Cognitive Science”, you count the rank of the 3 keywords in each questionnaire, and then take the appropriate number of steps (1-10) along the axis for that questionnaire. Each questionnaire forms a vector in that space, with its direction and magnitude determined by how high the ranks of the three keywords appear in it. When you are finished, you get a cluster of points in this three-dimensional space.
If you draw a line from the origin of the graph to each of these points, you obtain a set of vectors in the “Cognitive Science” space. The size and direction of each vector tells you how high the ranks of the three key items were in any particular questionnaire, and the set of all the vectors taken together tells you something about the kind of Cognitive Science people favour.
We choose 3 key words, so we get a 3-D visualization. However, this space could have any number of dimensions, depending on how many keywords we chose to use. If we were to choose Artificial Intelligence, Anthropology as well, then we have a 5 dimensional term space. In real world this term space usually has many thousands of dimensions. Each document in our collection is a vector with as many components as there are content words. Although we can not visualize them, the basic idea is the same as what we have with the Cognitive Science example. If we use binary term weight, that is whether the term occur in the document or not, intuitively documents in such a space that have many words in common will have vectors that are near to each other, while documents with few shared words will have vectors that are far apart.
[6. part of term paper submitted 2004.04 Corpus-based Semantics]