Singular Value Decomposition (SVD)
Latent semantic indexing works by projecting this large, multidimensional space down into a smaller number of dimensions. Keywords that are similar in meaning will get squeezed together, and will loose its distinctiveness. This method of data compression allows LSI to go beyond conventional VSMs. This is done with SVD. With SVD, every matrix A could be written as
The formal explanation of SVD is beyond the scope of this paper. Readers are encouraged to read a comprehensive online tutorial of SVD, which starts from matrix action and all the way to noise filtering (http://www.acm.org/sigmm/MM98/electronic_proceedings/huang/node4.html).
The working principle of SVD is to reduce dimensions while preserving the relative distance between document vectors as much as possible. In this dimension reduction, information will be lost, and content words are laid over on one another. Information loss sounds harmful, but here it is a good thing. While losing noise from our original term-document matrix, we also reduce the dimension to a manageable extend. After SVD, similar things become more similar, while dissimilar things remain distinct. Synonymies get superimposed, while polysemic words get separated. This reductive mapping is what gives LSI its seemingly intelligent behaviour of being able to correlate semantically related terms. We are really exploiting a property of natural language, namely that words with similar meaning tend to occur together.
To make things easier to understand, we take a simple example, reducing a 2-D space to a 1-D. This happens to be called “Linear Regression” in statistics. With a set of n points (Xi,Yi), the optimal characterization would be a line f(X)=mx+b which is closest to all points in the set. This line also preserves the relative distance between the points to an optimal extent. When face a 3-D space, you are looking for an optimal mapping of points in this 3-space onto a plane, preserving the relative distance of the points as much as possible.
A typical term space might have tens of thousands of dimensions, and be projected down into fewer than 150. In an n-dimensional space, nevertheless, the principle is exactly the same. When term-document matrix is decomposed into 3 matrixes, the stretcher (the diagonal matrix) describes dimensions of variation. Thus the dimensions are ranked: the first value describes the dimension with the largest variation, the second the dimension with the second largest variation, and so on. Low-rank dimensions are deleted because they describe most similar vectors. After reduction, the original matrix is reconstructed. Here I will not reproduce the old examples, but examples obviously instantiate the whole procedure and help understanding.
The SVD algorithm has a complexity of O(N2*k3), where N is the number of terms plus documents, and k is the number of dimensions in the concept space. Typically, k will be small, ranging anywhere from 50 to 350. However, N grows rapidly as the number of terms and the number of documents increase. This makes the SVD algorithm unfeasible for a large, dynamic collection.
[7. part of term paper submtted 2004.4 Corpus-based Semantics]