Search engine vs. classic IR
attractive and useful as they are, search engines are confused with free-text retrieval systems. In fact retrieval from web pages is something that is quite different from classic IR systems. Basically they are different in the following aspects:
Firstly they differ in data volume. Classic IR system has a capacity of several GB, while search engines must deal with billions of web pages. Thus search engines usually apply several servers for webpage indexing, which is not necessary for normal enterprises.
Secondly they differ in ranking. Search engines such as Google have its special ranking algorithm, not only according to relevance of content but also according to the importance of the web pages. A real enterprise application will be only concerned with relevance of content. That is, according the query the most relevant information should be place in the first place. Thus link analysis is useless.
Thirdly, enterprise IR systems are usually real time applications. Once the data has been changed, retrievals should reflect those changes. On the other hand, search engines have their indexing module and retrieval module separated. Indexes are updated periodically. Large search engines like Goggle need 28 days.
Fourthly, since both data and user groups are huge and complex, search engines have great difficulties to apply computation-intensive techniques from data-mining and classic IR systems. At present, most search engines simply match keywords. On the other hand, it would be much easier to apply intelligent and individualized techniques for user-specific and data-specific enterprise IR systems.
[2. part of term paper submitted 2004.04 Corpus-based Semantics]