Search Engine - Document Reprensentation and Index

When we talk about search engine, we actually talk about one form of information retrieval systems. Web pages as documents, and web search engine gives us results based on our query. The idea seems simple, but how does search engine get this job done in millisecond? The number of web pages could be as high as 180 quadrillion! Before going deep into this, first we would like to know the representation of web pages in search engine, and we start with a simple document retrieval.

The simplest form of document retrieval is for a computer to do a linear scan through documents. But for large collection of documents it could be time consuming and low efficient, and for queries need more ﬂexible matching operations (multi keywords), this way does not seem feasible.

Boolean retrieval

The way to avoid linearly scanning the texts for each query is to index the documents in advance. This sounds familiar in database systems. We can use binary term-document incidence matrix to represent a collection of documents, it’s the same form as bag of words representation of document vector. But imagin we have millions of documents, even our total vocabulary is a thousand words, we would have a matrix of 1k x 1M. Since document would not have all the words in the vocabulary, the matrix is extremely sparse! A better representation is recording only 1’s position. Further more, we assign each document a unique id, and for each term t, we store a list of all documents that contain t.

The Boolean retrieval model is being able to ask a query that is a Boolean expression. That is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words. When process AND queries, the model retrives document index lists with every term in the query, and take intersection of those indices.

Inverted index construction

Tokenization -> cut character sequence into word tokens
- word segmentation
Normalization -> map text and query term to same form (U.S.A. and USA)
- accents and diacritics (color and colour)
- capitalization/case-folding (proper nouns)
Stemming -> make different forms of a root to match
Stop words removal -> omit very common words (or not)
- Web search engines generally do not use stop lists

Here is a more detailed post about index construction.

Phrase queries

Biword indexes

consider every pair of consecutive terms in a document as a phrase

Positional indexes

for each term in the vocabulary, we store postings of the form docID: 〈 position1, position2, … 〉where each position is a token index in the document

Combination schemes

a combination strategy uses a phrase index, or just a biword index, for certain queries and uses a positional index for other phrase queries.

Index compression

Benefits of compression:

less disk space
fit a lot more information into main memory -> decrease the response time of the system
faster transfer of data from disk to memory

Heaps’ law

$$M = kT^b$$

the simplest possible relationship between collection size and vocabulary size is linear in log-log space

Zipf’s law

$$cf_i \varpropto \frac{1}{i}$$ collection frequency $cf_i$ frequency decreases very rapidly with rank

Lossless compression

Dictionary compression
- srote the dictionary terms as one long string of characters
- increase the block size (group several terms into one block, only need one pointer)
Posting file compression (document ID)
- variable byte codes

Lossy compressoin

case folding
stemming
stop word elimination

Reference

CS276
Introduction to Information Retrieval