We cannot guarantee that every word in our queries is typo-free. However, search engine can always try to correct our spellings and gives us what we actually want. There is no mystery behind it, the truth is, search engines have to do a lot of work to make it.
To search for an index, search engine has to keep a vocabulary.
Vocabulary lookup operation usually uses a classical data structure called the dictionary and has two broad classes of solutions: hashing, and search trees.
In hashing, there is no easy way to find minor variants of a query term since these could be hashed to very different integers.
The best-known search tree is the binary tree, and we may also consider B-tree, a search tree in which every internal node has a number of children in an interval. The search for a term begins at the root of the tree.
Before talking about spelling correction, let’s take a look of a special query called wildcard query.
Wildcard query
A search like “Lico*” will find all docs containing any word beginning with “Lico”, and this is called wildcard query.
A search tree on the dictionary is a convenient way of handling trailing wildcard queries: we walk down from node l to node o, and enumerate a set of terms in the dictionary with the prefix “lico”, and finally lookup every terms on the standard inverted index to retrieve all documents containing any term in the set.
For more general wildcard query, where * could appear anywhere in the term, we use a special index called permuterm index, a form of inverted index. The idea behind it is simple: * occur at the end of a term is computational friendly.
To create permuterm index:
- Add a $ to the end of each term
- Rotate the resulting term and index them in a B-tree
For term hello, we have:
hello$, ello$h, llo$he, lo$hel, o$hell, $hello
Permuterm query processing
(Add $), rotate * to end, lookup in permuterm index Queries:
X | lookup on X$ | hello$ for hello |
X* | lookup on $X* | $hel* for hel* |
*X | lookup on X$* | llo$* for *llo |
*X* | lookup on X* | ell* for *ell* |
X*Y | lookup on Y$X* | lo$h for h*lo |
X*Y*Z | treat as a search for X*Z and post-filter | For h*a*o, search for h*o by looking up o$h* and post-filter hello and retain halo |
The permuterm index is simple, but it can lead to a considerable blowup from the number of rotations per term. So here introduce another index called k-gram index, it enumerate all k-grams (sequence of k chars) occurring in any term.
When using k-grams, we may suffer from overlapping or conjunction problem, e.g. search for red* with 3 gram $re and red, we may get retired. So we add one more step called post-filtering step, in which the terms enumerated by the Boolean query on the 3-gram index are checked individually against the original query red*. Terms that survive are then searched in the standard inverted index as usual.
Spelling correction
Now it time for spelling correction. And here is my implementation of a simple spelling corrector.
Basically, there are two types of spelling errors:
- Non-word Errors
- graffe -> giraffe
- Real-word Errors
- Typographical errors
- three -> there
- Cognitive Errors (homophones)
- piece -> peace
- Typographical errors
For non-word spelling error detection, we could just use a dictionary for look up. If the term does not appear in the dictionary, then it’s an error. For non-word spelling error correction, the steps include:
- Generate candidates: real words that are similar to error
- Choose the one which is best, and two selection approaches:
- Shortest weighted edit distance
- Highest noisy channel probability
Candidate generation
Since spelling error could be caused by typing, we can use edit distance to find the candidates. Usually the correction of spelling error is within two edit distance. The steps including:
- Run through dictionary, check edit distance with each word
- Generate all words within edit distance ≤ k (e.g., k = 1 or 2) and then intersect them with dictionary
- Use a character k-gram index and find dictionary words that share “most” k-grams with word (e.g., by Jaccard coefficient)
- Compute them fast with a Levenshtein finite state transducer
- Have a precomputed map of words to possible corrections
Candidate selection
The most straightforward way is select the candiate with minimal edit distance. When two correctly spelled candidates are tied (or nearly tied), select the one that is more common.
And another effective way is using noisy channel model, it’s a kind of Bayesian inference. Given an observation x of a misspelled word, find its correction by maximizing the probability of P(x|w) P(w). P(w) here is the probability of word w.
Language model conditions the probability of a word on previous words or sorrounding words, and we can use it to get the probability of candidates.