It’s essential for a search engine to rank-order the documents matching a query. To do so, search engine computes a score for each matching document with respect to the query at hand.
Weighted Zone Scoring
Instead as a sequence of terms, most documents have additional structure: metadata. Metadata is forms of data about a document, such as its authors(s), title and date of publication, and would generally include fields such as the date of creation and the format of the document.
The possible values of a field should be thought of as finite. Zones are similar to fields, except the contents of a zone can be arbitrary free text. Document titles and abstracts are generally treated as zones. Different zones may contribute difficult weight to the final score.
Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval. And to set the weight, one way is to be specified by an expert, or learned from training set that has been judged editorially.
Term frequency and weighting
The idea behind term frequency is: a document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score.
TF-IDF weighting
TF, or term frequency, is the occurrences of a term in a document. But not all words in a document are equally important, here comes the idea of IDF, or inverse document frequency, computed by taking the log value of inversed document frequence. TF-IDF score is the product of these two values. Check this post for more detail.
Vector space model
Denote \(V(d)\) the vector derived from document \(d\), with one component in the vector for each dictionary term. The components can be assumed calculated using TF-IDF or other measurements. To quantify the similarity between two documents in the vector space, the standard way is to compute the cosine similarity: $$sim(d_1, d_2) = \frac{V(d_1) \cdot V(d_2)}{|V(d_1)||V(d_2)|}$$
Query as vector
Besides to represent documents as vectors, we can also view a query as a vector. In this way, we’re able to measure the similarity between a query and a document.
Varient TF-IDF functions
There’re some alternatives to standard tf-idf talked before.
Sublinear tf scaling
It seems unlikely twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence. A common modification is to use instead the logarithm of the term frequency:
if tf > 0, \(wf = 1 + log(tf)\) else \(wf = 0\)
And the tf-idf weight is replaced by wf-idf
Maximum tf normalization
One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document: $$ntf = a + (1-a) \frac{tf}{tf_{max}}$$ \(a\) is a smoothing term, a value between 0 and 1, and is generally set to 0.4.
Pivoted normalized document length
Longer documents will have higher tf values, simply because they containing more (distinct) terms. Longer documents can broadly be lumped into two categories: (1) verbose documents that essentially repeat the same content (2) documents covering multiple different topics. To compensate for this phenomenon is a form of document length normalization that is independent of term and document frequencies. When we compute the dot product score between a unit query vector and such a normalized document, the score is skewed to account for the effect of document length on relevance.
Real world examples
ElasticSearch
In ElasticSearch, the relevance of each document is represented by a positive floating-point number called the _score
. The higher the _score, the more relevant the document. How that score is calculated depends on the type of query clause. Different query clauses are used for different purposes: a fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term; a terms query would incorporate the percentage of terms that were found.
For measuring the relevance between full-text field and a full-text query string, the standard similarity algorithm used in Elasticsearch is TF-IDF.
Jina
Jina not only searches text, so for every non-text object (also text object), it creates a vector representation, and the relevance usually is computed by their cosine similariy, sometimes euclidean is used. The representation of text is created by encoder, which encodes a chunk of text to a two dimentional array (np.ndarray). Encoders in Jina are usually neural network bases, popular ones including FARM text encoder and Flair text encoder.