TF-IDF Explained · Licor's Space

What is it?

TF-IDF is short for Term Frequency–Inverse Document Frequency.

Term frequency: $$tf_{t,d} = \frac{ n_{t,d} }{\sum_{k}{n_{k,d}}}$$ n is the number of times that term t occurs in document d
Obviously, it’s the occurrence of one specific word divided by the occurrence of all words

Inverse document frequency: $$idf_{t} = log \frac{number \ of \ documents}{number \ of \ documents \ where \ the \ term \ t \ appears}$$

And TF-IDF is the product of tf and idf: $$tfidf_{t,d} = tf_{t,d} \cdot idf_{t}$$

Why is it useful?

Briefly, the weight indicates the importance of one word in the corpus(all the documents). The higher the weight, the more important the word.

From the formulas above, it is easy to calculate as well as understand. It’s an efficient way to explore a new dataset and do some basic analysis of it.

A simple example

Just take the example from Sklearn’s document:

corpus = [
‘This is the first document.’,
‘This document is the second document.’,
‘And this is the third one.’,
‘Is this the first document?’,
]

We can make a table to store the tf values of the words in the corpus:

this	is	the	first	document	second	and	third	one
¹⁄₅=0.2	¹⁄₅=0/2	¹⁄₅=0.2	¹⁄₅=0.2	¹⁄₅=0.2	0	0	0	0
¹⁄₆=0.167	¹⁄₆=0.167	¹⁄₆=0.167	0	²⁄₆=0.333	¹⁄₆=0.167	0	0	0
¹⁄₆=0.167	¹⁄₆=0.167	¹⁄₆=0.167	0	0	0	¹⁄₆=0.167	¹⁄₆=0.167	¹⁄₆=0.167
¹⁄₅=0.2	¹⁄₅=0/2	¹⁄₅=0.2	¹⁄₅=0.2	¹⁄₅=0.2	0	0	0	0

Then compute the idf values:

this	is	the	first	document	second	and	third	one
log(1)=0	0	0	log(⁴⁄₂)=0.301	log(⁴⁄₃)=0.1249	log(⁴⁄₁)=0.602	0.602	0.602	0.602

Multiply them to get the tf-idf weight:

first	document	second	and	third	one
0.602	0.02498	0	0	0	0
0	0.04159	0.1005	0	0	0
0	0	0	0.1005	0.1005	0.1005
0.602	0.02498	0	0	0	0

Use Sklean

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

tfidf = TfidfVectorizer(smooth_idf=False, norm=None)
tfidf_weight = tfidf.fit_transform(corpus)

Get the feature name(dictionary):

tfidf.get_feature_names()
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Get the weight in matrix format:

tfidf_weight.todense()

In the example we’ll get:

and	document	first	is	one	second	the	third	this
0	0.4697	0.5802	0.3840	0	0	0.3840	0	0.3840
0	0.6876	0	0.2810	0	0.5386	0.2810	0	0.2810
0.5118	0	0	0.2671	0.5118	0	0.2671	0.5118	0.2671
0	0.4697	0.5802	0.3840	0	0	0.3840	0	0.3840

Note this weight matrix is a little bit different from the previous one we calculated with the standard formula, because sklearn makes some adjustments like smoothing and normalization, which adds more stability to the algorithm.

Reference

Wikipedia
Sklearn TfidfVectorizer