TF-IDF Explained

What is it?

TF-IDF is short for Term Frequency–Inverse Document Frequency.

Term frequency: $$tf_{t,d} = \frac{ n_{t,d} }{\sum_{k}{n_{k,d}}}$$ n is the number of times that term t occurs in document d
Obviously, it’s the occurrence of one specific word divided by the occurrence of all words

Inverse document frequency: $$idf_{t} = log \frac{number \ of \ documents}{number \ of \ documents \ where \ the \ term \ t \ appears}$$

And TF-IDF is the product of tf and idf: $$tfidf_{t,d} = tf_{t,d} \cdot idf_{t}$$

Why is it useful?

Briefly, the weight indicates the importance of one word in the corpus(all the documents). The higher the weight, the more important the word.

From the formulas above, it is easy to calculate as well as understand. It’s an efficient way to explore a new dataset and do some basic analysis of it.

A simple example

Just take the example from Sklearn’s document:

corpus = [
‘This is the first document.’,
‘This document is the second document.’,
‘And this is the third one.’,
‘Is this the first document?’,
]

We can make a table to store the tf values of the words in the corpus:

this is the first document second and third one
15=0.2 15=0/2 15=0.2 15=0.2 15=0.2 0 0 0 0
16=0.167 16=0.167 16=0.167 0 26=0.333 16=0.167 0 0 0
16=0.167 16=0.167 16=0.167 0 0 0 16=0.167 16=0.167 16=0.167
15=0.2 15=0/2 15=0.2 15=0.2 15=0.2 0 0 0 0

Then compute the idf values:

this is the first document second and third one
log(1)=0 0 0 log(42)=0.301 log(43)=0.1249 log(41)=0.602 0.602 0.602 0.602

Multiply them to get the tf-idf weight:

this is the first document second and third one
0 0 0 0.602 0.02498 0 0 0 0
0 0 0 0 0.04159 0.1005 0 0 0
0 0 0 0 0 0 0.1005 0.1005 0.1005
0 0 0 0.602 0.02498 0 0 0 0

Use Sklean

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

tfidf = TfidfVectorizer(smooth_idf=False, norm=None)
tfidf_weight = tfidf.fit_transform(corpus)

Get the feature name(dictionary):

tfidf.get_feature_names()
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Get the weight in matrix format:

tfidf_weight.todense()

In the example we’ll get:

and document first is one second the third this
0 0.4697 0.5802 0.3840 0 0 0.3840 0 0.3840
0 0.6876 0 0.2810 0 0.5386 0.2810 0 0.2810
0.5118 0 0 0.2671 0.5118 0 0.2671 0.5118 0.2671
0 0.4697 0.5802 0.3840 0 0 0.3840 0 0.3840

Note this weight matrix is a little bit different from the previous one we calculated with the standard formula, because sklearn makes some adjustments like smoothing and normalization, which adds more stability to the algorithm.

Reference

Wikipedia
Sklearn TfidfVectorizer

Chuanrong Li

Read more posts by this author.