What is it?
TF-IDF is short for Term Frequency–Inverse Document Frequency.
Term frequency:
$$tf_{t,d} = \frac{ n_{t,d} }{\sum_{k}{n_{k,d}}}$$
n is the number of times that term t occurs in document d
Obviously, it’s the occurrence of one specific word divided by the occurrence of all words
Inverse document frequency: $$idf_{t} = log \frac{number \ of \ documents}{number \ of \ documents \ where \ the \ term \ t \ appears}$$
And TF-IDF is the product of tf and idf: $$tfidf_{t,d} = tf_{t,d} \cdot idf_{t}$$
Why is it useful?
Briefly, the weight indicates the importance of one word in the corpus(all the documents). The higher the weight, the more important the word.
From the formulas above, it is easy to calculate as well as understand. It’s an efficient way to explore a new dataset and do some basic analysis of it.
A simple example
Just take the example from Sklearn’s document:
corpus = [
‘This is the first document.’,
‘This document is the second document.’,
‘And this is the third one.’,
‘Is this the first document?’,
]
We can make a table to store the tf values of the words in the corpus:
this | is | the | first | document | second | and | third | one |
---|---|---|---|---|---|---|---|---|
1⁄5=0.2 | 1⁄5=0/2 | 1⁄5=0.2 | 1⁄5=0.2 | 1⁄5=0.2 | 0 | 0 | 0 | 0 |
1⁄6=0.167 | 1⁄6=0.167 | 1⁄6=0.167 | 0 | 2⁄6=0.333 | 1⁄6=0.167 | 0 | 0 | 0 |
1⁄6=0.167 | 1⁄6=0.167 | 1⁄6=0.167 | 0 | 0 | 0 | 1⁄6=0.167 | 1⁄6=0.167 | 1⁄6=0.167 |
1⁄5=0.2 | 1⁄5=0/2 | 1⁄5=0.2 | 1⁄5=0.2 | 1⁄5=0.2 | 0 | 0 | 0 | 0 |
Then compute the idf values:
this | is | the | first | document | second | and | third | one |
---|---|---|---|---|---|---|---|---|
log(1)=0 | 0 | 0 | log(4⁄2)=0.301 | log(4⁄3)=0.1249 | log(4⁄1)=0.602 | 0.602 | 0.602 | 0.602 |
Multiply them to get the tf-idf weight:
this | is | the | first | document | second | and | third | one |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0.602 | 0.02498 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0.04159 | 0.1005 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0.1005 | 0.1005 | 0.1005 |
0 | 0 | 0 | 0.602 | 0.02498 | 0 | 0 | 0 | 0 |
Use Sklean
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
tfidf = TfidfVectorizer(smooth_idf=False, norm=None)
tfidf_weight = tfidf.fit_transform(corpus)
Get the feature name(dictionary):
tfidf.get_feature_names()
# ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Get the weight in matrix format:
tfidf_weight.todense()
In the example we’ll get:
and | document | first | is | one | second | the | third | this |
---|---|---|---|---|---|---|---|---|
0 | 0.4697 | 0.5802 | 0.3840 | 0 | 0 | 0.3840 | 0 | 0.3840 |
0 | 0.6876 | 0 | 0.2810 | 0 | 0.5386 | 0.2810 | 0 | 0.2810 |
0.5118 | 0 | 0 | 0.2671 | 0.5118 | 0 | 0.2671 | 0.5118 | 0.2671 |
0 | 0.4697 | 0.5802 | 0.3840 | 0 | 0 | 0.3840 | 0 | 0.3840 |
Note this weight matrix is a little bit different from the previous one we calculated with the standard formula, because sklearn makes some adjustments like smoothing and normalization, which adds more stability to the algorithm.