- Normalization
- Tokenization
- Stop words
- Part-of-Speech Tagging
- Named Entity Recognition
- Stemming and Lemmatization
Step 1, Normalize the text
Once we have the text, the very first step is to normalize the text to a unified way, like to lower-case letters. Then punctuations and special characters could also be removed with regular expression in this step.
Step 2, Tokenization
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
Python’s split()
function can be a simple solution. word_tokenize()
and sent_tokenize()
from NLTK can accomplish this task in a more robust way.
Step 3, Remove stop words
Stop words usually refers to the most common words in a language and in some circumstances it can be defined according to the task. These words can be filtered to minimize the size of text but avoide impact such as changing the meaning. Python’s list filter can be a simple solution.
Step 4, Part-of-Speech Tagging
PoS Tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech. After this process, text can be better understood, the relationship between words or phrases would be more clear.
You’ll find NLTK’s pos_tag
useful in this step.
Step 5, Named Entity Recognition
Once recognized, named entity can be served as index or avoid misconstrue in later text analysis.
NLTK’s ne_chunk()
would be a great tool.
Step 6, Stemming and Lemmatization
In English, form of words varies in different situation.
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form.
In most cases, first appling lemmatization to the text and then stemming is a general solution. PorterStemmer and LancasterStemmer are the most common ones. Porter has 100+ cascading “rewrite” rules.