General NLP Task Overview

  • Normalization
  • Tokenization
  • Stop words
  • Part-of-Speech Tagging
  • Named Entity Recognition
  • Stemming and Lemmatization

Step 1, Normalize the text

Once we have the text, the very first step is to normalize the text to a unified way, like to lower-case letters. Then punctuations and special characters could also be removed with regular expression in this step.

Step 2, Tokenization

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. Python’s split() function can be a simple solution. word_tokenize() and sent_tokenize() from NLTK can accomplish this task in a more robust way.

Step 3, Remove stop words

Stop words usually refers to the most common words in a language and in some circumstances it can be defined according to the task. These words can be filtered to minimize the size of text but avoide impact such as changing the meaning. Python’s list filter can be a simple solution.

Step 4, Part-of-Speech Tagging

PoS Tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech. After this process, text can be better understood, the relationship between words or phrases would be more clear. You’ll find NLTK’s pos_tag useful in this step.

Step 5, Named Entity Recognition

Once recognized, named entity can be served as index or avoid misconstrue in later text analysis. NLTK’s ne_chunk() would be a great tool.

Step 6, Stemming and Lemmatization

In English, form of words varies in different situation.

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form.

In most cases, first appling lemmatization to the text and then stemming is a general solution. PorterStemmer and LancasterStemmer are the most common ones. Porter has 100+ cascading “rewrite” rules.

Chuanrong Li

Read more posts by this author.