Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. It’s usually the very first step in NLP tasks.
How does Tokenizer works?
Use space and punctuation to split text
It’s the most straightforward way to seperate words because English already has a space between words. But the problem is words with punctuation inside won’t be splited, like won’t.Use regular expression
A more general way is using regular expression, so we can design the pattern and handle exceptions easily.Handle special cases
Including abbreviations, hyphenated words, phone numbers, dates…
A basic Regex based Word Tokenizer
import re
abbr_list = ['U.S', 'T']
def preprocess(s):
# handle some special cases
s = s.strip()
s = re.sub(r"n't", " n't", s) # e.g. won't -> wo n't
s = re.sub(r"'s", " 's", s) # e.g. Bob's -> Bob 's
s = re.sub(r"'re", " 're", s) # e.g. we're -> we 're
s = re.sub(r"cannot", "can not", s)
s = re.sub(r'\s"(\S+)\s', r' `` \1 ' , s) # e.g. he said, "xxx -> he said, `` xxx
s = re.sub(r'^"(\S+)\s', r'`` \1 ' , s) # e.g. "Xxx -> `` Xxx
s = re.sub(r'\s(\S+)"', r" \1 '' ", s) # e.g. xxx" -> xxx ''
return s
def check_abbr(s):
# end in punctuation except "."
result = re.findall(r'''^([\w+\.]+)([,?!:;'])$''', s)
if result:
return list(result[0])
# word end in ".", check if it's an abbr
result = re.findall(r'^(\w+).?[.]$', s)
if result:
if result[0] in abbr_list: # abbr_list is a list containing predefined abbrevations
return [s]
else:
return [result[0], '.']
return [s]
def tokenizer(s):
s = preprocess(s)
words = s.split()
token_list = []
for word in words:
token_list += check_abbr(word)
return [w for w in token_list if w]
Usage of tokenizers in NLTK
nltk.tokenize.TreebankWordTokenizer
Uses regular expressions to tokenize text as in Penn Treebank.from nltk.tokenize import TreebankWordTokenizer tokenizer = TreebankWordTokenizer() tokenizer.tokenize(text)
nltk.tokenize.WordPunctTokenizer
Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+from nltk.tokenize import WordPunctTokenizer tokenizer = WordPunctTokenizer() tokenizer.tokenize(text)
nltk.tokenize.word_tokenize
Currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language.from nltk.tokenize import word_tokenize word_tokenize(text)
spaCy’s tokenzer
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)
[w.text for w in tokenizer(text)]
Corresponding code is here