Word Tokenizer

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. It’s usually the very first step in NLP tasks.

How does Tokenizer works?

  • Use space and punctuation to split text
    It’s the most straightforward way to seperate words because English already has a space between words. But the problem is words with punctuation inside won’t be splited, like won’t.

  • Use regular expression
    A more general way is using regular expression, so we can design the pattern and handle exceptions easily.

  • Handle special cases
    Including abbreviations, hyphenated words, phone numbers, dates…

A basic Regex based Word Tokenizer

import re

abbr_list = ['U.S', 'T']

def preprocess(s):
    # handle some special cases
    s = s.strip()
    s = re.sub(r"n't", " n't", s) # e.g. won't -> wo n't
    s = re.sub(r"'s", " 's", s) # e.g. Bob's -> Bob 's
    s = re.sub(r"'re", " 're", s) # e.g. we're -> we 're
    s = re.sub(r"cannot", "can not", s)
    s = re.sub(r'\s"(\S+)\s', r' `` \1 ' , s) # e.g. he said, "xxx -> he said, `` xxx
    s = re.sub(r'^"(\S+)\s', r'`` \1 ' , s) # e.g. "Xxx -> `` Xxx
    s = re.sub(r'\s(\S+)"', r" \1 '' ", s) # e.g. xxx" -> xxx ''
    return s

def check_abbr(s):
    # end in punctuation except "."
    result = re.findall(r'''^([\w+\.]+)([,?!:;'])$''', s)
    if result:
        return list(result[0])
    # word end in ".", check if it's an abbr
    result = re.findall(r'^(\w+).?[.]$', s)
    if result:
        if result[0] in abbr_list: # abbr_list is a list containing predefined abbrevations
            return [s]
            return [result[0], '.']
    return [s]

def tokenizer(s):
    s = preprocess(s)
    words = s.split()
    token_list = []
    for word in words:
        token_list += check_abbr(word)
    return [w for w in token_list if w]

Usage of tokenizers in NLTK

  • nltk.tokenize.TreebankWordTokenizer
    Uses regular expressions to tokenize text as in Penn Treebank.

    from nltk.tokenize import TreebankWordTokenizer 
    tokenizer = TreebankWordTokenizer()
  • nltk.tokenize.WordPunctTokenizer
    Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+

    from nltk.tokenize import WordPunctTokenizer 
    tokenizer = WordPunctTokenizer()
  • nltk.tokenize.word_tokenize
    Currently an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language.

    from nltk.tokenize import word_tokenize 

spaCy’s tokenzer

from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

[w.text for w in tokenizer(text)]

