Word Tokenizer

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. It’s usually the very first step in NLP tasks. How does Tokenizer works? Use space and punctuation to split text It’s the most straightforward way to seperate words because English already has a space between words. But the problem is words with punctuation inside won’t be splited, like won’t. Use regular expression »

Chinese Text Clustering Hands On Practice

Task Specification In contract to classification, clustering is always considered as an unsupervised approach and usually applied to unlabeled data. Afterall most of the data in the world are unlabeled. Text clustering is a way to explore and group the text data for further analysis and can be applied to many tasks like document classification, organizaion, browsing etc.. General Steps Overview Text cleaning Text representation(feature engineering) Clustering algorithms Aanlysis of »

TF-IDF Explained

What is it? TF-IDF is short for Term Frequency–Inverse Document Frequency. Term frequency: $$tf_{t,d} = \frac{ n_{t,d} }{\sum_{k}{n_{k,d}}}$$ n is the number of times that term t occurs in document d Obviously, it’s the occurrence of one specific word divided by the occurrence of all words Inverse document frequency: $$idf_{t} = log \frac{number \ of \ documents}{number \ of \ documents \ where \ the \ term \ t \ appears}$$ »

Chinese Text Classification Hands On Practice

Task Specification Text Classification is one of the foundamental tasks in Natural Language Processing. It aims to assign the text documents to predefined categories based on its content. It has a lot of potential usage in real world, from spam email detection to intent recognition in chatting bot. And in this practice I’ll use various ways to tackle this task. This is not a comprehensive study but a general illustration »

Transformer: successor of RNN

Transformer was introduced in Attention Is All You Need in 2017. It’s a neural network architecture based on a self-attention mechanism and has proved its excellent performance in language understanding. Background Typical recurrent models suffer from a large amount of computation Critical information in long sequences is difficult to represent in RNNs Attention mechanisms are effective A quick recap of attention Given a sequence to sequence model: We can calculate attention score in this way: General definition of attention Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. »