Chinese Text Classification Hands On Practice

Task Specification

Text Classification is one of the foundamental tasks in Natural Language Processing. It aims to assign the text documents to predefined categories based on its content. It has a lot of potential usage in real world, from spam email detection to intent recognition in chatting bot. And in this practice I’ll use various ways to tackle this task. This is not a comprehensive study but a general illustration of the basic process.

Dataset Overview

There are 10 categories in this shopping review dataset. The dataset has three columns: category, positive-or-negative and review-text.

id	cat	review
13479	平板	太厚，分辨率太差。看什么都花，像打了马赛克。
55068	酒店	虽离机场和市中心远了一点，但服务生笑脸相迎，房间内相关旅游者的资料准备周到…
32106	洗发水	没送小瓶，问了没回复，差评

Basic Process

First things first, clean the data. The numbers, English characters are not so useful in this task. So I removed them with regular expression.

def clean_review(text):
    remove_symbol = re.sub(r"[\s\/\\_$^*(+\"\'+~\-@#&^*\[\]{}【】]+", "", str(text))
    remove_num = re.sub(r"\d*\.\d*%?|\d+%?", "", remove_symbol)
    remove_char = re.sub(r"[a-zA-Z]+", "", remove_num)
    return remove_char

Neither too long or too short text will get a relatively good representation, to simplify the impact of this factor, I limit the length bewteen 3 and 123(90% text are shorter than 131 characters).

data = data[data.review.apply(lambda x: 3 < len(str(x)) < 123)]

Here is special in Chinese, words can be a single character or a combination of characters, and different combinations always have different semantics. Since there is no word delimiter between words in written Chinese sentences, the pre-step in processing Chinese text data is word segmentation. The quality of word segmentation is essential to the following performance. A quick way to do the segmentation is using tools like Jieba. After this process the sentences will be segmented into list of words.

data['words'] = data.review.apply(jieba.lcut)

Split train and test data. Just like other machine learning tasks, we always need to evaluate our model with a test dataset.

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, random_state=42)

Features

Models are fed with what we call features, they are the representation of our raw text data.

Bag of Words

This is the most intuitive way, just like we have a dictionary, the only thing we care about is whether a word in the dictionary appears. So in practice we can count all words that appear in the given text.

We can use the following code to create the dictionary:

# First create a dictionary of words from training data
word_counter = Counter([x for y in train.words.tolist() for x in y])

# Assign unique id to every word, start from the most frequent one
id_word_dict = {idx+1: item[0] for idx, item in enumerate(word_counter.most_common())}
id_word_dict[0] = 'UNK'
word_id_dict = {word: idx for idx, word in id_word_dict.items()}

Here I used a primitive way to represent the text:

X_train_bow = np.zeros([len(train), len(id_word_dict)])
for idx, words in enumerate(train.word_id.tolist()):
    for w in words:
        X_train_bow[idx][w] += 1

The bag of words representation is the matrix, each column stands for a word in our dictionary and each line is a review. The values in the matrix indicate the number of occurrences of that word in the review.

After feeding the features to Naive Bayes model, I got an accuracy of 0.6351

TF-IDF

TFIDF(short for term frequency–inverse document frequency) is a way to reflect the importance of a word in the given corpus. Instead of the number of occurrences, the value indicates the relative importance weight of the word.

Sklearn’s API is a quick way to transform the text. I added blank between words because Chinese words are not naturally separated like English.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [' '.join(item) for item in train.words.tolist()]

tfidf = TfidfVectorizer()
tfidf.fit(corpus)

X_train_tfidf = tfidf.transform([' '.join(item) for item in train.words.tolist()])
X_test_tfidf = tfidf.transform([' '.join(item) for item in test.words.tolist()])

I reached accuracy of 0.7954 on a Naive Bayes model.

Word Embedding

Word embedding has a good reputation in representing semantic relationship. I had a detail post about word representation. This time I exploited bert-as-service to map sentences into fixed-length representations.

from bert_serving.client import BertClient
bc = BertClient()

X_train_bert = bc.encode(train.review.tolist())
X_test_bert = bc.encode(test.review.tolist())

After the embedding, the values in the matrix are real numbers. The multinomial distribution normally requires integer feature counts(or fractions) and real numbers won’t work, so this time I cannot use Naive Bayes model to do a quick evaluation. Well, word embedding nowadays is always associated with neural network, so I built a shallow three layer with Keras to get a baseline evaluation.

input_layer = Input(shape=(input_size,))
hidden_layer = Dense(100, activation="relu")(input_layer)
output_layer = Dense(10, activation="softmax")(hidden_layer)

model = Model(input_layer, output_layer)

After 5 epoches training, I got an accuracy of 0.8294

Next Step

To improve the performance, there are two aspects overall: feature and model.

Intuitively, the more features we got, the better performance the same model can achieve. Besides of the features above, other popular features include part of speech, topic modeling(usually with LDA)… and the combination of different features.

Other machine learning models like support vector machine(SVM), logistic regression and the ensemble method always achieve higher accuracy. In deep learning models, different neural network structures have their own advantages, so CNN, LSTM, Bi-LSTM, RNNs with attention mechanism… are all worth a try.

Here is my code.