Chinese Topic Modeling Hands On Practice

Task Specification

Topic models are type of statistical models for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

About LDA

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora and explain why some parts of the data are similar. It’s widely being used in tasks like topic modeling, text classification and collaborative filtering.

Dataset and Preprocess

It’s almost the same as previous posts. I’ll use the product review dataset and apply basic text cleaning.

Build Dictionary

Gensim provides everything we need to do LDA topic modeling. All we need is a corpus.

import gensim
from gensim.utils import simple_preprocess

dictionary = gensim.corpora.Dictionary(select_data.words)

Transform the Corpus

In this step, transform the text corpus to word index with the dictionary we created before.

bow_corpus = [dictionary.doc2bow(doc) for doc in select_data.words]

Run LDA Model

Set the number of topics you want to discover.

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary)

result1

Ways to Improve

Word Frequency
Some stop words tend to have much more higher frequences, and they could have a huge impact to the result. So setting a lower word frenquency threshold helps to improve the result. (Another way is to filter those words at first.)
Number of Topics
The number of topics you set before running the model is also crucial. To find the best parameter, try using a range of values and measure the quality of result with similarity or other metrics until the best parameter is determined.
Corpus Quality
The quality of corpus itself matters too. If there is no similarities of the content in the corpus, then trying to find topics is trivial. Or if the text contains too much unnecessary information, then it would also be harder to find the topic.

Here I tried different parameters and used POS tagging to simplify the text, and I got a better result: result2

And here is my code.