Chinese Named Entity Recognition Hands On Practice

Task Specification In this post I’m going to implement a bidirectional LSTM model with CRF layer on top to tackle the NER task. Unlike in English where every word is naturally separated by a space, entities in Chinese text are not so straight forward: the boundaries between characters and words are not very clear, and word segmentation itself is a critical task in Chinese NLP. To avoid the extra segmentation, »

Brief Introduction to Named Entity Recognition

What is NER? Named Entity is a real world object(e.g. Person, Location, Organization), and Named Entity Recognition is a task to identify and locate named entity in text. It’s a key part of information retrival, machine translation, Q&A system etc. Some Usecases Classify Content Named Entity Recognition can automatically scan entire articles and reveal which are the major people, organizations, and places discussed in them. Knowing the relevant tags for »

Introduction to Automatic Text Summarization

Why we need text summarization Information is everywhere and omnipresent data fill up. With the unrelated information blowing up, finding useful and proper text data we need becomes tougher. So the summarization is greatly needed. The objective of automatic text summarization is to condense the origin text into a precise version preserves its report content and global denotation. There’re 2 main approaches 1. Extractive Methods. 2. Abstractive Methods. Extractive Methods The model extracts keywords and keyphrases from the text, which could be salient parts of the source document. »

General NLP Task Overview

Normalization Tokenization Stop words Part-of-Speech Tagging Named Entity Recognition Stemming and Lemmatization Step 1, Normalize the text Once we have the text, the very first step is to normalize the text to a unified way, like to lower-case letters. Then punctuations and special characters could also be removed with regular expression in this step. Step 2, Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. »

Generation methods in LMs

Given a language model simply represented by P(Y|X), there’re several methods we can use to generate a sentence. Sampling & Argmax Sampling: generate a random sentence according to the probility distribution Argmax: generate the sentence with the highest probility Greedy Search Greedy Search selects the most likely word at each step in the output sequence, that is, pick the single highest-probility word one by one. $$y_i = argmax P(y_i | X, y_1, y_2, …, y_{i-1})$$ The benefit is its speed, it can generate sentences vary fast, but the drawback is also obvious, it will choose easy and common word first because they’re more frequent. »