Licor's Space

Chinese Named Entity Recognition Hands On Practice

Task Specification In this post I’m going to implement a bidirectional LSTM model with CRF layer on top to tackle the NER task. Unlike in English where every word is naturally separated by a space, entities in Chinese text are not so straight forward: the boundaries between characters and words are not very clear, and word segmentation itself is a critical task in Chinese NLP. To avoid the extra segmentation, »

Brief Introduction to Named Entity Recognition

What is NER? Named Entity is a real world object(e.g. Person, Location, Organization), and Named Entity Recognition is a task to identify and locate named entity in text. It’s a key part of information retrival, machine translation, Q&A system etc. Some Usecases Classify Content Named Entity Recognition can automatically scan entire articles and reveal which are the major people, organizations, and places discussed in them. Knowing the relevant tags for »

Introduction to Automatic Text Summarization

Why we need text summarization Information is everywhere and omnipresent data fill up. With the unrelated information blowing up, finding useful and proper text data we need becomes tougher. So the summarization is greatly needed. The objective of automatic text summarization is to condense the origin text into a precise version preserves its report content and global denotation. There’re 2 main approaches 1. Extractive Methods. 2. Abstractive Methods. Extractive Methods The model extracts keywords and keyphrases from the text, which could be salient parts of the source document. »

General NLP Task Overview

Normalization Tokenization Stop words Part-of-Speech Tagging Named Entity Recognition Stemming and Lemmatization Step 1, Normalize the text Once we have the text, the very first step is to normalize the text to a unified way, like to lower-case letters. Then punctuations and special characters could also be removed with regular expression in this step. Step 2, Tokenization Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. »

Brief introduction to Virtualization

You may heard of Docker, VMWare, or cloud computing, and one thing that is essential to them, virtualization. Actually this is not some state-of-art technology, but exists almost half a century. What is virtualization? Wikipedia: Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources. Well, computer system can be partitioned into several layers in this way: »

Menu