Information Extraction Overview

Information extraction turns the unstructured information embedded in texts into structured data. There are many related applications around us now, say, when you apply for a job position, instead of filling out all the slots manually, just upload your resume and the system will automatically fill the information for you, though it’s not accurate sometimes(this happens a lot). Maybe that’s why the world still needs us : ) The goals of information extraction including: »

Search Engine - Document Reprensentation and Index

When we talk about search engine, we actually talk about one form of information retrieval systems. Web pages as documents, and web search engine gives us results based on our query. The idea seems simple, but how does search engine get this job done in millisecond? The number of web pages could be as high as 180 quadrillion! Before going deep into this, first we would like to know the representation of web pages in search engine, and we start with a simple document retrieval. »

Chatbot Overview

Generally, two kinds of dialog systems: task-oriented dialogue agents (help complete tasks) and chatbot (for conversation). Chatbot Some basic Concepts A dialogue is a sequence of turns, each a single contribution to the dialogue. Endpointing or endpoint detection: Spoken dialogue systems must detect whether a user is done speaking, so they can process the utterance and respond. Grounding means acknowledging that the hearer has understood the speaker. initiative: sometimes a conversation is completely controlled by one participant. »

Natural Language Generation Overview

Overview Natural Language Generation (NLG) is all about generating text, it’s part of Natural Language Processing (NLP), but in a difference branch to Natural Language Understanding (NLU). To be more specific, here are some tasks related to NLG: Machine Translation (Abstractive) Summarization Dialogue (chit-chat and task-based) Creative writing: storytelling, poetry-generation Freeform Question Answering (i.e. answer is generated, not extracted from text or knowledge base) Image captioning Key Component: Language Model To generate text, we rely on a language model, which predicts the next word given the words so far. »

Word Representation

Motivation of Word Representation Unlike images data which have a long tradition of using vectors of pixels, natural language text has no unified representation for a long time. It was always regarded as discrete atomic symbol, where each word was asigned an unique id. Recent years, a popular idea in modern machine learning is to represent words by vectors. Breif History of Word Representation Dictionary Lookup One Hot Encoding Word embedding (distributional semantic model) Distributed Word Representations word2vec Glove Contextural Word Representations CoVe ELMo BERT Dictionary Lookup The most straightforward way to represent a word is to create a dictionary and assign every word a unique ID. »