Information Extraction Overview

Information extraction turns the unstructured information embedded in texts into structured data. There are many related applications around us now, say, when you apply for a job position, instead of filling out all the slots manually, just upload your resume and the system will automatically fill the information for you, though it’s not accurate sometimes(this happens a lot). Maybe that’s why the world still needs us : )

The goals of information extraction including:

  • Organize information so that it is useful to people
  • Put information in a semantically precise form that allows further inferences to be made by computer algorithms

General approaches for information extraction tasks:

  1. Hand-written regular expressions
    • works well when extracting from automatically generated web pages
  2. Using classifiers
    • Generative: Naïve Bayes
    • Discriminative: Maxent models
  3. Sequence models
    • HMMs
    • CMMs/MEMMs
    • CRFs
    • RNNs and Transformers

And there are several common tasks of information extraction, including named entity recognition, relation extraction and event extraction.

Named Entity Recognition (NER)

A very important subtask of information extraction is to find and classify names in text. NER is to find each mention of a named entity in the text and label its type. Named entity type is task specific, common ones are people, places, and organizations, and for some tasks they can be gene or protein names or financial asset classes.

I have written a post about applying NER to Chinese text before.

Why NER?

  • Named entities can be indexed, linked off, etc
  • Sentiment can be attributed to companies or products
  • A lot of IE relations are associations between named entities
  • For question answering, answers are often named entities

The standard algorithm for named entity recognition is as a word-by-word sequence labeling task, and the tags usually in the form of IO or IOB, indicating the boundary and the type.

Algorithms for NER:

  • rule-based
    • hand written rules are usually high precision but low recall
  • feature based (MEMM/CRF)
    • prefix, suffix, POS tags
  • neural based
    • bi-LSTM + CRF + Viterbi decoding

Coreference resolution

Coreference resolution is the task of determining whether two mentions corefer, that is they refer to the same entity in the discourse model.

Reference in a text to an entity that has been previously introduced into the discourse is called anaphora, and the referring expression used is said to be an anaphor.

Modern systems for coreference are based on supervised neural machine learning.

The mention-pair architecture is based around a classifier, given a pair of mentions, a candidate anaphor and a candidate antecedent, and makes a binary classification decision: corefering or not.

The mention ranking model directly compares candidate antecedents to each other, choosing the highest-scoring antecedent for each anaphor.

Entity linking

It’s also called entity resolution, the task of mapping a discourse entity to some real-world individual. We usually operationalize entity linking or resolution by mapping to an ontology: a list of entities in the world.

Since the earliest systems, entity linking is done in two stages: mention detection and mention disambiguation.

Mention detection steps often include various kinds of query expansion, for example by doing coreference resolution on the current document.

Mention disambiguation is often done by supervised learning.

Relation Extraction

Relations are usually in the form of RDF triple, a tuple of (subject, predicate, object). A relation consists of a set of ordered tuples over elements of a domain. In most standard information extraction applications, the domain elements correspond to the named entities that occur in the text, to the underlying entities that result from co-reference resolution, or to entities selected from a domain ontology. There’s an obviously connection between relation extraction and NER: NER corresponds to the identification of a class of unary relations.

Why relation extraction?

  • Create new structured knowledge bases
  • Augment current knowledge bases
  • Support question answering

Algorithms for relation extraction

  • handwritten patterns
  • supervised machine learning
  • semi-supervised
    • bootstrapping
    • distant supervision
  • unsupervised

Using patterns

Example: IS-A relation
“Y such as X ((, X)* (, and|or) X)” “such Y as X”
“X or other Y”
“X and other Y”
“Y including X” “Y, especially X”

Pros:

  • human patterns tend to be high-precision
  • can be tailored to specific domains

Cons:

  • human patterns are often low-recall
  • a lot of work to think of all possible patterns
  • repeat similar work for every relation

Supervised machine learning for relations

This is a classification task, classify(predict) the relation between two entities in a sentence.

  1. Choose a set of relations we’d like to extract
  2. Choose a set of relevant named entities
  3. Find and label data
    • Choose a representative corpus
    • Label the named entities in the corpus
    • Hand-label the relations between these entities
    • Break into training, development, and test
  4. Train a classifier on the training set
    • Naive Bayes
    • MaxEnt
    • SVM
    • NNs

Semi-supervised and unsupervised relation extraction

Relation Bootstrapping

  • Gather a set of seed pairs that have relation
  • Iterate:
    1. Find sentences with these pairs
    2. Look at the context between or around the pair and generalize the context to create patterns
    3. Use the patterns for grep for more pairs

Distant Supervision

  • Combine bootstrapping with supervised learning
    • Use a large database to get huge # of seed examples
  • Create lots of features from all these examples
  • Combine in a supervised classifier
  1. For each relation
  2. For each tuple in big database
  3. Find sentences in large corpus with both entities
  4. Extract frequent features (parse, words, etc)
  5. Train supervised classifier using thousands of patterns

Evaluation

Since it extracts totally new relations from the web, there is no gold set of correct instances of relations. So we can only approximate precision by drawing a random sample of relations from output, check precision manually.

Event Extraction

An event is any expression denoting an event or state that can be assigned to a particular point or interval in time.

The approaches for event extraction are much similar to NER, generally modeled via supervised learning, detecting events via sequence models with IOB tagging, and assigning event classes and attributes with multi-class classifiers.

Extracting times

Times and dates are a particularly important kind of named entity that play a role in question answering, in calendar and personal assistant applications. In order to reason about times and dates, after we extract these temporal expressions they must be normalized converted to a standard format so we can reason about them.

Types of tempotal expression

  • absolute temporal expressions: can be mapped directly to calendar dates, times of day, or both
  • relative temporal expressions: map to particular times through some other reference point
  • durations: denote spans of time at varying levels of granularity

Most current approaches to temporal normalization are rule-based. That is, designing patterns for different encoding of temporal expression and converting them to a unified format.

Reference

Speech and Language Processing
CS124
DeepDive

Chuanrong Li

Read more posts by this author.