Introduction to Automatic Text Summarization

Why we need text summarization

Information is everywhere and omnipresent data fill up. With the unrelated information blowing up, finding useful and proper text data we need becomes tougher. So the summarization is greatly needed. The objective of automatic text summarization is to condense the origin text into a precise version preserves its report content and global denotation.

There’re 2 main approaches
1. Extractive Methods.
2. Abstractive Methods.

Extractive Methods

The model extracts keywords and keyphrases from the text, which could be salient parts of the source document. The significance of sentences is strongly based on statistical and linguistic features of sentences.

Unsupervised Methods

Graph based approach
Graph can effectively represent the document structure. And some external knowledge (e.g. Wikipedia) can be incorporated.
Fuzzy logic based approach
Summarization based on fuzzy rule using various sets of features.
Concept-based approach
Importance of sentences is calculated based on the concepts retrieved from external knowledge base.

Text Features

Content Key Word (e.g. based on TF-IDF)
Title Word
Cue Phrase (words indicating structure)
Biased Word (e.g. domain specific words)
Sentence Location (e.g. beginning and conclusion part)
Sentence Length (very long sentences have less chance to be important)
Paragraph Location (e.g. in peripheral sections)
Cohesion between Sentences (e.g. similarity)

text-rank
*Text summarization using the TextRank algorithm

Supervised Methods

Machine Learning approach
neural network based approaches
- attentional encoder-decoder

It is usually be regarded as a classification task

Abstractive Methods

The model not only extracts but also concisely paraphrases the important parts of the document via generation, it can overcome the grammar inconsistencies of the extractive method.

Recursive Autoencoder
Neural network based approaches
- Attentional feed-forward network
- RNN-based encoder-decoder models
Reinforcement learning for sequence generation

Metrics

ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE simply counts how many n-grams in your generated summary matches the n-grams in your reference summary

Reference

[1]Moratanch N, Chitrakala S. A survey on extractive text summarization[C]//2017 International Conference on Computer, Communication and Signal Processing (ICCCSP). IEEE, 2017: 1-6.
[2]An Introduction to Text Summarization using the TextRank Algorithm