Natural Language Generation Overview

Overview

Natural Language Generation (NLG) is all about generating text, it’s part of Natural Language Processing (NLP), but in a difference branch to Natural Language Understanding (NLU). To be more specific, here are some tasks related to NLG:

  • Machine Translation
  • (Abstractive) Summarization
  • Dialogue (chit-chat and task-based)
  • Creative writing: storytelling, poetry-generation
  • Freeform Question Answering (i.e. answer is generated, not extracted from text or knowledge base)
  • Image captioning

Key Component: Language Model

To generate text, we rely on a language model, which predicts the next word given the words so far.

$$P (y_t | y_1,…,y_{t-1})$$

The simplest model that assigns probabilities to sentences and sequences of words is the n-gram language model. Nowadays popular language models are RNN based or Transformer based.

To train a task specific language model, we can add more features to the model and let condition on the additional features:

$$P (y_t | y_1,…,y_{t-1}, x)$$

  • Machine Translation (x=source sentence, y=target sentence)
  • Summarization (x=input text, y=summarized text)
  • Dialogue (x=dialogue history, y=next utterance)

Perplexity of a language model captures how powerful your LM is, can be seen as a weighted average branching factor of a language model. The perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words: $$PP(W) = P(w_1w_2…w_N)^{−1/N}$$

Decoding the language model

A decoding algorithm is an algorithm to generate text from your language model.

  • Greedy decoding
    • On each step, take the most probable word (i.e. argmax)
    • Use that as the next word, and feed it as input on the next step
    • Keep going until produce <END> (or reach some max length)
  • Beam search
    • A search algorithm which aims to find a high-probability sequence by tracking multiple possible sequences at once
    • On each step of decoder, keep track of the k most probable partial sequences (which we call hypotheses)
    • After reach some stopping criterion, choose the sequence with the highest probability (factoring in some adjustment for length)
  • Sampling-based decoding
    • On each step t, randomly sample from the probability distribution \(P_t\) to obtain next word
    • Restrict to just the top-n most probable words in top-n sampling
  • Softmax temperature
    • Apply a temperature hyperparameter \(\tau\) to the softmax
    • \(P_t(w) = \frac{exp(s_w / \tau)}{\sum_{w’}exp(s_w’ / \tau)} \)
    • Raise \(\tau\) -> more diverse output (probability is spread around vocab)
    • Lower \(\tau\) -> less diverse output (probability is concentrated on top words)
    • Not a decoding algorithm, apply in conjunction with a decoding algorithm (such as beam search or sampling)

Summarization

Summarization is a task that given input text x, output a summary y which is shorter and contains the main information of x. Summarization can be single-document or multi-document. Single means we generate a summary for one document, and multi means we write a summary of multiple content-overlaping documents.

Summarization approaches:

  • extractive summarization
    • Content selection -> Information ordering -> Sentence realization
    • topic keywords based score function or graph based algorithm to select content
  • abstractive summarization
    • seq2seq + attention
    • hierarchical / multi-level attention
    • Copy mechanisms use attention to enable a seq2seq system to easily copy words and phrases from the input to the output [1]

Dialogue System

Dialogue systems can be divided into many categories:

  • Task-oriented dialogue
    • Assistive (e.g. customer service, giving recommendations, question answering, helping user accomplish a task like buying or booking something)
    • Co-operative (two agents solve a task together through dialogue)
    • Adversarial (two agents compete in a task through dialogue)
  • Social dialogue
    • Chit-chat (for fun or company)
    • Therapy / mental wellbeing

Previously dialogue system relied on predefined templates or corpus, without truely generation but slot filling or information retrieval.

Nowadays, seq2seq-based + attention mechanism gained its popularity.

Difficulties in dialogue system:

  • Genericness / boring response problem
    • improve decoding algorithm and model structure (retrieve-and-refine model)
  • Irrelevant responses
    • optimize for Maximum Mutual Information (MMI) between input S and response T
  • Repetition
    • directly block repeating n-grams or define a training objective to discourage repetition
  • Lack of context

  • Lack of consistent persona

    • personas as embeddings, conditione generated utterances on the embeddings

Story & Poetry Generation

Most neural storytelling work uses some kind of prompt:

  • Generate a story-like paragraph given an image (extend of image caption)
  • Generate a story given a brief writing prompt
  • Generate the next sentence of a story, given the story so far (story continuation)

From an image

  • Use a common sentence-encoding space to get around the lack of parallel data
    • Skip-thought vectors are a type of general-purpose sentence embedding method [2]
  • Use an image captioning dataset (COCO), learn a mapping from images to the skip-thought encodings of their captions
  • Use the target style corpus, train a RNN-LM to decode a skip-thought vector to the original text
  • Put the two together

From a writing prompt

seq2seq prompt-to-story model[3]:

  • Convolutional-based, faster than RNN-based seq2seq
  • Gated multi-head multi-scale self-attention
    • the self-attention is important for capturing long-range context
    • the gates allow the attention mechanism to be more selective
    • the different attention heads attend at different scales – this means there are different attention mechanisms dedicated to retrieving finegrained information and coarse-grained information
  • Model fusion:
    • pretrain one seq2seq model, then train a second seq2seq model that has access to the hidden states of the first
    • the idea is that the first seq2seq model learns general LM and the second learns to condition on the prompt

Story telling is very challenging, since a story is not just a sequence of words, it’s a sequence of events. To tell a story, we need to define characters, their personalities, motivations, histories, and relationships to other characters, and we need a narrative structure to compose the whole story.

Generating a poetry

Different with story, poetry has rhythm constraints. Hafez[4] is a poetry generation system. The idea is: using a Finite State Acceptor (FSA) to define all possible sequences that obey the desired rhythm constraints, then using the FSA to constrain the output of a RNN-LM.

  • User provides topic word
  • Get a set of words related to topic
  • Identify rhyming topical words. These will be the ends of each line
  • Generate the poem using backwards RNN-LM constrained by FSA, because last word of each line is fixed.

Evaluation

Up to 60% of NLG research published between 2012–2015 relies on automatic metrics[5] . Automatic metrics is the opposite of human judgements, including:

  • Word-based Metrics
    • Word-overlap Metrics (BLEU, ROUGE, METEOR, F1, etc.)
    • Semantic Similarity
  • Grammar-based metrics
    • Readability
    • Grammaticality

Human judgement always based on:

  • Informativeness: Does the utterance provide all the useful information from the meaning representation?
  • Naturalness: Could the utterance have been produced by a native speaker?
  • Quality: How do you judge the overall quality of the utterance in terms of its grammatical correctness and fluency?

Compared to human judgement, automatic metrics is cheap and easy to conduct.

Word overlap based metrics are not good for NLG tasks, since they are inconsistent with human evaluation criteria。

Word embedding based metrics is better than overlapping, since they compare the similarity of the word embeddings (or average of word embeddings), not just the overlap of the words themselves. They can capture semantics in a more flexible way. Unfortunately, they still doesn’t correlate well with human judgments for open-ended tasks like dialog.

We have no automatic metrics to adequately capture overall quality, but we can define more focused automatic metrics to capture particular aspects of generated text. Though these don’t measure overall quality, they can help us track some important qualities that we care about:

  • Fluency (compute probability w.r.t. well-trained LM)
  • Correct style (prob w.r.t. LM trained on target corpus)
  • Diversity (rare word usage, uniqueness of n-grams)
  • Relevance to input (semantic similarity measures)
  • Simple things like length and repetition
  • Task-specific metrics e.g. compression rate for summarization

Reference

CS224n
[1] See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
[2] Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing systems (pp. 3294-3302).
[3] Fan, A., Lewis, M., & Dauphin, Y. (2018). Hierarchical neural story generation. arXiv preprint arXiv:1805.04833.
[4] Ghazvininejad, M., Shi, X., Priyadarshi, J., & Knight, K. (2017, July). Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations (pp. 43-48).
[5] Novikova, J., Dušek, O., Curry, A. C., & Rieser, V. (2017). Why we need new evaluation metrics for NLG. arXiv preprint arXiv:1707.06875.

Chuanrong Li

Read more posts by this author.