"A Hierarchical Approach for Generating Descriptive Image Paragraphs" - Paper Reading

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

original paper

Motivation

Image captioning in previous work usually generate one single high level sentence to describe the whole image, which limits the quality and quantity of the information. As for the dense captioning, it suffers the lacking of coherence.

Data

Images annotated with paragraph description, 19,551 pairs.

Model

Region Detector + Region Pooling + Hierarchical Recurrent Network

Region Detector composed of VGG-16 for feature extraction, region proposal network for region of interest getting proposed and projected to a convolution feature map. RPN is basically a fully connected neural network, which is first proposed in Faster r-cnn. It generates a feature map, and we can get the corresponding region using the feature map.

Region Pooling uses a learned projection matrix and bias to compute the intermediate results, then get the final aggregated vetors with an elementwise maximum.

Hierarchical Recurrent Network composed of a sentence RNN and a word RNN, responsble for different level part. Sentence RNN determines the topic of each sentence, Word RNN generates the words of the sentence given the topic vector.

Training

Using transfer learning to train sub model and fix them during the whole end to end training. Transfer learning including region detection network, word embedding vectors, RNN weights and output projection of word RNN. The main training process is on sentence RNN and word RNN.

Chuanrong Li

Read more posts by this author.