Motivation
Image captioning in previous work usually generate one single high level sentence to describe the whole image, which limits the quality and quantity of the information. As for the dense captioning, it suffers the lacking of coherence.
Data
Images annotated with paragraph description, 19,551 pairs.
Model
Region Detector + Region Pooling + Hierarchical Recurrent Network
Region Detector composed of VGG-16 for feature extraction, region proposal network for region of interest getting proposed and projected to a convolution feature map. RPN is basically a fully connected neural network, which is first proposed in Faster r-cnn. It generates a feature map, and we can get the corresponding region using the feature map.
Region Pooling uses a learned projection matrix and bias to compute the intermediate results, then get the final aggregated vetors with an elementwise maximum.
Hierarchical Recurrent Network composed of a sentence RNN and a word RNN, responsble for different level part. Sentence RNN determines the topic of each sentence, Word RNN generates the words of the sentence given the topic vector.
Training
Using transfer learning to train sub model and fix them during the whole end to end training. Transfer learning including region detection network, word embedding vectors, RNN weights and output projection of word RNN. The main training process is on sentence RNN and word RNN.