Transformer: successor of RNN

Transformer was introduced in Attention Is All You Need in 2017. It’s a neural network architecture based on a self-attention mechanism and has proved its excellent performance in language understanding.

Background

Typical recurrent models suffer from a large amount of computation
Critical information in long sequences is difficult to represent in RNNs
Attention mechanisms are effective

A quick recap of attention

Given a sequence to sequence model: Seq2seq

We can calculate attention score in this way:

General definition of attention

Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.
General Attention

In the previous example, query is the hidden state in decoder $s_t $, there is no separation of key and value, they are encoder hidden states $h_1, h_2, …, h_N $.

Scaled Dot-Product Attention

Every word in the input sequence has three corresponding vectors. When we have multiple vectors, we stack them in a matrix.
Query,Key and Value

And we can calculate attention with the three matrixs: Attention Calculation

As $d_k $ gets large, the variance of $Q^{T}K$ increases, some values inside the softmax get large, so the softmax gets very peaked, hence its gradient gets smaller. As a solution, the attention value is scale by the length of query/key vectors.

Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

To make it more explict, scaled dot-product attention can be considered as way for words to make connection with each other, Word Connection but the problem is that it’s only one way for words to interact with one another.

So to let words make more interactions, we can map $Q, K, V$ into many lower dimensional spaces and concatenate all the outputs.

Make linear projection:
MultiHead Projection

Concatenate all the outputs:
MultiHead

Rest of the block

Feed-Forward Networks
Every attention sub-layers contains a fully connected feed-forward network, and this consists of two linear transformations with a ReLU activation in between
Residual Connection
It’s clear in the model architecture that there’s residual connection around each of the two sub-layers in the block
Layer Normalization
The output of each sub-layer is applied with layer normalization function, transformed to have mean 0 and variance 1

Positional Encoding

Until now, words in a sequence are treated the same way, the representations of words are indifferent to their positions. To take this critical feature into consideration, add positional encodings to the input embeddings at the bottoms of the encoder and decoder stacks.

$$PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}})$$ $$PE_{(pos,2i+1)} = cos(pos/10000^{2i/d{model}})$$ where $pos$ is the position and $i$ is the dimension.

Reference

CS224N-nmt
The Illustrated Transformer-Jay Alammar