This post was originally published by Kovendhan Venugopal at Medium [AI]
Paper: NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, 2014
Importance of the Paper:
The paper in discussion is “Neural Machine Translation by Jointly Learning to Align and Translate” by Dzmitry Bahdanau, KyungHyun Cho & Yoshua Bengio.
This is the paper that has introduced the now-famous “Attention Mechanism” in the year 2014. Though there are several advancements that happened to the concept of Attention, the mechanism introduced by this paper is still known as “Bahdanau Attention” or “Additive Attention”.
Paper Link: https://arxiv.org/pdf/1409.0473.pdf
Gist from the Paper:
- Neural Machine Translation (NMT) is the concept of using Neural Networks to translate sentences from a source language to a target language.
- Until this paper, such NMT models used multiple networks involving training each of them individually.
- This paper proposed to build and train a single, large neural network that reads a sentence and outputs a correct translation. This is the basis for all Sequence to Sequence models using Encoder-Decoder architecture in use today.
- Machine Translation, from a probabilistic perspective, is similar to finding a target sentence
ythat maximizes the conditional probability of
xis the source sentence.
- The objective of an NMT task: Maximize the Conditional Probability of Sentence Pairs using a Parallel training corpus. A parameterized model would be used to model such a relationship and the model will use Backpropagation to learn the parameter weights.
- NMT tasks make use of an Encoder-Decoder model (introduced by Sutskever et al., 2014; Cho et al., 2014a). Both Encoder and Decoder components are Neural Networks.
- An Encoder takes in a source sentence and encodes it into a fixed-length vector.
- A Decoder outputs the translation (target sentence) from the Encoded Vector.
- The Encoder-Decoder system is jointly trained to maximize the conditional probability of a correct translation for a given source-target sentence pair.
Limitations with the Encoder-Decoder Architecture:
- The Decoder depends only on the last Encoded fixed-length vector for information about the source sentence.
- Especially when the source sentence is pretty long, then it makes it really difficult for the Encoder to compress all the information within a single vector.
- It has been empirically proved that the performance of a basic encoder-decoder deteriorates rapidly as the length of a source sentence increases (Cho et al. (2014b).
So, what the paper proposes to overcome the limitations?
Learn to Align and Translate jointly:
- The paper proposes an extension to the Encoder-Decoder model that learns to ‘align’ and ‘translate’ jointly.
- Whenever the NMT model generates a translated word, it will soft-search for a set of positions in the source sentence and look for the positions with most relevant information concentration. It is similar to picking up the words that make more sense for the resultant translation.
- This is against the concept of encoding the entire source sentence into a single fixed-length context vector.
- Then, the NMT model predicts a target translation based on the context vectors associated with these source positions and from the previously generated translation outputs.
- How is it different from the earlier method? This method encodes the source sentence into a sequence of vectors and then the decoder picks up a subset of these vectors while outputting the translation.
- So what is the benefit? It enables the NMT model to refrain from squashing all the information into a single vector, rather it allows the model to understand long sentences and do a selective search based on the context importance.
The Math behind the Encoder-Decoder Framework:
The Encoder process the source sentence as a sequence of vectors
x = (x1, · · · , xTx ), into a context vector
his the hidden state,
cis the context vector generated from the sequence of hidden states
f, qare nonlinear functions
The Decoder is then trained to predict the next word
yt given the context vector
c and all the previously predicted words
This is nothing but the maximum likelihood estimation of the prediction
y output vector and
c context vector. Then the
p(y) is given as
Using a Recurrent Neural Network, each of the conditional probability is modeled as,
gis a nonlinear, potentially multi-layered, function that outputs
stis the hidden state of the RNN.
Encoder-Decoder Architecture proposed in the paper:
- The Encoder block is a Bi-directional RNN.
- The Decoder is a combination of the ‘search and align’ model that considers all the hidden states from the source sequence and then picks up the most relevant vectors.
X1, X2,…. XTare the source sentence tokens.
- Pointing arrows in both directions of the hidden states —
h1, h2, h3indicate the Bi-Directional nature of RNN blocks.
- Each hidden state is then passed on to an “additive” function along with corresponding weights
Why Bahdanau Attention is known as “Additive Attention”?
Since the attention context vector is derived by adding all the hidden states of the source sentence, Bahdanau attention is also known as “Additive Attention”.
The Math behind the Bahdanau Attention mechanism:
Now, let us look at it from the Mathematical point of view —
The Decoder block is responsible for predicting the next target word in a sequence, which is given by the conditional probability of the next prediction
yi given the previous predictions
yi-1 and input word
As per the context vector
ci calculated by the additive attention mechanism,
p(yi|y1,...yi-1, x) is a function of the following —
- previous prediction →
- current RNN hidden state (or simply the current state) →
- context vector derived by additive attention mechanism →
Expressing the same mathematically,
siis an RNN hidden state and is computed by,
i.e, the probability is conditioned on a distinct context vector
ci for each target word
- The context vector
cidepends on a sequence of annotations
(h1, · · · , hTx )to which an encoder maps the input sentence.
- Then, The context vector
ciis calculated as a weighted sum of these annotations
αij of each annotation
hj is computed by,
Does this equation ring a bell? Yes, this is same as the ‘Softmax’ activation widely used in all multi-class classification problems.
Using a softmax for additive attention ensures the following –
- It helps to consider ‘all’ the hidden states for computing the context vector
- It acts as an alignment model that scores how well the input words around a position
jand the output at the position
imatch. This score is based on the hidden states of RNNs.
Training the Bahdanau Attention Model:
- The alignment model (this is how the authors initially referred to the attention model) is a parameterized Feed Forward Neural Network.
- It is jointly trained with all the other components of the system using the Backpropagation mechanism, i.e, the gradients of the cost function will be used to update the weight vectors repeatedly until convergence.
- Hence, in addition to the RNN weights, even the Attention weights associated with each of the hidden states will also be ‘learned’.
- Weights will be updated in such a way that the words with maximum importance in translation will have maximum weight values, meaning more representation in the context vector.
We can understand the approach of taking a weighted sum of all the annotations as computing an expected annotation, where the expectation is over possible alignments.
How the Attention Mechanism benefits the quality of Translation?
As explained before, the Context vector
ci is given as the weighted sum of all hidden states
where the weights
αij are given as
αij be a probability that the target word
yi is aligned to, or translated from, a source word
i-th context vector
ci is the expected annotation over all the annotations with probabilities
αij , or its associated energy
eij , reflects the importance of the annotation
hj with respect to the previous hidden state
si−1 in deciding the next state
si and generating
Intuitively, this implements a mechanism of attention in the decoder.
The decoder learns to decide parts of the source sentence to pay attention to. By letting the decoder to have attention mechanism, the encoder is freed from the burden of having to encode all information in the source sentence into a fixed-length vector.
Hence, the attention mechanism enables the decoder to do a ‘Selective Retrieval’ from the source sentence, irrespective of the sequence length. So far, we have seen the math behind the additive attention mechanism and also the inner workings of both Encoder and Decoder blocks. Let us move ahead to understand and explain the model’s predictions (or translations) to the stakeholders. This is done using a visualization concept called ‘Attention Map’.
Visualizing the Attention Map:
The Attention map is the way to visualize which words in the source sentence are ‘attended to’ by the decoder. The more the attention between two words, the higher the cell value and vice versa. It is visually similar to a heatmap matrix, where related word pairs are highlighted in higher color gradients.
Before we wind up…
Bahdanau Attention also known as ‘Additive Attention’ is one of the types of Attention mechanisms and there are several advancements happened since the introduction of this paper.
Few useful links:
- Types of Attention mechanisms (Bahdanau & Luong Attention) → https://blog.floydhub.com/attention-mechanism/
- Transformers Original Paper — Attention is All you Need → https://arxiv.org/pdf/1706.03762.pdf
Thanks for your attention 🙂
This post was originally published by Kovendhan Venugopal at Medium [AI]