[Paper Xplained] Neural Machine Translation using Bahdanau Attention

mediumThis post was originally published by Kovendhan Venugopal at Medium [AI]

Paper: NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, 2014

Importance of the Paper:

Source: Original Paper: https://arxiv.org/pdf/1409.0473.pdf

This is the paper that has introduced the now-famous “Attention Mechanism” in the year 2014. Though there are several advancements that happened to the concept of Attention, the mechanism introduced by this paper is still known as “Bahdanau Attention” or “Additive Attention”.

Paper Link: https://arxiv.org/pdf/1409.0473.pdf

Gist from the Paper:

  • Until this paper, such NMT models used multiple networks involving training each of them individually.
  • This paper proposed to build and train a single, large neural network that reads a sentence and outputs a correct translation. This is the basis for all Sequence to Sequence models using Encoder-Decoder architecture in use today.
  • Machine Translation, from a probabilistic perspective, is similar to finding a target sentence y that maximizes the conditional probability of p(y|x), where x is the source sentence.
  • The objective of an NMT task: Maximize the Conditional Probability of Sentence Pairs using a Parallel training corpus. A parameterized model would be used to model such a relationship and the model will use Backpropagation to learn the parameter weights.
  • NMT tasks make use of an Encoder-Decoder model (introduced by Sutskever et al., 2014; Cho et al., 2014a). Both Encoder and Decoder components are Neural Networks.

Encoder-Decoder Architecture:

  • A Decoder outputs the translation (target sentence) from the Encoded Vector.
  • The Encoder-Decoder system is jointly trained to maximize the conditional probability of a correct translation for a given source-target sentence pair.
Source: Seq2Seq paper — https://arxiv.org/abs/1409.3215

Limitations with the Encoder-Decoder Architecture:

  • Especially when the source sentence is pretty long, then it makes it really difficult for the Encoder to compress all the information within a single vector.
  • It has been empirically proved that the performance of a basic encoder-decoder deteriorates rapidly as the length of a source sentence increases (Cho et al. (2014b).

So, what the paper proposes to overcome the limitations?

Learn to Align and Translate jointly:

  • Whenever the NMT model generates a translated word, it will soft-search for a set of positions in the source sentence and look for the positions with most relevant information concentration. It is similar to picking up the words that make more sense for the resultant translation.
  • This is against the concept of encoding the entire source sentence into a single fixed-length context vector.
  • Then, the NMT model predicts a target translation based on the context vectors associated with these source positions and from the previously generated translation outputs.
  • How is it different from the earlier method? This method encodes the source sentence into a sequence of vectors and then the decoder picks up a subset of these vectors while outputting the translation.
  • So what is the benefit? It enables the NMT model to refrain from squashing all the information into a single vector, rather it allows the model to understand long sentences and do a selective search based on the context importance.

The Math behind the Encoder-Decoder Framework:

where,

  • h is the hidden state,
  • c is the context vector generated from the sequence of hidden states
  • f, q are nonlinear functions

The Decoder is then trained to predict the next word yt given the context vector c and all the previously predicted words {y1,y2....yt-1}

This is nothing but the maximum likelihood estimation of the prediction yt given y output vector and c context vector. Then the p(y) is given as

Using a Recurrent Neural Network, each of the conditional probability is modeled as,

where,

  • g is a nonlinear, potentially multi-layered, function that outputs p(yt),
  • st is the hidden state of the RNN.

Encoder-Decoder Architecture proposed in the paper:

  • The Decoder is a combination of the ‘search and align’ model that considers all the hidden states from the source sequence and then picks up the most relevant vectors.
Source: Original Paper: https://arxiv.org/pdf/1409.0473.pdf

Here,

  • X1, X2,…. XT are the source sentence tokens.
  • Pointing arrows in both directions of the hidden states — h1, h2, h3 indicate the Bi-Directional nature of RNN blocks.
  • Each hidden state is then passed on to an “additive” function along with corresponding weights αt1, αt2..αtT.

Why Bahdanau Attention is known as “Additive Attention”?

The Math behind the Bahdanau Attention mechanism:

The Decoder block is responsible for predicting the next target word in a sequence, which is given by the conditional probability of the next prediction yi given the previous predictions yi-1 and input word x.

As per the context vector ci calculated by the additive attention mechanism, p(yi|y1,...yi-1, x) is a function of the following —

  • previous prediction → yi-1
  • current RNN hidden state (or simply the current state) → si
  • context vector derived by additive attention mechanism → ci

Expressing the same mathematically,

where siis an RNN hidden state and is computed by,

i.e, the probability is conditioned on a distinct context vector ci for each target word y.

  • The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence.
  • Then, The context vector ci is calculated as a weighted sum of these annotations hj,

The weight αij of each annotation hj is computed by,

Does this equation ring a bell? Yes, this is same as the ‘Softmax’ activation widely used in all multi-class classification problems.

Using a softmax for additive attention ensures the following –

  • It helps to consider ‘all’ the hidden states for computing the context vector ci
  • It acts as an alignment model that scores how well the input words around a position j and the output at the position i match. This score is based on the hidden states of RNNs.

Training the Bahdanau Attention Model:

  • It is jointly trained with all the other components of the system using the Backpropagation mechanism, i.e, the gradients of the cost function will be used to update the weight vectors repeatedly until convergence.
  • Hence, in addition to the RNN weights, even the Attention weights associated with each of the hidden states will also be ‘learned’.
  • Weights will be updated in such a way that the words with maximum importance in translation will have maximum weight values, meaning more representation in the context vector.

We can understand the approach of taking a weighted sum of all the annotations as computing an expected annotation, where the expectation is over possible alignments.

How the Attention Mechanism benefits the quality of Translation?

where the weights αij are given as

Let αij be a probability that the target word yi is aligned to, or translated from, a source word xj .

Then, the i-th context vector ci is the expected annotation over all the annotations with probabilities αij .

The probability αij , or its associated energy eij , reflects the importance of the annotation hj with respect to the previous hidden state si−1 in deciding the next state si and generating yi .

Intuitively, this implements a mechanism of attention in the decoder.

The decoder learns to decide parts of the source sentence to pay attention to. By letting the decoder to have attention mechanism, the encoder is freed from the burden of having to encode all information in the source sentence into a fixed-length vector.

Hence, the attention mechanism enables the decoder to do a ‘Selective Retrieval’ from the source sentence, irrespective of the sequence length. So far, we have seen the math behind the additive attention mechanism and also the inner workings of both Encoder and Decoder blocks. Let us move ahead to understand and explain the model’s predictions (or translations) to the stakeholders. This is done using a visualization concept called ‘Attention Map’.

Visualizing the Attention Map:

Source: Original Paper: https://arxiv.org/pdf/1409.0473.pdf

Before we wind up…

Few useful links:

Thanks for your attention 🙂

Spread the word

This post was originally published by Kovendhan Venugopal at Medium [AI]

Related posts