*This post was originally published by Kovendhan Venugopal at Medium [AI]*

## Paper: NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, 2014

### Importance of the Paper:

The paper in discussion is **“Neural Machine Translation by Jointly Learning to Align and Translate”** by *Dzmitry Bahdanau, KyungHyun Cho & Yoshua Bengio*.

This is the paper that has introduced the now-famous ** “Attention Mechanism” **in the year 2014. Though there are several advancements that happened to the concept of Attention, the mechanism introduced by this paper is still known as “

**.**

*Bahdanau Attention” or “Additive Attention”*## Paper Link: https://arxiv.org/pdf/1409.0473.pdf

### Gist from the Paper:

- Neural Machine Translation (NMT) is the concept of using Neural Networks to translate sentences from a source language to a target language.
- Until this paper, such NMT models used multiple networks involving training each of them individually.
- This paper proposed to build and train a single, large neural network that reads a sentence and outputs a correct translation. This is the basis for all Sequence to Sequence models using Encoder-Decoder architecture in use today.
- Machine Translation, from a probabilistic perspective, is similar to finding a target sentence
`y`

that maximizes the conditional probability of`p(y|x)`

, where`x`

is the source sentence. Maximize the Conditional Probability of Sentence Pairs using a Parallel training corpus. A parameterized model would be used to model such a relationship and the model will use Backpropagation to learn the parameter weights.*The objective of an NMT task:*- NMT tasks make use of an Encoder-Decoder model (introduced by Sutskever et al., 2014; Cho et al., 2014a). Both Encoder and Decoder components are Neural Networks.

### Encoder-Decoder Architecture:

- An Encoder takes in a source sentence and encodes it into a fixed-length vector.
- A Decoder outputs the translation (target sentence) from the Encoded Vector.
- The Encoder-Decoder system is jointly trained to maximize the conditional probability of a correct translation for a given source-target sentence pair.

### Limitations with the Encoder-Decoder Architecture:

- The Decoder depends only on the last Encoded fixed-length vector for information about the source sentence.
- Especially when the source sentence is pretty long, then it makes it really difficult for the Encoder to compress all the information within a single vector.
- It has been empirically proved that the performance of a basic encoder-decoder deteriorates rapidly as the length of a source sentence increases (Cho et al. (2014b).

So, what the paper proposes to overcome the limitations?

### Learn to Align and Translate jointly:

- The paper proposes an extension to the Encoder-Decoder model that learns to ‘align’ and ‘translate’ jointly.
- Whenever the NMT model generates a translated word, it will soft-search for a set of positions in the source sentence and look for the positions with most relevant information concentration. It is similar to picking up the words that make more sense for the resultant translation.
- This is against the concept of encoding the entire source sentence into a single fixed-length context vector.
- Then, the NMT model predicts a target translation based on the context vectors associated with these source positions and from the previously generated translation outputs.
- How is it different from the earlier method? This method encodes the source sentence into a sequence of vectors and then the decoder picks up a subset of these vectors while outputting the translation.
- So what is the benefit? It enables the NMT model to refrain from squashing all the information into a single vector, rather it allows the model to understand long sentences and do a selective search based on the context importance.

### The Math behind the Encoder-Decoder Framework:

The Encoder process the source sentence as a sequence of vectors `x = (x1, · · · , xTx )`

, into a context vector `c`

.

where,

`h`

is the hidden state,`c`

is the context vector generated from the sequence of hidden states`f, q`

are nonlinear functions

The Decoder is then trained to predict the next word `yt`

given the context vector `c`

and all the previously predicted words `{y1,y2....yt-1}`

This is nothing but the maximum likelihood estimation of the prediction `yt`

given `y`

output vector and `c`

context vector. Then the `p(y) `

is given as

Using a Recurrent Neural Network, each of the conditional probability is modeled as,

where,

`g`

is a nonlinear, potentially multi-layered, function that outputs`p(yt)`

,`st`

is the hidden state of the RNN.

## Encoder-Decoder Architecture proposed in the paper:

- The Encoder block is a Bi-directional RNN.
- The Decoder is a combination of the ‘search and align’ model that considers all the hidden states from the source sequence and then picks up the most relevant vectors.

Here,

`X1, X2,…. XT`

are the source sentence tokens.- Pointing arrows in both directions of the hidden states —
`h1, h2, h3`

indicate the Bi-Directional nature of RNN blocks. - Each hidden state is then passed on to an
function along with corresponding weights*“additive”*`αt1, αt2..αtT`

.

*Why Bahdanau Attention is known as “Additive Attention”?*

*Why Bahdanau Attention is known as “Additive Attention”?*

Since the attention context vector is derived by adding all the hidden states of the source sentence, Bahdanau attention is also known as “Additive Attention”.

## The Math behind the Bahdanau Attention mechanism:

Now, let us look at it from the Mathematical point of view —

The Decoder block is responsible for predicting the next target word in a sequence, which is given by the conditional probability of the next prediction `yi`

given the previous predictions `yi-1`

and input word `x`

.

As per the context vector `ci`

calculated by the additive attention mechanism, `p(yi|y1,...yi-1, x)`

is a function of the following —

- previous prediction →
`yi-1`

- current RNN hidden state (or simply the current state) →
`si`

- context vector derived by additive attention mechanism →
`ci`

Expressing the same mathematically,

where `si`

is an RNN hidden state and is computed by,

i.e, the probability is conditioned on a distinct context vector `ci `

for each target word `y`

.

- The context vector
`ci`

depends on a sequence of annotations`(h1, · · · , hTx )`

to which an encoder maps the input sentence. - Then, The context vector
`ci`

is calculated as a weighted sum of these annotations`hj`

,

The weight `αij `

of each annotation `hj `

is computed by,

Does this equation ring a bell? Yes, this is same as the ‘Softmax’ activation widely used in all multi-class classification problems.

Using a softmax for additive attention ensures the following –

- It helps to consider ‘all’ the hidden states for computing the context vector
`ci`

- It acts as an alignment model that scores how well the input words around a position
`j`

and the output at the position`i`

match. This score is based on the hidden states of RNNs.

### Training the Bahdanau Attention Model:

- The alignment model (this is how the authors initially referred to the attention model) is a parameterized Feed Forward Neural Network.
- It is jointly trained with all the other components of the system using the Backpropagation mechanism, i.e, the gradients of the cost function will be used to update the weight vectors repeatedly until convergence.
- Hence, in addition to the RNN weights, even the Attention weights associated with each of the hidden states will also be ‘learned’.
- Weights will be updated in such a way that the words with maximum importance in translation will have maximum weight values, meaning more representation in the context vector.

We can understand the approach of taking a weighted sum of all the annotations as computing an expected annotation, where the expectation is over possible alignments.

## How the Attention Mechanism benefits the quality of Translation?

As explained before, the Context vector `ci`

is given as the weighted sum of all hidden states `hj`

.

where the weights `αij`

are given as

Let `αij `

be a probability that the target word `yi `

is aligned to, or translated from, a source word `xj `

.

Then, the `i`

-th context vector `ci `

is the expected annotation over all the annotations with probabilities `αij `

.

The probability `αij `

, or its associated energy `eij `

, reflects the importance of the annotation `hj `

with respect to the previous hidden state `si−1 `

in deciding the next state `si `

and generating `yi `

.

*Intuitively, this implements a mechanism of attention in the decoder.*

The decoder learns to decide parts of the source sentence to pay attention to. By letting the decoder to have attention mechanism, the encoder is freed from the burden of having to encode all information in the source sentence into a fixed-length vector.

Hence, the attention mechanism enables the decoder to do a ‘Selective Retrieval’ from the source sentence, irrespective of the sequence length. So far, we have seen the math behind the additive attention mechanism and also the inner workings of both Encoder and Decoder blocks. Let us move ahead to understand and explain the model’s predictions (or translations) to the stakeholders. This is done using a visualization concept called ‘** Attention Map**’.

## Visualizing the Attention Map:

The Attention map is the way to visualize which words in the source sentence are ‘attended to’ by the decoder. The more the attention between two words, the higher the cell value and vice versa. It is visually similar to a heatmap matrix, where related word pairs are highlighted in higher color gradients.

## Before we wind up…

Bahdanau Attention also known as ‘Additive Attention’ is one of the types of Attention mechanisms and there are several advancements happened since the introduction of this paper.

### Few useful links:

- Types of Attention mechanisms (Bahdanau & Luong Attention) → https://blog.floydhub.com/attention-mechanism/
- Transformers Original Paper — Attention is All you Need → https://arxiv.org/pdf/1706.03762.pdf

*Thanks for your attention 🙂*

*This post was originally published by Kovendhan Venugopal at Medium [AI]*