This post was originally published by Jiahui Wang at Towards Data Science
In this post, I will summarize what I learnt from Natural Language Processing with Deep Learning offered by Stanford University, including the Winter 2017 video lectures, and the Winter 2019 lecture series. Both lectures were taught by Prof. Christopher Manning at Stanford University. Several topics on deep learning NLP theories which are covered in the two lectures include:
- Word Representation
- NLP Neural Networks
- Attention and Transformer
- Recent NLP Deep Learning Models
1. Word Representation
Deep learning models rely on numerical vectors to ‘understand’ the input words. We can think of the numerical vectors as high dimensional features representing the input words. In this high dimensional space, words are located close together or far away from each other.
Word representation is built by finding the proper numerical vector representations for all the words in a given corpus. The quality of word representation relies on the corpus. This can be easily understood in the way that two human beings can have a different understanding of the same word, depending on whether he likes to spend time reading the modern newspaper or Shakespeare’s literature. Besides, the quality of word representation heavily relies on the methods to find numerical vector representations for all the words. There are several methods to generate word representation by learning from the words’ context.
You shall know a word by the company it keeps.
-John Rupert Firth
1.1 Count-based Method: Co-occurrence Matrix
Co-occurrence describes the appearance of two words within a certain distance, which is defined by the fixed window size. On a corpus level, the co-occurrence matrix captures the co-occurrence frequency of all pairs of unique words in the corpus. In this way, it provides a statistical description of the corpus. Each row in the co-occurrence matrix is a numerical vector representing a word.
However, to capture the co-occurrence frequency of all pairs of the unique words in the corpus, this co-occurrence matrix can take up huge memory. Imagine you have a corpus with 10k unique words, then the scale of the co-occurrence matrix is 10k by 10k! Meanwhile, the dimension of each word vector is 10k! To reduce the dimension of the word vectors, Singular Value Decomposition (SVD) is usually used.
1.2 Prediction-based Method: Word2vec
Apart from the count-based co-occurrence matrix method, which purely relies on statistics, word vectors can also be generated by prediction-based methods that use shallow feed-forward neural network models. Such prediction-based methods include word2vec, which can be implemented with skip-gram or continuous bag-of-words (CBOW) algorithms. Both skip-gram and CBOW algorithms aim at predicting words within a fixed window size, and they have different prediction directions. Skip-gram predicts the context words using the center word, while CBOW predicts the center word using the context words.
Skip-gram is trained using a shallow neural network, and the input is a corpus of T unique words. Each word is randomly assigned as a d-dimension vector. In this way, there are 2*T*d parameters to be optimized in this shallow neural network, with the center word matrix and the context word matrix each having T*d parameters. This neural network is trained by iteratively taking each word as the center word and maximizing the probability of the context words given the current center word.
Word2vec is available on Gensim, and it can be easily imported in the code to generate prediction-based word vectors.
1.3 Combination of Count-based and Prediction-based Method: Glove
On one side, the co-occurrence matrix provides a statistical description of the co-occurrence of words on a corpus level. On the other side, word2vec captures the prediction capability of the context words within a window size by using a shallow neural network. Glove is an architecture proposed to combine the two methods to have both the statistical power and the local context prediction capability. Similar to word2vec, Glove is also trained using a shallow neural network. The cost function is what differentiates Glove from word2vec. The cost function of Glove¹ includes Xᵢⱼ, which is the number of times that word j appears in the context of word I, and this incorporates the statistical description of the corpus.
Glove is also available as a Python library. The Stanford Glove Python library can serve as a pre-trained word vector generator, as it is pre-trained on massive web datasets, with corpora like Wikipedia and Twitter. Besides, it can also be trained on a new corpus.
2. NLP Neural Networks
Up to here, we have learned word representation, which turns words into numerical vectors that can be understood by the machine. Then, the next step is to figure out how to use these word representations to do an NLP task.
Many NLP tasks can be considered as classification problems. A straightforward example is sentiment analysis, which can be as simple as classifying movie reviews as a positive or negative sentiment. A not so straightforward example is the NLP task of the next word prediction. When I want to predict the next word following ‘I like eating…’, this next word can be any word in my corpus. The way that next word prediction does is to calculate the probabilities for all the unique words in my corpus, and then pick up the word with the maximum probability. Softmax is usually used to normalize the probabilities of multiple output classes.
The problem with softmax is that it is a linear transformation of the input word vector. As a result, it can only work out a linear decision boundary for classification problems with multiple classes. For complex NLP tasks, the linear transformation of the word vectors is not enough. We need to introduce non-linearity before feeding the word vectors into the final layer of softmax. Neural network is a solution that can be implemented before the final layer of softmax to introduce non-linearity into the model.
In this session, we will start from the fundamental neural network elements — neurons and then discuss the three basic neural network architectures for NLP tasks, which are recurrent neural network (RNN), convolutional neural network (CNN), and tree recursive neural network (Tree-RNN).
Neurons are the elementary calculation units in a neural network. In a neuron, after linear transformation of the input, the result is then fed into logistic regression. Logistic regression introduces non-linearity into the calculation in a neuron. Thus, a neuron is essentially a binary logistic regression unit. A neural network consists of multiple layers, where each layer further consists of a bunch of neurons. We can think of each layer as several logistic regressions running at the same time, and the results are then fed into another logistic regression.
2.2 Recurrent Neural Network (RNN)
Oftentimes, in NLP tasks, we are dealing with a sequence of words, which can be movie reviews of incomplete sentences or well-written research articles. To predict the next word requires an overall understanding of the previous sequence of words. RNN is an architecture that is suitable for processing a sequence of data. Basically, at each time step, a word vector is fed into a neuron in RNN. In addition to the input word vector at this time step, the neuron also receives the hidden state that carries information about the previous time steps. In this way, information of the word sequence is passed on in the neural network.
The RNN architecture mentioned above is also called vanilla RNN. Vanilla RNN suffers from vanishing gradient issues, which is caused by the long chain of parameter updating in the backpropagation process. Backpropagation aims at updating neural network parameters to minimize the output error. However, due to the sequential connection nature of vanilla RNN, parameters of further neurons cannot get efficiently updated. In this way, neurons die easily. To overcome the vanishing gradient issue in vanilla RNN, two advanced RNN architecture solutions are proposed: gated recurrent units (GRU)² and long-short-term memories (LSTM)³. The way that GRU and LSTM solve the vanishing gradient issue is by introducing some form of the gating mechanism to control the amount of previous cell information passing on to the current cell.
2.3 Convolutional Neural Network (CNN)
RNN allows information flow across time, to enable the comprehension of global semantics. This in nature determines that RNN calculation is a slow process, as the later neuron waits for the information flow from the earlier neurons. As a faster solution, CNN has been explored for some NLP tasks which have a lower requirement for the comprehension of global semantics.
CNN is widely used in computer vision models. The basic idea of CNN is to parallelly apply multiple filters on the input data, where each filter extracts a certain feature. In computer vision tasks, these filters recognize local patterns across space. Similarly, when applied to NLP tasks, these filters can be trained to recognize local patterns across time.
When the window slide across the input, the filters work on the input word vectors within a given window size, where the window size is a hyperparameter to be optimized in the CNN training. For each filter, when a specific pattern is found, it will fire strongly and can be selected by the subsequent max-pooling layer. For example, in sentiment analysis, when ‘do not like’ is found, the filter of window size of three will fire strongly. In this way, we can think of a filter of window size of n as n-gram pattern searcher.
3. Attention and Transformer
RNN and CNN are the elementary units widely used in NLP deep learning models. RNN is good at sequentially learning long-range semantics, while CNN is good at learning local semantics and can be implemented by parallel computing. Long-range semantics and parallel computing are important, is it possible to achieve both at the same time? In this session, I will discuss some recent advanced neural network architectures.
3.1 Attention: a similarity scoring mechanism to align a pool of encoder information based on the decoder query
Many people have heard of the words ‘attention’ and ‘transformer’ because of the BERT model. However, unlike the transformer architecture, which has only until recently been proposed in Google’s 2017 publication called Attention is All You Need⁴, attention mechanism has been used in neural machine translation for a long time. According to Christopher Manning, neural machine translation is the approach of modeling the entire machine translation process via one big artificial neural network.
Conventionally, neural machine translation uses a sequence-to-sequence model with encoder and decoder, which are built using RNN units. In encoder-decoder architecture without an attention mechanism, the last neuron in the encoder sequence is the only unit connected to the decoder. However, this last neuron in the encoder sequence may not be able to carry over all the useful information in the input sequence. The attention mechanism can solve this issue.
The attention mechanism allows decoder neurons to pick up the hidden state from a pool of information in the encoder. Attention layers take two input vectors, which is the encoder and decoder hidden state and returns a normalized similarity score. In this way, attention layers help to align the encoder information based on the given decoder query.
3.2 Self-attention: sequence processing with parallel computing capability
RNN can carry long-range semantic information, but the computing is slow due to the sequential processing nature. In contrast, CNN works as local filters and can be easily implemented in parallel computing, but at the same time lacks long-range semantic information. Self-attention is a solution to get global comprehension of the input in a parallel computing manner.
In self-attention, every input word is firstly encoded into three vectors: q, k, and v. The output vector is calculated at each position. To get an output vector, attention scores are calculated using the q at the output position with all the k. Then the weighted sum of the attention scoring and every v vector generates the output vector.
I think of self-attention as a modified CNN structure. In CNN, the filter only applies to nearby input vectors within a certain window size. In this way, CNN is good at capturing local features. Since this filtering process is purely matrix operation, it can be easily implemented in a parallel way, with multiple filters working at the same time to find different features. In self-attention, this filtering process is modified, to take an encoded vector at each input position instead. This modification enables self-attention to inherent CNN’s advantage of matrix operation and parallel implementation, and at the same time enables new advantage of comprehension of global semantics.
3.3 Transformer: elementary units built on the self-attention mechanism
The transformer is a deep learning elementary unit architecture proposed in 2017 and is built using the self-attention mechanism. The transformer can replace RNN, to take sequence input and generate sequence output, and at the same time can be implemented in parallel computing.
4. Recent NLP Deep Learning Models
Up to here, I have discussed word representation (co-occurrence matrix, word2vec, Glove) and elementary deep learning unit architectures (RNN, CNN, transformer). In this session, we will go through recent advances in the field of NLP.
Transfer learning has been widely used in the field of computer vision (CV), where CV tasks, ranging from image classification to object detection, mostly use models pre-trained on large image datasets. However, transfer learning in NLP has only recently been possible since 2018. The past two years have witnessed the emergence of several pre-trained universal language models. We will go through some of the breakthroughs. A good article on recent NLP developments by Jay Alammar can be found here.
4.1 ELMo: the first contextual word embedding model
Embeddings from Language Models (ELMo) is the first contextual word embedding model, which was published in early 20¹⁸⁵. We have discussed word representations in the first session of this post. However, all those word representation models before ELMo, including word2vec and Glove, can only generate static word embeddings. By static word embeddings, we mean there is only one-word embedding for every word type. For example, in ‘river bank’ and ‘bank account’, the word ‘bank’ will have the same vector representation. Such static word embedding cannot well represent the meaning of the word in a certain linguistic context.
Instead, ELMo generates contextual word embedding by using the internal state of two two-layer LSTM, which include both forward and backward units. The parameters that determine how the internal states are used to generate the weighted sum result, which is the contextual word embedding, are learned in each specific NLP task.
4.2 ULMFit: replace lstm with transformer, exploration of NLP transfer learning
Following ELMo, Universal Language Model Fine-tuning (ULMFiT) was also published in early 20¹⁸⁷. One difference between ELMo and ULMFiT is that ULMFiT replaced all the LSTM units with a transformer, which was just published by Google in early 2018.
ULMFiT aims at achieving a generalized language model that can be used in transfer learning. There are two phases to implement ULMFit: language model pre-training on a general-domain corpus, and fine-tuning in a specific NLP task. In this way, the pre-trained language model learns the general features of the language, and the transfer learning helps to quickly train the model to adapt to different NLP tasks.
4.3 GPT: large-scale pre-trained NLP language model using forward transformer
In ULMFit, the first phase of the pre-training language model on a general-domain corpus is an unsupervised learning task. The unlabeled text corpora are abundant, while the well-labeled text corpora are harder to find. The unsupervised learning manner of the pre-training language model makes it possible for the model to generally learn from a large amount of available text. ULMFiT is pre-trained on a relatively small corpus to demonstrate the possibility of a transfer learning language model. Then not long after ULMFiT, Generative Pre-Training (GPT) was published by OpenAI in 20¹⁸⁸. Similar to ULMFiT, GPT uses a forward transformer, but GPT is trained on larger corpora with more parameters in the model.
In early 2019, OpenAI published GPT-²⁹, which is a much larger and more powerful pre-trained language model than GPT. It has 1.5B parameters and is pre-trained on 40GB of internet text.
4.4 BERT: large-scale pre-trained NLP language model using a bi-directional transformer
At the end of 2018, Google released Bidirectional Encoder Representations from Transformers (BERT)⁶. Similar to GPT, BERT is also a language model pre-trained on larger corpora with more parameters as compared to ULMFiT.
Since I am using BERT in this Kaggle TensorFlow 2.0 Question Answering competition, more discussion on BERT will be provided in the following posts of this series.
In this post, I summarized what I learnt from the Natural Language Processing with Deep Learning courses. The post covered word representation and NLP neural network structures. Hope you find the summary useful!
 Pennington, Jeffrey, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation.” In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. 2014.
 Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. “Learning phrase representations using RNN encoder-decoder for statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014).
 Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9, no. 8 (1997): 1735–1780.
 Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” In Advances in neural information processing systems, pp. 5998–6008. 2017.
 Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018).
 Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
 Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018).
 Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improving language understanding by generative pre-training.” URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf (2018).
 Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. “Language models are unsupervised multitask learners.” OpenAI Blog 1, no. 8 (2019).
This post was originally published by Jiahui Wang at Towards Data Science