Published by FirstAlign
In this blog we explore – What is Tokenization? Why it is important? and; How tokenization is achieved? We will go through various Tokenization techniques, the issues identified with those techniques and how these problems are resolved.
So let’s begin
What is Tokenization?
Tokenization is one of the basic Natural Language Processing (NLP) operational techniques, and is performed before applying any other NLP operation. Tokenization is the technique of splitting the sequence of text into the meaningful units; each of the units is called a token. Problem with Tokenization is choosing the perfect split, so that all tokens from a sequence have ‘semantic meaning’ with no token being left out from the sequence. This works different for alphabet and symbol based languages.
Let’s look at language constraints:
Raw Sequence of Text: “I love eating pizza and what about you.”
Tokenization: [‘I’, ‘love’, ‘eating’, ‘pizza’, ‘and’, ‘what’, ‘about’, ‘you’, ‘.’]
From the above example, it is clear that if we tokenize this way we have a clear meaning of each token. For example, we know pizza is noun and type of food, eating is a verb, ‘and’ is a conjunction, and joining all tokens together we can derive a clear meaning of the sentence.
In real-world data is not as clean as shown here. A sequence may contain ambiguous words and punctuation, which in itself may create problems in tokenization. So let’s discuss various tokenization techniques and see how the various problems each technique is solved.
There three different levels of tokenization we are going to are as follows;
- Word Level Tokenization
- Character Level Tokenization
- Subword Level Tokenization
Word Level Tokenization
In this type of tokenization sentences are split into words based on white spaces and punctuation. One problem with this type of tokenization is the splitting of two words separated by spaces which were supposed to be a single unit. For example, the name “John Doe” is a single unit but when applying word level tokenization it splits into two units “John” and “Doe”.
Another problem is when splitting based on punctuation, words like “Don’t” split in “Don” and “t”. This of course doesn’t make sense. Due to words being tokenized in this way the size of vocabulary library increases drastically. To illustrate, if a document contains 100 sentences, each sentence containing 10 words then the size of the vocabulary is 1000.So, when we are working with huge NLP datasets a huge vocabulary is created if we are using word level tokenization.
Whenever we are dealing with word level tokenization we are facing certain problems.
- For alphabet based languages such as English, a huge vocabulary is required because a model only recognizes whole word level data. For example, if the word “Test” is added to vocabulary, the model will still not recognize the word “Tested” as being the same, so for a model to perform well, we need to give a huge amount of additional text data.
- Not all languages separate words by space. For symbol based languages such language is Chinese, this type of language word tokenization is not a good option.
- Compound words such as a name; which contains first and last name must be considered as a single unit, but are split into pieces, which are not correct and causes problems while feeding such data to models.
Character Level Tokenization
In this type of tokenization, tokens are single characters not words, so the whole sequence of text is represented by tokens of characters. In this type of tokenization is the size of the vocabulary is very small, that is 26 plus special characters. The problem with this type of tokenization is it doesn’t strictly fall under its definition because semantic meaning cannot apply. Keeping in view of everything this type of tokenization is widely used because of the results it produces; refer to Leet et al. (2017). Character level tokenization as shown in “A character-based convolutional neural network for language-agnostic Twitter sentiment analysis”
The character level tokenization solves the problem of vocabulary library size as shown in the picture above. It only has 26 characters and additional special characters; which decreases the size of vocabulary. This type of tokenization has problems of its own such as:
- This type of tokenization, a sequence is split into characters which cannot pass the test of semantic meaning, i.e. they don’t have any meaning of their own. This type of tokenization lacks context and understanding.
- As all words are split into characters which makes the size of the tokenizing array, or list huge. This causes a computational effect on the model.
Subword Level Tokenization
In this type , only the rarest words are tokenized, the common words are not. If, in a sentence, a word is considered rare then it is tokenized as shown in the example. Unfriendly is considered as rare word, therefore it splits into three tokens – “Un”, “friend” and “ly” all having the semantic meaning. This level of tokenization is used in Transformer Architecture; considered currently the state of art in Deep Learning.
In a nutshell, what we want from a tokenization technique is an infinite set of words, with an infinite set of characters, without splitting so meaning is not lost. Otherwise this can become computationally hard to perform.
In sub-word level tokenization words are reused to form new words so increasing the number of words to be input without increasing the size of the input sequence. We would rather it used that same input to save a word in in the vocabulary, find its possible splits and save them to vocabulary as well.
For example, the word ‘anywhere’. It is stored and its reused as “any” as well as “where” so making this technique efficient.
In this blog post, we discussed tokenization and three levels of tokenization, word, character and sub-word level. We discussed how each of them works, the problems associated with each level and how the next level can resolve these problems.
It is clear that subword level tokenization solves the issues that arise in word and character level tokenization, and for better results subword level tokenization should be used.
Hope you enjoyed the blog post stay tuned until next post.
Happy Coding ❤