Published by FirstAlign
Clickbait is false advertising links whose purpose is to get clicked at any cost. The creators of clickbait chose popular thumbnail or phrases from news or current events to give the sense that this link will redirect you to a web page content of interest, but instead leads to alternative content.
The problem we are discussing here is clickbait in the form of news headlines i.e. we see a headline or a news related thumbnail, we click on it for the purpose of reading, but are baited into clicking that false link. Click bait spoils the experience of social media and creates questions about the reality of content. We are looking to solve this problem using Neural Linguistic Processing (NLP).
What is NLP?
Natural language processing, also NLP, is a branch of Artificial Intelligence that is utilized when computers and humans interact allowing the computer to make sense of human language. NLP has many use cases shown below;
As we now understand what NLP is, let’s understand why and how will it help us to solve the problem in detecting what is a click-bait and what is an actual headline
Why use NLP?
We have already mentioned that we use NLP to allow a computer to understand human readable text. So the reason we are using NLP is to allow the computer to identify the text and categorize as either news or a clickbait. NLP does this by extracting features and understanding what is what (context).
NLP has vast range of techniques that can be applied in our case. To be able to pick a technique, we need to ask the question, How is NLP is used?
How NLP is used?
In order to detect what is a clickbait and what is real. NLP, with help of a labeled dataset, performs a text classification analysis. This is one of the fundamental tasks in NLP. We outline this in more detail below;
What is text classification?
Text classification is one of basic tasks in NLP in which we have a labeled dataset that contains text and labels. Each text has a label which corresponds to a classification of that text. We use this text and labels to create a text classification model. For example, in email systems, emails are classified as spam or genuine using text classification. In this case it detects whether the text is a genuine message or spam. From this it’s clear that in order to fulfill our purpose we need to have a labeled dataset which should contain text and labels, so let’s discuss which dataset we used.
Which Dataset is used?
The dataset we are using is from Stop Clickbait: Detecting and Preventing Click baits in Online News Media. This dataset contains 2 columns one is a headline which contains the text and other is clickbait which contains either 0 or 1. 1 signifies it is clickbait, 0 it is a real headline. In this dataset there are 32,000 rows. This label column contains a very balanced number of labels i.e. 50% percent of both true and false.
With the dataset in hand it’s time to perform some pre-processing and NLP operations to make our data efficient to use. Following are operation which were performed on the dataset. We will perform all NLP operation using nltk python package
Pre- Processing and NLP operations
- Remove everything from text except the Alphabet: In this step we take text and remove all symbols, numbers, special characters, etc. until only alphanumeric is remaining. We do this with the help of an re python package.
- Stopword Removal: This is the process of removing stop words from the text. Stopwords are high frequency words with almost no effect on model. We remove them to save computation time, as well as protect the model from the adverse effects which may be caused by stopwords. We do it with help stopword corpus from nltk.
- Stemming: Is process of splicing the word into the ‘root’ word. This is done for the purpose of removing the ambiguity. We are using PorterStemmer from nltk to perform stemming.
After processing the data in such a way that it can now be efficiently used. To create a model, we need to convert this data into a vocabulary or matrix of features so that it is in human readable format. It can then be converted to computer understandable format for decision making.
We know Machine Learning cannot be applied directly applied on the raw data. We need to create a vector or matrix of numbers. In this case we will create a matrix of words and apply a score or weighting to the word. There are different ways to score the words, in this case we used TF-IDF. TF-IDF splits into two components.
- TF (Term Frequency): Is calculated by the number of times a word appears in a document divided by total number of words.
- IDF (Inverse Data Frequency): Is calculated by a log of number of documents divided by the total number of documents, in which a particular word occurs.
After creating both features and labels it’s time to split the data into two sets, training and test sets using train_test_split from sklearn. We are now all set for perform the classification. We performed the classification using boosting ensemble classification techniques.
Apply Boosting Algorithms
Normally the Machine Learning algorithm chooses one model, the one with the most accurate predictions. Sometimes booting is applied. In boosting many weak learners are created and each weak learner is improved through the weakness of previous learner creating and ongoing learning effect. The definition of weak models vary, so boosting is a general term below are given various boosting algorithm which define weakness of a model in different ways;
- Apply and Evaluate AdaBoost: AdaBoost trains a chain of weak learners targeting iterative improvement . Adaboost evaluates the error rate of weak learners. In each iteration it looks at the errors ‘points which were classified incorrectly’ so in next iteration it can focus more on higher weighted points. With each iteration an alpha value is given to a learner to lessen the errors and provide a higher score and promote the learner. Higher alpha scores are prioritized and the learner with best alpha is selected. Here we have used AdaBoostClassifier from sklearn. For our use case Adaboost has given accuracy of 83.5%
- Apply and Evaluate Gradient Boosting: Here we are also training the sequence of weak learners, but the weakness in the points which are erroneous. Here We have used GradientBoostingClassifier from sklearn. For our given dataset we got accuracy of 84.25% using Gradient Boosting.
Our aim with this blog was to classify clickbait and identify real information. To solve this problem we sought help from Natural Language Processing. We used text classification for this purpose and performed some NLP operations to clean the dataset. We then applied TF-IDF for vocabulary creation and enable is to train our data through Machine Learning. We then applied an ensemble classification technique known as boosting.
We have used two boosting algorithm methods, Adaboost and Gradient Boosting to create classifiers and test these model. Adaboost provided an 83.5% accuracy, however Gradient Boosting performed better, and gave us an accuracy of 84.25%.
I hope you have enjoyed this blog, the code is available at GitHub. Until next time, happy coding ❤.