Published by FirstAlign
Imagine being a Human Resources department with multiple job openings. Your email box is flooded with resumes requiring review and while one way to do this is by searching the keywords in your email account, another is to download the emails and categorize them one by one. Both will require a lot of effort so to help HR we will demonstrate the automation of the process using Machine Learning.
During the course of this blog, we run an example of this use case, we will be perform text classification for, based on the content of resume and classify to which group a particular resume belongs. If you are Natural Language Processing (NLP) novice, or know little about text classification then check out one of my previous blogs “Classification of hate speech and offensive language using machine learning”, this will provide you with the basics to better understand this approach.
In the meantime, let’s begin. The figure below shows all the steps needed to automate the process of resume classification.
Step 1: Resume Dataset
The dataset used is taken from Kaggle and can be found here. This dataset contains two columns, 1) Category; and 2) Resume. The Category contains the title of the job the resume is sent for; and the resume column contains the content of the resume.
In the first step, we need a dataset. This will enable us to train the model so that in the future, based on the training, we can make predictions. A snapshot of data from that dataset is shown below;
Step 2: Apply Regular Regression
The content of a resume contains a lot of information that is unnecessary for classification, for example, a contact number and email. We need a way to strip away all these unwanted data points to prevent bias. To achieve this we use a Regular Expression to strip out all numbers and special characters, so that we are left with alphabet characters only.
The figure below shows the Regular Expression used here.
Step 3: Apply Stemming
Once we are left with only words, we address the problem here ambiguity. For example, if we have words such as “breaking”, “broke” and “break” all these mean one and the same thing, the only difference is the tense used. We need a way to reduce all words to their root, in this case the word “break”. To achieve that we have applied text stemming using nltk.
The below figure shows Text Stemming.
Step 4: Remove Stop words
In this step, we remove the stop words from the text. Stop words are high-frequency words with very low statistical importance in prediction. The presence of these words can create a bias in classification. ‘I’, ‘The’, ‘a’ are the examples of stop words.
The figure below shows the stop word removal.
Step 5: Apply Count Vectorizer
At this stage, we are left with the raw text data. Machine algorithms don’t work well on raw text data, so we need convert that raw text data into a vector of numbers. These vectors contain the relative importance (weighting) of each word, which is calculated by the number of times a word appears in the document.
For example, as shown in the image below each row represents a document, and each column a word. Each cell details the number of times it appears in that document. Put simply from the following raw text;
the house had a tiny little mouse the cat saw the mouse the mouse ran away from the house the cat finally ate the mouse the end of the mouse storyBaseline dataset
“ate” is not present, “cat” “mouse” and “see” appear once and “the” appears twice. So in this step, we have converted the data in such a way that it can be utilized effectively by the Machine Learning algorithm.
Step 6: Apply Classifier
It is now time to apply some Machine Learning. As we know we are dealing with a text classification problem, i.e. we need to classify the resumes based on its content, and determine to which class it belongs. To achieve this we apply the Machine Learning algorithm to establish a classification model which we can use for future prediction.
Here we have applied the Logistic Regression classifier using sklearn, a python library, for applying Machine Learning and statistical operations on data.
The figure below shows how the training data is passed through the algorithm to create a classifier.
Step 7: Classification Model
Once the classifier or Model has been created, we can use it make future predictions. Here we have chosen a 20% sample of the dataset to apply newly to the classifier and for the basis of our future predictions. Based on this we will see how well our model performs.
Step 8: Evaluate Model
Now we have the model and 20% of dataset for testing our model. To evaluate the model we are going to use 4 parameters; accuracy, precision, recall and the F1 score. All these parameters are calculated on the 20% dataset which classifier never saw. So by using Logistic Regression our model was able to achieve an accuracy of 67.6%, precision of 68.25%, recall of 67.5% and f1 score of 64.6%.
In this blog we have created a Machine Learning model that can classify resumes into different categories. The purpose of this model is to help HR in classifying resumes automatically.
For evaluating the model we are using four evaluation parameter Accuracy, Precision, Recall and f1 score. Using Machine Learning and Natural Language Processing we were able to achieve an accuracy of 67.6%, precision of 68.25%, recall of 67.5% and f1 score of 64.6%.
The complete code for this blog is available at GitHub Hope you enjoyed the article stay tuned until then happy coding ❤