How to develop an end-to-end Machine Learning project

mediumThis post was originally published by Astha Sharma at Medium [AI]

A step by step guide to build a web app for prediction from problem definition to model deployment.

Things we will cover in this Article

  • Data Collection and Problem Statement

Note : As being a Data and ML enthusiast I have tried many different projects related to the subject but what I have realised is that Deploying your machine learning model is a key aspect of every ML and Data science project. Everything thing I had studied or been taught so far in my Data science and ML journey had mostly focused on defining problem statement followed by Data collection and preparation, model building and evaluation process which is of course important for every ML/DS project but what if I want different people to interact with my models, how can I make my model available for end-users? I can’t send them jupyter notebooks right! That’s why I wanted to try my hands on complete end-to-end machine learning project.

Overview of the Project

A simple Machine learning and NLP based Web Application which classify the given messages as Spam or Ham(Not Spam) build using Flask and deployed on Heroku.

To view the Deployed Application, click on the link given below : https://sms-spam-predictor-nlp.herokuapp.com/

Demo of the Deployed Web Application

Let’s Get Started…

Step1: Data collection and Problem Statement

The very first step of every machine learning Project is focus on the Data. It is very important task to decide what , how and from where to get the required data to solve the given Problem. The required data could be gathered from clients internal DB, third Party APIs, Online Data sites or by web scraping. So for this Project we are going to use UCI-ML SMS Spam Collection Data from the Kaggle.

Here the Link of the Dataset :
https://www.kaggle.com/uciml/sms-spam-collection-dataset

Problem Statement : To Classify the given text messages as Spam or Ham.

Step2 : Loading the Dataset

First Import the required libraries to load the data and to perform the EDA.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then read and load the file into a dataframe using the read_csv method and lets have a look at basic information about the data with the help of info() method.

#reading and checking the basic structure of the datasms_data = pd.read_csv("spam.csv",encoding='latin-1')
sms_data.info()

Note : The character encoding of this dataset character set is latin-1(ISO/IEC 8859–1).

sms_data.info()

The dataset contains 5 columns. Column v1 is the dataset label (“ham” or “spam”) and column v2 contains the text of the SMS message. Columns “Unnamed: 2”, “Unnamed: 3”, and “Unnamed: 4” contain “NaN” (not a number) signifying missing values. They are not needed, so they can be dropped as they are not going to be useful in building the model.

The following code snippet will drop and rename the columns to improve understandability of the dataset:

# creating a copy of the dataset so that the changes will not affect the original dataset.
sms_df = sms_data. Copy() # Dropping the redundant looking columns (for this project)
to_drop = ["Unnamed: 2","Unnamed: 3","Unnamed: 4"]
sms_df = sms_df. Drop(sms_df[to_drop], axis=1)# Renaming the columns for better understanding 
sms_df.rename(columns = {"v1":"Target", "v2":"Message"}, inplace = True)
sms_df.head()

Now the dataset looks more clear and understandable.

Step3 : Exploratory Data Analysis

Will do the data exploration in few series of steps given below :

  • To check the Basic information about the data like number of rows and columns , data types and null values we can use a direct pandas function called info().
# using info method
sms_df.info()
sms_df.info()

The dataset consists of 5,572 messages in English. The data is designated as being ham or spam. Data frame has two columns. The first column is “Target” indicating the class of message as ham or spam and the second “Message” column is the string of text. There is no null values present in the dataset.

  • We can also get the statistical information about the data using describe() method.
# statistical info of dataset
sms_df.describe()
sms_df.describe()

Before exploring the distribution of the dataset and further analysis will first map the values(each category) of target variable with 0 and 1.(0-Ham,1-Spam)

# Mapping Values for labels  
sms_df['Target'] = sms_df['Target'].map({'ham': 0, 'spam': 1}) sms_df.head()
  • Now Let’s look at the Distribution of Labels
#Palette
cols= ["#E1F16A", "salmon"] 
plt.figure(figsize=(8,8))
fg = sns.countplot(x= sms_df["Target"], palette= cols)
fg.set_title("Countplot for spam Vs ham")
fg.set_xlabel("Classes(0:ham,1:spam)", color="#58508d")
fg.set_ylabel("Number of Data points")

From the above count plot, it is evident that the dataset is imbalanced with most the messages being Ham(not spam). So next will have to dive into a feature engineering.

Step4 : Feature Engineering

Handle imbalanced dataset using oversampling

To balance the dataset we are using oversampling technique .

Oversampling in data analysis are techniques used to adjust the class distribution of a dataset(i.e. the ratio between the different classes/categories represented).

# Handling imbalanced dataset using Oversampling 
only_spam = sms_df[sms_df['Target']==1] print('Number of Spam records: {}'.format(only_spam.shape[0])) print('Number of Ham records: {}'.format(sms_df.shape[0]-only_spam.shape[0]))count = int((sms_df.shape[0]-only_spam.shape[0])/only_spam.shape[0])
for i in range(0, count-1):
    sms_df = pd.concat([sms_df, only_spam])

sms_df.shape

Now will again check the distribution

Now from the above plot we can clearly say that our data is almost balanced.

To understand more about the data will create some new features like :

  • word_count : Number of words in the text message

Distribution of word count for ham and spam messages

# creating new feautre word_countsms_df['word_count'] = sms_df['Message'].apply(lambda x: len(x.split()))plt.figure(figsize=(12, 6))

# 1-row, 2-column, go to the first subplot
plt.subplot(1, 2, 1)
g = sns.distplot(a=sms_df[sms_df[‘Target’]==0].word_count,color=’#E1F16A’)
p = plt.title(‘Distribution of word_count for Ham messages’)

# 1-row, 2-column, go to the second subplot
plt.subplot(1, 2, 2)
g = sns.distplot(a=sms_df[sms_df[‘Target’]==1].word_count, color=’salmon’)
p = plt.title(‘Distribution of word_count for Spam messages’)

plt.tight_layout()
plt.show()

It can be seen that mostly ham messages are shorter than spam messages as the distribution of words in the Spam messages fall in the range of 15–30 words, whereas majority of the Ham messages fall in the range of below 25 words.

Lets have a look at how messages are related to currency symbols

# Creating feature contains_currency_symbol
def currency(x):
    currency_symbols = ['€', '

It can be seen that Almost 1/3 of Spam messages contain currency symbols, and currency symbols are rarely used in Ham messages which is quite acceptable because most of spams messages are intended for money.

Now lets check for text messages containing number

# Creating feature contains_number 
def numbers(x):     
    for i in x:         
        if ord(i)>=48 and ord(i)<=57:             
            return 1     
     return 0  
sms_df['contains_number'] = sms_df['Message'].apply(numbers)

From above plot we can say that (at least for this dataset)most of the Spam messages contain numbers, and majority of the Ham messages do not contain numbers.

Step5 : Data Pre-processing

The data cleaning process NLP is crucial. The computer doesn’t understand the text. For the computer, it is just a cluster of symbols. The process of converting data to something a computer can understand is referred to as pre-processing. In the context of this article, this involves processes and techniques to prepare our text data for our machine learning algorithm.

Lets have a look at a sample of texts before Pre-processing

print("The First 5 Texts:",*sms_df["Message"][:5], sep = "n")
Texts before Pre-processing

Following are the steps used for pre-processing the data for NLP :

Cleaning text :

  • In the first step we extract only the alphabetic characters by this we are removing punctuation and numbers.

Tokenizing the messages : Tokenization is breaking complex data into smaller units called tokens. It can be done by splitting paragraphs into sentences and sentences into words.

Removing the stop words : Stop words are frequently occurring words(such as few, is, an, etc). These words hold meaning in sentence structure, but do not contribute much to language processing in NLP. For the purpose of removing redundancy in our processing, we are removing those. NLTK library has a set of default stop words that we will be removing.

Lemmatizing the words : Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of NLP.

Lemmatization is the process of arriving at a lemma of a word. What is a lemma, then? Lemma is the root from which a word is formed. For example, given the word went, the lemma would be ‘go’ since went is the past form of go.

Lets have a look on how will perform a steps for pre-processing mentioned above

First we need to import the following libraries :

# libraries for performing NLP 
import nltk
import re
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

The following code snippet will perform the text pre-processing.

# Cleaning the messages
corpus = []
wnl = WordNetLemmatizer()

for sms_string in list(sms_df.Message):

# Cleaning special character from the sms
message = re.sub(pattern='[^a-zA-Z]’, repl=’ ‘, string=sms_string)

# Converting the entire sms into lower case
message = message.lower()

# Tokenizing the sms by words
words = message.split()

# Removing the stop words
filtered_words = [word for word in words if word not in set(stopwords.words(‘english’))]

# Lemmatizing the words
lemmatized_words = [wnl.lemmatize(word) for word in filtered_words]

# Joining the lemmatized words
message = ‘ ‘.join(lemmatized_words)

# Building a corpus of messages
corpus.append(message)

Let’s have a look at texts after cleaning

Texts after pre-processing

Step6 : Vectorization

In NLP cleaned data needs to be converted into a numerical format where each word is represented by a matrix. This is also known as word embedding or Word vectorization. we will be using TfidfVectorizer() to vectorize the pre-processed data.

#Changing text data in to numbers and Creating the Bag of Words model
tfidf = TfidfVectorizer(max_features=500)
vectors = tfidf.fit_transform(corpus).toarray()
feature_names = tfidf.get_feature_names()

#Let’s have a look at our feature
vectors.dtype

Now our next step is to start training our machine learning models.

Libraries required for model building and evaluation :

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline    
from sklearn.naive_bayes import Multinomial
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision score, recall_score, plot_confusion_matrix, classification_report, accuracy_score, f1_score
from sklearn import metrics

Step7 : Building and training Models

we are going to use following Steps in model building :

  • Setting up features and target as X and y
# Extracting independent and dependent variables from the dataset
X = pd.DataFrame(vectors, columns=feature_names)
y = sms_df['Target']
  • Splitting the testing and training sets
# Splitting the testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  • Build a pipeline of model for four different classifiers.

We are using Naïve Bayes , RandomForestClassifier, DecisionTreeClassifier, Support Vector Machines

#Testing on the following classifiers
classifiers = [MultinomialNB(), 
               RandomForestClassifier(),
               DecisionTreeClassifier(), 
               SVC()]
for cls in classifiers:
    cls.fit(X_train, y_train)

# Dictionary of pipelines and model types for ease of reference
pipe_dict = {0: “NaiveBayes”, 1: “RandomForest”, 2: “DecisionTree”,3: “SVC”}

  • Get the cross-validation on the training set for all the models for accuracy
# Cross-validation 
for i, model in enumerate(classifiers):
    cv_score = cross_val_score(model, X_train,y_train,scoring="accuracy", cv=10)
    print("%s: %f " % (pipe_dict[i], cv_score.mean()))

So we have trained the model now we will evaluate the model performance on the given test set.

Step8 : Model Evaluation

Now in this step will how all the model are performing and which model works well with the given data. we are using accuracy report and confusion matrix for model evaluation

Lets first have a look at Accuracy Report

Accuracy Report

Accuracy report shows that Random Forest and SVC classifier almost got the same score.

Confusion Matrix

With the help of confusion matrix we can say that SVC performed slightly better than the Random Forest classifier on the give data.

Step9 : Deploying model on Heroku with using Flask

Deploying a machine learning model means making the model available for end-users to make use of. In order to deploy any trained model, we need the following thing :

  • A Trained model : save the model into a file to be further loaded and used by the web service.

Since the model will be deployed, it is saved into a pickle file (model.pkl) created by pickle, and this file will reflect in your project folder.

Pickle is a python module that enables python objects to be written to files on the disk and read back into the python program runtime.

# Creating a pickle file for the CountVectorizer
pickle.dump(cv, open('cv-transform.pkl', 'wb'))# Creating a pickle file 
filename = 'spam-sms-mnb-model.pkl'
pickle.dump(classifier, open(filename, 'wb'))# loading
classifier = pickle.load(open(filename, 'rb'))
cv = pickle.load(open('cv-transform.pkl','rb'))

Create the Webpage

Will first create a web page using HTML and CSS which takes input from the users(In this case a text messages) and shows the output(predicts messages as spam or ham).

You can find code here.

Flask makes it easy to write applications, and also gives a variety of choices for developing web applications.

First we need to install flask framework

pip install flask 

Will create two files one named “app.py” file . This file is used to run the application and is engine of this app. It contains API that gets input from the user and computes a predicted value based on the model.

And another file named “sms_classifier_model.py” which contains code to build and train a Machine learning model.

# Importing essential libraries
from flask import Flask, render_template, request
import pickle

Next will Create the Flask App

app = Flask(__name__)

Load the pickle

classifier = pickle.load(open(filename, 'rb'))
cv = pickle.load(open('cv-transform.pkl','rb'))

Create an app route to render the HTML template as the home page

@app.route('/')
def home(): 
    return render_template('main.html')

Create an API that gets input from the user and computes a predicted value based on the model.

@app.route('/predict',methods=['POST'])
def predict():    
    if request.method == 'POST':     
        message = request.form['message']    
        data = [message]     
        vect = cv.transform(data).toarray()     
        my_prediction = classifier.predict(vect)     
    return render_template('result.html', prediction=my_prediction)

Now, call the run function to start the Flask server.

if __name__ == '__main__':
         app.run(debug=True)

This should return an output that shows that your app is running. Simply copy the URL and paste it into your browser to test the app.

Step10 : Deployment on Heroku

Heroku is a multi-language application platform that allows developers to deploy, and manage their applications. It is flexible and easy to use, offering developers the simplest path to getting their apps to market.

First Install Heroku CLI as this makes it easy to create and manage your Heroku apps directly from the terminal. You can download it from here.

Sign up and Log In to Heroku. Create a “Procfile” and “requirement.txt” file, which handles the configuration part in order to deploy the model into the Heroku server.

Follow the command below to create a Procfile :

web: gunicorn app:app

The requirements file consists of the project dependencies and to install use the command given below:

pip install -r requirements.txt

Next commit your code to GitHub and connect GitHub to Heroku.

There are 2 ways to deploy your app. You could either choose automatic deploy or manual deploy. The automatic deployment will take place whenever you commit anything into your GitHub repository. By selecting the branch and clicking on deploy, build starts.

Once the model will be successfully deployed on the server your App will be created and you will get a URL.

This is the link of my Web Application : https://sms-spam-predictor-nlp.herokuapp.com/

Conclusion

So we have successfully created a Spam SMS classifier Web Application.

This article covered all the steps for building and most importantly deploying a Spam SMS classifier machine learning model that could significantly help to reduce the chances of getting trapped by spam messages.

You can refer to my GitHub repository for this project.

Note : This was my first End-to-End Machine learning project and also my very first article on medium so please do share you feedback or suggestions so that I can improve.

I hope someone find this useful.

Thank you!

, ‘¥’, ‘£’, ‘₹’]
for i in currency_symbols:
if i in x:
return 1
return 0

sms_df[‘contains_currency_symbol’]=sms_df[‘Message’].apply(currency)

It can be seen that Almost 1/3 of Spam messages contain currency symbols, and currency symbols are rarely used in Ham messages which is quite acceptable because most of spams messages are intended for money.

Now lets check for text messages containing number


From above plot we can say that (at least for this dataset)most of the Spam messages contain numbers, and majority of the Ham messages do not contain numbers.

Step5 : Data Pre-processing

The data cleaning process NLP is crucial. The computer doesn’t understand the text. For the computer, it is just a cluster of symbols. The process of converting data to something a computer can understand is referred to as pre-processing. In the context of this article, this involves processes and techniques to prepare our text data for our machine learning algorithm.

Lets have a look at a sample of texts before Pre-processing


Texts before Pre-processing

Following are the steps used for pre-processing the data for NLP :

Cleaning text :

  • In the first step we extract only the alphabetic characters by this we are removing punctuation and numbers.

Tokenizing the messages : Tokenization is breaking complex data into smaller units called tokens. It can be done by splitting paragraphs into sentences and sentences into words.

Removing the stop words : Stop words are frequently occurring words(such as few, is, an, etc). These words hold meaning in sentence structure, but do not contribute much to language processing in NLP. For the purpose of removing redundancy in our processing, we are removing those. NLTK library has a set of default stop words that we will be removing.

Lemmatizing the words : Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of NLP.

Lemmatization is the process of arriving at a lemma of a word. What is a lemma, then? Lemma is the root from which a word is formed. For example, given the word went, the lemma would be ‘go’ since went is the past form of go.

Lets have a look on how will perform a steps for pre-processing mentioned above

First we need to import the following libraries :


The following code snippet will perform the text pre-processing.


Let’s have a look at texts after cleaning

Texts after pre-processing

Step6 : Vectorization

In NLP cleaned data needs to be converted into a numerical format where each word is represented by a matrix. This is also known as word embedding or Word vectorization. we will be using TfidfVectorizer() to vectorize the pre-processed data.


Now our next step is to start training our machine learning models.

Libraries required for model building and evaluation :


Step7 : Building and training Models

we are going to use following Steps in model building :

  • Setting up features and target as X and y

  • Splitting the testing and training sets

  • Build a pipeline of model for four different classifiers.

We are using Naïve Bayes , RandomForestClassifier, DecisionTreeClassifier, Support Vector Machines


  • Get the cross-validation on the training set for all the models for accuracy

So we have trained the model now we will evaluate the model performance on the given test set.

Step8 : Model Evaluation

Now in this step will how all the model are performing and which model works well with the given data. we are using accuracy report and confusion matrix for model evaluation

Lets first have a look at Accuracy Report

Accuracy Report

Accuracy report shows that Random Forest and SVC classifier almost got the same score.

Confusion Matrix

With the help of confusion matrix we can say that SVC performed slightly better than the Random Forest classifier on the give data.

Step9 : Deploying model on Heroku with using Flask

Deploying a machine learning model means making the model available for end-users to make use of. In order to deploy any trained model, we need the following thing :

  • A Trained model : save the model into a file to be further loaded and used by the web service.

Since the model will be deployed, it is saved into a pickle file (model.pkl) created by pickle, and this file will reflect in your project folder.

Pickle is a python module that enables python objects to be written to files on the disk and read back into the python program runtime.


Create the Webpage

Will first create a web page using HTML and CSS which takes input from the users(In this case a text messages) and shows the output(predicts messages as spam or ham).

You can find code here.

Flask makes it easy to write applications, and also gives a variety of choices for developing web applications.

First we need to install flask framework


Will create two files one named “app.py” file . This file is used to run the application and is engine of this app. It contains API that gets input from the user and computes a predicted value based on the model.

And another file named “sms_classifier_model.py” which contains code to build and train a Machine learning model.


Next will Create the Flask App


Load the pickle


Create an app route to render the HTML template as the home page


Create an API that gets input from the user and computes a predicted value based on the model.


Now, call the run function to start the Flask server.


This should return an output that shows that your app is running. Simply copy the URL and paste it into your browser to test the app.

Step10 : Deployment on Heroku

Heroku is a multi-language application platform that allows developers to deploy, and manage their applications. It is flexible and easy to use, offering developers the simplest path to getting their apps to market.

First Install Heroku CLI as this makes it easy to create and manage your Heroku apps directly from the terminal. You can download it from here.

Sign up and Log In to Heroku. Create a “Procfile” and “requirement.txt” file, which handles the configuration part in order to deploy the model into the Heroku server.

Follow the command below to create a Procfile :


The requirements file consists of the project dependencies and to install use the command given below:


Next commit your code to GitHub and connect GitHub to Heroku.

There are 2 ways to deploy your app. You could either choose automatic deploy or manual deploy. The automatic deployment will take place whenever you commit anything into your GitHub repository. By selecting the branch and clicking on deploy, build starts.

Once the model will be successfully deployed on the server your App will be created and you will get a URL.

This is the link of my Web Application : https://sms-spam-predictor-nlp.herokuapp.com/

Conclusion

So we have successfully created a Spam SMS classifier Web Application.

This article covered all the steps for building and most importantly deploying a Spam SMS classifier machine learning model that could significantly help to reduce the chances of getting trapped by spam messages.

You can refer to my GitHub repository for this project.

Note : This was my first End-to-End Machine learning project and also my very first article on medium so please do share you feedback or suggestions so that I can improve.

I hope someone find this useful.

Thank you!

Spread the word

This post was originally published by Astha Sharma at Medium [AI]

Related posts