This post was originally published by Allen Kong at Towards Data Science
Developing classical models for predicting restaurant revenue
Restaurants are an essential part of a country’s economy and society. Whether it may be for social gatherings or a quick bite, most of us have experienced at least one visit. With the recent rise in pop up restaurants and food trucks, it’s imperative for the business owner to figure out when and where to open new restaurants since it takes up a lot of time, effort, and capital to do so. This brings up the problem of finding the best optimal time and place to open a new restaurant. TFI which owns many giant restaurant chains has provided demographic, real estate, and commercial data in their restaurant revenue prediction on Kaggle. The challenge here would be to build a robust model that is capable of predicting the revenue of a restaurant.
After taking a look at the data, there are 137 samples in the training set and 100,000 samples in the test set. This is very intriguing since the distribution of data is usually the other way around. The goal here would be to model revenue based on 137 samples in the training set and see how well the model performs on the 100,000 samples in the test set. The data fields for each sample consist of the restaurant ID which is unique for each restaurant in the sample, the opening date of the restaurant, the city, city group, restaurant type, several non-arbitrary P-variables, and revenue which is the target variable. Using a complex model for this small training dataset with noise will cause the model to overfit to the dataset. To prevent that from happening, regularization techniques for linear regression will definitely need to be used.
After a brief look at the training data, it appears that there are no null values which is a good thing. However, that may not be the case for the P-variables as we will see later in the data exploration.
Data Pre-Processing & Exploring Features
The two figures above show the count of types of restaurants in the training set and test set. Looking carefully, there doesn’t seem to be a single occurrence of the ‘MB’ type in the training set. Type ‘MB’ stands for mobile restaurants and type ‘DT’ stands for drive-thru restaurants. Since mobile restaurants are more related to drive-thru than inline and food courts, the ‘MB’ samples in the test set were replaced with the ‘DT’ type.
There doesn’t seem to be any changes required for the city group feature. The training set has slightly more ‘Big Cities’ samples than ‘Other’ samples but that shouldn’t be a problem when we create our model. It should also be intuitive that restaurant revenue in the city than other areas.
(df['City'].nunique(), test_df['City'].nunique()) Out: (34, 57)
For the ‘City’ feature, it appears that there are cities in the test set that aren’t in the training set. It is also worth noting that some of the non-arbitrary P-variables already contain geolocation information so the entire ‘City’ feature was dropped for both datasets.
import datetime df.drop('Id',axis=1,inplace=True) df['Open Date'] = pd.to_datetime(df['Open Date']) test_df['Open Date'] = pd.to_datetime(test_df['Open Date']) launch_date = datetime.datetime(2015, 3, 23) # scale days open df['Days Open'] = (launch_date - df['Open Date']).dt.days / 1000 test_df['Days Open'] = (launch_date - test_df['Open Date']).dt.days / 1000 df.drop('Open Date', axis=1, inplace=True) test_df.drop('Open Date', axis=1, inplace=True)
The opening date is the date the restaurant first opened. It won’t be of much use in terms of predicting revenue but it would be useful to know how long the restaurant has been open since the opening date. For that reason, I decided to use March 23, 2015 as the date of comparison to calculate the amount of days the restaurant has been open. Then, I chose to downscale the number of days open by a factor of 1000 to slightly improve model performance.
The data has 37 p-variables which are all obfuscated data. These features contain demographic data, real estate data, and commercial data based on the data field description on the Kaggle competition page.
Initially, I had thought that the p-variables were numerical features but after reading some of the discussions in the competition, it turns out that some of these features were actually categorical data encoded using integers. What’s even more interesting is that a majority of the values for some of these features are zero. Once again, after digging through the discussions, people concluded that these zero values were actually null values as shown in the plots above. Multivariate imputation by chained equations (also known as MICE) was used to replace the missing values in some of these features. The way it works is that is uses the entire set of available data to estimate the missing values.
from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer imp_train = IterativeImputer(max_iter=30, missing_values=0, sample_posterior=True, min_value=1, random_state=37) imp_test = IterativeImputer(max_iter=30, missing_values=0, sample_posterior=True, min_value=1, random_state=23) p_data = ['P'+str(i) for i in range(1,38)] df[p_data] = np.round(imp_train.fit_transform(df[p_data])) test_df[p_data] = np.round(imp_test.fit_transform(test_df[p_data]))
The imputer was used on all p-variables separately for the training set and the test set. The missing values are estimated several times before the imputer takes the average. Before feeding these averages to the model, they need to be rounded to the nearest integer.
One Hot Encoding
To deal with object types in the data, one hot encoding will be used to transform these features into numerical form which can be provided to the machine learning models. Dummy encoding can also be used to avoid redundancy. The features that will be encoded are ‘Type’ and ‘City Group’ since they are the only object types in the datasets.
columnsToEncode = df.select_dtypes(include=[object]).columns df = pd.get_dummies(df, columns=columnsToEncode, drop_first=False) test_df = pd.get_dummies(test_df, columns=columnsToEncode, drop_first=False)
Target Variable Distribution
Based on the distribution, it looks like revenue is right skewed. There also appears to be outliers which will cause issues in model training. Since we will be experimenting with linear models, the target variable will be transformed to make it normally distributed for improved model interpretation. The target variable was log transformed so the final predictions will need to be exponentiated to rescale the results back to normal.
The models that I decided on experimenting with were several different linear models, KNN, random forest and gradient boosted models. The goal here is to find the best hyper-tuned models to ensemble for the final model. Before we train any model, we will split the training set into a training and validation set.
df['revenue'] = np.log1p(df['revenue']) X, y = df.drop('revenue', axis=1), df['revenue'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=118)
Ridge Linear Model
Ridge regression is a regularized linear model. As stated earlier, regularization techniques need to be used to prevent overfitting especially since our training set is very small. Before we train a ridge model on the training, we need to find the optimal parameters for the model. To do this, grid search along with k-fold cross validation was used to find the optimal parameters that led to the best score.
The optimal parameters are then used for model evaluation using both the training and test sets. The RMSE here is actually RMSLE since we’ve taken the log of the target variable.
Lasso Linear Model
Now we repeat the same procedure for a lasso model. The lasso model works differently from the ridge model because it shrinks the coefficients of less important features. This can be visualized later in the feature importance plot.
We can see that the lasso model is generalizing a lot better than the ridge model using just the ‘Days Open’ feature. It’s able to achieve just about the same test error as the ridge model using all features which shows the true potential of these regularization techniques.
ElasticNet is a linear model that combines the regularization techniques of ridge and lasso. We will use ElasticNetCV to select the best hybrid model using cross validation.
There is little to no improvement using the elastic net model. We can see that the training scores and test scores between the linear models are about the same.
In terms of feature importance, the elastic model reduced features by 72%. Even with this reduction, the model does not seem to give an improved score against the ridge or lasso model. This is probably due to the small dataset and the linear models tendency to overfit.
For KNN, we will use the KNeighborsRegressor from sklearn. We apply the same process to find the optimal neighbor parameter.
Surprisingly, the KNN model seems to perform a bit better than the linear models on the test set.
Random forests are very powerful models which are a bit different from bagged decision trees. Unlike bagged trees, random forests will select a subset of features at random and finds the best feature to split at each node whereas bagged trees considers using all features for splitting at each node. Random forests also provide unique hyperparameters to reduce overfitting as well. We will tune our model based on several of these hyperparameters.
We can see the training score has improved significantly compared to the linear models we’ve used above and KNN. The model is also able to achieve this using nearly all of the data as shown below in the feature importance plot.
LightGBM provides boosting capabilities for decision trees. It is a good alternative to XGBoost which we will test as well after.
Based on the training score, it seems that the model has overfitted to the training set since there is no improvement in the test score. It’s probably not optimal to include this model at all in our ensemble later.
XGBoost is yet another boosting algorithm for decision trees. Let’s see how it compares to LightGBM after tuning hyperparameters.
The model doesn’t seem to be overfitting as much as LightGBM. The training and test scores seem to be lower than the random forest model too. This explains why the model is heavily used in just about any problem setting.
It is very easy to overfit in boosted models so we will add early stopping parameters to reduce overfitting. This gives us a better model that is still able to generalize the test set quite well.
Based on the experimentation of the models above, it’s clear that the linear models and KNN are not the best models for this dataset. Therefore, they won’t be used as part of the ensemble. The best models to ensemble would be random forests and XGBoost models as we have seen from the training and test errors above. I decided to use a random forest ensemble since boosting models in this scenario have a tendency to overfit as shown by the LightGBM model. For the ensemble, I decided to use a stacked ensemble. The benefits of this is to create a single model that has the well-performing capabilities of several base models. The base models are different tuned random forest models and the meta model will be a simple model such as a linear regressor.
I fitted the stacked model on the entire dataset and tested it against the Kaggle private leaderboard. The model did surprisingly well placing 4th on the private leaderboard with a RMSE score of 1741680.77896. For reference, the 1st place solution on the private leaderboard was an RMSE of 1727811.48553.
Challenges & Lessons Learned
Initially, I had not planned for a lot of things for this project. Before I read some of the discussions in the Kaggle competition, I had assumed the p-variables were actually numeric values. With this in mind, I log transformed these variables which actually ended up making the linear models better in predicting the target variable. This makes sense since every feature is normalized making it easier for a linear model to predict. After finding out that the p-variables were categorical with many missing values, imputation was a much better approach. This brings up the lesson that it’s very important to understand the data you’re working with especially if it’s a small dataset. I had also played around with many models, manually tweaking hyperparameters until I discovered grid search which did wonders on saving time and effort in finding the best model. These were just some of the few things I learned while working on this dataset. A few things to perhaps try in the future would be to fit models on relevant features and including a more diverse ensemble of models.
This post was originally published by Allen Kong at Towards Data Science