This post was originally published by Josh Weiner at Towards Data Science
From all of our analysis of relativized team versus aggregated player statistics, it looks clear to us that our Elo Rating and its determinants would be better features to train our models on when it comes to predicting the outcome of NBA games.
Our first step here was to split our data into features and columns. Reading from our dataset, once split, we then used sklearn to randomly split our data into train and test sets with an 80:20 ratio.
The first model we aimed to use to predict the outcome of an NBA game was a Logistic Regression model. Unlike a Linear Regression model which predicts outcomes on a range of values between (and sometimes outside) 0 and 1, Logistic Regression models aim to group predictions into binary outcomes. Since we are predicting wins and losses, this type of classification suits us perfectly.
To begin, we used a simple non-parameterized LR model with our team stats and Elo Ratings as parameters using sklearn:
After playing around with some hyperparameter tuning, we found that using max_iter=131 and verbose=2 slightly improved our initial testing accuracy to 66.95%. Definitely not bad for a non-parameterized model and very close to our desired prediction accuracy. However, we sought to see if we could better tune our hyperparameters to improve our overall accuracy. Essentially, we would try out many combinations of possible hyperparameters on our data to give us the absolute best weights for our LR model.
We accomplished this using cross-validation: because we only have a vague idea of the parameters we might want to use, our best approach is to narrow our search is and evaluate a wide range of values for each hyperparameter.
Using RandomizedSearchCV, we searched among 2 * 4 * 5 * 11 * 3 * 3 * 5 * 3 = 59,400 possible settings – and so the most efficient way to do this would be to take a random sample of the values.
Running our model with the best parameter values of the random samples actually decreased the accuracy of our model to 66.27%, which showed us that while random sampling helped us narrow down our hyper parameter tuning within a distribution, we would have to explicitly check all combinations with GridSearchCV.
In this case, implementing GridSearch only marginally increased our accuracy with our LR model.
The second model we looked to implement was a RandomForestClassifier, which can be efficiently used for both regressions and classifications. In this case, we will see if the Classifier can build a proper decision tree to determine wins from the given team-stats.
Immediately, we get that the RandomForestClassifier reaches an initial accuracy of 66.95%, which again is pretty good. Like with the LR model, we attempted to tune the hyperparameters to give us more accurate results — first using RandomizedSearchCV.
Unlike with the LR model, we find that RandomizedSearch improves our hyperparameter tuning, giving us a better accuracy of 67.15%.
Running GridSearchCV in a similar manner to what we did above, we also sought to explicitly test 2 * 1 * 6 * 2 * 3 * 3 * 5 = 1080 combinations of settings instead of randomly sampling a distribution of settings. GridSearch also gave us an improvement from the base RandomForestClassifier, with an accuracy of 67.11%.
Overall, when running both a LinearRegression and RandomForestClassifier on the team stats and Elo Ratings, we achieved a win-prediction accuracy of 66.95%–67.15%. For basketball games, which as we established earlier are quite variable in their actual versus predicted results, this is a significant result.
We then took a different approach to predicting the outcome of a game to see if we can achieve any better perfomance. Using the larger dataset of individual player statistics that we’ve collected, we will train a model to predict how many points a player will score in a given game. We will predict this based on a players average season stats up until the game we are trying to predict as well as their average performance over the past 10 games. We already created this data in the feature engineering section above. We will also make use of Elo ratings in our prediction as well, as presumably the higher rating of the opposing team the less points a player will score. Once we have this model we can predict how many points a team will score in a game by summing the predicted number of points of each individual player will score. With this information we will be able to predict which team will score more points and thus win the game.
Before we run our models, we need to clean our data slightly. For some games in this dataset, we have the statistics for one teams’ players, but not for the other team — generally only for the first game that other team plays in the season. Thus, we will remove all these games from the dataset.
Unlike with the above games, we can’t randomly split our data into train and test sets. We are looking to use individual player statistics to predict the final score of a team, thus we must keep all players playing in the same game together. To do this, we will split up our train and test sets by game so players playing in the same game stay together. About 80% of the games will be in the train set and 20% will be in the test set:
Instead of using a Logistic Regression model, for player scoring we will use a Linear Regression model as we are looking to predict a range of possible values (points scored) instead of simply predicting a win or a loss. Our RMSE (Root Mean Squared Error) for all players was 5.56, or the equivalent of each player making or missing around 2–3 baskets game around their averages.
On the test set, we grouped each team’s predicted scoring for each game and compared it with their actual scoring numbers. Computing the numbers of games won versus the winner based on predicted scoring gave us a ratio of 1483/2528, or an accuracy of 58.66%. Clearly, and as we realized earlier when looking at PER distributions of teams versus their opponents, aggregated player performance is too variable of a determinant to accurately predict the outcome of games — especially when compared to team performance which tends to be more consistent across games.
As avid NBA fans, we felt that creating a model to predict the outcome of NBA games would be an interesting project and taught us a lot about building classifiers for professional sports game outcomes. We were able to utilize many of the concepts learned in our Big Data Analytics class for this project — including scraping, data cleaning, feature analysis, building models and hyperparameter tuning — and want to thank Professor Ives for his fantastic work in teaching throughout the semester.
Our Random Forest Regression model, with parameters optimized through RandomSearchCV, gave us the highest testing accuracy of 67.15%. It is slightly higher than the Logistic Regression model, and it is much higher than the Linear Regression model based on individual player statistics. Optimizing parameters using GridSearchCV and RandomizedSearchCV was time consuming and computationally costly, and it resulted in only marginal changes in testing accuracy. If we had more time, we’d likely spend less time optimizing parameters and more time selecting a model.
The best NBA game prediction models only accurately predict the winner about 70% of the time, so our logistic regression model and random forest classifier are both very close to the upper bound of predictions that currently exist. If we had more time, we would explore other models and see just how high of a test accuracy we could get. Some of those candidates include an SGD Classifier, linear discriminant analysis, convolutional network, or a naïve Bayes classifier.
Hopefully, you enjoyed reading about our work as much as we enjoyed making it — and learned something from it too.
This post was originally published by Josh Weiner at Towards Data Science