The Machine Learning soccer pundit


This post was originally published by Gustavo Sant'Ana Ferreira at Towards Data Science

Clustering MLS players with PCA and K-Means

Image for post

Recently, in December, the 2020 season of Major League Soccer ended after a tough year. Although I have a distant relationship with the sport nowadays, the subject ends up being very natural to me as a Brazilian. Since I returned to live in the USA, I started to be interested in the soccer growth in the country, seeing great potential as a sport and as a business.

Reading about the finals, I realized that I know little about the players and decided to explore a little, to learn more about the league, its teams, fans, and players. However, my objective was not a superficial analysis, just getting familiar with the basic statistics. To have broad coverage of the subject, I thought of creating a model that would analyze the players and develop clusters, based on several criteria simultaneously, grouping them according to their game characteristics.

I started the project by getting data from, a site with numerous statistics available and that allows scraping for non-commercial purposes. I selected the necessary URLs manually and iterated them, capturing via requests, and analyzing with BeautifulSoup. Although the tables were not in the “body” of the page, it was easy to find them in the “comments,” starting the cleaning process. Using pandas, I created a dataframe for each table and merged them all into one, covering the data of all 680 players who participated in the championship.

Data from the Standard Stats, Shooting, Passing, Passing Type, Shooting, Goals and Shot Creation, Defensive Actions, Possession, and Goalkeeping tables were captured.

At first, I left out the goalkeeping data since these tables do not contain data from other players, differently from the “Shooting” table, which also includes data from goalkeepers. After a cleaning work that included deleting header lines that also appear in the middle of the tables, deleting duplicate columns (name, nation, position, team, for example, were present in all the original tables), and fixing the column with the positions (players with double positions, like “MF, FM” were left with the first ones in the list, “MF” in this case), I added the goalkeeper data. Such data is essential for the clustering stage, providing the necessary material to distinguish goalkeepers from others.

After that, I selected the necessary columns. I chose to select only statistics related to skills, getting rid of statistics such as projected values, averages, or numbers based exclusively on opponents’ actions. I also didn’t use age and place of birth data. My interest was in the last season, not in the future of the players. I chose to analyze what they did on the field.

In total, I selected 97 features, some obvious as goals, assists, and passes completed, and others less analyzed as passes in the central third of the field and the total distance of passes towards the attack.

Image for post

Image for post

After the dataframe was finalized and, mainly, cleaned, I started the clustering stage. I chose two approaches: both using PCA (Principal Component Analysis) and K-Means. I wanted to perform a visual analysis and take advantage of Plotly’s hover function, which allows you to position the mouse cursor over the marker and obtain point data on the charts (I didn’t upload the files here because they are huge). My idea was to travel on the points and understand the main characteristics of each player. With 93 features, that is impossible. That’s when PCA can help.

As each feature has its own range of values, it is necessary to scale the features. Since the data were normally distributed, I used StandardScaler. I also use the traditional process of sci-kit learn, creating a pipeline to make the process easier.

Image for post

I checked the cumulative explained variance. By compressing the 93 features into just two, it was possible to obtain 69% of the variance. With three components, 73%. For the idea that I was proposing, it was enough. First, I plotted the components keeping the colors separated by the standard field positions indicated in the player data. It is interesting to see how the three components’ formulation considers different features, giving different values ​​for features related to attacks, defense, and decisive moves.

Image for post

I also plotted the PCA with three elements.

Image for post

Starting with K-means, the first thing to be done was to define the ideal number of clusters to be built. Both the “Elbow” and the “Silhouette Score” methods indicated minimal values like 2 clusters ​​(as expected, the indication was to separate between goalkeepers and other players). Four positions (goalkeepers, backs, midfielders, and forwards) were not of my interest. So, I went a little deeper and chose six groups.

Image for post

The clusters formed were:

  1. Strikers — They are responsible for most of the goals, steal a few balls, and kick a lot. They have a relatively small amount of forwarding passes because they often receive the ball close to the goal and have little space to advance. They are decisive, obtaining good offensive results with few touches on the ball.

Ex: Gyasi Zardes, Raúl Ruidíaz, Robert Berić

2. Great defenders — is the group that plays the most, accumulating the highest minute totals among all groups. They put pressure on opponents on the first third and win many tackles, using the aerial game and heading a lot — players of great importance.

Ex: Francisco Calvo, Judson, Eddie Segura

3. Defenders with some attacking power — a group that includes many full-backs and very few forwards. They use the sides of the field, defend and also produce attack moves. They are average in most statistics, standing out only in the number of assists. Overall, the group is very homogeneous, with a few highlight players.

Ex: Alexander Büttner, Diego Palacios, Ruan

4. Not great defenders — another group with few forwards, they often touch the ball and have the tranquility to play with little speed. Accumulate good progressive distance since they receive the ball in the defense field. They score a few goals. However, they are not of great prominence, having discrete statistics.

Ex: Renzo Zambrano, Gedion Zelalem, Ifunanyachi Achara

5. Bench players — Group with many reserves, mainly goalkeepers. They accumulate just a few minutes in the field and have poor statistics due to limited participation.

6. Ordinary group — A group that brings together players from various positions, with many midfielders of little prominence. In this cluster, the rule is not to stand out. Players even score a few goals but lose the ball a lot and yield little.

The work was interesting because it took into account several qualities of the players. Looking only at leaders in idolized statistics can distort the interpretation. Players who do well in various areas are crucial in building a competitive team. This type of work, for example, can be expanded, taking into account the salaries of more experienced players and the potential of younger players.

Visit my Github repo and get access to the code used. You can also see how I developed the same clustering process using Hierarchical Clustering.

Spread the word

This post was originally published by Gustavo Sant'Ana Ferreira at Towards Data Science

Related posts