November 9th 2020
Welcome to the third edition of xFO!
Always bringing the latest in football and data analytics. This weeks edition is about predicting the golden boot race in the EPL. This article also shows how using the right football statistics and data science concepts great analyses can be made to relatively complex situations.
Introduction:
The current EPL season has been filled with surprises. Shocking results, surprising teams positioning themselves at the top half of the table, and a lot of goals. The Liverpool vs Leeds United match during week one definitely set the tone for the season. Teams on the attack, underdogs competing and outperforming the established top teams during parts of the match, and again, plenty of goals. The golden boot, awarded to the player who scores the most goals during the season, is always a highly contested prize between players. With the amount of goals currently being scored, not only are the usual players such as Salah, Vardy and Kane at the top, but also surprise players such as Calvert-Lewin, Son and Bamford. With the extraordinary amount of goals being scored this season, I wonder who will win this year's golden boot.
Once again, thanks to understat.com I was able to collate an Excel spreadsheet with the goals, xG and other statistics for all players in the premier league from the 2018/2019 and 2019/2020 seasons. I created this spreadsheet to conduct a regression analysis and find out if there is a relationship between the goals a player scores and/or xG in one season, and how many goals they score the following season. The regression analysis was done to determine if goals and xG of previous seasons can be indicators of how the player will perform the next season. The results from the regression analysis of the goals scored by players in the 2018/2019 season and 2019/2020 season can be used to predict the goals that will be scored by players during this 2020/2021 season.
Before diving into the data analysis process, certain edits to the Excel spreadsheet were made to obtain precise results in this data analysis. For example, teams that were relegated during the 2018/2019 season were not included in the primary regression analysis as data from the EFL Championship (England’s second division) is not readily available. Similarly, any players that were transferred away from premier league teams between the two seasons, were also removed to limit the eventual predictions to current EPL players. Players who transferred to the EPL this season or last were added to the spreadsheet with the statistics from their previous leagues. This was done in an attempt to have the final predictions take into account the new players who could possibly enter the race for the golden boot.
Regression Analysis:
In case you are unfamiliar with a regression analysis, it is a statistical method used to estimate a relationship between variables. It is composed of a dependent variable and an independent one, or a dependent variable and multiple independent variables. These analyses are most common when the relationships between the variables are linear, meaning there is a direct relationship between the two variables. However, there are instances when regressions can be conducted if the relationship between the variables is nonlinear. This tends to occur when data sets and variables are more complex. In this particular case, a linear regression analysis was used to estimate if the relationship between players scoring a certain number of goals in two consecutive seasons. If the relationship is strong enough, the results of the regression analysis can be a useful indicator to predict and estimate the number of goals players will score the following season.
Data Analysis:
As mentioned, to start this regression analysis, an Excel spreadsheet was created with all the players who have taken part in both the 2018/2019 and 2019/2020 seasons, as well as all new signings EPL teams have made between the beginning of last season and start of this season, along with the criteria mentioned in the introduction section.
With 20 teams in the premier league and about 500 eligible players, one would think all players have an equal chance of winning the golden boot. However, this is not the case. In the original spreadsheet, I added an extra column called ‘team values’ categorizing the teams of the EPL into different groups. By using last season’s league table standings, I gave the top four teams: Liverpool, Manchester City, Manchester United and Chelsea a team value of five (5). The next four teams: Leicester, Tottenham, Arsenal and Wolves were given a value of four (4). The next four teams: Sheffield United, Burnley, Southampton and Everton, were given a team value of three (3). Lastly, the remaining five teams: Newcastle United, Crystal Palace, Brighton, West Ham and Aston Villa were given a team value of two (2). After categorizing each team, I aggregated the xG of every team in each category for the past two seasons. The graph below contains the aggregated xG by team value for the 2018/2019 season. The teams who finished higher in the table, those with the team value of 5, had the most combined goals between them.
What this graph conveys is that, logically, those teams who finish higher in the table are expected to score the most goals. This makes sense because the teams performing at their best, are creating the highest quality of scoring opportunities and capitalizing on the majority of them. The more chances a team creates, the more scoring opportunities players in that team will have to add on to their season goal tally. The graph above shows that there is a small difference between categories 4 and 5, and a much bigger difference between categories 3 and 4. This bigger difference is evidence that the golden boot winner will come from a team with a value of 4 or 5. In fact, during this season, 3 players shared the golden boot: Salah, Mané and Aubameyang. All 3 players scored 22 goals. Liverpool came in second that year and had a team value of 5, while Arsenal finished in fifth and had a team value of 4. Therefore, the golden boot winners during this season supports the conclusion that golden boot winners will come from teams with a value of 4 or 5.
The same aggregated xG graph was created for the 2019/2020 season and again shows very similar conclusions to the previous graph. Teams with a value of four or five scored at a higher rate than teams with a value of two or three. Jaime Vardy won the golden boot last year by scoring 23 goals. His team, Leicester City finished in fifth place and had a team value of 4 which further supports the claim that golden boot winners will come from teams with a value of four or five. The graph of the 2019/2020 season can be seen blow.
In this context we can conclude that the golden boot winner for the 2020/2021 season will come from a team that has finished in the top eight. For the past two seasons the top eight have consisted of the same seven teams: Liverpool, Manchester United, Manchester City, Arsenal, Tottenham, Wolves and Chelsea with Everton in 2018/2019 and Leicester in 2019/2020. From this information it would be safe to state that the golden boot winner will come from one of the seven teams that have consistently been in the top positions the last two seasons. The next step is to conduct the regression analysis to determine if this conclusion holds.
Regression Analysis Procedure:
The first step in making sure this is a viable analysis is to obtain the correlation coefficient of the two variables. Python has an easy method using Excel and CSV files to assess the correlation coefficient between goals scored in the 2018/2019 and the 2019/2020 seasons which was about 0.734. This translates into 73.4%. To help visualize this positive relationship, a scatter plot with a line of best fit was created.
The graph illustrates that there is a direct relationship between the number of goals a player scores in one season and the number of goals he will score in the following. The high percentage of the correlation coefficient supports the need for a deeper analysis.
Using the original spreadsheet, I split the data into two sections. A trial section and a testing section. By doing this, the results of the regression analysis from the trial section were compared to those in the testing section to measure its accuracy with the testing section. Similarly, by splitting the data into two sections the computer software, Python, learns about the statistics and relationships of the stats/variables of the given data to make predictions every time it is given new data. If the accuracy is strong enough, the regression results can be used to run another regression and predict the golden boot winner for the current season.
An important statistic when conducting a regression analysis is the r squared. R squared quantifies the amount of data and can be explained by a model’s inputs. After running the regression analysis with the trial section and later comparing it to the test section, the r squared value was 0.61 or 61%. This means that 61% of the data can be explained by the regression analysis, in this case, the goals scored. Ideally, this value would be higher, but football is a very unpredictable game with many factors that can affect a player and the amount of goals he scores. Injuries can limit players’ time on the pitch and new managers switching formations and style of play may have an effect on the player’s ability to score. These are factors that are extremely hard to quantify in a model like this. However, over half of the data can be explained by the model, and therefore can still be used as a possible predictor for this season's golden boot winner. It is important to note that as with any statistical analysis, a margin of error is calculated as well. In this case the root mean squared error (RMSE) was 3.08. RMSE is a form of standard deviation that shows how concentrated the data is around the line of best fit. The RMSE stat allows a bit more interpretation because it relies on what the dependent variable is. In this case, the dependent variable is goals scored with the data the computer received from the training set. When making predictions, there is a standard deviation of three goals, based on the unpredictability of football and the amount of low variables considered in these models.
Predicting the Golden Boot Winner:
Now that the computer software has values from the regression analysis to work with, it can be given a new dataset to predict the amount of goals a player will score in this current season. The new dataset will be the player statistics of the 2019/2020 season. Using the functions of the computer software, and given the new data, it can predict the number of goals scored based on the variables and results of the regression analysis discussed earlier. The top ten predicted scorers can be found in the table below.
Player Description:
The first information that stands out is that Timo Werner is predicted to win the golden boot this year. However, it can be claimed that the 28 goals he scored in the Bundesliga are inflating his predicted goals and do not represent the higher level in the EPL. Although this might be true, Werner is at the top of his game right now and the focal point for a very talented attack that will only improve. He may just get the necessary opportunities to score approximately 24 goals. This number of goals would be enough to secure the golden boot given that the top scorer last year scored 23 goals, and the year before that, 22. Also, what makes him an excellent forward and a great candidate for the golden boot, apart from his finishing, is his ability to create chances for himself. His quick pace and dribbling allow him to get into better positions than most players. He is definitely a worthy winner of a golden boot. With Werner predicted to finish at the top, this further supports the claim regarding the fact that the golden boot winner would come from a team valued at a four or five category. Chelsea had a team value of five.
Given the start to the season they have had, two surprising absentees are Son and Calvert-Lewin, due to their past season performances. Son has usually played as a winger for Tottenham with Harry Kane being the main striker. This resulted in Son not having the same number of opportunities as other players on this list. Son scored 12 and 11 goals respectively during the 2018/2019 and 2019/2020 seasons. This number of goals is only more than those scored by three players during the 2018 season, and less goals than every single player during the 2019 season. These low numbers explain his absence from the top ten predicted scorers for this season.
Contrary to the past two seasons, a factor that has drastically improved Son’s goal scoring performance this season is Mourinho’s decision to adapt Son and Kane’s playing style. There have been many instances when Kane drops in between the opponents’ defensive and midfield lines opening up space behind the defensive line for Son to run in, as seen below.
It is definitely something Tottenham have worked on during the season. It will be important to see if Son can continue to be clinical in front of a goal, and if opposing teams will adapt their style to defend this strategy.
Calvert Lewin has had a breakthrough season this year and has already scored more than half of the goals he scored last year. He scored 6 and 13 goals in the past two seasons respectively, nowhere near the players in the predicted table above. This low scoring performance explains his absence because this model takes into account statistics from the past two seasons. Having played two full seasons of EPL football, he has learned, gained experience and truly made a statement this season. As a young player, it is hard for coaches, managers and even analysts to quantify and predict when these players will have their breakthrough season. The signings Everton made this summer could have been a sign for his improvement, but there was no way to confirm this was a factor for Calvert Lewin to reach his current potential. Personally, I would feel glad if he proved this golden boot table prediction wrong and finished amidst the top ten scorers. It is fantastic to see new and young talents reach their potential, and reach the objectives they have been training for throughout their lives.
Would using xG predict a more accurate top scorer table?
The same regression analysis was conducted replacing the goal statistic for xG. Similar to the previous regression analysis, the correlation coefficient was found to validate further investigation. The correlation coefficient for the xG stat was 80.5% which suggests a stronger direct relationship between a players xG stat in one season and their xG the following compared to actual goals scored between two seasons. The graph below helps visualize the direct relationship and correlation coefficient of the xG statistic.
After using the same trial and testing, the r squared was slightly higher at 0.64 or 64%, while the RMSE was lower at 2.08. This is quite strong evidence for the xG being a more accurate predictor for future player performances as more data is explained by the model’s inputs, and the standard deviation is only two goals instead of three the way it was before. Again, with the computer software having values and estimations from the regression analysis to work with, and given the new data, it can make xG predictions based on player stats from the 2019/2020 season. The results are found in the table below.
Timo Werner has also come out on top on the xG based prediction table which would suggest that he is a favorite to win the golden boot this year.
Player Description:
Similar to the prediction table based on a player’s actual goals, there are a few surprises here as well. The biggest surprise refers to Gabriel Jesus who is predicted to come in second place. However, understanding that xG was the dependent variable used for this regression, Jesus does have plenty of xG, especially last season. By underperforming his xG and by having the biggest margin of any player, he did not come up in the previous table based on actual goals scored, but did show on the xG table. The question here is if Jesus had a bad season and if he will perform to a higher standard, or on the contrary, continue his poor finishing.
Kane is quite low on this list at tenth. This can be explained by his relatively low xG numbers throughout the past two seasons in comparison to his competitors. One of the reasons could be his limited time on the pitch due to injuries sustained. According to Transfermarkt, he missed 17 games in the 2018/2019 season due to two different ligament injuries, and missed 14 games last season due to a torn hamstring. Despite scoring a decent amount of goals in returning from these injuries, the less time he has had on the pitch has negatively affected his xG statistics.
Lastly, an unexpected absentee is Sadio Mané who shared the golden boot two seasons ago and came in seventh last season. His absence comes down to him over performing his xG, similar to Kane. Mané outperformed his xG by 4 goals last season and two seasons ago by almost 6 goals. With a relatively low xG, but high actual goals scored, he made it to the top ten in the actual goals prediction but not the xG. Mané is the counter situation of Jesus. The question with Mané is, will he continue to score the difficult chances he has been scoring for Liverpool the past two seasons? Or, will his finishing ability come back to a more standard level where most professional players in the EPL range from very slightly under performing their xG to just about exceeding their xG.
Conclusion:
Both golden boot prediction tables have their surprise inclusions and surprise absentees. For the most part, from intuition and experience watching the EPL, I am confident that the final top scorer table will look similar to one of the predicted tables in this article. In theory, the xG table should be more accurate time goes on due to the higher r squared value, lower RMSE and the nature of the xG statistic. However, factors such as injuries, missing concrete opportunities, and overall team performances can negatively affect xG. In theory, over the course of the season these factors should even out, and players should perform to their overall xG.
There will always be surprises when it comes to goal scorers. Players have breakthrough seasons, change in managers can inspire or frustrate players, and players can develop and improve their play. These factors are unpredictable and this is the beauty of football. However, while football is complex, there are ways to show that it is quite simple, as I have done here. A simple regression analysis and basic machine learning can create a (hopefully) accurate prediction of who the top ten goal scorers for the current season may be. There may be minor faults and there are definitely ways to improve this analysis, but this simple analysis shows that using the right football statistics and the right data science concepts great analyses can be made. This balance between the complexity and simplicity of football, on and off the pitch, is what makes it the most beautiful game.
Stay tuned for the next article!