In this blog, I'm going to use the AI & Analytics Engine to build machine learning models, to predict the results of the 2022 World Cup. But before I get into the data and machine learning and predictions, I’ll quickly explain how the World Cup is organized for those who don’t know.
The group stage is how the tournament begins, with the 32 qualifying national teams being split into eight groups of four teams. Each team plays the 3 teams in their respective groups once, earning 3 points for winning, and 1 for drawing. The top two teams from each group advance to the knockout stage.
The knockout stages are elimination games, there are no draws, games will go to extra time, and penalty shootouts if the scores are level. The rounds of the knockout stage are The Round of 16, The Quarter Finals, The Semi-Finals, and The Final. All a team has to do to win The World Cup is win 4 games in a row. Easier said than done.
I’m training the machine learning model using a dataset containing 23,000 international football games going back to 1993. Each game has information about who’s playing and what their FIFA ranking was, where it’s being played, what tournament it’s in, and most importantly what the result was.
A classification machine learning model predicts the class label (dependent variable) of a given datapoint. It trains on a dataset and learns how the features (independent variables) affect the class label. You give the model features and it tells you what it thinks the class will be, simple right?
In our case, the features we are giving the model are the details about the upcoming group stage games, like who’s playing, what their FIFA ranking is etc. The class we are predicting is the home team result; Win, lose or draw.
Excerpt of two datapoints from the training dataset
Now full disclosure, I’ll reiterate that I do work for PI.EXCHANGE, but I do think the AI & Analytics Engine is pretty cool. The first step is to upload the training data, which I did by uploading in CSV format because it’s easy, although files can also be imported from a database if you’re a computer whiz. Next is creating the app, which is mostly specifying that it’s a classification problem and that we’re trying to predict the result column.
Then it’s time to create the models. I trained multiple models that use different classification algorithms. Each algorithm has different methods of predicting the class label and therefore has different levels of accuracy. Some of the algorithms I tried included K-nearest neighbors, Random Forest, and Logistic Regression. However the best-performing model used the LightGBM algorithm which is based on decision trees, so I proceeded with that model.
It’s important to understand how each model decides it’s prediction, The Engine helps you understand why a trained model performs as it does, under the feature importance tab. It displays a summary of which features in the training data affect the predicted class the most. For the home team to win, the difference between the FIFA rankings is by far the most important variable. There does seem to be a home team advantage because the neutral location variable is second most important.
The model uses 80% of the training data to learn but saves 20% in order to evaluate itself. We can see this in the displayed confusion matrix. Put simply the confusion matrix visualizes the model's performance by comparing the predicted and actual class. The model is most accurate when predicting the label “win” and tends to predict that class most often. The model also rarely picks a draw which can be seen in the group stage predictions.
Now for the good stuff. With the model ready, it was time to upload the data for the group stages to get our predictions for each game. By uploading a CSV for the upcoming games in the same data schema as the training data (the same format of columns for the unacquainted). The model gives a probability of each given class and chooses the most likely outcome. This means we can calculate the expected points (XPTS), with the formula XPTS = P(Win) * 3 + P(Draw).
These are the results:
Group Stage Predictions
Big surprise, many of the highest-ranked teams are projected to win all of their group-stage games, as the FIFA ranking difference is the most predictive feature. Although there were a few exceptions. Some notable upsets include USA (ranked 16th) defeating England (5), Canada (41) defeating Morocco (22), and Germany (11) defeating Spain (7). Finally, we have to mention that my own country, Australia (38th) was predicted to defeat Tunisia (30th), although it wouldn't be enough to make it past the group stage 😔
France has the highest XPTS, meaning that it is the strongest team in comparison to its group, closely followed by Brazil and Belgium.
Now the model has predicted which teams move onto the first round of the knockout stage. I created the test data of games according to the predictions from the previous (group) stage, and repeated the process for the quarter-finals, semi-finals, and *drumroll* the final.
The USA upset the higher-ranked Netherlands who topped Group A without losing a game. Serbia, the lowest-ranked team remaining, was also able to upset Uruguay. Argentina, Brazil, England, France and Belgium were all favorites and progressed, while Croatia was able to defeat similarly ranked Germany.
USA’s streak of luck ended, falling short of 3rd-ranked Argentina. The top two ranked teams, Brazil and Belgium were able to defeat Croatia and Serbia respectively. Finally, cross-channel rivals England and France were ranked 4th and 5th respectively, however, the French progressed to the Semifinals.
The four teams consisted of teams ranked in the top five by FIFA ranking, reiterating how strongly the model considers that feature. Predictably, first-ranked Brazil and second-ranked Belgium, progressed to the final, where Brazil would be predicted to win.
Knockout stage predictions
So that’s it. With the given data, the Engine predicts Brazil will win the 2022 FIFA World Cup. However, sports are notoriously hard to predict, with a million different variables at play. I’d love to improve the model by adding more variables to the training data and making a more sophisticated model.
Not sure where to start with machine learning? Reach out to us with your business problem, and we’ll get in touch with how the Engine can help you specifically.