Titanic Survival Prediction using No-Code Machine Learning
We’ll be using the no-code AutoML platform, the AI & Analytics Engine, to build machine learning models to predict survivors on the Titanic.
The 2022-23 NBA regular season has finished up, players are done recording their regular season stats and we’ve got all the information needed to predict the NBA Most Valuable Player award. Most years, there is usually a stand-out candidate whose performances were head and shoulders above the rest, leading to an anticlimactic announcement, that everybody already knows.
But 2022-23 is not most seasons.
This year, the finalists are Nikola Jokić, Joel Embiid, and Giannis Antetokounmpo, all putting up some of the greatest individual seasons of all time, in the same year - it's unprecedented. It’s one of the tightest races for the award ever, but don’t take my word for it - Charles Barkley guaranteeeed on Inside the NBA “This is going to be the closest vote ever, in my opinion” - and we all know that Chuck is usually right about things.
Sidenote: I’m a diehard Celtics fan, and while Jayson Tatum has had an almost-MVP-worthy season, if there’s any way I can twist the data to argue that he should win, I’m going to do that (I couldn’t).
Keep reading, because we will use the AI & Analytics Engine to predict who will win the 2022-23 NBA MVP, using the power of machine learning, without needing a single line of code.
The MVP trophy is the most prestigious individual award given to a player. But what are the criteria to be the MVP? It should be an easy question, and there should be clear, consistent parameters defining what constitutes the MVP, so the award can be given fairly every year (there's not).
The issue is trying to define “Valuable” because there are many different opinions.
The best player in the league?
The most impactful player in the league?
The best player on the best team?
The most valuable player to his team?
The player that contributes to winning the most?
There’s not really a clear answer. But here’s what we do know - Narratives play a big part. Voter fatigue is a real thing, otherwise, Michael Jordan and Lebron James would have just kept winning year after year. But people get bored, and it’s a shame to see an all-time great finish their career without one on their resume. , Often, the question isn’t really “Who’s the MVP?” but rather “Who’s turn is it to win MVP?”.
This makes using machine learning a bit tricky because narrative is impossible to measure. So the side quest we’ll also be going down, is answering the question “Which metrics are most important in winning the MVP?”.
The problem with this year is that each candidate has their own claim; Jokic is the most efficient, Embiid is the most dominant, and Giannis won the most. But even still, there’s not much separating them in each category.
There’s one thing that Embiid has that the other two don’t though - Narrative. Jokic has won the award the previous two years, and Giannis won the two years before that. But Embiid has never won it, managing only to come second in the last two years.
We’ve seen ex-players, now media personalities like Rajon Rondo and Jalen Rose say Embiid is their choice, while JJ Reddick has chosen Giannis, citing “The best player on the best team”.
If you’re unaware or need a refresher, these are some summary statistics (games played, points/rebounds/assists per game and team wins) for each player, to give you an idea of their respective seasons.
The contenders' summary stats
The first thing we have to do is understand how the MVP is decided. Each year 100-130 members of the NBA media vote, giving players 10, 7, 5, 3, and 1 vote(s). The votes are tallied, and the player with the most votes wins.
There are two approaches we can take toward predicting the winner with machine learning:
Classification: The first way is by a classification method. We define a target column called: "Is_MVP". It will contain 1
if the player is an MVP. 0
otherwise. The problem with this method of building the training data is that the data is heavily imbalanced. Each season, there are hundreds of players, however, there is only a single MVP. Overall, over 40 seasons in our data, we will have only 40 positive labels. This creates technical difficulties in training and evaluation.
Regression: The second possible way is using a regression method, and predicting a number. Because the number of voters changes each season, we can use the metric “MVP_award_share”, which is the number of votes divided by the number of possible votes. This works far better because each year there are about 10-20 players who receive at least one vote.
It’s worth noting, that we’re not making any decisions about what valuable means, or who had the objectively best season. We’re looking at which statistics seem to correlate to being voted MVP in the past (side-quest), and according to that criteria, predicting which player this year had the most MVP-ish season.
The data is taken from this dataset, which scraped basketball reference and has taken every single player's stats for every single season from 1982-2022. There are a few groups of statistics that we’ll be using in order to predict the MVP share variable:
The Dataset
Some humble but important stats are the number of games that a player plays, and how many games their team won. There is a slight problem with both because there have been 4 seasons since 1982, in which the total number of games played has been less than 82 (two were due to lockouts, and two were due to covid). Therefore I adjusted these both to be a percentage of the possible games available.
These are the stock standard basketball statistics, how many points, rebounds, assists, steals, blocks, and turnovers they averaged a game, you get the gist.
There’s a bit of an issue when discussing per-game stats. The pace of the league changes over eras; In the 80’s and 2020s, teams played quickly, whereas the 90’s were slow. Players in higher-paced eras have more possessions to record per-game stats, therefore, the inclusion of percentage versions of the per-game stats helps adjust for this.
True Shooting% (TS%) is a statistic that takes into account that 3-pointers are worth more than 2-pointers, and incorporates free throw accuracy. In an ideal world, we’d adjust different eras by using True Shooting% relative to league average (TS%+), but that data wasn’t in the dataset I used. Ahh well, not a big deal.
Ahh yes, advanced metrics, the phrase that makes NBA old-heads shudder. These are various metrics created by data scientists with a love for sports, with the goal of quantifying how good a sports player is. The ones we’ll be using are:
Player Efficiency Rating (PER): A measure of per-minute production standardized such that the league average is 15.
Win Shares (WS, OWS, DWS): An estimate of the number of wins contributed by a player. This also has offense and defense variations.
Box Plus/Minus (BPM, OBPM, DBPM): A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team. This also has offense and defense variations.
Value Over Replacement Player (VORP): A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team, and prorated to an 82-game season.
You don’t need to know exactly how they’re all calculated and the differences between them, because it’s not really important. Just remember that generally, the higher the number = the better they played.
The first step is processing the training data. With over 17,000 entries, seasons with non-zero MVP share represented around 3% of the data, so it’s worth filtering. The immediate thing that comes to mind is filtering seasons that either played a few games or a few minutes per game. So I limited the training to only the ones with;
Greater than 30 minutes per game
Greater than 60% of games played (equivalent to 49 in a regular 82-game season)
This reduced the training data set to 3500 entries, where non-zero MVP share represented 15% of the total dataset. For each season, there were roughly 80-90 qualifying players. One interesting note is that the number of players that got a vote has trended downwards, meaning the decision has become more unanimous in the last decade.
With the data ready, It was time to go to the AI & Analytics Engine.
The first step is to upload the training data, which the machine learning models use to predict the unknown, 2023 data. As mentioned, this is a regression problem, because we are predicting a numerical value for the MVP award share column.
The next step is to define the feature set, or predictors that are important. It was important to unselect the player information stats, like name and team, because we want to ensure that the models don’t get confused around correlation and causation.
The next stage is building the models. I built three different models, all using different tree-based machine learning algorithms that all have different ways of training the data. Despite all having fairly similar prediction qualities (R2 scores), they work differently and produce different results, so the average of the three will be taken.
Model Summary in the AI&A Engine
After each model has been trained, the last step is to upload the test 2022-23 season data, where the MVP share is obviously unknown. The Engine spits out a CSV file, we just repeat that process for each of the three models, and then we can put it into a spreadsheet and see the results.
Now for that little side quest from earlier - let's investigate the stats that the MVP players tend to be really strong in historically. The Feature Importance tab in the Engine allows us to see exactly how much each feature impacts each model, here are the results.
Feature importance of the ML models
Win shares, player efficiency rating and win loss% are all leading indicators, being in the top four for all three models. The XGBoost and LightGBM regression models both have PPG as 3rd most impactful, whereas in randomized trees it ranks 6th (that plays a big part in each model's predictions).
Alright, finally the results. Here they are.
ML-powered MVP predictions
We see that the extremely randomized trees model heavily favors Jokic, whereas XGBoost and LightGBM regressions both moderately favor Embiid. All three consider Giannis to have had a strong, but not quite MVP-ish season.
After all of that work, there’s still only a marginal difference splitting Embiid and Jokic in the average of all three models. In my opinion, it’s going to Embiid, due to a factor that machine learning can’t possibly quantify - the narrative. He’s been so close the past two years, it's hard to see it not going to him.
NOTE: The MVP has just been announced, and Joel Embiid has indeed won with 0.915 MVP award share, Jokic coming second with 0.674, and Giannis third with 0.606. I'm also happy to say Jayson Tatum was the consensus fourth place, with 0.280.
If you’re interested in reading about using the AI & Analytics Engine to predict other sports results, check out my blog on predicting the 2022 World Cup.
Not sure where to start with predictive analytics? Reach out to us with your business problem, and we’ll get in touch with how the Engine can help you specifically.
We’ll be using the no-code AutoML platform, the AI & Analytics Engine, to build machine learning models to predict survivors on the Titanic.
Marketers are making full use of AI tools available in all kinds of ways. In this blog, we’ll go over the most powerful AI use cases in marketing.
Analyzing prediction results is critical in evaluating the performance of trained models. We demonstrate how you can use Excel to evaluate your model