It’s a question as old as time, or at least as old as public football analytics. How good are our prediction models?
To try and answer this question, I’ll be using the Ranked Probability Score (RPS). The RPS is a statistical method that measures the quality of predictions. It calculates how far off predictions were from the actual results, which means that a low RPS is better than a high RPS.
If you want to know more on how the RPS works and how to calculate it, I suggest you skip to the end of this post here.
I’ve started to collect the predictions of various public models for the Premier League since fivethirtyeight.com started publishing theirs in January.
This means that for our first episode of “Statistical Models – what do they know? Do they know things?? Let’s find out!” (in short: SMWDTKDTKTLFO) we will be looking at the track record for game weeks 22-29 (73 games) of the following public models:
We will also be looking at how the highest available betting odds performed. The odds I used for this were available on Betbrain Friday prior to the matches.
So, what is the RPS for each of these models?
None of the public models outperformed the betting market when it comes to accuracy so far. Of course, Betbrain had the statistical models at an advantage, since betting markets incorporate a lot of information coming from various sources. But what if we build our own wisdom of crowds out of the models we have?
If we combine all six models, we get a new model with the following RPS:
Combining all prediction models and using those probabilities to forecast all 73 matches resulted in an RPS that outperformed every single model on its own, including the betting markets. I think this underlines the quality of each one of those models, since they all seemed to add valuable information to some degree.
There are different combinations of models that resulted in an even lower (better) RPS but to avoid building an overfitted ‘super model’, I just took the mean average of every model I had the data for.
Money, money, money
A probably even older question than ‘how good are my predictions’ is ‘how can I make money with this?’
To simulate a sensible staking strategy, I used this formula:
(expected profit)/(odds-1) = stake
This means if a model had an outcome at 70% and was offered odds at 60% (odds at 1,66) the stake would have been 0.25 units high.
(0.7*1.66)-1) / (1.66-1) = 0.25
While a predicted outcome at 30%, with offered odds at 20% (odds at 5) would have resulted in a 0.125 units high stake.
((0.3*5)-1) / (5-1) = 0.125
This should make intuitively sense, an event that is likely will result in high stakes, while an event that is unlikely in low stakes.
Interestingly, every model returned a profit over the observed time. We will get to why this might be, in a minute. For now, let’s look at the added Return of Investment (ROI) column.
Before going any further it should be noted that every model bet against the odds that were used for the Betbrain model, which were the highest available odds. This is the first reason why the models yielded such high, positive results. But why did they yield a positive result at all, when they weren’t as accurate as the odds they were betting against? And why has betbot the lowest return of investment, but outperformed three other models when it comes to the RPS?
One reason that comes to mind would of course be the staking strategy used. But there is a deeper meaning to this:
Betting results do not measure the quality of prediction models as accurately as the RPS does. In fact, only looking at betting results for the most part doesn’t show anything significant beyond pure chance at all, as has been shown by Joseph Buchdahl in his article here.
Take these three predictions for example:
The expected yield for the first bet (0.7*1.66) would have been 16%, if the prediction model had perfectly captured reality.
The only possible observable outcome for this game though will be a -100% yield or a +66% yield. It takes a long time and a lot of bets for this effect to even out.
Using the staking method I described earlier would have resulted in a return of investment of 28,1% for these three games, using a level yield strategy (same amount of units per bet, no matter how high or low the odds) would have resulted in a 77% return of investment.
Those predictions would have yielded a positive result, even though the implied bookmakers probabilities were more accurate. However, remaining betting profits without having a method that outperforms the accuracy of the betting markets seems to be an impossible task over the long run.
A grain of salt
It is important to keep track of our predictions and to hold them accountable but there is a lot of uncertainty in football and a 73 games sample is not that much.
Looking at how a very simple model would have performed in these matches can shed some light into how we can interpret the results so far.
To build this model I will try to simulate how someone who doesn’t know anything about football might try to predict Premier League matches, for example my or your, let’s say our grandma.
Our grandma doesn’t know much about football, but she does know that there is something called home field advantage. To incorporate her knowledge of this, every prediction she will make will consist to 50% out of the following probabilities: 45% chance for the home team to win, 25% chance for a draw and 30% chance for the away team to win.
Now, grandma isn’t stupid, so she knows that some teams are better than others. She knows Manchester United must be good, because she has heard of them before. What she hasn’t heard of though is Crystal Palace, or whatever the kids call it these days.
To simulate this knowledge of good teams and underdogs, the other 50% of our grandma model will consist of the Betbrain model.
This is a very simple concept that basically anyone could build without knowing much about anything. Now let’s see how grandma performed against the other more sophisticated models:
This is a really good RPS, outperforming my own model and the Euro Club Index. The return of investment of our grandma model was 38% in 73 games, leaving all of you nerds in the dust.
To make a little more sense out of why such a simple model could have outperformed something as sophisticated as the Euro Club Index, we can look at how probable the RPS for each model was, given the probabilities of each model would have been perfect.
If we simulate those 73 games a couple of thousand times based on the probabilities of our simple model, we would get an average RPS of 0.2199 with a standard deviation of 0.0138.
Note that a high RPS in this case only means that the model predicted there would be a lot of uncertainty in those matches, while a low RPS in this case means that the model was very confident in its predictions.
We can calculate from this how likely our observed RPS was, given the probabilities of the model were perfect.
The result of this analysis is that the observed RPS strays +1.58 standard deviations from the mean. This means that such a high RPS will probably regress towards the average RPS and this of course only if those probabilities were perfect. Assuming they probably weren’t the accuracy of this prediction model will take an even greater hit over the coming weeks.
Now let’s look at the final table to get a better picture of what to expect from these models in the future:
The z-score shows how many standard deviations the observed RPS was away from what could have been expected, if the probabilities of each model were perfect. Other than that we shouldn’t expect too much of our grandma model, the z score can also show that bing for example have been a bit overconfident in their predictions, while the Euro Club Index appears to be really well calibrated, even though their predictions could be more decisive.
All in all though, the results of every of these models seem to be really close to reality and it will be exciting to track their performance in the future!