Predicting the Indian Premier League

Predicting the Indian Premier League

Overview

The Indian Premier League (IPL) is an annual Twenty20 cricket competition that takes place each year during April and May. In 2018, 8 teams competed over 60 matches with the tournament won by the Chennai Super Kings. This article look at data to see if we can predict the outcome of a given match. The specific data used is all matches taking place between the 2008 (inaugural season) and 2016 season inclusive. The two files used are:

  • Match data including details of the teams playing in each match, who won the toss, match result and venue. (577 rows)
  • Ball by ball data containing details of each ball bowled in each match including the batsman, bowler, number of runs scored, how they were scored and the dismissal information if the ball resulted in a wicket. (136,598 rows)

Initial Analysis

The first task undertaken was to sort the ball by ball data in to a more meaningful format. This involved recreating the batting and bowling statistics for each match and ensuring that these figures matched those on the official match scoreboards.

We can do some interesting analysis such as average runs scored (total runs / total times out) and strike rate (100 x runs scored / balls faced) and plot this by batting position.

This chart shows average runs decrease by batting position. This makes sense as you typically top-load the innings with your best batsmen. Interestingly the strike rate is very similar for batsmen 1-7 decreasing from batsman no. 8 onwards.

We can also look at how runs, wicket, 4s and 6s vary by over.

We can clearly see the effects of the power play during the first 6 overs where only 2 fielders are allowed outside the inner circle. 4s are more likely to be scored here than during any other period of the innings. The average runs scored increases until over 6 before dropping and steadily increasing again until the end of the innings. By the end of the innings, you are nearly as likely to see a 6 scored as to see a 4 scored with a better than evens chance of seeing a wicket or a 6 during the final over of the innings.

Player Statistics

For each player, I decided to look at 10 different statistics that I had found suggested elsewhere; 5 for batting performance and 5 for bowling performance. These were:

Batting

  • Hard Hitter - [No. of runs scored off 4s and 6s] / [No. of balls faced]
  • Finisher - [No. of times out] / [No. of times batted]
  • Fast Scorer - [Total Runs Scored] / [No. of balls faced] (equivalent to Strike Rate)
  • Consistency - [Total Runs Scored] / [No. of Innings Out] (equivalent to Batting Average)
  • Running Between Wickets - [No. of runs not scored from boundaries] / [No. of balls faced with no boundaries hit]

Bowling

  • Economy - [Runs conceded] / [No. of Overs Bowled]
  • Wicket Taker - [No. of Balls Bowled] / [No. of Wickets Taken]
  • Consistency - [No. of Runs Conceded] / [No. of Wickets Taken]
  • Big Wicket Taker - [No. of times 4 or more wickets taken] / [No. of Innings Played]
  • Short Performance - [No. of wickets taken in innings when taking less than 4 wickets] / [No. of Innings played where taking less than 4 wickets]

In addition for the bowling, I looked at the economy of bowlers during the death overs; these are overs 16-20 in the innings where batsmen are usually looking to hit the ball out of the ground. I also looked at a measure called EconomyRX which looks at the economy for a specific bowler less the overall economy of all bowlers in any match that he has been involved with. This should eliminate the effect for a good bowler who mostly bowls on flat pitches.

Player Database

For each match, I formed the player statistics for each player in the match up to that point in time based on all matches played previously. We can use these to compare players. The following chart shows Virat Kohli and Chris Gayle compared on their batting statistics, as at the end of the 2016 season, where each measure has been taken as a % of the highest in each category. For example, Shaun Marsh had the highest consistency with a value of 53.5. Chris Gayle's consistency was 43.9 hence 82% has been assigned.

Whilst Chris Gayle has a higher strike rate, average and hits a higher proportion of 4s and 6s, Virat Kohli shows a higher tendency to score runs by running between the wickets.

If a player hadn't played many matches then naturally their statistics might be quite volatile. To overcome this issue, I looked at all players who had played at least 80 innings to understand how many innings a player needs to play in order for their statistics to become credible. This analysis suggested 30 innings and so the statistics for each player were formed using the following credibility formula and the average statistic across all players who had played at least 30 innings:

where N is the no. of innings played.

Final Database

The next step was to take the player statistics and aggregate them in to statistics for each team going in to each match. For the batting statistics, these were done based on the expected number of balls each batsman would face based on their past record. For bowling statistics, the same methodology was used except looking at the expected number of balls each bowler would bowl.

For each team in each match, we can now form a table of inputs and a predictor variable to feed in to a GLM. The 13 inputs were the 5 batting statistics, 7 bowling statistics (3 different measures of economy) and whether the team played home, away or on a neutral venue. The 12 statistic inputs were defined as the statistics for the team being analysed less the statistic for the opposition team. The predictor variable is simply whether the team won or lost.

Forming our GLM

To form and test the parameters to feed in to the GLM, I first ignored the data from the 2008 to 2010 seasons. The statistics formed for these seasons don't have a lot of credibility. I also ignored matches that ended with no result, e.g. rained off. The remaining 6 seasons were split with 4 seasons of data used to train the algorithm and form the parameters for the GLM with the remaining 2 seasons used to test the accuracy of the model.

The GLM was formed using logistic regression as the predictor is a binary 0/1 statistic. After trying a few different combinations, the statistics that weren't good predictors for whether a team won or lost were excluded to arrive at the following GLM and output from R:

From the above, the key statistics appear to be:

  • Finisher - batsmen who don't get out; the more likely a batsman is to still be in at the end of the innings the better.
  • Wicket Taker - the number of balls needed on average to take a wicket; the fewer balls required the better.
  • Home Advantage - there appears to be a big advantage to playing a match in front of a home crowd.

Other observations:

  • Fast Scorer (Strike Rate) appears to have a negative effect with teams with the lower expected strike rate more likely to win. This appears counter intuitive and needs further investigation.
  • The economy of a bowler appears to have no predictive element in determining if a team is likely to win or not. In fact, the only significant bowling statistic appears to be 'Wicket Taker' indicating that batting power is more important than bowling power or perhaps we need to determine more predictive bowling statistics from the data. It's likely to be a combination of both of these factors.

We can check that this reduced GLM with 7 predictors fits as well as our full GLM with all 13 predictors by running an ANOVA test to compare.

The chi-squared test value of 0.84 is non-significant indicating that the additional predictors don't add significantly to the predictive nature of the model.

The residual deviance was also checked for over-dispersion of the model. This test checks that the variance of the response variable is similar to what would be expected from a binomial distribution. This is done by looking at the residual deviance divided by the residual degrees of freedom. This value should be close to 1 and indeed for this model we obtain a value of 1.33.

Testing the GLM

Running the data for 2015 and 2016 that we held back when forming our GLM, we can derive a probability of the team winning. If the output is larger than 0.5 then we take the prediction to be 'Win' and less than 0.5 we predict 'Lose'. The GLM obtains the correct result 59% of the time. This at least tells us that our model is better than a chimpanzee tossing a coin!

We can further test the predictive nature of our model by looking at the odds given for each match prior to it starting using oddsportal.com. Examining the odds by looking at the favourite for each match according to the bookmakers, they only correctly predict 55% of the matches.

We can develop a betting system for the IPL. If the GLM predicts a higher probability of a win then the odds suggest, then we will place a £10 bet on that team winning. Note that in a given match, the probabilities from the GLM will add up to 100% whereas the odds will add up to slightly more to allow for the bookmaker's profit margin. For example, in the first match of the 2015 season between the Mumbai Indians and the Kolkata Knight Riders (KKR), the GLM suggests a 43% chance of a win for Mumbai and a 57% chance for KKR whereas the odds suggest a 46% chance of a win for Mumbai and 58% chance of a win for KKR. In this instance, we wouldn't place a bet.

Using our betting system on the 2015 and 2016 seasons yields the following results.

Summary

Our analysis appears to suggest that we can build a GLM to predict the IPL and make some money on the betting markets. However, we would need to do some further analysis to make ensure that the results seen from analysing the 2015 and 2016 seasons weren't a fluke by looking at the results from the 2017 and 2018 seasons. In addition, the bookmakers could have changed their algorithms and methodology since 2016 and so there may no longer be an edge to be gained.

There are also added complications as the data fed in to the model for each match relies on player statistics up to that point in time so we would need to update our statistics throughout the tournament. Further, the teams aren't known until the coin is tossed. The toss of the coin changes the odds as it's decided which team will bat first or second. One way around this could be to guess the teams; they tend not to vary too much from match to match and in particular the key players who have the largest effect on the overall statistics will be expected to start every game unless injured.

Some further work is required (see below) but it looks like we may have a working, predictive model!

Further Developments

  • We can look in to other measures of performance other than the 10 statistics mentioned above. Perhaps these have a larger predictive power.
  • The analysis above only uses batter and bowler statistics. We also have data for fielding statistics including run outs and catches. Fielding in Twenty20 is being shown to be more and more important so we could use this data to further enhance our model.
  • The statistics for each player are gathered using every match that they have played in up until the match we are trying to predict with the same emphasis placed on each match. We could put more emphasis on the more recent matches, e.g. for players who have gone past their peak and whose performances are deteriorating.
  • We now have the data from the 2017 and 2018 tournaments so can use this to further refine and test our model.
  • We have data about which team won the toss. We can use this as this a predictive element but would need to gather post-toss odds in order to properly analyse.
  • We can use the ball by ball data to predict among other things the likelihood of a team winning from a given situation. This would enable us to try and find an edge for in-play betting. I envision this analysis being somewhat similar to that done to produce the Duckworth-Lewis-Stern tables for rain affected matches.
  • We could approach the problem in a completely different way and use an alternative technique such as Support Vector Machines or Clustering.

Data Source


Joel Green

Interim Retail Pricing and Trading Director at First Central

5 年

How come you decided to use a glm rather than gbm?

回复
Dylan Liew

Qualified L&H Actuary, CERA, & Data Scientist

6 年

Great stuff!

Edward Cassels

Arondite | Defence, Robotics, Software and AI

6 年
回复
?? Adam Joe Parker ??

ReFounder, OceanSaver - helping 1,000,000 people to clean plastic-free and Ocean-friendly

6 年

love it. well done!

回复
Jacob Turner

Barrister at Fountain Court Chambers

6 年

Great piece Marc!

回复

要查看或添加评论,请登录

Marc Wiseman的更多文章

  • When AI Codes: A Developer's Existential Crisis

    When AI Codes: A Developer's Existential Crisis

    Introduction Python, the current darling of programming language, is booming in business, thanks to its simplicity…

    3 条评论
  • Machine Learning using FIFA 2019

    Machine Learning using FIFA 2019

    Introduction to Machine Learning The term Machine Learning was coined in 1959 by Arthur Samuel, an American pioneer in…

    4 条评论

社区洞察

其他会员也浏览了