Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 2 of 2 )
In the first part of this blog series, we were able to acquire the historical batting/bowling records of players participating in the Cricket World Cup 2019 by scraping and crawling through the data rich espncricinfo portal. The data acquired was then used to analyze the player performance using an interactive web form with charts and graphs developed in a Qubole Notebook. In this final part of the series, we will dive into predicting/forecasting the future performance of a player. As a disclaimer, I would like to reference a famous adage in Data Science which quotes: "All models are wrong, but some are useful". So I will start out by saying that the predictive time series models that I will demonstrate here may be completely wrong and if some or all of the predictions come true, I will consider that as a mere coincidence.
There are several variables that impact player performance, so in order to avoid boiling the ocean, we will explore a simple forecasting technique called "Time Series Analysis". One of the early applications of Time Series Analysis is Stock Ticker Price forecasting, where you have historical observations of a measure which in this example happens to be stock ticker price over time and the time series analysis intends to fit the ticker price over time considering trends, cycles and seasonality. This then helps to extrapolate and forecast the stock price out into the near future.
Time Series generally does not require distributed scale processing as the series data is summarized and not of big data class. While there are distributed time series libraries like spark-ts, they lack the breadth of what Python Statsmodels package offers. So I decided to stick to Python statsmodels package to complete this solution.
To start from where we left off in part 1, I will attempt to do time series analysis for my favorite Cricketer Virat Kohli to predict the runs he will score in the group stage of the tournament (round-robin stage). To cut to the chase, the time series model I created produced the following predictions of runs scored by Virat Kohli in the group stages of the ICC Cricket world cup.
########Virat Kohli world cup match by match runs predictions#######
#South Africa vs India on June 5th: 47
#India vs Australia on June 9th: 42
#India vs NewZealand on June 13th: 46
#India vs Pakistan on June 16th: 46
#India vs Afghanistan on June 22: 45
#West Indies vs India on June 27: 45
#England vs India on June 30th: 45
#Bangladesh vs India on July 2nd: 44
#Srilanka vs India on July 6th: 44
##################################################################
Introduction to Basic Time Series Models:
Mean Constant Model: This is a very simple forecasting model which assumes that mean is constant and this is also termed as time series being Stationary, and for these kind of simple stationary time series, the best prediction is just the mean of historical data.
Linear Trend Model: Most naturally-occurring time series in business and economics are not at all stationary. Instead they exhibit various kinds of trends, cycles, and seasonal patterns. So the idea of Linear trend model is to fit a sloping line (y=mx+c) to forecast the near future.
Random Walk Model: When faced with a time series that shows irregular growth, the best strategy may not be to try to directly predict the level of the series at each period. Instead, it may be better to try to predict the change that occurs from one period to the next and calculating this change in an attempt to predict it is referred to as differencing and this essentially is the guiding principle for a Random Walk Model.
Below code filters out the selected player data, which in this case happens to be Virat Kohli, and produces a line plot for runs scored along with mean, to get some basic understanding of the trends and patterns of Virat's historical batting performance.
from pylab import rcParams
rcParams['figure.figsize'] = 8,6
virat_batsman= clean_batting_spark_df.filter(\
batting_spark_df.PlayerID == selectedPlayerProfile)\
.toPandas()
runs=virat_batsman['Runs'].astype('float')
date= pd.to_datetime(virat_batsman['Start Date'])
virat_df=pd.DataFrame({'date':date,'runs':runs})
virat_df=virat_df.set_index('date')
model_mean_pred = virat_df.runs.mean()
virat_df["runsMean"] = model_mean_pred
virat_df.plot()
z.showplot(plt)
plt.gcf().clear()
Advanced Time Series Models:
Most of the time series models work on the assumption that the time series is stationary. Intuitively, we can see that if a time series has a particular behavior over time, there is a very high probability that it will follow the same in the future. Also, the theories related to stationary series are more mature and easier to implement as compared to non-stationary series. The underlying principle behind these advanced models is to model or estimate the trend and seasonality in the series and remove those to get a stationary series. Statistical forecasting techniques can then be implemented on stationary series and the values can be forecasted and converted to original scale by applying trend and seasonality constraints back.
Before we move to the next step, we need to fill the missing values. By missing values, I mean the calendar days where Virat Kohli has not played ODI cricket. The question we are trying to answer here is, how many runs Virat would have scored each calendar day since his career began, if he played international ODI cricket every single day. There are several effective methods to fill missing values in a time series like mean/median imputation, rolling mean imputation and Imputation using different interpolation methods . We will go with Imputation based on time based interpolation ( reference: https://medium.com/@drnesr/filling-gaps-of-a-time-series-using-python-d4bfddd8c460 ) . Below is the code which will complete this step.
idx=pd.date_range(start=virat_df.index.min(), \
end=virat_df.index.max(), freq='D')
virat_df = pd.DataFrame(data=virat_df,index=idx, columns=['runs'])
virat_df = virat_df.assign(RunsInterpolated=df.interpolate(method='time'))
Now that we have a clean, continuous time series, lets look at a decomposition plot which will help us understand trends, seasonality and residuals in the series. The below code completes this step and provides a chart that aids in anlyzing the error, trends & seasonality components of the series. To help avoid the noise, we will only look at the last few years of data in the series.
decomposition = seasonal_decompose(\
virat_df.loc['2017-01-01':df.index.max()].RunsInterpolated)
decomposition.plot()
z.showplot(plt)
From the above decomposition plot, seasonality appears to be consistent over time, with no apparent trend and error/residual terms appear to be linearly varying over time. So when choosing one amongst the various advanced time series ETS models, based on the observations, we should choose ETS(A,N,A) due to the presence of Additive Error Terms, with no trend and Additive Seasonality. This class of ETS algorithms comes under the umbrella of Holt's Winter Exponential Smoothing methods. For a primer on various time series algorithms including the more advanced ARIMA and Seasonal ARIMA, please refer to https://www.analyticsvidhya.com/blog/2018/02/time-series-forecasting-methods/
Note that to use Holt's Winter Exponential Smoothing methods in Python a specific version of statsmodels0.9.0rc1 (pip install statsmodels==0.9.0rc1) should be used. The below code block completes the step of splitting the series into train and test sets as well as model validation. The resulting graph visually helps us understand how close the model predictions are to the actual test values.
import matplotlib
from statsmodels.tsa.holtwinters import ExponentialSmoothing
rcParams['figure.figsize'] = 8,6
series=df.RunsInterpolated
series[series==0]=1
train_size=int(len(series)*0.8)
train_size
train=series[0:train_size]
test=series[train_size+1:]
y_hat = test.copy()
fit = ExponentialSmoothing( train ,seasonal_periods=730,\
trend=None, seasonal='add',).fit()
y_hat['Holt_Winter_Add']= fit.forecast(len(test))
plt.plot(test, label='Test')
plt.plot(y_hat['Holt_Winter_Add'], label='Holt_Winter_Add')
plt.legend(loc='best')
z.showplot(plt)
plt.gcf().clear()
The below block of code computes the RMSE (Root Mean Squared Error) for the fitted model. RMSE indicates the performance of the model in the test set
from sklearn.metrics import mean_squared_error
from math import sqrt
rms_add = sqrt(mean_squared_error(test, y_hat.Holt_Winter_Add))
print('RMSE for Holts Winter Additive Model:%f'%rms_add)
#####Output#####
# RMSE for Holts Winter Additive Model:40.490023
Now lets finally forecast out for 180 days beyond the last ODI game that Virat Played and print the predictions for the runs that Virat Kohli will score in the group stage.
y_hat['Holt_Winter_Add']= fit.forecast(len(test)+180)
print('########Virat Kohli world cup match by match runs prediction 1 (Holts Winter Additive)#######')
print('########Virat Kohli world cup match by match runs predictions#######')
print('South Africa vs India on June 5th: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-06-05']))
print('India vs Australia on June 9th: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-06-09']))
print('India vs NewZealand on June 13th: %d' %int(y_hat['Holt_Winter_Add'].loc['2019-06-13']))
print('India vs Pakistan on June 16th: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-06-16']))
print('India vs Afghanistan on June 22: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-06-22']))
print('West Indies vs India on June 27: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-06-27']))
print('England vs India on June 30th: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-06-30']))
print('Bangladesh vs India on July 2nd: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-07-02']))
print('Srilanka vs India on July 6h: %d' \
%int(y_hat['Holt_Winter_Add'].loc['2019-07-06']))
print('###############################################################')
########Virat Kohli world cup match by match runs predictions#######
#South Africa vs India on June 5th: 47
#India vs Australia on June 9th: 42
#India vs NewZealand on June 13th: 46
#India vs Pakistan on June 16th: 46
#India vs Afghanistan on June 22: 45
#West Indies vs India on June 27: 45
#England vs India on June 30th: 45
#Bangladesh vs India on July 2nd: 44
#Srilanka vs India on July 6h: 44
##################################################################
Conclusion:
This series demonstrated how to acquire data from a seemingly semi-structured source , transform , analyze player performance and finally apply time series predictive analytics technique to predict the near future using Python and Pyspark. For all those cricket fans who have read this article thus far, enjoy the World Cup and may the best team win.
**Note: The above content was curated using Qubole’s Big Data Activation Platform that offers a choice of cloud, big data engines, tools & technologies to activate Big Data in the cloud. You may test drive Qubole 14 days free at https://www.qubole.com/lp/testdrive/
Appendix:
Just to briefly introduce the more advanced ARIMA(Autoregressive Integrated Moving Average) models which were not covered in this article, ARIMA models are used because they can reduce a non-stationary series to a stationary series using a sequence of repeated differencing steps to achieve stationarity in the series. often the notation for ARIMA is ARIMA(p,d,q) where p is the number of autoregressive terms, d is the number of nonseasonal differences needed for stationarity, and q is the number of lagged forecast errors . Seasonality effects can be tackled with the Seasonal ARIMA model (SARIMA) and the SARIMA models introduce additional terms of the form ARIMA(p,d,q)(P,D,Q) where the additional P,D,Q terms represent the number of seasonal autoregressive terms, seasonal differences, and seasonal lagged forecast errors. The ARIMA terms are interpreted from the autocorrelation & partial autocorrelation plots of the series. Beyond Seasonal effects, Conditional heteroscedastic effects (Eg: volatility clustering in equities indexes) can be tackled with more advanced ARCH/GARCH models.
??????. Interesting to see all of them are in 40s. can’t wait to see his runs in English conditions. Great job.
Staff Software Engineer at Walmart
5 年No half century/century.Do something in your model so that it will predict at least one century for him????