登录查看更多内容

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 2 of 2 )

Pradeep Reddy

Product Management - AI/ML & Data Platforms at Apple

发布日期: 2019年6月1日

In the first part of this blog series, we were able to acquire the historical batting/bowling records of players participating in the Cricket World Cup 2019 by scraping and crawling through the data rich espncricinfo portal. The data acquired was then used to analyze the player performance using an interactive web form with charts and graphs developed in a Qubole Notebook. In this final part of the series, we will dive into predicting/forecasting the future performance of a player. As a disclaimer, I would like to reference a famous adage in Data Science which quotes: "All models are wrong, but some are useful". So I will start out by saying that the predictive time series models that I will demonstrate here may be completely wrong and if some or all of the predictions come true, I will consider that as a mere coincidence.

There are several variables that impact player performance, so in order to avoid boiling the ocean, we will explore a simple forecasting technique called "Time Series Analysis". One of the early applications of Time Series Analysis is Stock Ticker Price forecasting, where you have historical observations of a measure which in this example happens to be stock ticker price over time and the time series analysis intends to fit the ticker price over time considering trends, cycles and seasonality. This then helps to extrapolate and forecast the stock price out into the near future.

Time Series generally does not require distributed scale processing as the series data is summarized and not of big data class. While there are distributed time series libraries like spark-ts, they lack the breadth of what Python Statsmodels package offers. So I decided to stick to Python statsmodels package to complete this solution.

To start from where we left off in part 1, I will attempt to do time series analysis for my favorite Cricketer Virat Kohli to predict the runs he will score in the group stage of the tournament (round-robin stage). To cut to the chase, the time series model I created produced the following predictions of runs scored by Virat Kohli in the group stages of the ICC Cricket world cup.

########Virat Kohli world cup match by match runs predictions#######
#South Africa vs India on June 5th: 47
#India vs Australia on June 9th: 42
#India vs NewZealand on June 13th: 46
#India vs Pakistan on June 16th: 46
#India vs Afghanistan on June 22: 45
#West Indies vs India on June 27: 45
#England vs India on June 30th: 45
#Bangladesh vs India on July 2nd: 44
#Srilanka vs India on July 6th: 44
##################################################################

Introduction to Basic Time Series Models:

Mean Constant Model: This is a very simple forecasting model which assumes that mean is constant and this is also termed as time series being Stationary, and for these kind of simple stationary time series, the best prediction is just the mean of historical data.

Linear Trend Model: Most naturally-occurring time series in business and economics are not at all stationary. Instead they exhibit various kinds of trends, cycles, and seasonal patterns. So the idea of Linear trend model is to fit a sloping line (y=mx+c) to forecast the near future.

Random Walk Model: When faced with a time series that shows irregular growth, the best strategy may not be to try to directly predict the level of the series at each period. Instead, it may be better to try to predict the change that occurs from one period to the next and calculating this change in an attempt to predict it is referred to as differencing and this essentially is the guiding principle for a Random Walk Model.

Below code filters out the selected player data, which in this case happens to be Virat Kohli, and produces a line plot for runs scored along with mean, to get some basic understanding of the trends and patterns of Virat's historical batting performance.

from pylab import rcParams
rcParams['figure.figsize'] = 8,6  
virat_batsman= clean_batting_spark_df.filter(\
              batting_spark_df.PlayerID == selectedPlayerProfile)\
              .toPandas()
runs=virat_batsman['Runs'].astype('float')
date= pd.to_datetime(virat_batsman['Start Date'])
virat_df=pd.DataFrame({'date':date,'runs':runs})
virat_df=virat_df.set_index('date')
model_mean_pred = virat_df.runs.mean()
virat_df["runsMean"] = model_mean_pred
virat_df.plot()
z.showplot(plt)
plt.gcf().clear()

Advanced Time Series Models:

Most of the time series models work on the assumption that the time series is stationary. Intuitively, we can see that if a time series has a particular behavior over time, there is a very high probability that it will follow the same in the future. Also, the theories related to stationary series are more mature and easier to implement as compared to non-stationary series. The underlying principle behind these advanced models is to model or estimate the trend and seasonality in the series and remove those to get a stationary series. Statistical forecasting techniques can then be implemented on stationary series and the values can be forecasted and converted to original scale by applying trend and seasonality constraints back.

Before we move to the next step, we need to fill the missing values. By missing values, I mean the calendar days where Virat Kohli has not played ODI cricket. The question we are trying to answer here is, how many runs Virat would have scored each calendar day since his career began, if he played international ODI cricket every single day. There are several effective methods to fill missing values in a time series like mean/median imputation, rolling mean imputation and Imputation using different interpolation methods . We will go with Imputation based on time based interpolation ( reference: https://medium.com/@drnesr/filling-gaps-of-a-time-series-using-python-d4bfddd8c460 ) . Below is the code which will complete this step.

idx=pd.date_range(start=virat_df.index.min(), \
              end=virat_df.index.max(), freq='D')


virat_df = pd.DataFrame(data=virat_df,index=idx, columns=['runs'])


virat_df = virat_df.assign(RunsInterpolated=df.interpolate(method='time'))

Now that we have a clean, continuous time series, lets look at a decomposition plot which will help us understand trends, seasonality and residuals in the series. The below code completes this step and provides a chart that aids in anlyzing the error, trends & seasonality components of the series. To help avoid the noise, we will only look at the last few years of data in the series.

decomposition = seasonal_decompose(\
               virat_df.loc['2017-01-01':df.index.max()].RunsInterpolated)
decomposition.plot()
z.showplot(plt)

From the above decomposition plot, seasonality appears to be consistent over time, with no apparent trend and error/residual terms appear to be linearly varying over time. So when choosing one amongst the various advanced time series ETS models, based on the observations, we should choose ETS(A,N,A) due to the presence of Additive Error Terms, with no trend and Additive Seasonality. This class of ETS algorithms comes under the umbrella of Holt's Winter Exponential Smoothing methods. For a primer on various time series algorithms including the more advanced ARIMA and Seasonal ARIMA, please refer to https://www.analyticsvidhya.com/blog/2018/02/time-series-forecasting-methods/

Note that to use Holt's Winter Exponential Smoothing methods in Python a specific version of statsmodels0.9.0rc1 (pip install statsmodels==0.9.0rc1) should be used. The below code block completes the step of splitting the series into train and test sets as well as model validation. The resulting graph visually helps us understand how close the model predictions are to the actual test values.

import matplotlib
from statsmodels.tsa.holtwinters import ExponentialSmoothing
rcParams['figure.figsize'] = 8,6  
series=df.RunsInterpolated
series[series==0]=1
train_size=int(len(series)*0.8)
train_size
train=series[0:train_size]
test=series[train_size+1:]
y_hat = test.copy()
fit = ExponentialSmoothing( train ,seasonal_periods=730,\
                             trend=None, seasonal='add',).fit()
y_hat['Holt_Winter_Add']= fit.forecast(len(test))
plt.plot(test, label='Test')
plt.plot(y_hat['Holt_Winter_Add'], label='Holt_Winter_Add')
plt.legend(loc='best')
z.showplot(plt)
plt.gcf().clear()

The below block of code computes the RMSE (Root Mean Squared Error) for the fitted model. RMSE indicates the performance of the model in the test set

from sklearn.metrics import mean_squared_error
from math import sqrt
rms_add = sqrt(mean_squared_error(test, y_hat.Holt_Winter_Add))
print('RMSE for Holts Winter Additive Model:%f'%rms_add)

#####Output#####
# RMSE for Holts Winter Additive Model:40.490023

Now lets finally forecast out for 180 days beyond the last ODI game that Virat Played and print the predictions for the runs that Virat Kohli will score in the group stage.

y_hat['Holt_Winter_Add']= fit.forecast(len(test)+180)
print('########Virat Kohli world cup match by match runs prediction 1 (Holts Winter Additive)#######')
print('########Virat Kohli world cup match by match runs predictions#######')
print('South Africa vs India on June 5th: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-06-05']))
print('India vs Australia on June 9th: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-06-09']))
print('India vs NewZealand on June 13th: %d' %int(y_hat['Holt_Winter_Add'].loc['2019-06-13']))
print('India vs Pakistan on June 16th: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-06-16']))
print('India vs Afghanistan on June 22: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-06-22']))
print('West Indies vs India on June 27: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-06-27']))
print('England vs India on June 30th: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-06-30']))
print('Bangladesh vs India on July 2nd: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-07-02']))
print('Srilanka vs India on July 6h: %d' \
                  %int(y_hat['Holt_Winter_Add'].loc['2019-07-06']))
print('###############################################################')

########Virat Kohli world cup match by match runs predictions#######
#South Africa vs India on June 5th: 47
#India vs Australia on June 9th: 42
#India vs NewZealand on June 13th: 46
#India vs Pakistan on June 16th: 46
#India vs Afghanistan on June 22: 45
#West Indies vs India on June 27: 45
#England vs India on June 30th: 45
#Bangladesh vs India on July 2nd: 44
#Srilanka vs India on July 6h: 44
##################################################################

Conclusion:

This series demonstrated how to acquire data from a seemingly semi-structured source , transform , analyze player performance and finally apply time series predictive analytics technique to predict the near future using Python and Pyspark. For all those cricket fans who have read this article thus far, enjoy the World Cup and may the best team win.

**Note: The above content was curated using Qubole’s Big Data Activation Platform that offers a choice of cloud, big data engines, tools & technologies to activate Big Data in the cloud. You may test drive Qubole 14 days free at https://www.qubole.com/lp/testdrive/

Appendix:

Just to briefly introduce the more advanced ARIMA(Autoregressive Integrated Moving Average) models which were not covered in this article, ARIMA models are used because they can reduce a non-stationary series to a stationary series using a sequence of repeated differencing steps to achieve stationarity in the series. often the notation for ARIMA is ARIMA(p,d,q) where p is the number of autoregressive terms, d is the number of nonseasonal differences needed for stationarity, and q is the number of lagged forecast errors . Seasonality effects can be tackled with the Seasonal ARIMA model (SARIMA) and the SARIMA models introduce additional terms of the form ARIMA(p,d,q)(P,D,Q) where the additional P,D,Q terms represent the number of seasonal autoregressive terms, seasonal differences, and seasonal lagged forecast errors. The ARIMA terms are interpreted from the autocorrelation & partial autocorrelation plots of the series. Beyond Seasonal effects, Conditional heteroscedastic effects (Eg: volatility clustering in equities indexes) can be tackled with more advanced ARCH/GARCH models.

Kiru Thangavelu

5 年

??????. Interesting to see all of them are in 40s. can’t wait to see his runs in English conditions. Great job.

1 次回应

Sabyasachi Nayak

Staff Software Engineer at Walmart

5 年

No half century/century.Do something in your model so that it will predict at least one century for him????

1 次回应

查看更多评论

要查看或添加评论，请登录

Pradeep Reddy的更多文章

Personal Private Cloud with Palm Sized System On Chips (Part 3 - Realize Object Storage and build your own Website)

2025年2月1日

Personal Private Cloud with Palm Sized System On Chips (Part 3 - Realize Object Storage and build your own Website)

Disclaimer: The views and opinions expressed in this article are solely mine. This article and any subsequent article…
Personal Private Cloud with Palm Sized System On Chips (Part 2 - Implement Distributed File and Block Storage )

2025年1月8日

Personal Private Cloud with Palm Sized System On Chips (Part 2 - Implement Distributed File and Block Storage )

Disclaimer: The views and opinions expressed in this article are solely mine. This article and any subsequent article…
Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

2025年1月4日

Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

Disclaimer: The views and opinions expressed in this article are solely mine. This article and any subsequent article…

1 条评论
Visualize/Analyze Progression of COVID-19 (Part 2 of 2)

2020年3月22日

Visualize/Analyze Progression of COVID-19 (Part 2 of 2)

Transcending time and space, Many countries across the world have long equated patriotism with armed forces. But none…

6 条评论
Visualize/Analyze progression of COVID-19 (Part 1 of 2)

2020年3月16日

Visualize/Analyze progression of COVID-19 (Part 1 of 2)

Similar to COVID-19 outbreak that started in China, Back in 1854 when London was emerging as the first modern city of…

4 条评论
Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 1 of 2 )

2019年5月29日

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 1 of 2 )

Cricket is a game I always loved, growing up as a kid in India. Cricket World Cup is an ODI tournament organized by…
Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 2 of 2

2019年5月17日

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 2 of 2

In the part 1 of this series, we looked at how we can summarize the raw data points that track various customer…

2 条评论
Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 1 of 2

2019年5月17日

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 1 of 2

Have you ever wondered how big corporations plan their budgets and determine spends on different marketing channels?…
Machine Learning Model Evaluation, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 3 of 3)

2018年9月15日

Machine Learning Model Evaluation, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 3 of 3)

Just to recap in part 1 of the series, we looked at how to train various models to solve the same classification…
Machine Learning Model Comparison, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 2 of 3)

2018年9月15日

Machine Learning Model Comparison, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 2 of 3)

In part 1 of this series we saw how we can train various models using 70% of the data available. In this part we will…

See all articles

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 2 of 2 )

Pradeep Reddy

Product Management - AI/ML & Data Platforms at Apple

Introduction to Basic Time Series Models:

Advanced Time Series Models:

Conclusion:

Appendix:

Pradeep Reddy的更多文章

社区洞察

其他会员也浏览了

Aggregation in Pandas DataFrame

Python Challenge: User Activity Analysis

Working with Time Series Data in Python

Integrating PyCaret time-series module into Power BI

Spark Tidbits - Lesson 12

Creating a Machine Learning App with Power BI and PyCaret

Data processing with Python

Unleashing the Power of Sports Data: Use the right analytical tools for the right job

FIFA player analysis

Sports Data Lab: R for Power Bi Basics

Introduction to Basic Time Series Models:

Advanced Time Series Models:

Conclusion:

Appendix:

Pradeep Reddy的更多文章

Personal Private Cloud with Palm Sized System On Chips (Part 3 - Realize Object Storage and build your own Website)

Personal Private Cloud with Palm Sized System On Chips (Part 2 - Implement Distributed File and Block Storage )

Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

Visualize/Analyze Progression of COVID-19 (Part 2 of 2)

Visualize/Analyze progression of COVID-19 (Part 1 of 2)

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 1 of 2 )

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 2 of 2

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 1 of 2

Machine Learning Model Evaluation, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 3 of 3)

Machine Learning Model Comparison, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 2 of 3)

社区洞察

其他会员也浏览了

Aggregation in Pandas DataFrame

Python Challenge: User Activity Analysis

Working with Time Series Data in Python

Integrating PyCaret time-series module into Power BI

Spark Tidbits - Lesson 12

Creating a Machine Learning App with Power BI and PyCaret

Data processing with Python

Unleashing the Power of Sports Data: Use the right analytical tools for the right job

FIFA player analysis

Sports Data Lab: R for Power Bi Basics