Data Science Series —1. Cricket Data Analysis
Background
I have a deep interest in the field of data science so I decided to do some personal projects to improve my skills and to gain experience in this field. From now on, I will be posting articles related to projects I am working on as well as things I am learning in this field. To start this series off, I will write about a data science project I did where I analyzed cricket data.
Introduction
What is cricket?
Cricket is a bat and ball sport where 2 teams made up of 11 players each, will take turns batting and bowling in order to win a game. The overall objective is for a team to score more runs than their opposition. Cricket is played in varying lengths of time from 5 days to shorter formats of 20 overs per side. An over contains 6 balls so 20 overs will have a maximum of 120 balls.
Which data am I analyzing and what is the purpose?
The data I chose to analyze in this project can be viewed here. The name of this dataset is IPL Player Stats between the years of 2016–2022. The reason why I decided to target this dataset is because it contains data from one of the most popular cricket leagues in the world making it quite intriguing to analyze.
What is the IPL?
IPL stands for the Indian Premier League and is a professional men’s Twenty20 cricket league hosted in India. The competition consists of 10 teams and the ultimate goal is to win the championship at the end of each season. The league is franchised and hosts players from all around the world with the main pool of players coming from India.
The data consists of batting statistics for each season from 2016 to 2022 along with the bowling statistics for the same seasons. In my analysis, I analyzed both the batting and bowling statistics but for this article, I will only focus on the batting data. I will write a separate article for the bowling data soon. To view the full notebook, please click here.
Purpose of my analysis
The ultimate purpose of my analysis is to identify some of the best batsmen in this competition. There is a huge amount of contention about how to decide the best batsmen and bowlers, however I decided to use a non-biased statistical approach in this notebook.
What are the key statistics being analyzed?
So for a bit of context, T20 cricket is one of the shortest formats of cricket being played around the world. Historically, Test cricket was considered the original format for cricket and it was fairly straightforward on how to identify the best batsman and bowler.
You simply had to look at who scored the most runs to determine the best batsman and who had the most wickets for the best bowler. These are ultimately the most important metrics, however we will fail to acknowledge a core difference between the 2 formats being discussed.
Test cricket is played over the course of 5 days and each day has 90 overs. What this means is that batsmen are not pressured to score runs at a quick pace leading to strike rates and boundaries scored to be a non factor in the measure of a great batsman. Long story short, more overs means less urgency to score runs very quickly.
Now comparing that to modern T20 cricket which has a limit of 20 overs per side, we cannot simply look at runs only to determine a great T20 batsman. If a person has a lot of runs but has a poor strike rate, then they may do more harm to their team’s performance than good. For example, a par score in modern T20 cricket is 150 runs. A person that scores 80 runs off of 90 balls in a T20 game means that the team has 30 balls to score the remaining 70 runs to get to par. This hurts the team much more than a person who scores 60 runs off of 40 balls since this will mean the team has 80 balls to score at least 90 runs.
So for me to identify the statistical best T20 batsmen, I will factor in strike rate, boundaries scored (since this directly impacts strike rate and batsmen’s efficiency), landmarks scored (a lesser factor but significantly helps teams when scored at a good strike rate), average, and finally total runs scored.
1. Load and View initial batting data
The first step is to load in the batting data and then view it. In the screenshot below I created a utility function to load in the file using a custom file path.
After writing this function, I simply had to call it to load in each file individually. I then inspected each dataframe using head and info function calls. Info gives a quick overview of the dataset while head prints the first 5 rows in each dataset.
Some key observations from the info function results were that the highest score column was of string type, there was an obsolete POS column which was the initial index assigned to each player in each file, and the average column was of type string with extra special characters present. These issues will need to be addressed before I can perform any analysis.
2. Fix issues in the datasets
I performed some conversions for the average column to convert it from a string to a float64 type. I also replaced all occurrences of * with an empty space in the High Score column since this indicates the person was not out in their high scoring inning. And finally I dropped the unneeded POS column. I wrapped all these fixes into a function and called it for all datasets separately.
3. Combine dataframes for all years into a single dataframe
For this step, I wanted to combine all the IPL seasons together and analyze the performances of players through the entirety of the 2016–2022 IPL seasons. I used the function below to perform the initial concatenation along with a season count for each player.
So the first line of the function will produce a dataframe with 2 column names, ‘index’ and ‘Player’, with the index column containing the Player Names and the Player column containing a count of the number of seasons each player has played. This line concatenates all the dataframes together and then finds the number of occurrences of each Player’s Name.
After that, I simply rename the ‘index’ column to be ‘Player’ since that is its real name and the ‘Player’ column was named to Seasons. The next line actually performs the real concatenation and sums up each Player’s statistics across all years they have played. Finally, I simply merge the first concatenated dataframe into the second one to add the season column for each player.
The next step is to format the Average and Strike Rate columns since they have been skewed after summing values across all the years. To do this I simply divide the Average column by the number of Seasons and likewise for the Strike Rate column.
4.Extracting more detailed information about each dataset
To get a better feel for the concatenated dataset, I called the describe function and the info function again. The describe function is used to get descriptive statistics summary of a dataset including mean, count, standard deviation, etc. To read more see this link.
In addition to this, I wanted to identify which columns are strongly correlated to the most important statistic for a batsman. The most important statistic for a batsman is Runs Scored. To do this I simply used to correlation function to perform pairwise correlation of columns to the Runs Scored column. The results can be seen below.
The values closer to the top are highly correlated with the Runs Scored column. What this means is that when Runs Scored increases, then these column values will also go up. Logically, this correlation makes sense. If you score more runs then chances are you will face more balls (BF), you will hit more 4s, you will score more 50s and you will have more innings.
5. Generating charts for further analysis of the batting data
The first chart I generated was for the top 25 batsmen based on Runs Scored.
The next chart is the 25 batsmen with the highest batting averages.
After this, I generated a chart for the 25 batsmen with the highest strike rate. I also introduced a quota of 200 balls faced to filter out players that have a ridiculous strike rate because of a very small sample size.
After doing a quick scan of the charts above, an acceptable batting average for T20 cricket can be between 30 and 40. So I would first like to view the number of batsmen that have an average over 35. In addition to this, strike rate matters more to a T20 batsman than to any other format so strike rate will also be factored in. An ideal strike rate for a T20 batsman is anything above 130 so that will be our filter. So the criteria I created for a “Great” batsman is someone that averages over 35 and has a strike rate of 130 and above.
Finally, I added an “Exceptional” criteria which I define as a player that has an average surpassing 40 and a strike rate exceeding 140.
The last set of charts for this section were a set of pie charts to show how much of the total 50s and 100s a single player has. The reason for choosing pie charts is to give an idea of how rare it truly is to score one of these landmarks and to showcase how special some of the players are to achieve these landmarks so frequently.
I did the same for 100s scored below.
6. Trying to understand what variables influence Runs Scored
Before establishing a criteria to determine the best batsmen from a statistical point of view, I wanted to understand which factors can impact the total runs scored. The purpose of this section is to identify what can a batter improve on specifically to influence their runs.
Impact of Balls Faced on Runs Scored
As we saw in the correlation matrix above, the balls faced is strongly correlated to runs scored. The question I want to answer is whether facing more balls than everyone else will result in more runs scored? I will answer this question by plotting the top 25 players based on most balls faced and their corresponding runs. The expected outcome is to see the players with most balls faced will have the most runs.
What this plot tells us is that the players who face more balls indeed do have a higher chance of scoring more runs. With a few minor exceptions where some players scored a little less than others that faced lesser balls than them, then it is safe to say that increasing the number of balls faced will contribute to more runs scored.
Impact of Strike Rate on Runs Scored
The next question is whether strike rate has an impact on the number of runs scored? For this we will plot the 25 batsmen with the highest strike rates along with the runs scored by each.
This graph does not steadily decline which indicates that there isn’t a strong correlation between having a higher strike rate and scoring more runs. That being said, strike rate does have an impact on the runs you score since a higher strike rate means more efficient scoring.
Impact of boundaries on runs scored
Another question I want to pose is whether scoring more boundaries has an impact on the runs scored? To answer this I will count the total number of boundaries scored by a player and then compare them to the runs scored to see if that has a significant impact on the runs scored by a player. I will only plot the 25 players with the most boundaries hit. My expectation is that more boundaries will be scored by the players with the most runs, in other words boundaries have a direct impact on runs scored.
The bar chart above shows that more boundaries are hit by players who score the most runs. So scoring more boundaries does have a direct impact on runs scored.
Impact of landmarks on Runs Scored
In the same thought as boundaries to runs scored, another interesting question is whether scoring more landmarks such as half centuries and centuries affect runs scored? My expectation is that the more landmarks you have the higher the runs scored. I am expecting a bar chart similar to the one above for boundaries hit vs runs scored and balls faced vs runs scored.
This one is interesting since it doesn’t have the exact same effect as the boundaries hit and balls faced chart. As you can see there is quite some variation in the chart heights and they do not uniformly drop indicating that landmarks achieved does not have a very strong impact on runs scored.
Impact of Innings on Runs Scored
The final question I want to answer is a simple one. Does more innings result in more runs? My expectation is that innings directly impacts runs scored.
So having more innings does not automatically mean having most runs contrary to the balls faced vs runs score chart. The logic I am understanding here is that having many innings does not automatically mean more balls faced especially if you bat lower in the order and only get a few balls per inning to face.
7. Create Rating System for T20 Batsmen
For this section, I will establish a rating system to judge some of the best T20 batsmen in IPL. Just to clarify, this rating system is very simplified and by no means the ideal system to judge T20 batsmen. It is simply an experiment to identify players that are really good across all the batting metrics that matter in T20 cricket.
So for the rating system, I will simply be awarding a score between 1 to 5 for each metric by using pre-determined targets. So an example would be scoring more than 2000 runs would be considered a 5 in that category. See below for the full breakdown of the rating system.
Metrics and the targets for each rating
- Runs Scored. 0–500 is a 1, 500–1000 is a 2, 1000–1500, is a 3, 1500–2000 is a 4, 2000+ is a 5.
- Batting Average. 0–10 is a 1, 10–20 is a 2, 20–30 is a 3, 30–35 is a 4, 35+ is a 5.
- Strike Rate. 0–50 is a 1. 50–100 is a 2, 100–120 is a 3, 120–140 is a 4, 140+ is a 5.
- Landmarks Reached. 0–5 is a 1, 5–10 is a 2, 10–15 is a 3, 15–20 is a 4, 20+ is a 5.
- Boundaries Hit. 0–50 is a 1, 50–100 is a 2, 100–150 is a 3, 150–200 is a 4, 200+ is a 5.
- Total Rating. A summation of all the values above for each player. a higher total rating, the higher the batsmen is ranked.
See below for a bubble chart displaying these attributes. The x-axis indicates the amount of Runs Scored, the y-coordinate indicates the Strike Rate, the size of the bubble indicates the Batting Average, and the color is the Total Rating field.
9. Targets for an aspiring T20 Batsman
After analyzing all of this batting data, I thought it would be interesting to find out what are some measurable goals that an aspiring T20 batsman can aim for to replicate performances of the top 25 T20 batsman in the IPL. Obviously trying to score as much runs as the top 25 T20 batsmen is not a realistic goal for a season so I came up with 4 possible goals that a batsman can aim for.
- Percent of total runs scored in boundaries. So average boundary percentage of the top 25 T20 batsmen will give a player an idea of how much of their runs should come from boundaries to have a good strike rate.
- Average of the top 25 T20 Batsmen strike rate. This is simply an ideal strike rate T20 batsmen should aim for. We can simply calculate the average of the strike rates to provide this metric.
- Average of the top 25 T20 Batsmen averages. This will indicate how much runs per inning the batsman should strive for. We can calculate the average of the existing averages for this goal.
- Average balls faced per innings played. This will give a good idea of how much balls a player should be looking to face in each inning to give themselves the best possible chance of improving their average.
See below for the targets.
The Ideal Percent of total runs scored in boundaries for an aspiring player to aim for is: 59.43%
The Ideal Strike Rate a T20 Batsman should aim for is: 132.86
The Ideal Average a T20 Batsman should aim for is: 34.37
The Average Balls Faced per Innings Played for a T20 Batsman is: 21.4
Conclusion
After exploring the dataset above, I have a much deeper appreciation for the skill it takes to be one of the best players in this game. It is certainly not easy and it is made very apparent by the statistics shown above. Our rating system is definitely not the best out there but it does a good job of displaying the players that are dominant across all metrics. Also the goals/targets section is very helpful and I hope that it does benefit an aspiring cricketer out there looking to replicate their favorite players.
Thank you for your reaching this far and I hope you enjoy this analysis as much as I did making it! Stay tuned for more content like this and I will be writing a second part to this article specifically covering the Bowling Analysis portion!