IPL Powerplay Score Prediction using AWS Lambda

We start with a basic understanding of powerplay in IPL, and quickly go through "Powerplay" score prediction walk through & Hyper Parameters Optimization --> based on this a predictive score in 20-overs.

You want to access code? Click here

You want to Predict Score? Click Here

Disclaimer: The analysis and prediction done here is for learning purpose only and should not be used for any illegal activities such as betting.

Project Demo

Introduction:

It's COVID lockdown time. Everyone is worried about the increasing COVID cases, Work From Home (WFH) environment. No way to go out, No parties, No outing, No vacation plans, No gatherings ......

Government announces Lockdown : 1.0, 2.0, 3.0, 4.0 .....

At that time every cricket fan had a question. "Do new Govt. rules in lockdown 4.0 pave the way for IPL 2020 ??" Finally IPL 2020 happened. Even pandemic situation can't derail the IPL juggernaut.

IPL 2020 earned revenue of 4000 crores with,

35% reduced cost and
25% increase in viewership.

As a cricket fan I watch all the matches, during that time I observed that "Powerplay plays a major role" which is very Important in Teams score prediction.

Powerplay in IPL:

"Powerplay" in IPL has fielding restrictions in 1^st 6-overs, ie..

only 2-fielders can stay outside the Inner-circle

After powerplay, up to 5-fielders can stay outside inner circle & 4-fielders must remain inside the inner circle.

Effect & Importance:

Powerplay makes the batting comparatively easy. Also it's a trap for the batsmen, as this will get them to take into a risk and loose their wickets in 1^st 6-overs.
So, these overs are considered as pillars of any teams victory. 75% - of winning chance depends on the Powerplay score. So, every team's expectation from the top 3-batsmen is "START THE INNINGS BIG"

Steps of this blog are demonstrated in the below figure:

1. Loading the Data Set:

Actually, this step is the data connection layer and for this very simple prototype, we will keep it simple and easy as loading a data set from cricsheet

import pandas as pd
import requests,zipfile

def getCsvFile(url="https://cricsheet.org/downloads/ipl_male_csv2.zip"):
    res = requests.get(url, stream=True)
    if res.status_code == 200:
        print('### Downloading the CSV file')
        z = zipfile.ZipFile(io.BytesIO(res.content))
        if filename in z.namelist():
            z.extract(filename, dataset_path)
            print('### Extracted %s file' % filename)
        else:
            print('### %s : File not found in ZIP Artifact' % filename)

def downloadDataset():
    if not dataset_path.exists():
        Path.mkdir(dataset_path)
        print('### Created Dataset folder')
        getCsvFile()
    elif dataset_path.exists():
        files = [file for file in dataset_path.iterdir() if file.name ==
                 'all_matches.csv']
        if len(files) == 0:
            getCsvFile()
        else:
            print('### File already extracted in given path')

downloadDataset()

2. Data Frames:

Here we will use pandas to load the dataset into pandas object. Can be done by below code:

csv_file=Path.joinpath(dataset_path, filename)

df = pd.read_csv(csv_file,parse_dates=['start_date'],low_memory=False)
df_parsed = df.copy(deep=True)

df.head()

Which contains 200664 : rows , 22 : columns

match_id	season	start_date	venue	innings	ball	batting_team	bowling_team	striker	non_striker	bowler	extras	wides	legbyes
335982	2007/08	4/18/2008	M Chinnaswamy Stadium	1	0.1	Kolkata Knight Riders	Royal Challengers Bangalore	SC Ganguly	BB McCullum	P Kumar	1		1
335982	2007/08	4/18/2008	M Chinnaswamy Stadium	1	0.2	Kolkata Knight Riders	Royal Challengers Bangalore	BB McCullum	SC Ganguly	P Kumar	0
335982	2007/08	4/18/2008	M Chinnaswamy Stadium	1	0.3	Kolkata Knight Riders	Royal Challengers Bangalore	BB McCullum	SC Ganguly	P Kumar	1	1
335982	2007/08	4/18/2008	M Chinnaswamy Stadium	1	0.4	Kolkata Knight Riders	Royal Challengers Bangalore	BB McCullum	SC Ganguly	P Kumar	0

3. Exploratory Data Analysis (EDA):

Prior to Analyze data, "Data Processing" is considered most important step, in creation of the Learning Model !

We can easily get tons of Data in form of various Datasets, but to make that data fit for deriving various insights from it, requires a lot of observation, modification, manipulation and numerous other steps.

When we freshly download a Dataset for our project, the Data it contains is random(most of the time) i.e. not arranged or not filled in the way we need it to be.
Sometimes, it might have NULL Values, Unnecessary Features, Datatypes not in a proper format. etc…

So, to treat all these shortcomings, we go through a process which is popularly known as “Data Preprocessing’’.

Coming back to our IPL Dataset, we have to do data processing like below, to train model.

Here, headers of this dataset,are self-explanatory.

Data Processing Steps:

First Identify the "Null" values in the dataset, hopefully there is no null values present in this dataset for the required fields.

df.isnull().sum()
    match_id                       0
    season                         0
    start_date                     0
    venue                          0
    innings                        0
    ball                           0
    batting_team                   0
    bowling_team                   0
    striker                        0
    non_striker                    0
    bowler                         0
    runs_off_bat                   0
    extras                         0
    wides                     194583
    noballs                   199854
    byes                      200135
    legbyes                   197460
    penalty                   200662
    wicket_type               190799
    player_dismissed          190799
    other_wicket_type         200664
    other_player_dismissed    200664
    dtype: int64

Select the 1^st 6-Overs match details & drop rest:
- Here as we are predicting the "powerplay" score we can drop rest of the data, except 1st 6-overs and
- drop all innings > 2, As there is no scope for 3rd and 4th innings
```
     df_parsed[(df_parsed.ball < 6.0) & (df_parsed.innings < 3)]
```
Identify & Drop Innings played < 6 overs :
- In our data few matches are not conducted and declared as "no result" may be due to rain... some technical problems.
- Either match result is declared by DCB method, we can't predict the score.
Which may lead to outliers, to our model. Because If team played only 3 overs, then score minimum value effects which leads to modify the following "Normalization" process. So better to drop these data in our prediction.

obj = df_parsed.query('ball<6.0&innings<3').groupby(
    ['match_id', 'venue', 'innings', 'batting_team', 'bowling_team'])

for key, val in obj:
    if val['ball'].count() < 36:
        df_parsed.drop(labels=val.index, axis=0, inplace=True)

Here columns 'season' and 'start_date' are not needed for our prediction. So we can drop the columns 'season' and 'start_date' from the dataset.

df_parsed.drop(columns=['season', 'start_date'], inplace=True)

Delete Non-existing teams :

non_exist_teams = ['Kochi Tuskers Kerala',
                   'Pune Warriors',
                   'Rising Pune Supergiants',
                   'Rising Pune Supergiant',
                   'Gujarat Lions']

mask_bat_team = df_parsed['batting_team'].isin(non_exist_teams)

mask_bow_team = df_parsed['bowling_team'].isin(non_exist_teams)
df_parsed = df_parsed[~mask_bat_team]
df_parsed = df_parsed[~mask_bow_team]

Replace the old team names with new team name:

df_parsed.loc[df_parsed.batting_team ==
              'Delhi Daredevils', 'batting_team'] = 'Delhi Capitals'
df_parsed.loc[df_parsed.batting_team == 'Deccan Chargers',
              'batting_team'] = 'Sunrisers Hyderabad'
df_parsed.loc[df_parsed.batting_team ==
              'Punjab Kings', 'batting_team'] = 'Kings XI Punjab'

df_parsed.loc[df_parsed.bowling_team ==
              'Delhi Daredevils', 'bowling_team'] = 'Delhi Capitals'
df_parsed.loc[df_parsed.bowling_team == 'Deccan Chargers',
              'bowling_team'] = 'Sunrisers Hyderabad'
df_parsed.loc[df_parsed.bowling_team ==
              'Punjab Kings', 'bowling_team'] = 'Kings XI Punjab'

Correct the venue column with unique names. In this dataset same stadium is being represented as in multiple ways. So identify those and rename.

df_parsed.loc[df_parsed.venue == 'M.Chinnaswamy Stadium',
              'venue'] = 'M Chinnaswamy Stadium'
df_parsed.loc[df_parsed.venue == 'Brabourne Stadium, Mumbai',
              'venue'] = 'Brabourne Stadium'
df_parsed.loc[df_parsed.venue == 'Punjab Cricket Association IS Bindra Stadium, Mohali',
              'venue'] = 'Punjab Cricket Association Stadium'
df_parsed.loc[df_parsed.venue == 'Punjab Cricket Association IS Bindra Stadium',
              'venue'] = 'Punjab Cricket Association Stadium'
df_parsed.loc[df_parsed.venue == 'Wankhede Stadium, Mumbai',
              'venue'] = 'Wankhede Stadium'
df_parsed.loc[df_parsed.venue == 'Rajiv Gandhi International Stadium, Uppal',
              'venue'] = 'Rajiv Gandhi International Stadium'
df_parsed.loc[df_parsed.venue == 'MA Chidambaram Stadium, Chepauk',
              'venue'] = 'MA Chidambaram Stadium'
df_parsed.loc[df_parsed.venue == 'MA Chidambaram Stadium, Chepauk, Chennai',
              'venue'] = 'MA Chidambaram Stadium'

Create a columns "Total_score" : which reflects the runs through bat and extra runs through wide's, byes,no-balls,leg byes... etc.

df_parsed['Total_score'] = df_parsed.runs_off_bat + df_parsed.extras

df_parsed.drop(columns=['wides', 'noballs', 'byes', 'legbyes', 'penalty', 'wicket_type',
                        'other_wicket_type', 'other_player_dismissed'], axis=1, inplace=True)

Can Ignore considering columns in a data frame while propcessing ['wides', 'noballs', 'byes', 'legbyes', 'penalty', 'wicket_type','other_wicket_type', 'other_player_dismissed']

Data Frame looks like :

Then, our Data frame looks like below with 54525 : rows, 13 : columns (Initially it was 200664 : rows , 22 : columns)

# Total 33 : venue details present
# Total 8  : Batting teams are there
# Total 8  : Bowling teams are there
# Batting teams are : ['Kolkata Knight Riders' 'Royal Challengers Bangalore'
 'Chennai Super Kings' 'Kings XI Punjab' 'Rajasthan Royals'
 'Delhi Capitals' 'Sunrisers Hyderabad' 'Mumbai Indians']
# Bowling teams are : ['Royal Challengers Bangalore' 'Kolkata Knight Riders' 'Kings XI Punjab'
 'Chennai Super Kings' 'Delhi Capitals' 'Rajasthan Royals'
 'Sunrisers Hyderabad' 'Mumbai Indians']
# Shape of data frame after initial cleanup :(54525, 13)

match_id	venue	innings	ball	batting_team	bowling_team	batsmen	batsmen_non_striker	bowlers	extras	player_dismissed	Total_score
335982	M Chinnaswamy Stadium	1	0.1	Kolkata Knight Riders	Royal Challengers Bangalore	SC Ganguly	BB McCullum	P Kumar	1	NaN	1
335982	M Chinnaswamy Stadium	1	0.2	Kolkata Knight Riders	Royal Challengers Bangalore	BB McCullum	SC Ganguly	P Kumar	0	NaN	0
335982	M Chinnaswamy Stadium	1	0.3	Kolkata Knight Riders	Royal Challengers Bangalore	BB McCullum	SC Ganguly	P Kumar	1	NaN	1
335982	M Chinnaswamy Stadium	1	0.4	Kolkata Knight Riders	Royal Challengers Bangalore	BB McCullum	SC Ganguly	P Kumar	0	NaN	0
335982	M Chinnaswamy Stadium	1	0.5	Kolkata Knight Riders	Royal Challengers Bangalore	BB McCullum	SC Ganguly	P Kumar	0	NaN	0

4. Encoding:

Till now as a initial step,

We cleaned our dataset with all the "null" values and
filtered the columns/rows data (which are not used for prediction) and
Added the required column with values (like total_score) to make the dataset clean.

As you can see above, our cleaned dataset is having 13 columns of multiple Dtype like int64 and float64 and object.

So next what I am going to do is trying to convert all these multiple Dtypes into a single Dtype, to train the model.

1   venue                54525 non-null  object 
    4   batting_team         54525 non-null  object 
    5   bowling_team         54525 non-null  object 
    6   batsmen              54525 non-null  object 
    7   bowlers              54525 non-null  object

4.1. Encode "batsmen" and "bowlers" column values:

Here we can see, few players who can bat as well as bowl. Means same player will be listed as a batsmen and as well as bowler.
So to make the prediction properly, I am creating one Data frame with all the players name it as "players_df", which I use for encoding the players with some value to identify.
For inference going further, I create a dictionary with all these encoded values

players_df = pd.DataFrame(np.append(
                          df_parsed.batsmen.unique(), df_parsed.bowlers.unique()),
                          columns=['Players']
                          )

label_encode_dict = {}

dct = dict(enumerate(players_df.Players.astype('category').cat.categories))
label_encode_dict['Players'] = dict(zip(dct.values(), dct.keys()))

4.2. Encode "venue" and "batting_team" and "bowling_team" column values:

for col in ['venue', 'batting_team', 'bowling_team']:
    dct = dict(enumerate(df_parsed[col].astype('category').cat.categories))
    label_encode_dict[col] = dict(zip(dct.values(), dct.keys()))

4.3. Save the encoded values to a json file:

label_encode_dict['Total_score_min'] = float(Total_score_min_max[0])
label_encode_dict['Total_score_max'] = float(Total_score_min_max[1])
label_encode_dict['Total_score_min'],label_encode_dict['Total_score_max']

with open(json_path, 'w') as f:
    json.dump(label_encode_dict, f)

4.4. Format the Dataset:

In this step I am trying to club the all rows with respect to matchID and Innings (as match ID is unique way to identify a particular match and Innings to identify who bat first).

Based on these two details,

grab all the batsmen and bowler details who batted and bowled in 1^st 6-overs
Calculate the total score (runs through bat + extra runs)
How many players dismissed in 1st 6-overs

print('### Shape of Dataframe before format_data : {}'.format(df_parsed.shape))

Runs_off_Bat_6_overs = df_parsed.groupby(['match_id', 'venue', 'innings', 'batting_team', 'bowling_team'])['runs_off_bat'].sum()

Extras_6_overs = df_parsed.groupby(['match_id', 'venue', 'innings', 'batting_team', 'bowling_team'])['extras'].sum()

TotalScore_6_overs = df_parsed.groupby(['match_id', 'venue', 'innings', 'batting_team', 'bowling_team'])['Total_score'].sum()

Total_WktsDown = df_parsed.groupby(['match_id', 'venue', 'innings', 'batting_team', 'bowling_team'])['player_dismissed'].count()

bat_df = df_parsed.groupby(['match_id', 'venue', 'innings','batting_team', 'bowling_team'])['batsmen'].apply(list)

bow_df = df_parsed.groupby(['match_id', 'venue', 'innings','batting_team', 'bowling_team'])['bowlers'].apply(list)

df_parsed = pd.DataFrame(pd.concat([bat_df, bow_df, Runs_off_Bat_6_overs, Extras_6_overs, TotalScore_6_overs, Total_WktsDown],axis=1)).reset_index()

4.5. Align the batsmen and bowlers details in to a separate column

In above formatted dataset, we got list of batsmen and bowlers details who batted and bowled in 6-overs.

Now we have to arrange these batsmen into a separate columns,

* say bat1,bat2,bat3,bat4....bat10
 * say bow1,bow2,bow3.....bow6

Here I selected only 10-batsmen (as we have only 10-wickets), and 6-bowlers (can bowl in 6-overs) because in 6-overs this is only possible.

For proper prediction the order of batsmen and bowlers given the dataset matters. So we need to keep the order :
* batsmen : who batted 1st,2nd 3rd and 4th ... wicket same
* bowler  : who bowled in 1st,2nd,3rd,4th,5th and 6th overs same

4.6. Create a batsmen and bowlers dummy data frame:

To keep track of the order same, so 1st I am going to create a dummy data frame

with 10-batsmen with column names [bat1,bat2,.....bat9,bat10]
with 6-bowlers with column names [bow1,bow2,bow3,bow4,bow5,bow6]

bat  = pd.DataFrame(np.zeros((df_parsed.shape[0],10),dtype=float),columns=['bat1', 'bat2', 'bat3', 'bat4', 'bat5', 'bat6', 'bat7', 'bat8', 'bat9', 'bat10'])
bowl = pd.DataFrame(np.zeros((df_parsed.shape[0],6),dtype=float),columns=['bow1', 'bow2', 'bow3', 'bow4', 'bow5', 'bow6'])

columns = ['bat1', 'bat2', 'bat3', 'bat4', 'bat5', 'bat6', 'bat7', 'bat8', 'bat9', 'bat10','bow1', 'bow2', 'bow3', 'bow4', 'bow5', 'bow6']
df_bat_bow = pd.concat([bat,bowl],axis=1)
df_bat_bow

bat1	bat2	bat3	bat4	bat5	bat6	bat7	bat8	bat9	bat10	bow1	bow2	bow3	bow4	bow5	bow6
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

| 1453 rows × 16 columns |

# Concat the this dummy data frame with the parsed data frame
df_parsed =  pd.concat([df_parsed, df_bat_bow], axis=1)

4.7. Update the batsmen list of elements into each column in the same order:

Means, data which is in list of elements in each matchID and innings into corresponding individual batsmen columns in the same order.

Example, here in below list 1st batsmen is bat1 -> SC Ganguly, 2nd is bat2 -> BB McCullum, 3rd is bat3 -> RT Poting ...etc like 

"['SC Ganguly', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'SC Ganguly', 'SC Ganguly', 'SC Ganguly', 'BB McCullum', 'BB McCullum', 'SC Ganguly', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'SC Ganguly', 'SC Ganguly', 'SC Ganguly', 'BB McCullum', 'SC Ganguly', 'SC Ganguly', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'SC Ganguly', 'BB McCullum', 'SC Ganguly', 'RT Ponting', 'RT Ponting', 'RT Ponting', 'RT Ponting', 'BB McCullum', 'RT Ponting', 'BB McCullum', 'RT Ponting', 'RT Ponting', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'RT Ponting', 'BB McCullum', 'RT Ponting', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'RT Ponting', 'BB McCullum', 'RT Ponting', 'BB McCullum', 'RT Ponting', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'RT Ponting', 'RT Ponting', 'RT Ponting', 'RT Ponting', 'RT Ponting', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'RT Ponting', 'RT Ponting', 'RT Ponting', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'DJ Hussey', 'DJ Hussey', 'BB McCullum', 'DJ Hussey', 'BB McCullum', 'DJ Hussey', 'DJ Hussey', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'DJ Hussey', 'BB McCullum', 'DJ Hussey', 'DJ Hussey', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'DJ Hussey', 'BB McCullum', 'DJ Hussey', 'DJ Hussey', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'DJ Hussey', 'BB McCullum', 'Mohammad Hafeez', 'Mohammad Hafeez', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'Mohammad Hafeez', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum', 'BB McCullum']"

df_parsed['batsmen'] = df_parsed.batsmen.map(OrderedDict.fromkeys).apply(list)
df_parsed['bowlers'] = df_parsed.bowlers.map(OrderedDict.fromkeys).apply(list)

for row,val in enumerate(df_parsed.batsmen):
    for i in range(len(val)):
        df_parsed.loc[row,'bat%s'%(i+1)] = val[i]

for row,val in enumerate(df_parsed.bowlers):
    for i in range(len(val)):
        df_parsed.loc[row,'bow%s'%(i+1)] = val[i]

df_parsed.loc[:,['bat1','bat2','bat3','bat4','bat5','bat6','bat7','bat8','bat9','bat10','bow1','bow2','bow3','bow4','bow5','bow6']]

So finally our dataframe is ready with batsmen and bowlers details.
So we can drop few columns which are not important.

df_model = df_parsed[['venue', 'innings', 'batting_team', 'bowling_team', 'bat1', 'bat2', 'bat3', 'bat4', 'bat5', 'bat6', 'bat7', 'bat8','bat9', 'bat10', 'bow1', 'bow2', 'bow3', 'bow4', 'bow5', 'bow6', 'runs_off_bat', 'extras', 'Total_score', 'player_dismissed']]

df_model.columns

Index(['venue', 'innings', 'batting_team', 'bowling_team', 'bat1', 'bat2',
       'bat3', 'bat4', 'bat5', 'bat6', 'bat7', 'bat8', 'bat9', 'bat10', 'bow1',
       'bow2', 'bow3', 'bow4', 'bow5', 'bow6', 'runs_off_bat', 'extras',
       'Total_score', 'player_dismissed'],
      dtype='object')

4.8. Encode the multiple Dtypes into single Dtype:

Now its time to use, the label encoded values (already done in previous steps) to encode the data frame.

json_path = Path.joinpath(dataset_path, 'label_encode.json')
with open(json_path) as f:
    data = json.load(f)

condition = False

for col in df_model.columns:
    if col in data.keys():
        condition = True
        col = col
    elif col in ['bat1', 'bat2', 'bat3', 'bat4', 'bat5', 'bat6', 'bat7', 'bat8', 'bat9', 'bat10']:
        condition = True
        col = 'Players'  # 'batsmen'
    elif col in ['bow1', 'bow2', 'bow3', 'bow4', 'bow5', 'bow6']:
        col = 'Players'  # 'bowlers'
        condition = True

    if condition:
        condition = False
        for key in data[col]:
            df_model = df_model.replace([key], data[col][key])

4.9. At the end of this step our data frame looks like :

venue	innings	batting_team	bowling_team	bat1	bat2	bat3	bat4	bat5	bat6	bow1	bow2	bow3	runs_off_bat	extras	Total_score	player_dismissed
14	1	3	6	419	70	385	0	0	0	326	512	19	51	10	61	1
14	2	6	3	351	496	483	187	97	297	22	163	20	19	7	26	4
21	1	0	2	331	286	274	0	0	0	65	408	201	50	3	53	1

5. Normalization:

In Encoding step all data has been converted into numerical values, which are in different ranges.

In this Normalization step, we represent the numerical column values in a common data scale, without loosing information & distorting differences in the ranges of values, with the help of "MinMaxScaler".

df_model = df_model.applymap(np.float64)
df_norm = (df_model - df_model.min())/(df_model.max() - df_model.min())

df_norm.fillna(0.0,inplace=True)

df_norm.head()

At the end of this step data frame looks like :

venue	innings	batting_team	bowling_team	bat1	bat2	bat3	bat4	bat5	bat6	...	bow1	bow2	bow3	runs_off_bat	extras	Total_score	player_dismissed
0.43750	0.0	0.428571	0.857143	0.819253	0.134387	0.751953	0.000000	0.000000	0.000000	...	0.636008	1.000000	0.037109	0.452632	0.666667	0.516484	0.2
0.43750	1.0	0.857143	0.428571	0.685658	0.976285	0.943359	0.365949	0.189824	0.586957	...	0.041096	0.314342	0.039062	0.115789	0.466667	0.131868	0.8
0.65625	0.0	0.000000	0.285714	0.646365	0.561265	0.535156	0.000000	0.000000	0.000000	...	0.125245	0.795678	0.392578	0.442105	0.200000	0.428571	0.2
0.65625	1.0	0.285714	0.000000	0.402750	0.393281	0.000000	0.000000	0.000000	0.000000	...	0.358121	0.575639	0.000000	0.557895	0.133333	0.538462	0.2
0.28125	0.0	0.714286	0.142857	0.915521	0.996047	0.865234	0.499022	0.000000	0.000000	...	0.283757	0.115914	0.537109	0.315789	0.133333	0.285714	0.4

6. Split Train & test data:

Split the dataset into 70% for train and 30% for test

Identify the inputs vs targets from the pandas dataframe and convert into Torch Tensor
Create the torch dataset
Now split the torch_ds into train_ds and test_ds datasets (70% vs 30%)
Create a dataloader for train_ds and test_ds

inputs = torch.Tensor(df_norm.iloc[:,:-4].values.astype(np.float32))
targets = torch.Tensor(df_norm.loc[:,'Total_score'].values.reshape(-1,1).astype(np.float32))

torch_ds = torch.utils.data.TensorDataset(inputs,targets)
len(torch_ds) # 1453

train_ds_sz = int(round((len(torch_ds)*split_ratio[0])/100.0))
test_ds_sz  = len(torch_ds) - train_ds_sz

train_ds,test_ds = torch.utils.data.random_split(torch_ds,[train_ds_sz,test_ds_sz])
len(train_ds),len(test_ds) # (1017, 436)

train_dl = torch.utils.data.DataLoader(dataset = train_ds,batch_size=train_bs, shuffle = True)
test_dl = torch.utils.data.DataLoader(dataset = test_ds,batch_size=2*train_bs)

7. Model Selection:

Here I am using Linear regression model.

class linearRegression(torch.nn.Module):
    def __init__(self, inputSize, outputSize):
        super(linearRegression, self).__init__()
        self.linear = torch.nn.Linear(inputSize, outputSize)

    def forward(self, x):
        out = self.linear(x)
        return out

model = linearRegression(inputs.shape[1],targets.shape[1]) # (20,1)

8. Hyper Parameters Optimization:

I tried many combinations with respect to LR,momentum,NAG, batch size and optimizers and schedulers to find the global minima.

All the results are available at record metrics

Out of all tests, Identified below Hyper Parameters gives better results:

1. Train vs Test Batch Sz : 32 & 64
2. Loss : MSE Loss()
3. Optimizer : "SGD (Parameter Group 0,
                        dampening: 0,
                        lr: 0.1,
                        momentum: 0.9,
                        nesterov: True,
                        weight_decay: 0
                    )"
4. Minimum Train Loss "0.0061" observed in 36th epoch (can see in below graph)

#loss = torch.nn.L1Loss()
loss = torch.nn.MSELoss()
loss_Description = str(loss) # Its declared in Hyper Params section in top

#opt  = torch.optim.SGD(model.parameters(),lr=lr)
#opt  = torch.optim.SGD(model.parameters(),lr=lr,momentum=0.9)
opt  = torch.optim.SGD(model.parameters(),lr=lr,momentum=0.9,nesterov=True) 
optimizer_Description = str(opt)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer=opt,mode='min',factor=0.1,patience=10,verbose=True)

9. Evaluation:

From the abaove graph, we reached the global minima in 36th epoch, with a Train loss = 0.0061 & Test Loss = 0.0117

10. Scoring:

Lets see the predicted values of the model on test data:

predicted_values = []
actual_values = []
for input,output in test_dl:
    for row in input:
        pred     = (loaded_model(torch.Tensor(row))*(encode_data['Total_score_max'] - encode_data['Total_score_min'] )) + encode_data['Total_score_min']
        predicted_values.append(pred)
    for op in output:
        actual    = (op*(encode_data['Total_score_max'] - encode_data['Total_score_min'] )) + encode_data['Total_score_min']
        actual_values.append(actual)
    break

predicted_values[:5]
    [tensor([46.3414], grad_fn=<AddBackward0>),
    tensor([30.7063], grad_fn=<AddBackward0>),
    tensor([47.4587], grad_fn=<AddBackward0>),
    tensor([51.6460], grad_fn=<AddBackward0>),
    tensor([47.1634], grad_fn=<AddBackward0>)]

actual_values[:5]
    [tensor([49.]), 
    tensor([27.]), 
    tensor([64.]), 
    tensor([66.]), 
    tensor([64.])]