30
3 Different Approaches for Train/Test Splitting of a Pandas Dataframe
Usually, the Train/Test Splitting process is one of the Machine Learning tasks taken for granted. In fact, data scientists focus more on Data Preprocessing or Feature Engineering, delegating the process of dividing the dataset into a line of code.
In this short article, I describe three train/test splitting techniques, exploiting three different Python libraries:
scikit-learn
pandas
numPy
In this tutorial, I assume that the whole dataset is available as a CSV file, which is loaded as a Pandas Dataframe. I consider the heart.csv dataset, which has 303 rows and 14 columns:
import pandas as pd
df = pd.read_csv('source/heart.csv')
The output column corresponds to the target column and all the remaining ones correspond to the input features:
Y_col = 'output'
X_cols = df.loc[:, df.columns != Y_col].columns
Scikit-learn provides a function, named train_test_split()
, which automatically splits a dataset into a training and test set. As input parameters of the function either lists or Pandas Dataframes can be passed.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[X_cols], df[Y_col],test_size=0.2, random_state=42)
Other input parameters include:
-
test_size
: the proportion of the dataset to be included in the test dataset. -
random_state
: the seed number to be passed to the shuffle operation, thus making the experiment reproducible.
The original dataset contains 303 records, the train_test_split()
function with test_size=0.20
assigns 242 records to the training set and 61 to the test set.
Pandas provide a Dataframe function, named sample()
, which can be used to split a Dataframe into train and test sets. The function receives as input the frac parameter, which corresponds to the proportion of the dataset to be included in the result. Similarly to the scikit-learn train_test_split()
, also the sample()
function provides the random_state input parameter.
Continue Reading on Towards AI
30