Cross-Validation and Hyperparameter Search in scikit-learn - A Complete Guide

This post is inspired by Kevin Markham's course Introduction to Machine Learning with scikit-learn on Data School. If you're a beginner looking to get started with Machine Learning using scikit-learn, I would highly recommend this course to gain all the required foundational skills.

In just a few hours, you'd be able to understand and build basic regression, and classification models with the optimal hyperparameters.

📌If this sounds interesting, please be sure to check the course at this link.

That said, we shall cover the following in this post.

How to Use Cross-Validation to Evaluate a Machine Learning Model

How do we validate a machine learning model?

When evaluating a machine learning model, training and testing on the same dataset is not a great idea. Why? Let us draw a relatable analogy.

Ever since school days, we’ve been giving exams, and how are our exams designed?

  • Well, they’ve been designed so as to test our understanding of the concepts rather than our ability to memorize!
  • The same analogy can be transposed to our machine learning model as well!

Here’s the answer to the question ‘Why can we not evaluate a model on the same data that it was trained on?’

It is because this process inherently encourages the model to memorize the training data. So, the model performs extremely well on the training data.

However, it generalizes rather poorly to data that it has never seen before.

What does train_test_split do?

As a model's performance on data that it has never seen before is a more reliable estimate of its performance, we usually validate the model by checking how it performs on out-of-sample data, that is, on data that it has never seen before.

If you remember, it is for this reason, we use the train_test_split method, in our very friendly and nifty library, scikit-learn.

This train_test_split splits the available data into two sets: the train and test sets in certain proportions.

For example, train on 70% of the available data and test on the remaining 30% of the data. This way, we can ensure that every record in the dataset can either be in the training set or the test set but not both!

And in doing so, we are making sure that we test the model’s performance on unseen data.

But, is this good enough? Or do we take this with a pinch of salt?

Let us train a simple KNeighborsClassifier in scikit-learn on the iris dataset.

Necessary imports
As shown below, we import the necessary modules.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

Load the data
Let us load the iris data and separate out the features(Sepal Length, Sepal Width, Petal Length and Petal Width) and the target variables which are the class labels indicating the iris type. (Setosa, Versicolour, and Virginica)

# read in the iris data
iris = load_iris()

# create X (features) and y (response)
X = iris.data
y = iris.target

Create train_test_split
Let’s create the train and test sets with random_state= 4. Setting the random_state ensures reproducibility.

In this case, it ensures that the records that go into the train and test sets stay the same every time our code is run.

# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 4)

Check Classification Accuracy
We now instantiate the KNeighborsClassifier with n_neighbors=9 and fit the classifier on the training set and predict on the test set.

knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

# Output
0.9736842105263158

Now, let us change the random_state to a different value. What do you think the accuracy score would be?

Well, please feel free to insert your favorite number in the random_state and check for yourselves. It'd be a different accuracy score this time.

Setting the random_state to another value, we would get another value for the accuracy score.

The evaluation metric thus obtained is therefore susceptible to high variance.

  • The accuracy score may depend heavily on which data points end up in the training set and which end up in the test set.
  • Thus the accuracy score may be significantly different depending on how the division is made. Clearly, this doesn’t seem like the best way to validate our model’s performance!

How do we reach a consensus on how to calculate the accuracy score?

One very natural thing to do would be to create multiple train/test splits, calculate the accuracy for each such split, and compute the average of all the accuracy scores thus obtained. This definitely seems like a better estimate of the accuracy, doesn’t it?

This is precisely the essence of cross-validation, which we shall see in the subsequent section.

Understanding K-fold cross-validation

Steps in K-fold cross-validation

  1. Split the dataset into K equal partitions (or “folds”).
  2. Use fold 1 for testing and the union of the other folds as the training set.
  3. Calculate accuracy on the test set.
  4. Repeat steps 2 and 3 K times, using a different fold for testing each time.
  5. Use the average accuracy on different test sets as the estimate of out-of-sample accuracy.

Let us try to visualize this by splitting a dataset of 25 observations into 5 equal folds as shown below.

The dataset contains 20 observations (numbered 0 through 24).
5-fold cross-validation runs for 5 iterations.

Let's see how the folds are created

# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle = False).split(range(25))

# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
   print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))

# Output

Iteration                   Training set observations                   Testing set observations
    1     [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [0 1 2 3 4]       
    2     [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]        [5 6 7 8 9]       
    3     [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 23 24]     [10 11 12 13 14]     
    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23 24]     [15 16 17 18 19]     
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]     [20 21 22 23 24]

We observe the following:

  • For each iteration, every observation is either in the training set or the testing set, but not both.
  • Every observation is in the test set exactly once.
  • Each fold is used as the test set exactly once and in the training set (K-1) times.
  • The average accuracy thus obtained is a more accurate estimate of out-of-sample accuracy.

This process uses data more efficiently as every observation is used for both training and testing.

It is recommended to use stratified sampling for creating the folds, as this ensures that all class labels are represented in equal proportions in each fold. And, scikit-learn’s cross_val_score does this by default.

In practice, we can even do the following:

  • “Hold out” a portion of the data before beginning the model building process.
  • Find the best model using cross-validation on the remaining data, and test it using the hold-out set.
  • This gives a more reliable estimate of out-of-sample performance since the hold-out set is truly out-of-sample.

How to Use Cross-Validation to Search for the Best Model Hyperparameters

Cross-validation for hyperparameter tuning

For the KNN classifier on the iris dataset, can we possibly use cross-validation to find the optimal value for K? That is, to search for the optimal value of n_neighbors?

Remember, K in KNN classifier is the number of neighbors (n_neighbors) that we take into account for predicting the class label of the test sample. Not to be confused with the K in K-fold cross-validation.

from sklearn.model_selection import cross_val_score
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)

# Output

[1.         0.93333333 1.         1.         0.86666667 0.93333333
 0.93333333 1.         1.         1.        ]

# use average accuracy as an estimate of out-of-sample accuracy
print(scores.mean())
# Output

0.9666666666666668

Now, we shall run the K fold cross-validation for the models with different values of n_neighbors, as shown below.

# search for an optimal value of K for KNN
k_range = list(range(1, 31))
k_scores = []
for k in k_range:    
   knn = KNeighborsClassifier(n_neighbors=k)    
   scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
   k_scores.append(scores.mean())
print(k_scores)
# Output k_scores

[0.96, 0.9533333333333334, 0.9666666666666666, 0.9666666666666666, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9800000000000001, 0.9666666666666666, 0.9666666666666666, 0.9733333333333334, 0.96, 0.9666666666666666, 0.96, 0.9666666666666666, 0.9533333333333334, 0.9533333333333334, 0.9533333333333334]

Let us plot the values to get a better idea.

import matplotlib.pyplot as plt
%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

We see that n_neighbors (K) values from 13 to 20 yield higher accuracy, especially K=13,18 and 20.As a larger value of K yields a less complex model, we choose K=20.

This process of searching for the optimal values of hyperparameters is called hyperparameter tuning.

In this example, we chose the values of K that resulted in higher mean accuracy score under 10-fold cross validation.

This is how cross-validation can be used to search for the best hyperparameters and this can be done much more efficiently in scikit-learn.

In KNN classifiers, setting a very small value for K will make the model needlessly complex, and a very large value of K would result in a model with high bias that yields suboptimal performance.

As K=13,18 and 20 gave the highest accuracy score, close to 0.98, we decided to choose K=20 as a larger value of K would yield a less complex model.

While it is not particularly difficult to write the for loop, we do realize that we may have to do it often.

Therefore, it would be good to have a more convenient way of doing the hyperparameter search, without having to write a loop every time and identify the best parameter through inspection.

Grid Search class in scikit-learn helps us do this all the more easily. Let's learn about it in the next section.

How to Use Grid Search for Hyperparameter Search

Import GridSearchCV Class

from sklearn.model_selection import GridSearchCV

Define the Parameter Grid

We now define the parameter grid (param_grid), a Python dictionary, whose key is the name of the hyperparameter whose best value we’re trying to find and the value is the list of possible values that we would like to search over for the hyperparameter.

# define the parameter values that should be searched
k_range = list(range(1, 31))
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}

We now instantiate GridSearchCV. Note that we specify the param_grid instead of the n_neighbors argument that we had specified for cross_val_score earlier.

Why is this valid?

Remember, the parameter grid, param_grid is a dictionary whose key is n_neighbors and the value is a list of possible values of n_neighbors. Therefore, specifying the param_grid ensures that the value at index i is fetched as the value of n_neighbors in the i_th run.

Instantiate, fit the grid and view the results

# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)

We now go ahead and fit the grid with data, and access the cv_results_ attribute to get the mean accuracy score after 10-fold cross-validation, standard deviation and the parameter values.

For convenience, we may store the results in a pandas DataFrame. The mean and standard deviation of the accuracy scores for n_neighbors=1 to 10 are shown below.

# fit the grid with data
grid.fit(X, y)
# view the results as a pandas DataFrame
import pandas as pd
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Output

mean_test_score std_test_score  params
0   0.960000    0.053333    {'n_neighbors': 1}
1   0.953333    0.052068    {'n_neighbors': 2}
2   0.966667    0.044721    {'n_neighbors': 3}
3   0.966667    0.044721    {'n_neighbors': 4}
4   0.966667    0.044721    {'n_neighbors': 5}
5   0.966667    0.044721    {'n_neighbors': 6}
6   0.966667    0.044721    {'n_neighbors': 7}
7   0.966667    0.044721    {'n_neighbors': 8}
8   0.973333    0.032660    {'n_neighbors': 9}
9   0.966667    0.044721    {'n_neighbors': 10}

When using cross_val_score, we tried eyeballing the accuracy scores to identify the best hyperparameters and to make it easier, we plotted the value of hyperparameters vs the respective cross-validated accuracy scores!

Sounds good but doesn’t seem to be a great option though!

Once we’ve completed the grid search, the following attributes can be very useful! We can choose to examine:
☑ the best_score_ , the highest cross-validated accuracy score
☑ the best_params_, the optimal value for the hyperparameters, and
☑ the best_estimator_, which is the best model that has the best hyperparameter.

Let's now examine these for our example.
Examine the best score and best hyperparameters

# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

# Output
0.9800000000000001

{'n_neighbors': 13}

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=13, p=2,
                     weights='uniform')

K=13 has been chosen, remember, K=13 was one of the values of K that gave highest cross-validated accuracy score.✔
So far so good!

Searching for Multiple Hyperparameters

In this example, the only hyperparameter that we searched for was n_neighbors.

What if there were many such hyperparameters?
We may think, “Why not tune each hyperparameter independently?”

Well, we may independently search for the optimal values for each of the hyperparameters; but the model may perform best at some values of the parameters that are very different from the individual best values.

So, we have to search for the combination of the parameters that optimizes performance rather than the individual best parameters.

Let's build on the same example of KNNClassifier.

In addition to n_neighbors, let's search for the optimal weighting strategy as well.

  • The default weighting option is ‘uniform’ where all points are weighted equally and ‘distance’ option weights points by the inverse of their distance.

In this case, closer neighbors of a query point will have a greater influence than neighbors which are far away.

Define the Parameter Grid

# define the parameter values that should be searched
k_range = list(range(1, 31))
weight_options = ['uniform', 'distance']
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range, weights=weight_options)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],

'weights': ['uniform', 'distance']}

Let us instantiate and fit the grid and view results, as before.

# instantiate and fit the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False) 
grid.fit(X, y)
# view the results
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Results
mean_test_score std_test_score  params
0   0.960000    0.053333    {'n_neighbors': 1, 'weights': 'uniform'}
1   0.960000    0.053333    {'n_neighbors': 1, 'weights': 'distance'}
2   0.953333    0.052068    {'n_neighbors': 2, 'weights': 'uniform'}
3   0.960000    0.053333    {'n_neighbors': 2, 'weights': 'distance'}
4   0.966667    0.044721    {'n_neighbors': 3, 'weights': 'uniform'}
5   0.966667    0.044721    {'n_neighbors': 3, 'weights': 'distance'}
6   0.966667    0.044721    {'n_neighbors': 4, 'weights': 'uniform'}
7   0.966667    0.044721    {'n_neighbors': 4, 'weights': 'distance'}
8   0.966667    0.044721    {'n_neighbors': 5, 'weights': 'uniform'}
9   0.966667    0.044721    {'n_neighbors': 5, 'weights': 'distance'}
10  0.966667    0.044721    {'n_neighbors': 6, 'weights': 'uniform'}
11  0.966667    0.044721    {'n_neighbors': 6, 'weights': 'distance'}
12  0.966667    0.044721    {'n_neighbors': 7, 'weights': 'uniform'}
13  0.966667    0.044721    {'n_neighbors': 7, 'weights': 'distance'}
14  0.966667    0.044721    {'n_neighbors': 8, 'weights': 'uniform'}
15  0.966667    0.044721    {'n_neighbors': 8, 'weights': 'distance'}
16  0.973333    0.032660    {'n_neighbors': 9, 'weights': 'uniform'}
17  0.973333    0.032660    {'n_neighbors': 9, 'weights': 'distance'}
18  0.966667    0.044721    {'n_neighbors': 10, 'weights': 'uniform'}

We see that we have 30*2=60 models.(As we had 30 possible values for n_neighbors and 2 possible values for weights)
As we chose 10-fold cross-validation, there will be 60*10=600 predictions made!

Time to look at our model’s best score and parameters that yielded the best score.
Examine the best score and best hyperparameters

# examine the best model
print(grid.best_score_)
print(grid.best_params_)
# best score and best parameters
0.9800000000000001
{'n_neighbors': 13, 'weights': 'uniform'}

We obtain the same best cross-validated accuracy score of 0.98, with n_neighbors=13 and weights= ‘uniform’.

  • Now, let's suppose we have to tune 4 hyperparameters and we have a list of 10 possible values for each of the hyperparameters.

  • This process creates 10*10*10*10 =10,000 models and when we run 10 fold cross-validation, there are 100,000 predictions made.

  • Clearly, things do scale up very quickly and can soon become computationally infeasible.

How to Search for Hyperparameters More Efficiently Using Randomized Search

We shall now try to rephrase the limitations of Grid Search better, in a more formal way.

  • Say we have to search for M parameters; Let p_1, p_2,p_3, …, p_M be the M parameters.
  • Let the number of values that we would like to search over for p_1 be n1, for p_2 be n2, and so on, with nM values for p_M.

Grid Search considers all possible hyperparameter settings (combinations) into account and creates a model for each possible setting to choose the best model with optimal hyperparameters.

  • To understand it better, assume that out of M parameters, we decide to freeze the values of all hyperparameters except one, say the M_th parameter p_M. So, Grid Search involves searching through the list of nM values for the M_th hyperparameter; And nM models are created.
  • Suppose we now freeze the values of all hyperparameters except two, say the last two (p_M and p_(M-1)). We now have to search through all possible combinations of p_M and p_(M-1), each having nM and n_(M-1) possible values that we could search over.
  • We now take a step back and freeze the value of p_M-1 and search through all values for p_M; To account for all possible combinations, we should repeat the procedure for all n_M-1 values for p_M-1. So, this process would leave us with n_(M-1) * nM models.

Hope it’s clear how the complexity scales with increase in the number of values each hyperparameter could take.

  • For the above example with M hyperparameters, we would have n1*n2*n3*…*n_M models. This is why we said that things could scale up quickly and become computationally intractable with Grid Search.

With this motivation to make hyperparameter search computationally more efficient, let us proceed to understand Randomized Search.

Understanding RandomizedSearchCV

In contrast to GridSearchCV, not all parameter values are tried out in RandomizedSearchCV, but rather a fixed number of parameter settings is sampled from the specified distributions/ list of parameters.

If some of the hyperparameters that we’re searching for are continuous, then we should specify the distribution rather than the list of values, while defining the parameter grid. How do we define the fixed number of parameter settings?

The number of parameter settings that are tried is given by n_iter. There's a quality vs computational cost trade-off in picking n_iter.

  • A very small value of n_iter would imply that we’re more likely to find a suboptimal solution, because we are actually considering too few combinations.

  • A very high value of n_iter would mean we can ideally get closer to finding the best hyperparameters that yield the best model, but this again comes with a high computation cost as before.

  • In fact, if we set n_iter= n1*n2*n3*…*n_M from the previous example, then, we’re essentially considering all possible hyperparameter combinations and now Randomized Search and Grid Search are equivalent.

Let us build on the same example of KNNClassifier from the previous section.

Let us search for the optimal weighting strategy and n_neighbors. And now, let us implement Randomized Search in scikit-learn and do the following steps, as we did for Grid Search.
Import RandomizedSearchCV class

from sklearn.model_selection import RandomizedSearchCV

Define the parameter grid

# specify "parameter distributions" rather than a "parameter grid"
param_dist = dict(n_neighbors=k_range, weights=weight_options)

Instantiate the grid; Set n_iter=10, Fit the grid & View the results

# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
rand.fit(X, y)
pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]

#DataFrame

mean_test_score std_test_score  params
0   0.973333    0.032660    {'weights': 'distance', 'n_neighbors': 16}
1   0.966667    0.033333    {'weights': 'uniform', 'n_neighbors': 22}
2   0.980000    0.030551    {'weights': 'uniform', 'n_neighbors': 18}
3   0.966667    0.044721    {'weights': 'uniform', 'n_neighbors': 27}
4   0.953333    0.042687    {'weights': 'uniform', 'n_neighbors': 29}
5   0.973333    0.032660    {'weights': 'distance', 'n_neighbors': 10}
6   0.966667    0.044721    {'weights': 'distance', 'n_neighbors': 22}
7   0.973333    0.044222    {'weights': 'uniform', 'n_neighbors': 14}
8   0.973333    0.044222    {'weights': 'distance', 'n_neighbors': 12}
9   0.973333    0.032660    {'weights': 'uniform', 'n_neighbors': 15}

Examine the best score and best hyperparameters

# examine the best model
print(rand.best_score_)
print(rand.best_params_)

# Output
0.9800000000000001
{'weights': 'uniform', 'n_neighbors': 18}

Parameters of the best model

  • Surprisingly, we see that the highest accuracy score obtained in this case, where we only looked at 10 different parameter settings instead of 60 in Grid Search, is the same as before: 0.98 ✔
  • And the value for n_neighbors= 18, which is also one of the optimal values that we got when we initially searched for the optimal value of n_neighbors.

Maybe we just got lucky?
What is the guarantee that we will always get the best results?
Ah, this question makes perfect sense, doesn’t it?

Let us do the following now: Let us run RandomizedSearchCV for multiple times and see how many times we really end up getting lucky!

➡️Run RandomizedSearchCV 20 times and see what happens; We log the best_score_ for every run.

# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in range(20):
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, return_train_score=False)
rand.fit(X, y)
best_scores.append(round(rand.best_score_, 3))
Let us examine all the 20 best scores now.
print(best_scores)
# Output: Best Scores
[0.973, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.973, 0.98, 0.973, 0.973, 0.98, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.973]

Upon examining the best scores above for all the 20 runs, we see that we get the best accuracy score of 0.98 about 13 times.

Looks like we’re lucky indeed! What about the other 7 times when we didn't quite get the best accuracy score? These accuracy scores are around 0.973 which is pretty close to 0.98.

This observation convinces us that even though Randomized Search may not always give the hyperparameters of the best performing model, the models obtained by using these hyperparameters do not perform much worse compared to the best model obtained from Grid Search.

This means, the best models thus obtained, with the hyperparameters from randomized search are clearly very close to the optimal model.

In essence, these may not be the best hyperparameters, but certainly close to the best hyperparameters, except that these are found under resource-constrained settings.

Hope you all understood how we could use Randomized Search for hyperparameter tuning.

Hope you all enjoyed reading this post.😀

Happy Learning! Until next time!✨

References

Cover Image: Photo by Ross Sneddon on Unsplash

Note: This post is a comprehensive post for understanding model evaluation, and hyperparameter search. This is a compiled, and improved version of a few of my earlier shorter posts.

25