30
Cross-Validation and Hyperparameter Search in scikit-learn - A Complete Guide
This post is inspired by Kevin Markham's course Introduction to Machine Learning with scikit-learn on Data School. If you're a beginner looking to get started with Machine Learning using scikit-learn, I would highly recommend this course to gain all the required foundational skills.
In just a few hours, you'd be able to understand and build basic regression, and classification models with the optimal hyperparameters.
šIf this sounds interesting, please be sure to check the course at this link.
That said, we shall cover the following in this post.
When evaluating a machine learning model, training and testing on the same dataset is not a great idea. Why? Let us draw a relatable analogy.
Ever since school days, weāve been giving exams, and how are our exams designed?
Hereās the answer to the question āWhy can we not evaluate a model on the same data that it was trained on?ā
It is because this process inherently encourages the model to memorize the training data. So, the model performs extremely well on the training data.
However, it generalizes rather poorly to data that it has never seen before.
As a model's performance on data that it has never seen before is a more reliable estimate of its performance, we usually validate the model by checking how it performs on out-of-sample data, that is, on data that it has never seen before.
If you remember, it is for this reason, we use the
train_test_split
method, in our very friendly and nifty library, scikit-learn.This
train_test_split
splits the available data into two sets: the train and test sets in certain proportions.For example, train on 70% of the available data and test on the remaining 30% of the data. This way, we can ensure that every record in the dataset can either be in the training set or the test set but not both!
And in doing so, we are making sure that we test the modelās performance on unseen data.
But, is this good enough? Or do we take this with a pinch of salt?
Let us train a simple
KNeighborsClassifier
in scikit-learn on the iris dataset.ā¶ Necessary imports
As shown below, we import the necessary modules.
As shown below, we import the necessary modules.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
ā¶ Load the data
Let us load the iris data and separate out the features(Sepal Length, Sepal Width, Petal Length and Petal Width) and the target variables which are the class labels indicating the iris type. (Setosa, Versicolour, and Virginica)
Let us load the iris data and separate out the features(Sepal Length, Sepal Width, Petal Length and Petal Width) and the target variables which are the class labels indicating the iris type. (Setosa, Versicolour, and Virginica)
# read in the iris data
iris = load_iris()
# create X (features) and y (response)
X = iris.data
y = iris.target
ā¶ Create train_test_split
Letās create the train and test sets with
Letās create the train and test sets with
random_state= 4
. Setting the random_state
ensures reproducibility.In this case, it ensures that the records that go into the train and test sets stay the same every time our code is run.
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 4)
ā¶ Check Classification Accuracy
We now instantiate the
We now instantiate the
KNeighborsClassifier
with n_neighbors=9
and fit the classifier on the training set and predict on the test set.knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
# Output
0.9736842105263158
Now, let us change the
random_state
to a different value. What do you think the accuracy score would be?Well, please feel free to insert your favorite number in the
random_state
and check for yourselves. It'd be a different accuracy score this time.Setting the
random_state
to another value, we would get another value for the accuracy score. The evaluation metric thus obtained is therefore susceptible to high variance.
One very natural thing to do would be to create multiple train/test splits, calculate the accuracy for each such split, and compute the average of all the accuracy scores thus obtained. This definitely seems like a better estimate of the accuracy, doesnāt it?
This is precisely the essence of cross-validation, which we shall see in the subsequent section.
Steps in K-fold cross-validation
Let us try to visualize this by splitting a dataset of 25 observations into 5 equal folds as shown below.
The dataset contains 20 observations (numbered 0 through 24).
5-fold cross-validation runs for 5 iterations.
5-fold cross-validation runs for 5 iterations.
Let's see how the folds are created
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle = False).split(range(25))
# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))
# Output
Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
We observe the following:
This process uses data more efficiently as every observation is used for both training and testing.
It is recommended to use stratified sampling for creating the folds, as this ensures that all class labels are represented in equal proportions in each fold. And, scikit-learnās
cross_val_score
does this by default.In practice, we can even do the following:
For the KNN classifier on the iris dataset, can we possibly use cross-validation to find the optimal value for
K
? That is, to search for the optimal value of n_neighbors
?Remember,
K
in KNN classifier is the number of neighbors (n_neighbors
) that we take into account for predicting the class label of the test sample. Not to be confused with the K
in K-fold cross-validation.from sklearn.model_selection import cross_val_score
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
# Output
[1. 0.93333333 1. 1. 0.86666667 0.93333333
0.93333333 1. 1. 1. ]
# use average accuracy as an estimate of out-of-sample accuracy
print(scores.mean())
# Output
0.9666666666666668
Now, we shall run the K fold cross-validation for the models with different values of
n_neighbors
, as shown below.# search for an optimal value of K for KNN
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
print(k_scores)
# Output k_scores
[0.96, 0.9533333333333334, 0.9666666666666666, 0.9666666666666666, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9800000000000001, 0.9666666666666666, 0.9666666666666666, 0.9733333333333334, 0.96, 0.9666666666666666, 0.96, 0.9666666666666666, 0.9533333333333334, 0.9533333333333334, 0.9533333333333334]
Let us plot the values to get a better idea.
import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

We see that
n_neighbors (K)
values from 13 to 20 yield higher accuracy, especially K
=13,18 and 20.As a larger value of K
yields a less complex model, we choose K
=20.This process of searching for the optimal values of hyperparameters is called hyperparameter tuning.
In this example, we chose the values of
K
that resulted in higher mean accuracy score under 10-fold cross validation. This is how cross-validation can be used to search for the best hyperparameters and this can be done much more efficiently in scikit-learn.
In KNN classifiers, setting a very small value for K
will make the model needlessly complex, and a very large value of K
would result in a model with high bias that yields suboptimal performance.
As
K
=13,18 and 20 gave the highest accuracy score, close to 0.98, we decided to choose K
=20 as a larger value of K
would yield a less complex model.While it is not particularly difficult to write the for loop, we do realize that we may have to do it often.
Therefore, it would be good to have a more convenient way of doing the hyperparameter search, without having to write a loop every time and identify the best parameter through inspection.
Grid Search class in scikit-learn helps us do this all the more easily. Let's learn about it in the next section.
ā¶ Import GridSearchCV Class
from sklearn.model_selection import GridSearchCV
ā¶ Define the Parameter Grid
We now define the parameter grid (
param_grid
), a Python dictionary, whose key is the name of the hyperparameter whose best value weāre trying to find and the value is the list of possible values that we would like to search over for the hyperparameter.# define the parameter values that should be searched
k_range = list(range(1, 31))
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
We now instantiate
GridSearchCV
. Note that we specify the param_grid
instead of the n_neighbors
argument that we had specified for cross_val_score
earlier. Why is this valid?
Remember, the parameter grid,
param_grid
is a dictionary whose key is n_neighbors
and the value is a list of possible values of n_neighbors
. Therefore, specifying the param_grid
ensures that the value at index i
is fetched as the value of n_neighbors
in the i_th run.ā¶ Instantiate, fit the grid and view the results
# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
We now go ahead and fit the grid with data, and access the
cv_results_
attribute to get the mean accuracy score after 10-fold cross-validation, standard deviation and the parameter values. For convenience, we may store the results in a pandas DataFrame. The mean and standard deviation of the accuracy scores for
n_neighbors
=1 to 10 are shown below.# fit the grid with data
grid.fit(X, y)
# view the results as a pandas DataFrame
import pandas as pd
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Output
mean_test_score std_test_score params
0 0.960000 0.053333 {'n_neighbors': 1}
1 0.953333 0.052068 {'n_neighbors': 2}
2 0.966667 0.044721 {'n_neighbors': 3}
3 0.966667 0.044721 {'n_neighbors': 4}
4 0.966667 0.044721 {'n_neighbors': 5}
5 0.966667 0.044721 {'n_neighbors': 6}
6 0.966667 0.044721 {'n_neighbors': 7}
7 0.966667 0.044721 {'n_neighbors': 8}
8 0.973333 0.032660 {'n_neighbors': 9}
9 0.966667 0.044721 {'n_neighbors': 10}
When using
cross_val_score
, we tried eyeballing the accuracy scores to identify the best hyperparameters and to make it easier, we plotted the value of hyperparameters vs the respective cross-validated accuracy scores! Sounds good but doesnāt seem to be a great option though!
Once weāve completed the grid search, the following attributes can be very useful! We can choose to examine:
ā the
ā the
ā the
ā the
best_score_
, the highest cross-validated accuracy scoreā the
best_params_
, the optimal value for the hyperparameters, andā the
best_estimator_
, which is the best model that has the best hyperparameter.Let's now examine these for our example.
ā¶ Examine the best score and best hyperparameters
ā¶ Examine the best score and best hyperparameters
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
# Output
0.9800000000000001
{'n_neighbors': 13}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=13, p=2,
weights='uniform')
K
=13 has been chosen, remember, K
=13 was one of the values of K
that gave highest cross-validated accuracy score.āSo far so good!
In this example, the only hyperparameter that we searched for was
n_neighbors
.What if there were many such hyperparameters?
We may think, āWhy not tune each hyperparameter independently?ā
Well, we may independently search for the optimal values for each of the hyperparameters; but the model may perform best at some values of the parameters that are very different from the individual best values.
So, we have to search for the combination of the parameters that optimizes performance rather than the individual best parameters.
Let's build on the same example of KNNClassifier.
In addition to
n_neighbors
, let's search for the optimal weighting strategy as well.āuniformā
where all points are weighted equally and ādistanceā
option weights points by the inverse of their distance.In this case, closer neighbors of a query point will have a greater influence than neighbors which are far away.
ā¶ Define the Parameter Grid
# define the parameter values that should be searched
k_range = list(range(1, 31))
weight_options = ['uniform', 'distance']
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range, weights=weight_options)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'weights': ['uniform', 'distance']}
Let us instantiate and fit the grid and view results, as before.
# instantiate and fit the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
grid.fit(X, y)
# view the results
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Results
mean_test_score std_test_score params
0 0.960000 0.053333 {'n_neighbors': 1, 'weights': 'uniform'}
1 0.960000 0.053333 {'n_neighbors': 1, 'weights': 'distance'}
2 0.953333 0.052068 {'n_neighbors': 2, 'weights': 'uniform'}
3 0.960000 0.053333 {'n_neighbors': 2, 'weights': 'distance'}
4 0.966667 0.044721 {'n_neighbors': 3, 'weights': 'uniform'}
5 0.966667 0.044721 {'n_neighbors': 3, 'weights': 'distance'}
6 0.966667 0.044721 {'n_neighbors': 4, 'weights': 'uniform'}
7 0.966667 0.044721 {'n_neighbors': 4, 'weights': 'distance'}
8 0.966667 0.044721 {'n_neighbors': 5, 'weights': 'uniform'}
9 0.966667 0.044721 {'n_neighbors': 5, 'weights': 'distance'}
10 0.966667 0.044721 {'n_neighbors': 6, 'weights': 'uniform'}
11 0.966667 0.044721 {'n_neighbors': 6, 'weights': 'distance'}
12 0.966667 0.044721 {'n_neighbors': 7, 'weights': 'uniform'}
13 0.966667 0.044721 {'n_neighbors': 7, 'weights': 'distance'}
14 0.966667 0.044721 {'n_neighbors': 8, 'weights': 'uniform'}
15 0.966667 0.044721 {'n_neighbors': 8, 'weights': 'distance'}
16 0.973333 0.032660 {'n_neighbors': 9, 'weights': 'uniform'}
17 0.973333 0.032660 {'n_neighbors': 9, 'weights': 'distance'}
18 0.966667 0.044721 {'n_neighbors': 10, 'weights': 'uniform'}
We see that we have 30*2=60 models.(As we had 30 possible values for
As we chose 10-fold cross-validation, there will be 60*10=600 predictions made!
n_neighbors
and 2 possible values for weights)As we chose 10-fold cross-validation, there will be 60*10=600 predictions made!
Time to look at our modelās best score and parameters that yielded the best score.
ā¶ Examine the best score and best hyperparameters
ā¶ Examine the best score and best hyperparameters
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
# best score and best parameters
0.9800000000000001
{'n_neighbors': 13, 'weights': 'uniform'}
We obtain the same best cross-validated accuracy score of 0.98, with
n_neighbors
=13 and weights
= āuniformā.Now, let's suppose we have to tune 4 hyperparameters and we have a list of 10 possible values for each of the hyperparameters.
This process creates 10*10*10*10 =10,000 models and when we run 10 fold cross-validation, there are 100,000 predictions made.
Clearly, things do scale up very quickly and can soon become computationally infeasible.
We shall now try to rephrase the limitations of Grid Search better, in a more formal way.
M
parameters; Let p_1, p_2,p_3, ā¦, p_M
be the M
parameters. p_1
be n1
, for p_2
be n2
, and so on, with nM
values for p_M
.Grid Search considers all possible hyperparameter settings (combinations) into account and creates a model for each possible setting to choose the best model with optimal hyperparameters.
M
parameters, we decide to freeze the values of all hyperparameters except one, say the M_th parameter p_M
. So, Grid Search involves searching through the list of nM
values for the M_th hyperparameter; And nM
models are created.p_M
and p_(M-1)
). We now have to search through all possible combinations of p_M
and p_(M-1)
, each having nM
and n_(M-1)
possible values that we could search over.p_M-1
and search through all values for p_M
; To account for all possible combinations, we should repeat the procedure for all n_M-1
values for p_M-1
. So, this process would leave us with n_(M-1) * nM
models.Hope itās clear how the complexity scales with increase in the number of values each hyperparameter could take.
M
hyperparameters, we would have n1*n2*n3*ā¦*n_M
models. This is why we said that things could scale up quickly and become computationally intractable with Grid Search.With this motivation to make hyperparameter search computationally more efficient, let us proceed to understand Randomized Search.
In contrast to
GridSearchCV
, not all parameter values are tried out in RandomizedSearchCV
, but rather a fixed number of parameter settings is sampled from the specified distributions/ list of parameters.If some of the hyperparameters that weāre searching for are continuous, then we should specify the distribution rather than the list of values, while defining the parameter grid. How do we define the fixed number of parameter settings?
The number of parameter settings that are tried is given by
n_iter
. There's a quality vs computational cost trade-off in picking n_iter
.A very small value of n_iter
would imply that weāre more likely to find a suboptimal solution, because we are actually considering too few combinations.
A very high value of n_iter would mean we can ideally get closer to finding the best hyperparameters that yield the best model, but this again comes with a high computation cost as before.
In fact, if we set n_iter= n1*n2*n3*ā¦*n_M
from the previous example, then, weāre essentially considering all possible hyperparameter combinations and now Randomized Search and Grid Search are equivalent.
Let us build on the same example of KNNClassifier from the previous section.
Let us search for the optimal weighting strategy and
ā¶ Import RandomizedSearchCV class
n_neighbors
. And now, let us implement Randomized Search in scikit-learn and do the following steps, as we did for Grid Search.ā¶ Import RandomizedSearchCV class
from sklearn.model_selection import RandomizedSearchCV
ā¶ Define the parameter grid
# specify "parameter distributions" rather than a "parameter grid"
param_dist = dict(n_neighbors=k_range, weights=weight_options)
ā¶ Instantiate the grid; Set n_iter=10, Fit the grid & View the results
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
rand.fit(X, y)
pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
#DataFrame
mean_test_score std_test_score params
0 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 16}
1 0.966667 0.033333 {'weights': 'uniform', 'n_neighbors': 22}
2 0.980000 0.030551 {'weights': 'uniform', 'n_neighbors': 18}
3 0.966667 0.044721 {'weights': 'uniform', 'n_neighbors': 27}
4 0.953333 0.042687 {'weights': 'uniform', 'n_neighbors': 29}
5 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 10}
6 0.966667 0.044721 {'weights': 'distance', 'n_neighbors': 22}
7 0.973333 0.044222 {'weights': 'uniform', 'n_neighbors': 14}
8 0.973333 0.044222 {'weights': 'distance', 'n_neighbors': 12}
9 0.973333 0.032660 {'weights': 'uniform', 'n_neighbors': 15}
ā¶ Examine the best score and best hyperparameters
# examine the best model
print(rand.best_score_)
print(rand.best_params_)
# Output
0.9800000000000001
{'weights': 'uniform', 'n_neighbors': 18}
ā¶ Parameters of the best model
n_neighbors= 18
, which is also one of the optimal values that we got when we initially searched for the optimal value of n_neighbors
. Maybe we just got lucky?
What is the guarantee that we will always get the best results?
Ah, this question makes perfect sense, doesnāt it?
Let us do the following now: Let us run
RandomizedSearchCV
for multiple times and see how many times we really end up getting lucky!ā”ļøRun RandomizedSearchCV 20 times and see what happens; We log the best_score_ for every run.
# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in range(20):
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, return_train_score=False)
rand.fit(X, y)
best_scores.append(round(rand.best_score_, 3))
Let us examine all the 20 best scores now.
print(best_scores)
# Output: Best Scores
[0.973, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.973, 0.98, 0.973, 0.973, 0.98, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.973]
Upon examining the best scores above for all the 20 runs, we see that we get the best accuracy score of 0.98 about 13 times.
Looks like weāre lucky indeed! What about the other 7 times when we didn't quite get the best accuracy score? These accuracy scores are around 0.973 which is pretty close to 0.98.
This observation convinces us that even though Randomized Search may not always give the hyperparameters of the best performing model, the models obtained by using these hyperparameters do not perform much worse compared to the best model obtained from Grid Search.
This means, the best models thus obtained, with the hyperparameters from randomized search are clearly very close to the optimal model.
In essence, these may not be the best hyperparameters, but certainly close to the best hyperparameters, except that these are found under resource-constrained settings.
Hope you all understood how we could use Randomized Search for hyperparameter tuning.
Hope you all enjoyed reading this post.š
Happy Learning! Until next time!āØ
[1]https://courses.dataschool.io/courses/introduction-to-machine-learning-with-scikit-learn
[2]http://scikitlearn.org/stable/modules/cross_validation.html
[3]http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
[4]http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
[2]http://scikitlearn.org/stable/modules/cross_validation.html
[3]http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
[4]http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Cover Image: Photo by Ross Sneddon on Unsplash
Note: This post is a comprehensive post for understanding model evaluation, and hyperparameter search. This is a compiled, and improved version of a few of my earlier shorter posts.
30