25
Cross-Validation and Hyperparameter Search in scikit-learn - A Complete Guide
This post is inspired by Kevin Markham's course Introduction to Machine Learning with scikit-learn on Data School. If you're a beginner looking to get started with Machine Learning using scikit-learn, I would highly recommend this course to gain all the required foundational skills.
In just a few hours, you'd be able to understand and build basic regression, and classification models with the optimal hyperparameters.
đIf this sounds interesting, please be sure to check the course at this link.
That said, we shall cover the following in this post.
When evaluating a machine learning model, training and testing on the same dataset is not a great idea. Why? Let us draw a relatable analogy.
Ever since school days, weâve been giving exams, and how are our exams designed?
- Well, theyâve been designed so as to test our understanding of the concepts rather than our ability to memorize!
- The same analogy can be transposed to our machine learning model as well!
Hereâs the answer to the question âWhy can we not evaluate a model on the same data that it was trained on?â
It is because this process inherently encourages the model to memorize the training data. So, the model performs extremely well on the training data.
However, it generalizes rather poorly to data that it has never seen before.
As a model's performance on data that it has never seen before is a more reliable estimate of its performance, we usually validate the model by checking how it performs on out-of-sample data, that is, on data that it has never seen before.
If you remember, it is for this reason, we use the train_test_split
method, in our very friendly and nifty library, scikit-learn.
This train_test_split
splits the available data into two sets: the train and test sets in certain proportions.
For example, train on 70% of the available data and test on the remaining 30% of the data. This way, we can ensure that every record in the dataset can either be in the training set or the test set but not both!
And in doing so, we are making sure that we test the modelâs performance on unseen data.
But, is this good enough? Or do we take this with a pinch of salt?
Let us train a simple KNeighborsClassifier
in scikit-learn on the iris dataset.
ⶠNecessary imports
As shown below, we import the necessary modules.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
ⶠLoad the data
Let us load the iris data and separate out the features(Sepal Length, Sepal Width, Petal Length and Petal Width) and the target variables which are the class labels indicating the iris type. (Setosa, Versicolour, and Virginica)
# read in the iris data
iris = load_iris()
# create X (features) and y (response)
X = iris.data
y = iris.target
ⶠCreate train_test_split
Letâs create the train and test sets with random_state= 4
. Setting the random_state
ensures reproducibility.
In this case, it ensures that the records that go into the train and test sets stay the same every time our code is run.
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 4)
ⶠCheck Classification Accuracy
We now instantiate the KNeighborsClassifier
with n_neighbors=9
and fit the classifier on the training set and predict on the test set.
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
# Output
0.9736842105263158
Now, let us change the random_state
to a different value. What do you think the accuracy score would be?
Well, please feel free to insert your favorite number in the random_state
and check for yourselves. It'd be a different accuracy score this time.
Setting the random_state
to another value, we would get another value for the accuracy score.
The evaluation metric thus obtained is therefore susceptible to high variance.
- The accuracy score may depend heavily on which data points end up in the training set and which end up in the test set.
- Thus the accuracy score may be significantly different depending on how the division is made. Clearly, this doesnât seem like the best way to validate our modelâs performance!
One very natural thing to do would be to create multiple train/test splits, calculate the accuracy for each such split, and compute the average of all the accuracy scores thus obtained. This definitely seems like a better estimate of the accuracy, doesnât it?
This is precisely the essence of cross-validation, which we shall see in the subsequent section.
Steps in K-fold cross-validation
- Split the dataset into K equal partitions (or âfoldsâ).
- Use fold 1 for testing and the union of the other folds as the training set.
- Calculate accuracy on the test set.
- Repeat steps 2 and 3 K times, using a different fold for testing each time.
- Use the average accuracy on different test sets as the estimate of out-of-sample accuracy.
Let us try to visualize this by splitting a dataset of 25 observations into 5 equal folds as shown below.
The dataset contains 20 observations (numbered 0 through 24).
5-fold cross-validation runs for 5 iterations.
Let's see how the folds are created
# simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle = False).split(range(25))
# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))
# Output
Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
We observe the following:
- For each iteration, every observation is either in the training set or the testing set, but not both.
- Every observation is in the test set exactly once.
- Each fold is used as the test set exactly once and in the training set (K-1) times.
- The average accuracy thus obtained is a more accurate estimate of out-of-sample accuracy.
This process uses data more efficiently as every observation is used for both training and testing.
It is recommended to use stratified sampling for creating the folds, as this ensures that all class labels are represented in equal proportions in each fold. And, scikit-learnâs cross_val_score
does this by default.
In practice, we can even do the following:
- âHold outâ a portion of the data before beginning the model building process.
- Find the best model using cross-validation on the remaining data, and test it using the hold-out set.
- This gives a more reliable estimate of out-of-sample performance since the hold-out set is truly out-of-sample.
For the KNN classifier on the iris dataset, can we possibly use cross-validation to find the optimal value for K
? That is, to search for the optimal value of n_neighbors
?
Remember, K
in KNN classifier is the number of neighbors (n_neighbors
) that we take into account for predicting the class label of the test sample. Not to be confused with the K
in K-fold cross-validation.
from sklearn.model_selection import cross_val_score
# 10-fold cross-validation with K=5 for KNN (the n_neighbors parameter)
knn = KNeighborsClassifier(n_neighbors=5)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
print(scores)
# Output
[1. 0.93333333 1. 1. 0.86666667 0.93333333
0.93333333 1. 1. 1. ]
# use average accuracy as an estimate of out-of-sample accuracy
print(scores.mean())
# Output
0.9666666666666668
Now, we shall run the K fold cross-validation for the models with different values of n_neighbors
, as shown below.
# search for an optimal value of K for KNN
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
k_scores.append(scores.mean())
print(k_scores)
# Output k_scores
[0.96, 0.9533333333333334, 0.9666666666666666, 0.9666666666666666, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9666666666666668, 0.9666666666666668, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9733333333333334, 0.9800000000000001, 0.9733333333333334, 0.9800000000000001, 0.9666666666666666, 0.9666666666666666, 0.9733333333333334, 0.96, 0.9666666666666666, 0.96, 0.9666666666666666, 0.9533333333333334, 0.9533333333333334, 0.9533333333333334]
Let us plot the values to get a better idea.
import matplotlib.pyplot as plt
%matplotlib inline
# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
We see that n_neighbors (K)
values from 13 to 20 yield higher accuracy, especially K
=13,18 and 20.As a larger value of K
yields a less complex model, we choose K
=20.
This process of searching for the optimal values of hyperparameters is called hyperparameter tuning.
In this example, we chose the values of K
that resulted in higher mean accuracy score under 10-fold cross validation.
This is how cross-validation can be used to search for the best hyperparameters and this can be done much more efficiently in scikit-learn.
In KNN classifiers, setting a very small value for
K
will make the model needlessly complex, and a very large value ofK
would result in a model with high bias that yields suboptimal performance.
As K
=13,18 and 20 gave the highest accuracy score, close to 0.98, we decided to choose K
=20 as a larger value of K
would yield a less complex model.
While it is not particularly difficult to write the for loop, we do realize that we may have to do it often.
Therefore, it would be good to have a more convenient way of doing the hyperparameter search, without having to write a loop every time and identify the best parameter through inspection.
Grid Search class in scikit-learn helps us do this all the more easily. Let's learn about it in the next section.
ⶠImport GridSearchCV Class
from sklearn.model_selection import GridSearchCV
ⶠDefine the Parameter Grid
We now define the parameter grid (param_grid
), a Python dictionary, whose key is the name of the hyperparameter whose best value weâre trying to find and the value is the list of possible values that we would like to search over for the hyperparameter.
# define the parameter values that should be searched
k_range = list(range(1, 31))
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
We now instantiate GridSearchCV
. Note that we specify the param_grid
instead of the n_neighbors
argument that we had specified for cross_val_score
earlier.
Why is this valid?
Remember, the parameter grid, param_grid
is a dictionary whose key is n_neighbors
and the value is a list of possible values of n_neighbors
. Therefore, specifying the param_grid
ensures that the value at index i
is fetched as the value of n_neighbors
in the i_th run.
ⶠInstantiate, fit the grid and view the results
# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
We now go ahead and fit the grid with data, and access the cv_results_
attribute to get the mean accuracy score after 10-fold cross-validation, standard deviation and the parameter values.
For convenience, we may store the results in a pandas DataFrame. The mean and standard deviation of the accuracy scores for n_neighbors
=1 to 10 are shown below.
# fit the grid with data
grid.fit(X, y)
# view the results as a pandas DataFrame
import pandas as pd
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Output
mean_test_score std_test_score params
0 0.960000 0.053333 {'n_neighbors': 1}
1 0.953333 0.052068 {'n_neighbors': 2}
2 0.966667 0.044721 {'n_neighbors': 3}
3 0.966667 0.044721 {'n_neighbors': 4}
4 0.966667 0.044721 {'n_neighbors': 5}
5 0.966667 0.044721 {'n_neighbors': 6}
6 0.966667 0.044721 {'n_neighbors': 7}
7 0.966667 0.044721 {'n_neighbors': 8}
8 0.973333 0.032660 {'n_neighbors': 9}
9 0.966667 0.044721 {'n_neighbors': 10}
When using cross_val_score
, we tried eyeballing the accuracy scores to identify the best hyperparameters and to make it easier, we plotted the value of hyperparameters vs the respective cross-validated accuracy scores!
Sounds good but doesnât seem to be a great option though!
Once weâve completed the grid search, the following attributes can be very useful! We can choose to examine:
â the best_score_
, the highest cross-validated accuracy score
â the best_params_
, the optimal value for the hyperparameters, and
â the best_estimator_
, which is the best model that has the best hyperparameter.
Let's now examine these for our example.
ⶠExamine the best score and best hyperparameters
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
# Output
0.9800000000000001
{'n_neighbors': 13}
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=13, p=2,
weights='uniform')
K
=13 has been chosen, remember, K
=13 was one of the values of K
that gave highest cross-validated accuracy score.â
So far so good!
In this example, the only hyperparameter that we searched for was n_neighbors
.
What if there were many such hyperparameters?
We may think, âWhy not tune each hyperparameter independently?â
Well, we may independently search for the optimal values for each of the hyperparameters; but the model may perform best at some values of the parameters that are very different from the individual best values.
So, we have to search for the combination of the parameters that optimizes performance rather than the individual best parameters.
Let's build on the same example of KNNClassifier.
In addition to n_neighbors
, let's search for the optimal weighting strategy as well.
- The default weighting option is
âuniformâ
where all points are weighted equally andâdistanceâ
option weights points by the inverse of their distance.
In this case, closer neighbors of a query point will have a greater influence than neighbors which are far away.
ⶠDefine the Parameter Grid
# define the parameter values that should be searched
k_range = list(range(1, 31))
weight_options = ['uniform', 'distance']
# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range, weights=weight_options)
print(param_grid)
# param_grid
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
'weights': ['uniform', 'distance']}
Let us instantiate and fit the grid and view results, as before.
# instantiate and fit the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)
grid.fit(X, y)
# view the results
pd.DataFrame(grid.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
# Results
mean_test_score std_test_score params
0 0.960000 0.053333 {'n_neighbors': 1, 'weights': 'uniform'}
1 0.960000 0.053333 {'n_neighbors': 1, 'weights': 'distance'}
2 0.953333 0.052068 {'n_neighbors': 2, 'weights': 'uniform'}
3 0.960000 0.053333 {'n_neighbors': 2, 'weights': 'distance'}
4 0.966667 0.044721 {'n_neighbors': 3, 'weights': 'uniform'}
5 0.966667 0.044721 {'n_neighbors': 3, 'weights': 'distance'}
6 0.966667 0.044721 {'n_neighbors': 4, 'weights': 'uniform'}
7 0.966667 0.044721 {'n_neighbors': 4, 'weights': 'distance'}
8 0.966667 0.044721 {'n_neighbors': 5, 'weights': 'uniform'}
9 0.966667 0.044721 {'n_neighbors': 5, 'weights': 'distance'}
10 0.966667 0.044721 {'n_neighbors': 6, 'weights': 'uniform'}
11 0.966667 0.044721 {'n_neighbors': 6, 'weights': 'distance'}
12 0.966667 0.044721 {'n_neighbors': 7, 'weights': 'uniform'}
13 0.966667 0.044721 {'n_neighbors': 7, 'weights': 'distance'}
14 0.966667 0.044721 {'n_neighbors': 8, 'weights': 'uniform'}
15 0.966667 0.044721 {'n_neighbors': 8, 'weights': 'distance'}
16 0.973333 0.032660 {'n_neighbors': 9, 'weights': 'uniform'}
17 0.973333 0.032660 {'n_neighbors': 9, 'weights': 'distance'}
18 0.966667 0.044721 {'n_neighbors': 10, 'weights': 'uniform'}
We see that we have 30*2=60 models.(As we had 30 possible values for n_neighbors
and 2 possible values for weights)
As we chose 10-fold cross-validation, there will be 60*10=600 predictions made!
Time to look at our modelâs best score and parameters that yielded the best score.
ⶠExamine the best score and best hyperparameters
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
# best score and best parameters
0.9800000000000001
{'n_neighbors': 13, 'weights': 'uniform'}
We obtain the same best cross-validated accuracy score of 0.98, with n_neighbors
=13 and weights
= âuniformâ.
Now, let's suppose we have to tune 4 hyperparameters and we have a list of 10 possible values for each of the hyperparameters.
This process creates 10*10*10*10 =10,000 models and when we run 10 fold cross-validation, there are 100,000 predictions made.
Clearly, things do scale up very quickly and can soon become computationally infeasible.
We shall now try to rephrase the limitations of Grid Search better, in a more formal way.
- Say we have to search for
M
parameters; Letp_1, p_2,p_3, âŠ, p_M
be theM
parameters. - Let the number of values that we would like to search over for
p_1
ben1
, forp_2
ben2
, and so on, withnM
values forp_M
.
Grid Search considers all possible hyperparameter settings (combinations) into account and creates a model for each possible setting to choose the best model with optimal hyperparameters.
- To understand it better, assume that out of
M
parameters, we decide to freeze the values of all hyperparameters except one, say the M_th parameterp_M
. So, Grid Search involves searching through the list ofnM
values for the M_th hyperparameter; AndnM
models are created. - Suppose we now freeze the values of all hyperparameters except two, say the last two (
p_M
andp_(M-1)
). We now have to search through all possible combinations ofp_M
andp_(M-1)
, each havingnM
andn_(M-1)
possible values that we could search over. - We now take a step back and freeze the value of
p_M-1
and search through all values forp_M
; To account for all possible combinations, we should repeat the procedure for alln_M-1
values forp_M-1
. So, this process would leave us withn_(M-1) * nM
models.
Hope itâs clear how the complexity scales with increase in the number of values each hyperparameter could take.
- For the above example with
M
hyperparameters, we would haven1*n2*n3*âŠ*n_M
models. This is why we said that things could scale up quickly and become computationally intractable with Grid Search.
With this motivation to make hyperparameter search computationally more efficient, let us proceed to understand Randomized Search.
In contrast to GridSearchCV
, not all parameter values are tried out in RandomizedSearchCV
, but rather a fixed number of parameter settings is sampled from the specified distributions/ list of parameters.
If some of the hyperparameters that weâre searching for are continuous, then we should specify the distribution rather than the list of values, while defining the parameter grid. How do we define the fixed number of parameter settings?
The number of parameter settings that are tried is given by n_iter
. There's a quality vs computational cost trade-off in picking n_iter
.
A very small value of
n_iter
would imply that weâre more likely to find a suboptimal solution, because we are actually considering too few combinations.A very high value of n_iter would mean we can ideally get closer to finding the best hyperparameters that yield the best model, but this again comes with a high computation cost as before.
In fact, if we set
n_iter= n1*n2*n3*âŠ*n_M
from the previous example, then, weâre essentially considering all possible hyperparameter combinations and now Randomized Search and Grid Search are equivalent.
Let us build on the same example of KNNClassifier from the previous section.
Let us search for the optimal weighting strategy and n_neighbors
. And now, let us implement Randomized Search in scikit-learn and do the following steps, as we did for Grid Search.
ⶠImport RandomizedSearchCV class
from sklearn.model_selection import RandomizedSearchCV
ⶠDefine the parameter grid
# specify "parameter distributions" rather than a "parameter grid"
param_dist = dict(n_neighbors=k_range, weights=weight_options)
ⶠInstantiate the grid; Set n_iter=10, Fit the grid & View the results
# n_iter controls the number of searches
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5, return_train_score=False)
rand.fit(X, y)
pd.DataFrame(rand.cv_results_)[['mean_test_score', 'std_test_score', 'params']]
#DataFrame
mean_test_score std_test_score params
0 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 16}
1 0.966667 0.033333 {'weights': 'uniform', 'n_neighbors': 22}
2 0.980000 0.030551 {'weights': 'uniform', 'n_neighbors': 18}
3 0.966667 0.044721 {'weights': 'uniform', 'n_neighbors': 27}
4 0.953333 0.042687 {'weights': 'uniform', 'n_neighbors': 29}
5 0.973333 0.032660 {'weights': 'distance', 'n_neighbors': 10}
6 0.966667 0.044721 {'weights': 'distance', 'n_neighbors': 22}
7 0.973333 0.044222 {'weights': 'uniform', 'n_neighbors': 14}
8 0.973333 0.044222 {'weights': 'distance', 'n_neighbors': 12}
9 0.973333 0.032660 {'weights': 'uniform', 'n_neighbors': 15}
ⶠExamine the best score and best hyperparameters
# examine the best model
print(rand.best_score_)
print(rand.best_params_)
# Output
0.9800000000000001
{'weights': 'uniform', 'n_neighbors': 18}
ⶠParameters of the best model
- Surprisingly, we see that the highest accuracy score obtained in this case, where we only looked at 10 different parameter settings instead of 60 in Grid Search, is the same as before: 0.98 â
- And the value for
n_neighbors= 18
, which is also one of the optimal values that we got when we initially searched for the optimal value ofn_neighbors
.
Maybe we just got lucky?
What is the guarantee that we will always get the best results?
Ah, this question makes perfect sense, doesnât it?
Let us do the following now: Let us run RandomizedSearchCV
for multiple times and see how many times we really end up getting lucky!
âĄïžRun RandomizedSearchCV 20 times and see what happens; We log the best_score_ for every run.
# run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
best_scores = []
for _ in range(20):
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, return_train_score=False)
rand.fit(X, y)
best_scores.append(round(rand.best_score_, 3))
Let us examine all the 20 best scores now.
print(best_scores)
# Output: Best Scores
[0.973, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.973, 0.98, 0.973, 0.973, 0.98, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.973]
Upon examining the best scores above for all the 20 runs, we see that we get the best accuracy score of 0.98 about 13 times.
Looks like weâre lucky indeed! What about the other 7 times when we didn't quite get the best accuracy score? These accuracy scores are around 0.973 which is pretty close to 0.98.
This observation convinces us that even though Randomized Search may not always give the hyperparameters of the best performing model, the models obtained by using these hyperparameters do not perform much worse compared to the best model obtained from Grid Search.
This means, the best models thus obtained, with the hyperparameters from randomized search are clearly very close to the optimal model.
In essence, these may not be the best hyperparameters, but certainly close to the best hyperparameters, except that these are found under resource-constrained settings.
Hope you all understood how we could use Randomized Search for hyperparameter tuning.
Hope you all enjoyed reading this post.đ
Happy Learning! Until next time!âš
[1]https://courses.dataschool.io/courses/introduction-to-machine-learning-with-scikit-learn
[2]http://scikitlearn.org/stable/modules/cross_validation.html
[3]http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
[4]http://scikitlearn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Cover Image: Photo by Ross Sneddon on Unsplash
Note: This post is a comprehensive post for understanding model evaluation, and hyperparameter search. This is a compiled, and improved version of a few of my earlier shorter posts.
25