Model Training Patterns - Hyperparameter Tuning

Model parameters vs Hyperparameters

Model parameters refer to the weights and biases learned by the model as it goes through training iterations.

Hyperparameters are, on the other hand, parameters that we as model builders can control.

Types of hyperparameters

Model architecture hyperparameters - Hyperparameters that control model's underlying mathematical function
Model training hyperparameters - Hyperparameters that control the training loop and the way the optimizer works

Finding the best possible values for hyperparameters

Grid search

Define a set of values for each hyperparameter that you want to optimize
Use grid search - it will try every combination of the specified values and return the combination that results in the best evaluation metric for the model

Problems with this approach

As the number of hyperparameters and values for each hyperparameter increases, the number of combinations increase and the time required to try them all increases => combinatorial explosion.
It's a brute force solution => it doesn't learn. It will continue trying the combinations even after reaching a certain threshold, say, we reach a point where the error starts increasing instead of decreasing.

Randomized search

A faster alternative to grid search.

Unlike grid search, this approach will randomly sample values for each hyperparameter and try the combination.

Define range of values for each hyperparameter that you want to optimize
Mention number of times you would want to randomly sample values for each hyperparameter
Use random search

`keras-tuner` library

This library provides solution that scales and learns from previous trials to find an optimal combination of hyperparameter values.

EXAMPLE - tuning the number of neurons in the first and second hidden layers of a MNIST classification model

import keras_tuner as kt
from tensorflow import keras

def build_model(hp):
  model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(units=hp.Int('first_hidden', min_value=32,
                                    max_value=256, step=32), activation='relu'),
    keras.layers.Dense(units=hp.Int('second_hidden', min_value=32,
                                    max_value=256, step=32), activation='relu'),
    keras.layers.Dense(units=10, activation='softmax')
  ])

  model.compile(optimizer=keras.optimizers.Adam(
    hp.Float('learning_rate', min_value=.005, max_value=.01, sampling='log')),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'])

  return model

tuner = kt.BayesianOptimization(
  build_model,
  objective='val_accuracy',
  max_trials=10,
)

tuner.search(x_train, y_train, validation_split=0.1, epochs=10)

best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

Bayesian Optimization (B.O.)

Goal of this optimization approach - Directly train the model or call the objective function (the process of training the ML model) as few times as possible as it's a costly operation

One of the issues with the above approaches is that every time a new set of hyperparameters is tried on, it means running the model through an entire training loop. This is what Bayesian optimization tries to solve.

How this works?

Choose hyperparameters that need optimization
Define a range of values for these hyperparameters
Define the objective function
Bayesian optimization uses this objective function to create a new function that emulates our model and is much cheaper to run (surrogate function)
Surrogate function is used by B.O. to find the best combination of hyperparameters
Once the best combination is found, model is run through a full training loop using these values
The results post training are fed back into the surrogate function and the process is repeated for number_of_trials