Model Training Patterns - Useful Overfitting

Creating a machine learning model involves training loops. With each training iteration, we try to improve the model. Generally, we use Gradient Descent (GD) to determine the model parameters. However, for larger datasets, Stochastic Gradient Descent (SGD) is preferred. It is the same as gradient descent, however, it operates on mini-batches of the data set.

Extensions of SGD - Adam and Adagrad are the de-facto optimizers used in ML frameworks.

A typical Keras training loop looks as follows:

model = Keras.model(...)
model.compile(optimizer=keras.optimizers.Adam(),
              loss=keras.losses.categorical_crossentropy(),
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=64,
                    epochs=3,
                    validation_data=(x_val, y_val))
results = model.evaluate(x_test, y_test, batch_size=128)
model.save()

In this post, we are going to discuss Useful overfitting.

Sometimes, it might happen we have the entire domain space of observations at our disposal. In such cases, intentionally overfitting over the entire training dataset without regularization, dropout or a validation or testing set seems to be a good idea as we would want our model to compute the precise solution instead of approximating the solution.

Overfitting is not a concern if all possibly inputs are trained for.

Use cases for useful overfitting

  1. All possible inputs are available
  2. Distilling knowledge of NN: Training a smaller model with data generated from a very larger model
  3. Overfitting a batch

Overfitting a batch

Overfitting on a small batch can be considered to be a good sanity check. If a model is not able to, then something might be wrong with the model or it's training process.

So, ideally, we should look for a model which should overfit over the training set. Once we find such a model, we can apply regularization to improve the validation accuracy and focus less on training accuracy.

13