24
3 Mistakes to Avoid When You Write Your Machine Learning Model
Eventually I was able to breathe a sigh of relief: my Machine Learning model works perfectly both on training and on the test set. All the metrics used to measure the performance of my model achieve very high performance. I can finally say that my work is almost completed: just deploy and that’s it.
Instead, it is right there, in the deployment phase, that all the problems arise: on the new data the model seems to give bad results. And above all it seems to have code implementation problems.
In this article I describe three common mistakes that absolutely must be avoided when developing a Machine Learning model, in order to prevent surprises during the deployment phase.
In the first phase of development of a Machine Learning model, we proceed with the cleaning of the data and their normalization and standardization.
One of the possible errors during this preprocessing phase could be to perform operations of this type:
df['normalized_column'] = df['column']/df['column'].max()
In the previous example, everything seems to work on the original dataset. However, when you go to deploy, the problem of column normalization arises. Which value should you use as a scale factor?
A possible solution to the problem could be to store the scale factor somewhere:
file = open("column_scale_factor.txt", "w")
scale_factor = df['column'].max()
file.write(scale_factor)
file.close()
The previous solution is very simple. As alternative, a more powerful scaler could be used, such as those provided by the Scikit-learn preprocessing package:
form sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
feature = np.array(df[column]).reshape(-1,1)
scaler.fit(feature)
Once the scaler is fitted, do not forget to save it! You will use it during the deployment phase!
We can save the scaler, by exploiting the joblib Python library:
import joblib
joblib.dump(scaler, 'scaler_' + column + '.pkl')
When looking for the optimal model for representing your data, it may happen that we test different algorithms and different scenarios before obtaining the best possible model.
In this case, one of the possible mistakes is creating a new notebook as we test a new algorithm. Over time, the risk is to have a folder crammed with files on your filesystem. There is therefore a risk of not being able to accurately track the steps taken during model development.
So how to solve the problem?
It is not enough to comment on the code, we have to enter talking names to the various notebooks. A possible solution could be to precede the title of the notebook with a progressive number that indicates exactly at which point to perform a given step.
Here a possible example:
01_01_preprocessing_standard_scaler.ipynb
01_02_preprocessing_max_abs_scaler.ipynb
02_01_knn.ipynb
02_02_decisiontree.ipynb
Continue Reading on Towards Data Science
24