16
Practical Steps for improving the accuracy of a Machine learning model
Getting the right accuracy on the first try is a dream or you are the luckiest data scientist. getting the desired accuracy is an iterative process you have to try different things and then you start getting positive feedbacks and you keep going in a particular direction.
It becomes really frustrating to choose the right method which suits the most with your dataset but if we do it systemically then it becomes easier.
Note: There are some basic steps in improving the model accuracy but It would be nice to cover everything.
Make sure your data is cleaned
First, you definitely need to clean your data because data with noise or missing values can't make any sense to the model it will just confuse the model and you can't understand the data properly so make sure your data is cleaned.
You should take care of the important information in the data while you will clean the data make sure you should not remove the important information.
Understand you data
You have to understand the structure of your data which can be done using data visualization and with the help of statistical testing such as checking the mean max and average, etc. Pandas Describe() function can be very useful for basic analysis.
Then try to plot some graphs using matplotlib or seaborn and understand the format of each feature.
Data Pre-processing
There are two major steps in data preprocessing that is common.
Converting non-numerical data to binary or numeric
Scaling the data so every feature of the data gets the right attention.
Converting non-numerical data
There are always data with some text or some categorical values like gender or something else where the Machine learning model does not accept it so make sure you convert it into numerical data properly without any mistake. Scikit-learn LabelEncoder() Function can be useful for that.
Scaling your data
When you have many features in your data sometimes there are some features which values and very high than other features and due to that high values algorithm consider it as a very important feature but probably it was not that much important.
So to deal with that issue we try to convert our data into some kind of values where all the features have the same importance when we give that data to our machine learning algorithm.
Scikit-learn StandardScaler() can be useful for that.
Choose the right features
Choosing the right features is very important while you dealing with high numbers of features sometimes some feature isn't useful for our model or sometimes it also acts as a noise in our algorithm.
Many feature selection methods help you to choose the right feature from your data. feature selection methods just try to check the relation with the target variable so you can choose a useful one and remove the useless features which create just noise in your model.
Here you can try some feature selection methods ( detailed article will come soon)
Correlation coefficient
Information gain
Chi-square test
Random forest feature importance
Recursive Feature Elimination
Choose the right evaluation metrics
It is very important how you are going to test your model so you should choose the right accuracy metrics for your machine learning model.
In the regression, there is not much complexity for testing the model but when you work with classification there are multiple classes you got good accuracy on one class and getting horrible accuracy on the other one.
To deal with this kind of problem you should different metrics than accuracy you should use confusion metrics or classification reports can be very useful.
On scikit-learn there are prebuilt functions that you can use for testing your model properly.
Confusion metrics
Classification report
These are some ideas which can be useful for improving your model there many different things to try which will be posted in the future these.
16