21
Handling Imbalanced Multiclass and Binary Classification Datasets
In this article, I’m going to discuss how to properly handle imbalanced datasets that can be either multiclass or binary classification problems using XGBoost and problems I encountered doing this for the first time. This is somewhat oddly specific, but it can be applied to other classification problems with similar issues.
While midway through my Capstone project in the Flatiron School’s Data Science program, I encountered an interesting issue. Using the default XGBoost settings, my recall and precision score for one of my three classes were both sitting at an extremely low value of 0.22. I had never seen this before for a model with an overall accuracy score in excess of 80% until I realized; I have an imbalanced dataset. Sure enough, this is the distribution of outcomes and classes I observed in my classification report:
class precision recall f1-score support
-1 0.27 0.24 0.25 142
0 0.82 0.73 0.77 532
1 0.74 0.81 0.78 750
accuracy 0.72 1424
macro avg 0.61 0.59 0.60 1424
weighted avg 0.72 0.72 0.72 1424
If you find yourself in a similar scenario, the fix is actually fairly simple. There are two parameters you need to make sure to set on your XGBoost- setting objective to ‘multi:softmax’, and your ‘num_class’ parameter to your number of classes. The documentation was somewhat misleading, and this took me quite a while to run without warnings. By default, XGBoost will attempt to set the objective to ‘binary:logistic’, which is quite simply just not what our problem is. As a result, we can see how our “-1” outcome class has such low values across the board. This is what our code should look like in this instance.
model_xgb = XGBClassifier(objective='multi:softmax', num_class=3)
model_xgb.fit(X_train, y_train)
pred_xgb_test = model_xgb.predict(X_test)
print(classification_report(y_test, pred_xgb_test))
Although my recall was not improved for this class, the precision shot up to nearly 80%. This is a vast improvement from our last model, as we actually have our parameters and classifier set up correctly. It was at this point in my project that a thought entered my mind- what if one of the class outcomes is irrelevant? Interestingly, this just became a binary classification problem.
At this point, I had already done quite a bit of research on XGBoost through it’s documentation and a variety of stackoverflow posts of people having similar issues. During that search I had already found the solution to the binary classification; the scale_pos_weight parameter needed to be set. Essentially, this parameter would multiply the size of the positive class, namely, 1 by a factor of whatever number you provide, to even out the class sizes. However, there was an issue- my positive class size was the majority class… by a factor of 6. So what does one do in a scenario like this? There is no “scale_neg_weight” parameter, as I found out. This meant I was going to re-engineer my outcome feature to essentially invert my classes. Without modification, my outcome field consisted of only one of:
[-1,0,1]
For context, this was originally a “favors_pitcher” field and the values speak for themselves. The class I was trying to remove was the “0” class, the neutral outcomes that don’t really favor either team in the short term. So I dropped these rows, and re-engineered a new outcome field with some crafty python.
no_neutrals_df['favors_hitter_binary'] = 1
no_neutrals_df.loc[no_neutrals_df['favors_pitcher']==1, 'favors_hitter_binary'] = 0
Perfect. Now I had only zeros and ones in my new “favors_hitter_binary” column for my new binary classification problem. There were six times as many “zero” entries as there were “one” entries, so now we were able to use our XGBoost parameter from earlier.
nn_model_xgb = XGBClassifier(scale_pos_weight=6) #6x value count diff
nn_model_xgb.fit(X_train_nn, y_train_nn)
pred_xgb_nn = nn_model_xgb.predict(X_test_nn)
print(classification_report(y_test_nn, pred_xgb_nn))
Although my binary model ended up overfitting and not really filling my needs in the grand scheme of the project, I felt like it was a worthwhile endeavour learning how to handle both scenarios. The model could have performed extremely well across the board for the positive hitter outcome class and I would have been one happy programmer. However, the precision and recall scores were too low, both around 0.33. Additionally, I had to re-do my own train-test-split just for this tangent, which was not ideal, even though I deemed it worth investigating.
Regardless, I hope this retrospective is helpful for you as you tackle your own classification problems, or using XGBoost. This is definitely something I’ll be able to handle more easily in the future. If you’d like to see the full context of what I was working on, please check out my MLB pitch classification project here.
21