Building Machine Learning models with Unbalanced data

In this post, I discuss a number of considerations and techniques for dealing with imbalanced data when training a machine learning model, relying heavily on the imbalanced-learn sklearn contributor package.

Imbalanced data typically refers to a classification problem where the number of observations per class is not equally distributed — often you’ll have a large amount of data for one class (the majority class) and much fewer observations for other classes (the minority classes). For example, suppose you’re building a classifier to classify a credit card transaction as fraudulent or authentic — you’ll likely have 10,000 authentic transactions for every 1 fraudulent transaction.

When building a tree-based model, the top of the decision tree is likely to learn splits which separate out the majority class at the expense of learning rules for the minority class. For gradient-based models, most updates move in a direction that allows for correct classification of the majority class — a frequency bias.

Metrics

When dealing with imbalanced data, standard classification metrics do not adequately represent model performance. An accuracy of 99.5% might look great until you realize it’s correctly classifying 99.5% of healthy people as “disease-free” while incorrectly classifying all sick patients.

Precision is the fraction of relevant examples among all examples predicted to belong to a class:

precision = true_positives / (true_positives + false_positives)

Recall is the fraction of examples predicted to belong to a class with respect to all examples that truly belong to that class:

recall = true_positives / (true_positives + false_negatives)

They can be combined into the F-score:

F_β = (1 + β²) * precision * recall / (β² * precision + recall)

β < 1 focuses more on precision; β > 1 focuses more on recall.

ROC Curves

An ROC curve visualizes an algorithm’s ability to discriminate the positive class from the rest. It plots the True Positive Rate against the False Positive Rate:

TPR = true_positives / (true_positives + false_negatives)
FPR = false_positives / (false_positives + true_negatives)

The area under the curve (AUC) summarizes the ROC curve. The ideal model has an AUC of 1; a random guess has an AUC of 0.5.

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

preds = model.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, preds, pos_label=1)
auc = roc_auc_score(y_test, preds)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], color='navy', linestyle='--', label='random')
plt.title(f'AUC: {auc}')
ax.set_xlabel('False positive rate')
ax.set_ylabel('True positive rate')

Class Weight

One of the simplest ways to address class imbalance is to provide a weight for each class, placing more emphasis on the minority classes. You can calculate proper weights using sklearn:

from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes, y)

In a tree-based model, scale the entropy component of each class by the corresponding weight. As a reminder, the entropy of a node is:

-∑ᵢ pᵢ log(pᵢ)

where pᵢ is the fraction of data points within class i.

In a gradient-based model, scale the calculated loss for each observation by the appropriate class weight. A common loss function for classification is categorical cross entropy:

-∑ᵢ yᵢ log(ŷᵢ)

where yᵢ represents the true class and ŷᵢ represents the predicted class distribution.

Metrics

ROC Curves

Class Weight

Further Reading