Deploy AI apps for free on Ploomber Cloud!

Evaluating class imbalance

Evaluating class imbalance#

Class imbalance occurs when the distribution of data points across the known classes are skewed. It’s a common problem in machine learning and can affect the model accuracy. Standard classification algorithms work well for a fairly balanced dataset, however when the data is imbalanced the model tends to learn more features from the majority classes as compared to minority classes.

One common approach of solving this problem is to either decrease the number of samples in the majority class (under-sampling) or increase the number of samples in the minority class(over-sampling).

It’s essential to understand the class imbalance before implementing any resampling techniques. Target analysis helps to visualise the class imbalance in the dataset by creating a bar chart of the frequency of occurence of samples across classes in the dataset

import matplotlib
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from sklearn_evaluation import plot
matplotlib.rcParams["figure.figsize"] = (7, 7)
matplotlib.rcParams["font.size"] = 18
X, y = make_classification(
    n_samples=1000,
    n_features=5,
    n_informative=3,
    n_classes=2,
    # Set label 0 for  97% and 1 for rest 3% of observations
    weights=[0.85],
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Balance Mode#

When only training data is passed the balance mode is displayed which shows distribution of each class.

In the below example we can see that class 0 is the dominating class, hence classifier may have a bias towards this class and predict class 0 most of the time.

plot.target_analysis(y_train)
<Axes: title={'center': 'Class Balance for 700 Instances'}, ylabel='support'>
../_images/72d918011e42973a3a7cf94288fce187f3f1003dc079076f507a9abb10609bc4.png

Compare Mode#

When both the training and the test sets are passed, a side by side bar chart of both the sets is displayed.

The below chart shows that distribution of samples is fairly similar across the train and test splits.

plot.target_analysis(y_train, y_test)
<Axes: title={'center': 'Class Balance for 1,000 Instances'}, ylabel='support'>
../_images/17028060769b38d82fdd59712312039cf8fb734ba3fa2fe2823141318ee4f84c.png