Deploy AI apps for free on Ploomber Cloud!

Optimal number of clusters

Optimal number of clusters#

Learn how to easily evaluate clustering algorithms and determine the optimal number of clusters using the below methods:

  • Elbow curve plots the sum of squared errors (squared errors summed across all points) for each value of k.

  • Silhouette analysis determines if individual points are correctly assigned to their clusters.

import matplotlib
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn_evaluation import plot
matplotlib.rcParams["figure.figsize"] = (7, 7)
matplotlib.rcParams["font.size"] = 18
# get data for clustering
X, y = datasets.make_blobs(
    center_box=(-10.0, 10.0),

# Fit kMeans on the data
kmeans = KMeans(random_state=10, n_init=5)

Elbow curve#

Elbow curve helps to identify the point at which the plot starts to become parallel to the x-axis. The K value corresponding to this point is the optimal number of clusters. In the below plot one is likely to select k=4. Currently the kmeans argument input only accepts Kmeans, MiniBatchKMeans, and BisectingKMeans.

plot.elbow_curve(X, kmeans, range_n_clusters=range(1, 30))
<Axes: title={'center': 'Elbow Plot'}, xlabel='Number of clusters', ylabel='Sum of Squared Errors'>


If you want to train the models yourself, you can use elbow_curve_from_results to plot.

Silhouette plot#

The below plot shows that n_clusters value of 3, 5 and 6 are a bad pick for the given data. One is likely to select between 2 and 4 n_clusters.

silhouette = plot.silhouette_analysis(X, kmeans)
../_images/233b4b2767dc5af188acaa0a4353d848970fec1d0315080cd86057760b922e6f.png ../_images/cd24ae0904531d6f0b025afa5996976aa0c2fe437f52f2aa82c59f25ca015e86.png ../_images/aed9dd8f59e467126e7a1104504d46c3ebab614fc979ad0511321c6ba42697b5.png ../_images/a2a176bcdd3d64a4696043fcf4e5814803a5b20da70cf0020267cc1a3965f0f3.png ../_images/725f5740f83eba56948d4250417d754bd7b4ff1e0430ef282fd470f4aa0589c6.png


If you want to train the models yourself, you can use silhouette_analysis_from_results to plot.