{
"cells": [
{
"cell_type": "markdown",
"id": "4491676e",
"metadata": {},
"source": [
"# Clustering\n",
"\n",
"In this guide, you'll learn how to use `sklearn-evaluation`, and `sklearn` to evaluate clustering models.\n",
"\n",
"```{note}\n",
"This guide requires `scikit-learn>=1.2`\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "92451189",
"metadata": {},
"source": [
"## Sample clustering model\n",
"\n",
"Let's generate some sample data with 5 clusters; note that in most real-world use cases, you won't have ground truth data labels (which cluster a given observation belongs to). However, in this case, the ground truth data is available, which will help us explain the concepts more clearly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3f16809",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_blobs\n",
"\n",
"X, y = make_blobs(\n",
" n_samples=1000,\n",
" centers=5,\n",
" n_features=20,\n",
" random_state=0,\n",
" cluster_std=3,\n",
" center_box=(-10, 10),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8ed3fc70",
"metadata": {},
"source": [
"## Visualizing clusters\n",
"\n",
"Visualizing high-dimensional data is difficult. A common approach is to reduce its dimensionality using PCA; this losses some information but can help us visualize the clusters. Let's run PCA on our data and plot it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6574057b",
"metadata": {},
"outputs": [],
"source": [
"from sklearn_evaluation import plot\n",
"\n",
"_ = plot.pca(X, y, n_components=2)"
]
},
{
"cell_type": "markdown",
"id": "2ae82273",
"metadata": {},
"source": [
"We can see the clusters in our synthetic data. However, the clusters won't be as transparent when using real-world datasets as in our example dataset."
]
},
{
"cell_type": "markdown",
"id": "5e334ad8",
"metadata": {},
"source": [
"## Evaluation metrics\n",
"\n",
"When clustering data, we want to find the number of clusters that better fit the data. Most models have `n_clusters` as a parameter, so we have to try different values and evaluate which number is the best. To find the *best model*, we need to quantify the quality of the clusters. Here are three metrics you can use that do not require ground truth data:\n",
"\n",
"- `silhouette_score`: goes from -1 to +1, **higher is better** defined clusters ([documentation](https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient))\n",
"- `calinski_harabasz_score`: a ratio, **higher is better** ([documentation](https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index))\n",
"- `davies_bouldin_score`: **lower is better**, minimum value is 0 ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score))\n",
"\n",
"Let's run a `KMeans` algorithm with different `n_clusters` and compute all three metrics; we'll highlight the metric value that best fits the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3831790",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.cluster import KMeans\n",
"from sklearn import metrics\n",
"\n",
"\n",
"def score(X, n_clusters):\n",
" model = KMeans(n_init=\"auto\", n_clusters=n_clusters, random_state=1)\n",
" model.fit(X)\n",
" predicted = model.predict(X)\n",
" return {\n",
" \"n_clusters\": n_clusters,\n",
" \"silhouette_score\": metrics.silhouette_score(X, predicted),\n",
" \"calinski_harabasz_score\": metrics.calinski_harabasz_score(X, predicted),\n",
" \"davies_bouldin_score\": metrics.davies_bouldin_score(X, predicted),\n",
" }\n",
"\n",
"\n",
"df_metrics = pd.DataFrame(\n",
" score(X, n_clusters) for n_clusters in (2, 3, 4, 5, 6, 7, 8, 9, 10)\n",
")\n",
"df_metrics.set_index(\"n_clusters\", inplace=True)\n",
"\n",
"(\n",
" df_metrics.style.highlight_max(\n",
" subset=[\"silhouette_score\", \"calinski_harabasz_score\"], color=\"lightgreen\"\n",
" ).highlight_min(subset=[\"davies_bouldin_score\"], color=\"lightgreen\")\n",
")"
]
},
{
"cell_type": "markdown",
"id": "bbabf57e",
"metadata": {},
"source": [
"All three metrics have their *best* value when `n_clusters=5`. We know this is the best value since our data has 5 clusters; however, when using real datasets, you might find that these metrics might not agree, so it's advisable to understand how each metric is computed and choose the best one for your project.\n",
"\n",
"You can also find the best number of clusters visually. Let's see how to do it using an elbow curve."
]
},
{
"cell_type": "markdown",
"id": "f593fbff",
"metadata": {},
"source": [
"## Optimal number of clusters\n",
"\n",
"### `plot.elbow_curve`\n",
"\n",
"```{important}\n",
"Currently, `plot.elbow_curve` only works with the following sklearn models: `KMeans`, `BisectingKMeans`, and `MiniBatchKMeans`\n",
"```\n",
"\n",
"An elbow curve evaluates the sum of squared errors (i.e., how far each point is from its assigned cluster center). So, naturally, the more centers you have, the lower this metric will be. However, a good clustering model does not necessarily minimize this metric but balances the trade-off between lowering the sum of squared errors and having a small number of clusters (since those two are competing objectives).\n",
"\n",
"The elbow curve will plot the sum of squared errors for a different number of clusters; then, you can visually evaluate when increasing the number of clusters is not worth it, as it'll not yield a significant decrease in the sum of squared errors."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f005c72",
"metadata": {},
"outputs": [],
"source": [
"from sklearn_evaluation import plot\n",
"\n",
"model = KMeans(n_init=\"auto\", random_state=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b53c7aa2",
"metadata": {},
"outputs": [],
"source": [
"_ = plot.elbow_curve(X, model, range_n_clusters=(2, 3, 4, 5, 6, 7, 8))"
]
},
{
"cell_type": "markdown",
"id": "90bb3d22",
"metadata": {},
"source": [
"In our curve, we see significant improvements when moving from 2 to 5 clusters; but increasing to 6 or larger does not yield substantial improvements; hence, we can conclude that 5 is the optimal number of clusters."
]
},
{
"cell_type": "markdown",
"id": "e6687823",
"metadata": {},
"source": [
"### `plot.silhouette_analysis`\n",
"\n",
"We can visually represent the `silhouette_score` to assess the number of clusters. Remember that values close to +1 indicate that the clusters are well-separated. Another characteristic to consider is the size of each silhouette plot. If they're too different, it means some clusters are tiny while others are too large (see, for example the plots with `n_clusters` from 6 to 8: they all have some tiny clusters.\n",
"\n",
"Note that the silhouette score reported on each plot (top-right corner) matches our previous table. Again, we see that the value is maximized when `n_clusters=5`, and that all clusters have similar size and silhouette scores, so we choose that as the optimal value."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3db3a83c",
"metadata": {},
"outputs": [],
"source": [
"_ = plot.silhouette_analysis(X, model, range_n_clusters=(2, 3, 4, 5, 6, 7, 8))"
]
}
],
"metadata": {
"jupytext": {
"text_representation": {
"extension": ".md",
"format_name": "myst",
"format_version": 0.13,
"jupytext_version": "1.14.1"
}
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"source_map": [
12,
22,
28,
39,
45,
49,
53,
65,
93,
99,
113,
119,
121,
125,
133
]
},
"nbformat": 4,
"nbformat_minor": 5
}