Implementing K-Means in Python

Welcome to your practical guide on implementing the K-Means clustering algorithm from scratch in Python. Whether you’re a beginner looking to understand the nuts and bolts or an enthusiast aiming to implement it yourself, this article will walk you through every step of building K-Means without relying on libraries like Scikit-learn. Let’s get started!

Understanding K-Means Clustering

K-Means is one of the most popular clustering algorithms used in unsupervised machine learning. Its goal is simple: partition data points into K distinct, non-overlapping clusters. Each data point belongs to the cluster with the nearest mean, serving as the prototype of the cluster.

The algorithm works iteratively: - Initialize K centroids randomly. - Assign each data point to the nearest centroid. - Recalculate centroids as the mean of all points in the cluster. - Repeat assignment and recalculation until centroids stabilize.

This process minimizes the within-cluster variance, making clusters as compact as possible.

Step	Description
1	Randomly initialize K centroids
2	Assign points to nearest centroid
3	Recompute centroids based on assigned points
4	Repeat steps 2-3 until convergence

One key challenge in K-Means is choosing the right value for K. Methods like the Elbow Method or Silhouette Analysis can help, but for this implementation, we’ll assume K is given.

Let’s now look at how you can build this step-by-step in Python.

Implementing K-Means from Scratch

To implement K-Means, you’ll need basic Python and NumPy. NumPy helps with efficient numerical operations, which is crucial when working with large datasets.

Start by importing NumPy:

import numpy as np

Next, define a function to compute the Euclidean distance between points. This will help in assigning points to the nearest centroid:

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

Now, create the KMeans class. The class will have methods for initialization, fitting, and prediction.

class KMeans:
    def __init__(self, K=5, max_iters=100, plot_steps=False):
        self.K = K
        self.max_iters = max_iters
        self.plot_steps = plot_steps

        # list of sample indices for each cluster
        self.clusters = [[] for _ in range(self.K)]
        # the centers (mean feature vector) for each cluster
        self.centroids = []

The __init__ method initializes the number of clusters K, maximum iterations, and optional plotting. It also sets up empty lists for clusters and centroids.

The core of the algorithm is in the fit method:

    def fit(self, X):
        self.X = X
        self.n_samples, self.n_features = X.shape

        # initialize centroids
        random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
        self.centroids = [self.X[idx] for idx in random_sample_idxs]

        # optimize clusters
        for _ in range(self.max_iters):
            # assign samples to closest centroids (create clusters)
            self.clusters = self._create_clusters(self.centroids)

            # calculate new centroids from the clusters
            centroids_old = self.centroids
            self.centroids = self._get_centroids(self.clusters)

            # check if converged
            if self._is_converged(centroids_old, self.centroids):
                break

Here, we first initialize centroids by randomly selecting K data points. Then, we iterate: assign points to clusters, recalculate centroids, and check for convergence.

Now, let’s implement the helper methods:

    def _create_clusters(self, centroids):
        clusters = [[] for _ in range(self.K)]
        for idx, sample in enumerate(self.X):
            centroid_idx = self._closest_centroid(sample, centroids)
            clusters[centroid_idx].append(idx)
        return clusters

    def _closest_centroid(self, sample, centroids):
        distances = [euclidean_distance(sample, point) for point in centroids]
        closest_idx = np.argmin(distances)
        return closest_idx

    def _get_centroids(self, clusters):
        centroids = np.zeros((self.K, self.n_features))
        for cluster_idx, cluster in enumerate(clusters):
            cluster_mean = np.mean(self.X[cluster], axis=0)
            centroids[cluster_idx] = cluster_mean
        return centroids

    def _is_converged(self, centroids_old, centroids):
        distances = [euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)]
        return sum(distances) == 0

Each method plays a specific role: - _create_clusters assigns each sample to the nearest centroid. - _closest_centroid finds the index of the closest centroid for a sample. - _get_centroids calculates the new centroids as the mean of points in each cluster. - _is_converged checks if centroids have stopped moving.

Finally, add a predict method to assign new points to clusters:

    def predict(self, X):
        predictions = []
        for sample in X:
            idx = self._closest_centroid(sample, self.centroids)
            predictions.append(idx)
        return np.array(predictions)

With this, your K-Means implementation is complete! Let’s test it on sample data.

Testing Your Implementation

Generate some sample data using NumPy or Scikit-learn’s make_blobs for simplicity. Here’s how you can create and visualize clusters:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)

kmeans = KMeans(K=4)
kmeans.fit(X)
predictions = kmeans.predict(X)

# Plotting clusters
fig, ax = plt.subplots(figsize=(12, 8))
for i, index in enumerate(kmeans.clusters):
    point = X[index].T
    ax.scatter(*point)
for point in kmeans.centroids:
    ax.scatter(*point, marker="x", color="black", linewidth=2)
plt.show()

This code generates blob data, fits the K-Means model, and plots the clusters with centroids.

When you run it, you should see distinct clusters formed around the centroids. If your implementation is correct, the clusters should be well-separated and centroids positioned at the centers.

It’s important to note that K-Means is sensitive to initial centroid placement. Different runs might yield different results. To mitigate this, you can run the algorithm multiple times and choose the best outcome.

Another point: K-Means works best on spherical clusters of similar size. It may struggle with elongated clusters or clusters of varying densities.

Now, let’s discuss how to evaluate your clusters.

Evaluating Cluster Quality

Since clustering is unsupervised, there’s no single right answer, but you can use metrics to assess quality. Two common metrics are:

Inertia: Sum of squared distances of samples to their closest cluster center. Lower inertia means tighter clusters.
Silhouette Score: Measures how similar a point is to its own cluster compared to other clusters. Ranges from -1 to 1; higher is better.

You can compute inertia easily:

def inertia(X, centroids, labels):
    total_inertia = 0
    for i in range(len(centroids)):
        cluster_points = X[labels == i]
        total_inertia += np.sum((cluster_points - centroids[i]) ** 2)
    return total_inertia

For silhouette score, use Scikit-learn’s implementation:

from sklearn.metrics import silhouette_score
score = silhouette_score(X, predictions)

These metrics help you compare different K values or initializations.

Choosing the Right K

Selecting the appropriate number of clusters is critical. Here’s how you can use the Elbow Method:

Run K-Means for a range of K values.
Calculate inertia for each K.
Plot K against inertia.
Look for the “elbow” where inertia stops decreasing significantly.

Example code:

inertias = []
for k in range(1, 10):
    kmeans = KMeans(K=k)
    kmeans.fit(X)
    inertias.append(inertia(X, kmeans.centroids, kmeans.predict(X)))

plt.plot(range(1,10), inertias, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

The plot helps you identify the optimal K where adding more clusters doesn’t give much better modeling.

K Value	Inertia
1	2500.5
2	1200.3
3	700.8
4	300.2
5	290.1
6	285.0

From the table, you might choose K=4 since inertia drops significantly until then and plateaus afterward.

Remember, domain knowledge also plays a role. Sometimes, the right K is determined by the problem context.

Limitations and Considerations

While K-Means is powerful, it has limitations:

Sensitive to initializations: Different starts can lead to different results. Use K-Means++ for better initialization.
Assumes spherical clusters: Struggles with non-globular shapes.
Requires specifying K: You need to choose the number of clusters beforehand.

For non-spherical clusters, consider algorithms like DBSCAN or Gaussian Mixture Models.

You can improve your implementation by adding K-Means++ initialization. Here’s a simple way:

def initialize_centroids(X, K):
    centroids = [X[np.random.randint(X.shape[0])]]
    for _ in range(1, K):
        dists = np.array([min([euclidean_distance(x, c) for c in centroids]) for x in X])
        probs = dists / dists.sum()
        cumulative_probs = probs.cumsum()
        r = np.random.rand()
        for j, p in enumerate(cumulative_probs):
            if r < p:
                centroids.append(X[j])
                break
    return centroids

This method spreads out the initial centroids, often leading to better and faster convergence.

Practical Tips for Use

When applying K-Means:

Standardize your data: Features on different scales can distort distances. Use StandardScaler from Scikit-learn.
Handle large datasets wisely: K-Means can be slow for huge data. Consider Mini-Batch K-Means.
Use for preprocessing: Clustering can be a feature engineering step for supervised learning.

Example of standardization:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans.fit(X_scaled)

This ensures all features contribute equally to the distance calculations.

Comparing with Scikit-learn

It’s good practice to compare your implementation with Scikit-learn’s KMeans to check for correctness:

from sklearn.cluster import KMeans as SKKMeans
sk_kmeans = SKKMeans(n_clusters=4)
sk_labels = sk_kmeans.fit_predict(X)

Check if the clusters and centroids are similar. Small differences might occur due to random initialization, but overall patterns should match.

You’ve now built a functional K-Means algorithm in Python! This exercise deepens your understanding of how clustering works under the hood.

Feel free to experiment with different datasets, try improving the initialization, or even extend it to handle more complex scenarios. Happy coding!

Implementing K-Means in Python

Understanding K-Means Clustering

Implementing K-Means from Scratch

Testing Your Implementation

Evaluating Cluster Quality

Choosing the Right K

Limitations and Considerations

Practical Tips for Use

Comparing with Scikit-learn

Installing Packages Inside a Virtual Environment

Python 2 vs Python 3 – Which to Learn?

Python Data Types Overview

Automating Slack Messages

Reading Multiple Files in Python

Avoiding N+1 Queries in Django

Model Deployment Using FastAPI

How to Choose the Right Data Type

Popular topics

Memory Management in DataFrames

Exception Handling in Flask Projects

Python Composition vs Inheritance