
Implementing K-Means in Python
Welcome to your practical guide on implementing the K-Means clustering algorithm from scratch in Python. Whether you’re a beginner looking to understand the nuts and bolts or an enthusiast aiming to implement it yourself, this article will walk you through every step of building K-Means without relying on libraries like Scikit-learn. Let’s get started!
Understanding K-Means Clustering
K-Means is one of the most popular clustering algorithms used in unsupervised machine learning. Its goal is simple: partition data points into K distinct, non-overlapping clusters. Each data point belongs to the cluster with the nearest mean, serving as the prototype of the cluster.
The algorithm works iteratively: - Initialize K centroids randomly. - Assign each data point to the nearest centroid. - Recalculate centroids as the mean of all points in the cluster. - Repeat assignment and recalculation until centroids stabilize.
This process minimizes the within-cluster variance, making clusters as compact as possible.
Step | Description |
---|---|
1 | Randomly initialize K centroids |
2 | Assign points to nearest centroid |
3 | Recompute centroids based on assigned points |
4 | Repeat steps 2-3 until convergence |
One key challenge in K-Means is choosing the right value for K. Methods like the Elbow Method or Silhouette Analysis can help, but for this implementation, we’ll assume K is given.
Let’s now look at how you can build this step-by-step in Python.
Implementing K-Means from Scratch
To implement K-Means, you’ll need basic Python and NumPy. NumPy helps with efficient numerical operations, which is crucial when working with large datasets.
Start by importing NumPy:
import numpy as np
Next, define a function to compute the Euclidean distance between points. This will help in assigning points to the nearest centroid:
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
Now, create the KMeans class. The class will have methods for initialization, fitting, and prediction.
class KMeans:
def __init__(self, K=5, max_iters=100, plot_steps=False):
self.K = K
self.max_iters = max_iters
self.plot_steps = plot_steps
# list of sample indices for each cluster
self.clusters = [[] for _ in range(self.K)]
# the centers (mean feature vector) for each cluster
self.centroids = []
The __init__
method initializes the number of clusters K, maximum iterations, and optional plotting. It also sets up empty lists for clusters and centroids.
The core of the algorithm is in the fit
method:
def fit(self, X):
self.X = X
self.n_samples, self.n_features = X.shape
# initialize centroids
random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
self.centroids = [self.X[idx] for idx in random_sample_idxs]
# optimize clusters
for _ in range(self.max_iters):
# assign samples to closest centroids (create clusters)
self.clusters = self._create_clusters(self.centroids)
# calculate new centroids from the clusters
centroids_old = self.centroids
self.centroids = self._get_centroids(self.clusters)
# check if converged
if self._is_converged(centroids_old, self.centroids):
break
Here, we first initialize centroids by randomly selecting K data points. Then, we iterate: assign points to clusters, recalculate centroids, and check for convergence.
Now, let’s implement the helper methods:
def _create_clusters(self, centroids):
clusters = [[] for _ in range(self.K)]
for idx, sample in enumerate(self.X):
centroid_idx = self._closest_centroid(sample, centroids)
clusters[centroid_idx].append(idx)
return clusters
def _closest_centroid(self, sample, centroids):
distances = [euclidean_distance(sample, point) for point in centroids]
closest_idx = np.argmin(distances)
return closest_idx
def _get_centroids(self, clusters):
centroids = np.zeros((self.K, self.n_features))
for cluster_idx, cluster in enumerate(clusters):
cluster_mean = np.mean(self.X[cluster], axis=0)
centroids[cluster_idx] = cluster_mean
return centroids
def _is_converged(self, centroids_old, centroids):
distances = [euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)]
return sum(distances) == 0
Each method plays a specific role:
- _create_clusters
assigns each sample to the nearest centroid.
- _closest_centroid
finds the index of the closest centroid for a sample.
- _get_centroids
calculates the new centroids as the mean of points in each cluster.
- _is_converged
checks if centroids have stopped moving.
Finally, add a predict method to assign new points to clusters:
def predict(self, X):
predictions = []
for sample in X:
idx = self._closest_centroid(sample, self.centroids)
predictions.append(idx)
return np.array(predictions)
With this, your K-Means implementation is complete! Let’s test it on sample data.
Testing Your Implementation
Generate some sample data using NumPy or Scikit-learn’s make_blobs
for simplicity. Here’s how you can create and visualize clusters:
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)
kmeans = KMeans(K=4)
kmeans.fit(X)
predictions = kmeans.predict(X)
# Plotting clusters
fig, ax = plt.subplots(figsize=(12, 8))
for i, index in enumerate(kmeans.clusters):
point = X[index].T
ax.scatter(*point)
for point in kmeans.centroids:
ax.scatter(*point, marker="x", color="black", linewidth=2)
plt.show()
This code generates blob data, fits the K-Means model, and plots the clusters with centroids.
When you run it, you should see distinct clusters formed around the centroids. If your implementation is correct, the clusters should be well-separated and centroids positioned at the centers.
It’s important to note that K-Means is sensitive to initial centroid placement. Different runs might yield different results. To mitigate this, you can run the algorithm multiple times and choose the best outcome.
Another point: K-Means works best on spherical clusters of similar size. It may struggle with elongated clusters or clusters of varying densities.
Now, let’s discuss how to evaluate your clusters.
Evaluating Cluster Quality
Since clustering is unsupervised, there’s no single right answer, but you can use metrics to assess quality. Two common metrics are:
- Inertia: Sum of squared distances of samples to their closest cluster center. Lower inertia means tighter clusters.
- Silhouette Score: Measures how similar a point is to its own cluster compared to other clusters. Ranges from -1 to 1; higher is better.
You can compute inertia easily:
def inertia(X, centroids, labels):
total_inertia = 0
for i in range(len(centroids)):
cluster_points = X[labels == i]
total_inertia += np.sum((cluster_points - centroids[i]) ** 2)
return total_inertia
For silhouette score, use Scikit-learn’s implementation:
from sklearn.metrics import silhouette_score
score = silhouette_score(X, predictions)
These metrics help you compare different K values or initializations.
Choosing the Right K
Selecting the appropriate number of clusters is critical. Here’s how you can use the Elbow Method:
- Run K-Means for a range of K values.
- Calculate inertia for each K.
- Plot K against inertia.
- Look for the “elbow” where inertia stops decreasing significantly.
Example code:
inertias = []
for k in range(1, 10):
kmeans = KMeans(K=k)
kmeans.fit(X)
inertias.append(inertia(X, kmeans.centroids, kmeans.predict(X)))
plt.plot(range(1,10), inertias, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
The plot helps you identify the optimal K where adding more clusters doesn’t give much better modeling.
K Value | Inertia |
---|---|
1 | 2500.5 |
2 | 1200.3 |
3 | 700.8 |
4 | 300.2 |
5 | 290.1 |
6 | 285.0 |
From the table, you might choose K=4 since inertia drops significantly until then and plateaus afterward.
Remember, domain knowledge also plays a role. Sometimes, the right K is determined by the problem context.
Limitations and Considerations
While K-Means is powerful, it has limitations:
- Sensitive to initializations: Different starts can lead to different results. Use K-Means++ for better initialization.
- Assumes spherical clusters: Struggles with non-globular shapes.
- Requires specifying K: You need to choose the number of clusters beforehand.
For non-spherical clusters, consider algorithms like DBSCAN or Gaussian Mixture Models.
You can improve your implementation by adding K-Means++ initialization. Here’s a simple way:
def initialize_centroids(X, K):
centroids = [X[np.random.randint(X.shape[0])]]
for _ in range(1, K):
dists = np.array([min([euclidean_distance(x, c) for c in centroids]) for x in X])
probs = dists / dists.sum()
cumulative_probs = probs.cumsum()
r = np.random.rand()
for j, p in enumerate(cumulative_probs):
if r < p:
centroids.append(X[j])
break
return centroids
This method spreads out the initial centroids, often leading to better and faster convergence.
Practical Tips for Use
When applying K-Means:
- Standardize your data: Features on different scales can distort distances. Use StandardScaler from Scikit-learn.
- Handle large datasets wisely: K-Means can be slow for huge data. Consider Mini-Batch K-Means.
- Use for preprocessing: Clustering can be a feature engineering step for supervised learning.
Example of standardization:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans.fit(X_scaled)
This ensures all features contribute equally to the distance calculations.
Comparing with Scikit-learn
It’s good practice to compare your implementation with Scikit-learn’s KMeans to check for correctness:
from sklearn.cluster import KMeans as SKKMeans
sk_kmeans = SKKMeans(n_clusters=4)
sk_labels = sk_kmeans.fit_predict(X)
Check if the clusters and centroids are similar. Small differences might occur due to random initialization, but overall patterns should match.
You’ve now built a functional K-Means algorithm in Python! This exercise deepens your understanding of how clustering works under the hood.
Feel free to experiment with different datasets, try improving the initialization, or even extend it to handle more complex scenarios. Happy coding!