Master Clustering Techniques: K-Means, DBSCAN, and Hierarchical Clustering
In the last lesson, we covered model evaluation methods like cross-validation, precision, recall, and F1 score. These tools help us measure how well our models perform, which is key to building reliable machine learning systems. Now, we move to unsupervised learning, where we explore patterns in data without labeled outcomes. This lesson focuses on clustering, a method that groups similar data points together.
What is Clustering?
Clustering is a way to find hidden patterns in data. Unlike supervised learning, where we know the labels, clustering works with unlabeled data. It groups data points based on their similarities. For example, imagine you have customer data. You might want to group customers who buy similar products. Clustering helps you do that.
There are many clustering methods, but we will focus on three: K-Means, DBSCAN, and Hierarchical clustering. Each method has its strengths and weaknesses, which we will explore. By the end of this lesson, you will know how to choose the right method for your data and how to check if your clusters make sense.
K-Means Clustering
K-Means is one of the most used clustering methods. It works by dividing data into K groups, where K is a number you choose. The algorithm starts by picking K random points, called centroids. Then, it assigns each data point to the nearest centroid. After that, it updates the centroids to the center of their groups. This process repeats until the centroids stop moving.
Let me share a use-case I faced. I once worked on a project where I needed to group customers based on their spending habits. I used K-Means to create three groups: low, medium, and high spenders. This helped the marketing team target each group with tailored offers.
Here’s how you can implement K-Means in Python using Scikit-Learn:
from sklearn.cluster import KMeans
import numpy as np
# Sample data
data = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
# Create K-Means model
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(data)
# Get cluster labels
print(kmeans.labels_)
This code groups the data into two clusters. You can change the number of clusters by adjusting n_clusters.
DBSCAN Clustering
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. Unlike K-Means, DBSCAN does not need you to choose the number of clusters. Instead, it finds clusters based on the density of data points. It groups points that are close to each other and marks points in low-density areas as noise.
I once used DBSCAN to detect outliers in a dataset. The data had many points that did not fit into any group. DBSCAN helped me find these outliers and clean the data.
Here’s an example of DBSCAN in Python:
from sklearn.cluster import DBSCAN
import numpy as np
# Sample data
data = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
# Create DBSCAN model
dbscan = DBSCAN(eps=3, min_samples=2)
dbscan.fit(data)
# Get cluster labels
print(dbscan.labels_)
In this code, eps is the maximum distance between two points to be considered in the same cluster. min_samples is the minimum number of points needed to form a cluster.
Hierarchical Clustering
Hierarchical clustering builds a tree-like structure of clusters. It starts by treating each data point as a single cluster. Then, it merges the closest pairs of clusters until all points are in one cluster. You can cut the tree at any level to get the number of clusters you want.
I used hierarchical clustering in a project where I needed to group similar documents. The tree structure helped me see how documents were related at different levels.
Here’s how to implement hierarchical clustering in Python:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
# Sample data
data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# Create hierarchical clustering model
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data)
# Get cluster labels
print(cluster.labels_)
This code groups the data into two clusters using the Ward method, which minimizes the variance within clusters.
Choosing the Right Clustering Algorithm
Choosing the right algorithm depends on your data and goals. K-Means works well when you know the number of clusters and the data is evenly spread. DBSCAN is great for data with noise and uneven cluster sizes. Hierarchical clustering is useful when you want to explore relationships at different levels.
Evaluating Clustering Results
To check if your clusters make sense, you can use the silhouette score. This score measures how similar a point is to its own cluster compared to other clusters. A high score means the clusters are well-defined.
Here’s how to calculate the silhouette score in Python:
from sklearn.metrics import silhouette_score
# Calculate silhouette score
score = silhouette_score(data, kmeans.labels_)
print("Silhouette Score:", score)
A score close to 1 means the clusters are well-separated. A score close to 0 means clusters overlap.
Conclusion
In this lesson, we explored clustering techniques like K-Means, DBSCAN, and Hierarchical clustering. We learned how to choose the right method and evaluate results using the silhouette score. Clustering is a powerful tool for finding patterns in unlabeled data.
In the next lesson, we will dive into Principal Component Analysis (PCA), a method to reduce the number of features in your data. This will help you work with large datasets more efficiently. Stay tuned!
Comments
There are no comments yet.