Introduction
This lesson and the next utilize what are known as unsupervised learning algorithms. Unsupervised algorithms don't use a target; instead, their purpose is to learn a property of the data, representing the structure of the features in a specific way. In the context of feature engineering for prediction, you could think of an unsupervised algorithm as a "feature discovery" technique.
Clustering simply means assigning data points to groups based on how similar the points are to each other. A clustering algorithm makes "birds of a feather flock together," so to speak.
When used for feature engineering, we could attempt to discover groups of customers representing a market segment, for instance, or geographic areas that share similar weather patterns. Adding a feature of cluster labels can help machine learning models untangle complicated relationships of space or proximity.
Cluster Labels as a Feature
Applied to a single real-valued feature, clustering acts like a traditional "binning" or discretization transform. On multiple features, it's akin to "multi-dimensional binning" (sometimes referred to as vector quantization).
Left: Clustering a single feature. Right: Clustering across two features.
![Cluster Labels]()
Added to a dataframe, a feature of cluster labels might appear as follows.
Longitude |
Latitude |
Cluster |
-93.619 |
42.054 |
3 |
-93.619 |
42.053 |
3 |
-93.638 |
42.060 |
1 |
-93.602 |
41.988 |
0 |
It's important to remember that this Cluster feature is categorical. Here, it's shown with a label encoding (i.e., as a sequence of integers), as a typical clustering algorithm would produce; depending on your model, a one-hot encoding may be more appropriate.
The motivating idea for adding cluster labels is that they will break down complicated relationships across features into simpler, more manageable chunks. Our model can then just learn the simpler chunks one by one instead of having to understand the complex whole all at once. It's a "divide and conquer" strategy.
![Divide and conquer]()
Clustering the YearBuilt feature enables this linear model to learn its relationship with SalePrice.
The figure shows how clustering can improve a simple linear model. The curved relationship between YearBuilt and SalePrice is too complex for this type of model -- it underfits. On smaller chunks, however, the relationship is almost linear, and the model can learn easily.
k-Means Clustering
There are numerous clustering algorithms. They differ primarily in how they measure "similarity" or "proximity" and in what kinds of features they work with. The algorithm we'll use, k-means, is intuitive and easy to apply in a feature engineering context. Depending on your application, another algorithm might be more appropriate.
K-means clustering measures similarity using ordinary straight-line distance (also known as Euclidean distance). It creates clusters by placing a number of points, called centroids, inside the feature space. Each point in the dataset is assigned to the cluster of the centroid to which it's closest. The "k" in "k-means" is how many centroids (that is, clusters) it creates.
K-means clustering creates a Voronoi tessellation of the feature space.
Let's review how the k-means algorithm learns the clusters and what that means for feature engineering. We'll focus on three parameters from scikit-learn's implementation: n_clusters, max_iter, and n_init.
It's a simple two-step process. The algorithm starts by randomly initializing some predefined number (n_clusters) of centroids. It then iterates over these two operations.
- Assign points to the nearest cluster centroid
- Move each centroid to minimize the distance to its points
It iterates over these two steps until the centroids no longer move, or until a maximum number of iterations has been reached (max_iter).
It often happens that the initial random position of the centroids results in poor clustering. For this reason, the algorithm repeats a number of times (n_init) and returns the clustering that has the least total distance between each point and its centroid, the optimal clustering.
The animation below shows the algorithm in action. It illustrates the dependence of the result on the initial centroids and the importance of iterating until convergence.
![Initial Centroids]()
The K-means clustering algorithm on Airbnb rentals in NYC.
You may need to increase the max_iter for a large number of clusters or n_init for a complex dataset. Ordinarily, though, the only parameter you'll need to choose yourself is n_clusters (k, that is). The best partitioning for a set of features depends on the model you're using and what you're trying to predict, so it's best to tune it like any other hyperparameter (through cross-validation, for example).
We will go through the practical in the following article.