K-Means#
K-Means is a popular clustering algorithm used in machine learning to partition data points into \( K \) distinct clusters. The objective is to minimize the variance within each cluster.
Objective Function#
The goal of K-Means is to minimize the following objective function:
where:
\( \mathbf{x}_i \) is a data point,
\( \mathbf{\mu}_k \) is the centroid of cluster \( k \),
\( C_k \) is the set of points assigned to cluster \( k \),
\( \| \cdot \| \) denotes the Euclidean distance.
Algorithm#
The K-Means algorithm proceeds as follows:
Initialization: Choose \( K \) initial centroids randomly.
Assignment Step: Assign each data point to the nearest centroid.
Update Step: Recompute the centroids as the mean of all points assigned to each centroid.
Convergence Check: Repeat the assignment and update steps until convergence (i.e., the centroids no longer change significantly).
Detailed Steps#
Step 1: Initialization#
The initial centroids are chosen randomly from the data points or using the K-Means++ algorithm, which spreads out the initial centroids to improve convergence.
Step 2: Assignment#
Each data point \( \mathbf{x}_i \) is assigned to the nearest centroid \( \mathbf{\mu}_k \):
Step 3: Update#
The centroids are updated as the mean of the points assigned to each cluster:
Initialization by K-Means++#
K-Means++ improves the initialization step by spreading out the initial centroids. The steps are:
Randomly choose the first centroid from the data points.
For each remaining centroid, choose a point from the data set with probability proportional to its distance squared from the nearest centroid already chosen.
Repeat until \( K \) centroids are chosen.
Extension: K-Medoids#
K-Medoids is a variant of K-Means that is more robust to noise and outliers. Instead of using the mean of the points in a cluster, K-Medoids uses the median. This makes it more suitable for scenarios with non-Euclidean distances.
Algorithm#
Initialization: Choose \( K \) initial medoids randomly.
Assignment Step: Assign each data point to the nearest medoid.
Update Step: For each cluster, choose the medoid that minimizes the sum of distances to other points in the cluster.
Convergence Check: Repeat the assignment and update steps until convergence.
Objective Function for K-Medoids#
The objective function for K-Medoids is to minimize the following:
where \( \mathbf{m}_k \) is the medoid of cluster \( k \), and \( d(\cdot, \cdot) \) is a general distance metric.
Real-World Examples#
Customer Segmentation: K-Means is often used in marketing to segment customers into different groups based on their purchasing behavior.
Image Compression: By clustering pixel colors, K-Means can reduce the number of colors in an image, leading to compression.
Anomaly Detection: K-Means can help identify unusual patterns in data, such as detecting fraud in financial transactions.
K-Means and its variants are powerful tools in the machine learning toolkit, providing simple yet effective ways to analyze and group data.