GMM#
GMM => Gaussian Mixture Models
Gaussian Mixture Models (GMM) are a type of probabilistic model used for clustering that generalizes the k-means algorithm by using a mixture of Gaussian distributions. Let’s dive deeper into the concepts and workings of GMM.
Differences from k-means#
K-means assigns each data point to the nearest cluster based on distance, resulting in straight-line boundaries between clusters.
GMM, on the other hand, uses probabilistic “soft” assignments, where each data point is assigned a probability of belonging to each cluster. This results in more flexible, curved cluster boundaries.
Multivariate Gaussian Distribution#
A multivariate Gaussian distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. It is defined by a mean vector \(\mu\) and a covariance matrix \(\Sigma\).
The probability density function of a multivariate Gaussian distribution is given by:
where:
\(x\) is a \(k\)-dimensional data point.
\(\mu\) is the mean vector.
\(\Sigma\) is the covariance matrix.
\(|\Sigma|\) is the determinant of the covariance matrix.
Covariance#
The covariance matrix \(\Sigma\) represents the extent to which the dimensions vary from the mean with respect to each other. For a dataset \(X\) with \(n\) data points each having \(k\) dimensions, the covariance matrix is calculated as:
where \(x_i\) is a data point and \(\mu\) is the mean of the data points.
Gaussian Mixture Model#
A GMM assumes that the data points are generated from a mixture of several Gaussian distributions with unknown parameters. Each Gaussian distribution in the mixture represents a cluster.
Model#
A GMM with \(K\) components is defined as:
where:
\(\pi_k\) is the mixing coefficient for the \(k\)-th Gaussian component, representing the probability of selecting the \(k\)-th component.
\(\mathcal{N}(x | \mu_k, \Sigma_k)\) is the Gaussian distribution with mean \(\mu_k\) and covariance \(\Sigma_k\).
Expectation-Maximization Algorithm#
Since solving the likelihood function directly is difficult, we use the Expectation-Maximization (EM) algorithm to estimate the parameters.
Expectation Step (E-step): Calculate the posterior probability that each data point belongs to each Gaussian component.
\[ \gamma_{i,k} = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)} \]Maximization Step (M-step): Update the parameters (means, covariances, and mixing coefficients) using the posterior probabilities.
\[ \mu_k = \frac{\sum_{i=1}^{n} \gamma_{i,k} x_i}{\sum_{i=1}^{n} \gamma_{i,k}} \]\[ \Sigma_k = \frac{\sum_{i=1}^{n} \gamma_{i,k} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_{i=1}^{n} \gamma_{i,k}} \]\[ \pi_k = \frac{1}{n} \sum_{i=1}^{n} \gamma_{i,k} \]
Repeat the E-step and M-step until convergence.
Hard Assignment vs Soft Assignment#
Hard Assignment#
In hard assignment, each data point belongs to a single cluster. This can lead to biased estimations, especially when data points are near cluster boundaries.
Soft Assignment#
In soft assignment, each data point has a probability of belonging to each cluster. This probabilistic approach leads to better estimations and more flexible cluster boundaries.
Example#
Consider a dataset with points that form two distinct clusters. Using GMM, we can model this data as a mixture of two Gaussian distributions.