Density

Density#

Intuition#

The concept of density in the context of clustering refers to regions in the data space where data points are concentrated. Calculating the density involves determining how closely packed the points are within a given region.

Calculating the Density Threshold#

In high-dimensional spaces, computing connected components to identify dense regions becomes computationally expensive and often intractable. The challenge arises because the notion of distance becomes less meaningful as the number of dimensions increases. Therefore, density-based methods like DBSCAN are preferred for their efficiency and effectiveness in handling these issues.

DBSCAN#

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density-based clustering algorithm that is capable of finding arbitrary-shaped clusters and identifying noise in the data.

Steps of DBSCAN:#

Find All Core Points:
- A core point is defined as a point that has at least min_samples points within a radius epsilon (including the point itself).
- These core points form the backbone of the clusters.
Find the Connected Components (Clusters):
- Clusters are formed by grouping all core points that are reachable from one another.
- Two core points are directly reachable if they are within epsilon distance of each other.
- A core point can reach another core point if there exists a path of core points within epsilon distance between them.
Assign Each Boundary Point to its Closest Cluster:
- A boundary point is not a core point but falls within the epsilon radius of a core point.
- Each boundary point is assigned to the cluster of the nearest core point it can reach.
All Other Points Are Noise:
- Points that are neither core points nor boundary points are labeled as noise.
- These points do not belong to any cluster and are considered outliers.

Mathematical Formulation#

Let $ \mathcal{D} $ be a dataset of points.
For a point $ p \in \mathcal{D} $, define the neighborhood $ N_{\epsilon}(p) $ as: $$ N_{\epsilon}(p) = \{q \in \mathcal{D} \mid \text{distance}(p, q) \leq \epsilon \} $$
A point $ p $ is a core point if: $$ |N_{\epsilon}(p)| \geq \text{min\_samples} $$
A point $ q $ is directly reachable from $ p $ if $ q \in N_{\epsilon}(p) $.
A point $ q $ is reachable from $ p $ if there exists a sequence of points $ p_1, p_2, \ldots, p_n $ such that: $$ p_1 = p, \; p_n = q, \; \text{and} \; p_{i+1} \in N_{\epsilon}(p_i) \; \text{for} \; 1 \leq i < n $$
Clusters are maximal sets of reachable points.

Real-World Example#

Imagine you are analyzing geographical data to find urban areas in a region. Urban areas can be seen as high-density clusters of buildings. Here’s how DBSCAN can be applied:

Core Points: Identify core points as locations with many buildings within a small radius.
Clusters: Form clusters by connecting core points that are within a certain distance of each other.
Boundary Points: Include surrounding areas that are close to core points but not densely packed themselves.
Noise: Isolated buildings or small groups of buildings far from any core points are labeled as noise.

Visualization#

A visual representation of DBSCAN clustering might look like this:

Core Points: Densely packed points, marked as solid circles.
Clusters: Groups of connected core points, each cluster in a different color.
Boundary Points: Points near the clusters, marked with a different symbol.
Noise: Scattered points far from any cluster, marked with an ‘X’.

By using DBSCAN, you can efficiently identify clusters and noise in complex datasets, making it a versatile tool for data analysis.

If you would like a Python code demonstration of DBSCAN using PyTorch, please let me know!