Clustering is the problem of decomposing a given set $X$ into smaller disjoint sets. In this post I want to expose what makes clustering a difficult problem and, along the way, provide some intuition in general.

## What is a cluster?

Intuitively, a cluster is a set of points which is dense i.e. points within the cluster are closer to each other than the points outside the cluster.

## General definitions

We will need the following definitions throughout this post:

1. The collection of data points is called a Space and referred to as $X$.
2. Two subsets $X_i\subseteq X$ and $X_j \subseteq X$ of the space $X$ are disjoint if they do not have any points in common $X_i \cap X_j = \emptyset$.
3. A distance metric is a function $d$ that takes two individual points $x_i \in X$ and $x_j \in X$ and computes how far they are from each other. A proper metric must satisfy the following properties:

1. Non-negativity $d(x_i, x_j) \ge 0$.
2. Identity $d(x_i, x_i) = 0$.
3. Symmetry $d(x_i, x_j) = d(x_j, x_i)$.
4. Triangle Inequality $d(x_i, x_j) \le d(x_i, x_k) | d(x_k, x_j)$.

A lot of the following discussion still applies if the metric is not proper. As an example, if a metric is not symmetric

4. The Power set is the set of all possible subsets of a given set. For example, given a set $S = \{ 1,2,3 \}$, the power set of $P(S) = \{\{\emptyset\}, \{1\}, \{2\}, \{3\}, \{1,2\}, \{2,3\}, \{3,1\}, \{1,2,3\}\}$.

5. A clustering is a set of subsets of $X$, such that their union covers the entire set $X$ and none the subsets overlap. Formally, $C(X) = \{S_i\ | S_i \in P(X), \cup_i S_i = X, S_i \cap S_j = \emptyset \forall{i,j}\}$.

6. The density of a given data point measures how closely packed it is with other points. There are many ways of approximating the density function. For now, let us define the density of a point $x\in X$ to be $f(x) = \frac{1}{d_k(x)}$, where $d_k(x)$ is the distance of $x$ to it’s $k$-th near neighbor.

## General methods for clustering

### Hierarchical clustering

#### TL;DR

1. High level algorithm
1. Construct a graph from the data
2. Prune the edges of the graph
3. Return the connected components
2. Computational Complexity - generally $O(n^2 log(n))$
3. Issues
1. Noise
2. Chaining
3. Multi-modality
4. Computational Cost

#### Details

This category of algorithms, progressively construct a graph $G$ whose vertices are the data points of $X$ and the edges weights are computed using some function. The trick here is to prune edges of the graph using some optimization criteria. Finally, the clusters are recovered using the union-find data structure. Note that the algorithms in this category are quadratic in the number of data points.

These are the considerations for this class of algorithm:

1. How is the graph constructed? This is usually done by constructing some sort of a spanning tree over the data. The single linkage clustering algorithm constructs a minimum spanning tree, while the DBSCAN family of algorithms constructs the minimum spanning tree by considering a modification of the original distance metric.
2. What are the weights of the edges of the graph? There are two major families of methods here:
1. Distance based. The distance based methods represent the edge weight by the distance between the vertices connected by the edge.
2. Density based. The density based methods represent edge weight by some notion of the average density of the points.
3. What criteria will be used to prune the edges? Most methods in this family prune the edges by determining some threshold $\epsilon$ and then removing all edges whose weight is larger than that. Note that HDBSCAN uses an optimization technique to arrive at the list of edges to prune.

#### Side note - why are these methods hierarchical?

In this class of algorithms, the edges are often pruned by determining some threshold $\epsilon$ and pruning all edges whose weight is below $\epsilon$. This induces the inclusion property. Say that graph $G_a$ is the graph where we have pruned away all edges whose weight was higher than $a$. The inclusion property states that the graph $G_a$ is a subgraph of $G_b$ if $a \lt b$. Thus, if we ran union find on $G_a$ the clusters that we will find will be subsets of clusters that we will find by running union find on $G_b$. Another way of thinking about this is that the clusters are built up by composition.

### Exemplar based clustering

#### TL;DR

1. High level algorithm
1. Pick a few exemplars
2. Decompose the data based on proximity to exemplars
3. Find better exmplars
4. Repeat
2. Computational Complexity - generally $O(n)$
3. Issues
1. Noise
2. Multi-modality
3. Cluster shape

#### Details

This is the most widely-used class of algorithms for clustering, because these are trivial to implement and very easy to parallelize. K-Means is the most well recognized algorithm in this category. The algorithms generally proceed by:

1. Picking $k$ exemplar points at random. Note that $k$ is a user-supplied parameter - the number of clusters.
2. Finding a partition of the data such that each partition $X_i$ contains points that are the closest to the exemplar $i$.
3. Picking a different exemplar from within $X_i$.
4. Repeat until the sets $X_i$ stop changing or a specific number of iterations have occurred.
5. Savvy users scores such as the information gain to find the best value of $k$ and then run this process a few times over to determine whether the clusters are stable.

This class of methods is known to be NP-hard i.e. there are no guarantees associated with different runs and no provably optimal method to judge the value of $k$.

In my opinion, this class of methods is horrible for finding clusters since there is no way to reason about the algorithm and what it finds in data. That said, this is still an excellent set of algorithms for other problems in machine learning, such as feature creation.

### Model based clustering

#### TL;DR

1. High level algorithm
1. Pick a generating distribution
2. Find parameters which best match the data
3. Update the generating distribution
4. Repeat
2. Computational Complexity - depends on the optimizer. Practical use cases are generally $O(n)$
3. Issues
1. Noise
2. Cluster shape

#### Details

In this class of methods, we assume a generating distribution i.e. we assume that the data is sampled from a parameterized density function. The trick is in finding the parameters of the assumed density function that best match the data. This is usually done using Expectation-Maximization. A classic method here assumes that the data is sampled from a mixture of Gaussians and the EM algorithm determines the parameters (such as the multi-dimensional mean, variance etc.).

In many applications where the underlying distributions are theoretically known, this tends to work exceedingly well.

## What makes clustering difficult?

### The distance metric

All clustering algorithms depend on a distance metric - either explicitly such as in the case of hierarchical algorithms OR implicitly such as in the case of model based algorithms. In general, there are no theoretically-justified metrics that work for all classes of data. My advice is to try as many different distance metrics as possible and try to retroactively justify the clusters.

### Noise

Say that we define clusters to be subsets of high density. What is noise? Intuitively, noise is the subset of data which is distributed uniformly, at random, throughout the sample. As you can see, these definitions are not precise and that is what makes it difficult to define noise.

Many algorithms simply say that regions of low density are noise - usually the user is left to decide thresholds e.g. DBSCAN requires the user to supply an argument $\epsilon$ and $k$ which says that noisy points are those points which don’t have at least $k$ points in an $\epsilon$ ball around them.

### Chaining

This is related to the noise point above. One pathological form of noise that affects hierarchical clustering is the chaining problem - where the spanning tree contains links that are spurious which end up connecting clusters that should not be connected. Spanning trees constructed based on criteria such as complete linkage somewhat avoid this problem, but generally do not respect the inclusion property which makes it hard to reason about results.

### Multi-modality

Most algorithms implicitly assume that each cluster that we are looking for is equally dense. This is rarely the case in real datasets.

Another difficulty along the same lines is that there are no good ways of checking whether the algorithm missed or merged clusters of different densities. Density based hierarchical algorithms can avoid this problem to some degree.

### Shape assumption

Many algorithms have implicit or explicit constraints on the shape of the clusters. As an example, the k-means family of algorithms implicitly assumes clusters to be ‘spherical’. Similarly, model based algorithms explicitly assume a model of the shape of the cluster. Imagine you find clusters using either of these methods - how might you discover the fact that the clusters are not faithful to the underlying distribution?

Hierarchical methods fare far better on this count.

### How do we know if the clusters are any good?

There are lots of ways of assessing clusters. The general issue with all of these methods is that they essentially enforce an implicit model. Thus, evaluating ‘perfectly good’ clusters, might result in bad scores.

### Sampling

In general, clustering is a perfect candidate for sampling. Intuitively, if we sampled from data points uniformly at random, we expect to draw samples from the clusters in proportion to the density of the cluster. This intuition fails for high dimensional spaces - imagine a very high dimensional space, say constructed using a bag-of-words model of text. A general problem with high dimensional spaces is that every point is on an average equi-distant from every other point. This has two effects:

1. The regions of high density can be very small and numerous. This makes the sampling argument difficult, since we simply might not be able to sample from enough clusters for it to be a meaningful representation of the space.
2. The sample might appear to be a bunch of singletons - where no point is connected to any other point.