For many problems in computer science, there is a natural notion of “distance”. For example, perhaps the input data consists of real vectors for which it makes sense to measure their distance via the usual Euclidean distance. But in many cases, it makes sense to measure distances using a different “metric” that is not Euclidean distance.
There are many sophisticated algorithms for manipulating data involving these general metrics. One common paradigm for designing such algorithms is the familiar divide-and-conquer approach. Today we will discuss the fundamental algorithmic tool of partitioning metric spaces for the purpose of designing such divide-and-conquer algorithms.
This topic might seem a bit abstract and unmotivated. The next lecture will build on today’s ideas and present some algorithms whose usefulness is more easily appreciable.
A metric is a set of points together with a “distance function” such that
- “Non-negativity”: for all .
- for all (but we allow for ).
- “Symmetry”: for all .
- “The triangle inequality”: for all .
In some scenarios this would be called a “semimetric” or “pseudometric”; a “metric” would additionally require that .
Here are some standard examples of metrics with which you should be familiar.
- The Euclidean (or ) Metric: and .
- The Manhattan (or ) Metric: and .
- Shortest Path Metrics: Let be a graph with non-negative lengths associated with the edges. The set of points is . The distance function is defined by letting be the length of the shortest path between and .
Notice that in the first two examples the set is infinite, but in the last example is finite. When is finite we call the pair a finite metric space. We can also obtain finite metric spaces by restricting the first two examples to a finite subset of and keeping the same distance functions. In computer science we are often only interested in finite metric spaces because the input data to the problem is finite.
2. Lipschitz Random Partitions
Let be a metric space. Let be a partition of , i.e., the ‘s are pairwise disjoint and their union is . The ‘s are called the parts. Let us define the following notation: for , let be the unique part that contains .
The diameter of a part is . We say that the partition is -bounded if every has diameter at most .
As you will see soon, it will be very useful for us to choose a partition of randomly. We say that a random partition is -bounded if every possible partition that can occur as a realization of the random is a -bounded partition.
Our goal is that points that are “close” to each other should have good probability of ending up in the same part. Formally, let be a randomly chosen partition. We say that is -Lipschitz if
So if and are close, they have a smaller probability of being assigned to different parts of the partition. Note that this definition is not “scale-invariant”, in the sense that we need to double if we halve all the distances.
Combining the -bounded and -Lipschitz concepts is very interesting. Let us illustrate this with an example.
Consider the “line metric”, where , is odd, and . The diameter of this metric is clearly . Consider the partition where and . This partition is -bounded for . Does it capture our goal that “close points should end up in the same part”?
In some sense, yes. The only two consecutive points that ended up in different parts are the points and , so most pairs of consecutive points did end up in the same part. But if we modify our metric slightly, this is no longer true. Consider making copies of both of the points and , keeping the copies at the same location in the metric. (This is valid because we’re really looking at semimetrics: we allow for different points and .) After this change, a constant fraction of the consecutive points now ended up in different parts!
So, even in this simple example of a line metric, if we allow multiple copies of (equivalently, non-negative weights on) each point, it becomes much less clear how to choose a partition for which most close points end up in the same part.
Choosing a random partition makes life much easier. Pick an index uniformly at random, then set and . Let be the resulting random partition. These partitions always have diameter at most (even if we made multiple copies of points). So is -bounded with .
Now consider any two consecutive points and . They end up in different parts of the partition only if , which happens with probability at most . Thus . More generally
So is a -Lipschitz partition with . The key point is: this holds regardless of how many copies of the points we make. So this same random partition works under any scheme of copying (i.e., weighting) the points of this metric.
2.2. The General Theorem
The previous example achieves our “gold standard” of a random partition: . We can think of this as meaning that the probability of adjacent points ending up in different parts is roughly the inverse of the diameter of those parts. Our main theorem is that, by increasing by a logarithmic factor, we can obtain a similar partition of any finite metric.
This theorem is optimal: for any there are metrics on points for which every -bounded, -Lipschitz partition has .
Theorem 1 is a corollary of the following more general theorem. The statement is a bit messy, but the mess will be important in proving the main result of the next lecture. Define the partial Harmonic sum . Let be the ball of radius around .
2.3. Proof of Theorem 2
We start off by presenting the algorithm that generates the random partition . As is often the case with the algorithms we have seen in this class, the algorithm is very short, yet extremely clever and subtle.
- Pick uniformly at random.
- Pick a bijection (i.e., ordering) uniformly at random.
- Set .
- Output the random partition .Remark. Note that is not an arbitrary constant; it is random. Its role is analogous to the random choice of in our previous example.
To prove Theorem 2, we first need to check that the algorithm indeed outputs a partition. By definition, each is disjoint from all earlier with , so the ‘s are pairwise disjoint. Next, each point is either contained in or some earlier , so the union of the ‘s is .
Next we should check that this partition is -bounded. That is also easy: since , the diameter of is at most .
The difficult step is proving (1), which we do next time. Here is some vague intuition as to why it might be true. Condition (1) asks us to show that, with good probability, the ball is not chopped into pieces by the partition. But the parts of the partition are themselves balls of radius at least (minus all previous parts of the partition). So as long as , we might be optimistic that the ball does not get chopped up.
Let be the following eight points in the plane, with the usual Euclidean distance. Let be the identity ordering: . The algorithm generates the following partition.