1. Online Steiner Tree
Let be a graph and let be lengths on the edges. Let be the shortest path metric on .
For any , a Steiner tree is a subtree of that spans (i.e., contains) all vertices in , but does not necessarily span all of . The vertices in are called “terminals”. Equivalently, we can define a Steiner tree to be an acyclic, connected subgraph of that spans all of . Computing a minimum-length Steiner tree is NP-hard.
Today we consider the problem of constructing a Steiner tree in an online setting. There is a sequence of time steps. In each time step , we are given a vertex . Our algorithm must then choose a connected subgraph of which spans (and possibly other vertices). The objective is to minimize the total length . Since we only care about the cost of the union of the ‘s, we may assume without loss of generality that . There is no restriction on computation time of the algorithm.
Remark. The ‘s are not actually Steiner trees because we did not insist that they are acyclic. If trees are desired, one could remove cycles from each arbitrarily. This is equivalent to the problem that we stated above.
If we knew the terminal set in advance then the problem is trivial. The algorithm could compute in exponential time the minimum-length Steiner tree spanning , then set in every time step. Unfortunately the algorithm does not know in advance. Instead, our goal will be for the algorithm to behave almost as well as if it did know . Formally, define competitive ratio of the algorithm to be the ratio
We want our algorithm to have small competitive ratio.
Theorem 1 There is a randomized, polynomial time algorithm with expected competitive ratio .
I think this is optimal but I could not find a definitive reference. For very similar problems, Alon & Azar prove a lower bound and Imase & Waxman prove a lower bound.
1.1. The Algorithm
The main idea is to use algorithm of the last lecture to approximate the metric by a tree with edge lengths . Let be the corresponding distance function on . Recall that the leaves of are identified with the vertices in . The algorithm will then build a sequence of Steiner trees that are subtrees of , where each spans . This is trivial: since is itself a tree, there is really only one reasonable choice for what should be. We set to be the unique minimal subtree of that spans .
Remark. This step of the algorithm illustrates the usefulness of probabilistically approximating the metric by a tree. Many problems can be solved either trivially or by very simple algorithms, when the underlying graph is a tree.
Clearly . We would like to understand how the length of our final Steiner tree compares to the optimal Steiner tree .
Unfortunately the tree itself isn’t a solution to our problem. Recall that is not a subtree of : the construction of required adding extra vertices and edges. So is not a subtree of either. To obtain our actual solution, we will see below how to use the trees to guide our construction of the desired subgraphs of .
Proof: (of Claim 2). Let be an ordering of the terminals given by a depth-first traversal of . Equivalently, let denote the graph obtained from by replacing every edge with two parallel copies. Perform an Euler tour of , and let be the order in which the terminals are visited.
The Euler tour traverses every edge of exactly once, so is exactly half the length of the Euler tour. Thus
Now consider performing a walk through , visiting the terminals in the order given by . Since this walk visits every leaf of , it is a traversal of the tree, and hence it crosses every edge at least once. (In fact, it is an Eulerian walk, so it crosses every edge at least twice.)
Remark. The analysis in (3) illustrates why our random tree is so useful. The quantity that we’re trying to analyze (namely ) is bounded by a linear function of some distances in the tree (namely ). Because we can bound by , and because of linearity of expectation, we obtain a bound on involving distances in .
Now let us explain how the algorithm actually solves the online Steiner tree problem. It will maintain a sequence of subgraphs of such that each spans . Initially . Then at every subsequent time step , we do the following.
The trees described above can also be viewed in this iterative fashion. Initially . Then at every subsequent time step , we do the following.
The following important claim relates and .
Proof: Let be the closest vertex in to , so is a – path. The vertex would only be added to if some leaf beneath belongs to . By the choice of weights in , the weight of the – path is no longer than the weight of the – path. Thus for all . Consequently
as required.
Consequently,
since is the union of the paths and is the disjoint union of the paths. So
by Claim 2. This proves that the expected competitive ratio is .
For many optimization problems, the input data involves some notion of distance, which we formalize as a metric. But unfortunately many optimization problems can be quite difficult to solve in an arbitrary metric. In this lecture we present a very approach to dealing with such problems, which is a method to approximate any metric by much simpler metrics. The simpler metrics we will use are trees, i.e., the shortest path metric on a graph that is a tree. Many optimization problems are easy to solve on trees, so in one fell swoop we get algorithms to approximate a huge number of optimization problems.
Roughly speaking, our main result is: any metric on points can be represented by a distribution on trees, while preserving distances up to a factor. Consequently: for many optimization problems involving distances in a metric, if you are content with an -approximate solution, you can assume that your metric is a tree.
In order to state our results more formally, we will need to deal with a important issue. To illustrate the issue, and how to deal with it, we first present an example.
1.1. Example: Approximating a cycle
Let be a cycle on nodes. The (spanning) subtrees of are simply the paths obtained by deleting a single edge. So let be an edge and let be the corresponding tree. Is the shortest path metric of a good approximation of the shortest path of ? The answer is no: the distance between and in is only , whereas the distance between and in is . So, no matter which subtree of we pick, there will be some pair of nodes whose distance is poorly approximated.
Is there some way around this problem? Perhaps we don’t need to be a subtree of . We could consider a tree (possibly with lengths on the edges) where and is completely unrelated to . Can such a tree do a better job of approximating distances in ? It turns out that the answer is still no: there will always be a pair of nodes whose distance is only preserved up to a factor . But here is a small observation: any subtree of approximately preserves the average distances. One can easily check that the total distance between all pairs of nodes is , for both and for any subtree of . Thus, subtrees approximate the distances in “on average”.
So for the -cycle, a subtree cannot approximate all distances, but it can approximate the average distance. This motivates us to apply a trick that is both simple and counterintuitive. It turns out that we can approximate all distances if we allow ourself to pick the subtree randomly. (The trick is Von Neumann’s minimax theorem, and it implies that approximating the average distance is equivalent to finding a distribution on trees for which every distance is approximated in expectation.) To illustrate this, choose any pair of vertices . Let be the distance between and in . Pick a subtree by deleting an edge at random and let be the – distance in . Obviously since we constructed by removing from . We now give an upper bound on . If is on the shortest – path then ; the probability of that happening is . Otherwise, . Thus,
So, every edge of is approximated to within a factor of , in expectation.
1.2. Main Theorem
We now show that, for every metric with , there is an algorithm that generates a random tree for which all distances are approximated to within a factor of , in expectation.
Theorem 1 Let be a finite metric with . There is a randomized algorithm that generates a set of vertices , a map , a tree , and weights such that
The main tool in the proof is the random partitioning algorithm that we developed in the last two lectures. For notational simplicity, let us scale our distances and pick a value such that such that for all distinct . Note that does not appear in the statement of the theorem, so we do not care how big it is.
The main idea is to generate a -bounded random partition of for every then assemble those partitions into the desired tree. Assembling them is not too difficult, but there is one annoyance: the parts of have absolutely no relation to the parts of for any . If the parts of were nicely nested inside the parts of then this would induce a natural hierarchy on the parts, and therefore give us a nice tree structure.
The solution to this annoyance is to forcibly construct a nice partition , for , that is nested inside all of . In lattice theory terminology, we define the partition
where is the meet operation in the partition lattice. If you’re not familiar with this notation, don’t worry; it is easy to explain. Simply define , then let
Note that is also a partition of . Furthermore, the parts of are nicely nested inside the parts of , so we have obtained the desired hierarchical structure.
1.3. Example
Consider the following example which shows some possible partitions for the points , and the corresponding partitions .
The tree corresponding to these partitions is as follows.
1.4. Algorithm
More formally, here is our algorithm for generating the random tree.
Claim 2 Fix any distinct points . Let be the largest index with . Then .
Proof: The level is the highest level of the partitions in which and are separated. A simple inductive argument shows that is also the highest level of the partitions in which and are separated. So the least common ancestor in of and is at level . Let us call the least common ancestor . Then
Since , the proof is complete.
Claim 3 (1) holds.
Proof: Let be such that . Since is -bounded, and must lie in different parts of , i.e., . By Claim 2,
as required.
Claim 4 (2) holds.
Proof: Fix any and let . We have
where the last inequality, proven in the following claim, applies Theorem 2 of Lecture 22 and peforms a short calculation.
Claim 5 For any and ,
Proof: Let be the integer with . Then
since when . The final sum is upper bounded as follows.
This proves the claimed inequality.
1. Review of Previous Lecture
Define the partial Harmonic sum . Let be the ball of radius around .
Theorem 1 Let be a metric with . For every , there is -bounded random partition of with
The algorithm to construct is as follows.
2. The Proof
Fix any point and radius . For brevity let . Let us order all points of as where . The proof involves two important definitions.
Obviously “cuts” implies “sees”. To help visualize these definitions, the following claim interprets their meaning in Euclidean space. (In a finite metric, the ball is not a continuous object, so it doesn’t really have a “boundary”.)
Claim 2 Consider the metric where and is the Euclidean metric. Then
- sees if and only if intersects .
- cuts if and only if intersects the boundary of .
The following claim is in the same spirit, but holds for any metric.
Claim 3 Let be an arbitrary metric. Then
- If does not see then .
- If sees but does not cut then .
To illustrate the definitions of “sees” and “cuts”, consider the following example. The blue ball around is . The points and both see ; does not. The point cuts ; and do not. This example illustrates Claim 3: sees but does not cut , and we have .
The most important point for us to consider is the first point under the ordering that sees . We call this point .
The first iterations of the algorithm did not assign any point in to any . To see this, note that do not see , by choice of . So Claim 3 implies that . Consequently
The point sees by definition, but it may or may not cut . If it does not cut then Claim 3 shows that . Thus
i.e., . Since , we have shown that
Taking the contrapositive of this statement, we obtain
Let us now simplify that sum by eliminating terms that are equal to .
So define and . Then we have shown that
The remainder of the proof is quite interesting. The main point is that these two events are “nearly independent”, since and are independent, “” depends only on , and “” depends primarily on . Formally, we write
and separately upper bound these two probabilities.
The first probability is easy to bound:
because is the length of the interval and is the length of the interval from which is randomly chosen.
Next we bound the second probability. Recall that is defined to be the first element in the ordering that sees . Since cuts , we know that . Every coming earlier in the ordering has , so also sees . This shows that there are at least elements that see . So the probability that is the first element in the random ordering to see is at most .
Combining these bounds on the two probabilities we get
as required.
3. Optimality of these partitions
Theorem 1 from the previous lecture shows that there is a universal constant such that every metric has a -bounded, -Lipschitz random partition. We now show that this is optimal.
Theorem 6 There exist graphs whose shortest path metric has the property that any -bounded, -Lipschitz random partition must have .
The graphs we need are expander graphs. In Lecture 20 we defined bipartite expanders. Today we need non-bipartite expanders. We say that is a non-bipartite expander if, for some constants and :
It is known that expanders exist for all , and . (The constant can of course be improved.)
Proof: Suppose has a -bounded, -Lipschitz random partition. Then there exists a particular partition that is -bounded and cuts at most an -fraction of the edges. Every part in the partition has diameter at most . Since the graph is -regular, the number of vertices in is at most . So every part has size less than . By the expansion condition, the number of edges cut is at least
So .
4. Appendix: Proofs of Claims
Proof: (of Claim 3) Suppose does not see . Then . Every point has , so , implying that .
Suppose sees but does not cut . Then . Every point has . So , implying that .
Proof: (of Claim 4) The hypothesis of the claim is that , which is at least . So , implying that does not see .
Proof: (of Claim 5) The hypothesis of the claim is that , which is strictly less than . So , which implies that sees but does not cut .
There are many sophisticated algorithms for manipulating data involving these general metrics. One common paradigm for designing such algorithms is the familiar divide-and-conquer approach. Today we will discuss the fundamental algorithmic tool of partitioning metric spaces for the purpose of designing such divide-and-conquer algorithms.
This topic might seem a bit abstract and unmotivated. The next lecture will build on today’s ideas and present some algorithms whose usefulness is more easily appreciable.
1. Metrics
A metric is a set of points together with a “distance function” such that
In some scenarios this would be called a “semimetric” or “pseudometric”; a “metric” would additionally require that .
Here are some standard examples of metrics with which you should be familiar.
Notice that in the first two examples the set is infinite, but in the last example is finite. When is finite we call the pair a finite metric space. We can also obtain finite metric spaces by restricting the first two examples to a finite subset of and keeping the same distance functions. In computer science we are often only interested in finite metric spaces because the input data to the problem is finite.
2. Lipschitz Random Partitions
Let be a metric space. Let be a partition of , i.e., the ‘s are pairwise disjoint and their union is . The ‘s are called the parts. Let us define the following notation: for , let be the unique part that contains .
The diameter of a part is . We say that the partition is -bounded if every has diameter at most .
As you will see soon, it will be very useful for us to choose a partition of randomly. We say that a random partition is -bounded if every possible partition that can occur as a realization of the random is a -bounded partition.
Our goal is that points that are “close” to each other should have good probability of ending up in the same part. Formally, let be a randomly chosen partition. We say that is -Lipschitz if
So if and are close, they have a smaller probability of being assigned to different parts of the partition. Note that this definition is not “scale-invariant”, in the sense that we need to double if we halve all the distances.
Combining the -bounded and -Lipschitz concepts is very interesting. Let us illustrate this with an example.
2.1. Example
Consider the “line metric”, where , is odd, and . The diameter of this metric is clearly . Consider the partition where and . This partition is -bounded for . Does it capture our goal that “close points should end up in the same part”?
In some sense, yes. The only two consecutive points that ended up in different parts are the points and , so most pairs of consecutive points did end up in the same part. But if we modify our metric slightly, this is no longer true. Consider making copies of both of the points and , keeping the copies at the same location in the metric. (This is valid because we’re really looking at semimetrics: we allow for different points and .) After this change, a constant fraction of the consecutive points now ended up in different parts!
So, even in this simple example of a line metric, if we allow multiple copies of (equivalently, non-negative weights on) each point, it becomes much less clear how to choose a partition for which most close points end up in the same part.
Choosing a random partition makes life much easier. Pick an index uniformly at random, then set and . Let be the resulting random partition. These partitions always have diameter at most (even if we made multiple copies of points). So is -bounded with .
Now consider any two consecutive points and . They end up in different parts of the partition only if , which happens with probability at most . Thus . More generally
So is a -Lipschitz partition with . The key point is: this holds regardless of how many copies of the points we make. So this same random partition works under any scheme of copying (i.e., weighting) the points of this metric.
2.2. The General Theorem
The previous example achieves our “gold standard” of a random partition: . We can think of this as meaning that the probability of adjacent points ending up in different parts is roughly the inverse of the diameter of those parts. Our main theorem is that, by increasing by a logarithmic factor, we can obtain a similar partition of any finite metric.
Theorem 1 Let be a metric with . For every , there is a -bounded, -Lipschitz random partition of with .
This theorem is optimal: for any there are metrics on points for which every -bounded, -Lipschitz partition has .
Theorem 1 is a corollary of the following more general theorem. The statement is a bit messy, but the mess will be important in proving the main result of the next lecture. Define the partial Harmonic sum . Let be the ball of radius around .
Theorem 2 Let be a metric with . For every , there is -bounded random partition of with
Proof: (of Theorem~1) Let be the random partition from Theorem 2. Consider any and let . Note that if then . Thus
since .
2.3. Proof of Theorem 2
We start off by presenting the algorithm that generates the random partition . As is often the case with the algorithms we have seen in this class, the algorithm is very short, yet extremely clever and subtle.
To prove Theorem 2, we first need to check that the algorithm indeed outputs a partition. By definition, each is disjoint from all earlier with , so the ‘s are pairwise disjoint. Next, each point is either contained in or some earlier , so the union of the ‘s is .
Next we should check that this partition is -bounded. That is also easy: since , the diameter of is at most .
The difficult step is proving (1), which we do next time. Here is some vague intuition as to why it might be true. Condition (1) asks us to show that, with good probability, the ball is not chopped into pieces by the partition. But the parts of the partition are themselves balls of radius at least (minus all previous parts of the partition). So as long as , we might be optimistic that the ball does not get chopped up.
2.4. Example
Let be the following eight points in the plane, with the usual Euclidean distance. Let be the identity ordering: . The algorithm generates the following partition.
Property testing is a research area in theoretical computer science that has seen a lot of activity over the past 15 years or so. A one-sentence description of this area’s goal is: design algorithms that, given a very large object, examine the object in very few places and decide whether it either has a certain property, or is “far” from having that property.
As a simple example, let the object be the set of all people in Canada, and let the property be “does a majority like Stephen Harper?”. The first algorithm that comes to mind for this problem is: sample a few people at random, ask them if they like Stephen Harper, then return the majority vote of the sample.
Does this algorithm have a good probability of deciding whether a majority of people likes Stephen Harper? The answer is no. Suppose there are 999,999 people in Canada. The sampling algorithm cannot reliably distinguish between the case that exactly 500,000 people like Stephen Harper (a majority) and exactly 499,999 people like Stephen Harper (not a majority).
But our algorithm is in fact a good property testing algorithm for this problem! The reason is that we have not yet discussed the important word “far”. That word allows us to ignore these scenarios that are right “on the boundary” of having the desired property.
In our example, we could formalize the word “far” as meaning “more than 5% of the population needs to change their vote in order for a majority to like Stephen Harper”. Equivalently, our algorithm only needs to distinguish two scenarios:
Our sampling algorithm has good probability of distinguishing these scenarios. The number of samples needed depends only on the fraction 5% and the desired probability of success, and does not depend on the size of the population. (This is easy to prove using Chernoff bounds.)
This example probably doesn’t excite you very much because statisticians have studied these sorts of polling problems for centuries, and we know all about confidence intervals, etc. The field of property testing does not focus on these simple polling problems, but instead tends to look at problems of a more algebraic or combinatorial flavour. Some central problems in property testing are:
2. Testing Sortedness
Let be a list of distinct numbers. The property we would like to test is “is the list sorted?”. Our definition of “far” from sorted will be “we need to remove at least numbers for the remaining list to be sorted”. In other words, we wish to distinguish the following two scenarios:
We will give a randomized algorithm which can distinguish these cases with constant probability by examining only entries of the list.
A natural algorithm to try is: pick at random and test whether . Consider the input
We need to remove at least half of the list to make it sorted. But that algorithm will only find an unsorted pair if it picks , which happens with low probability.
So instead we consider the following more sophisticated algorithm.
Theorem 1 This algorithm has constant probability of correctly distinguishing between lists that are sorted and those that are far from sorted.
Proof: If the given list is sorted, obviously every binary search will work correctly and the algorithm will return “Yes”.
So let us suppose that the list is far from sorted. Say that an index is “good” if a binary search for correctly terminates at position (without discovering any unsorted elements). We claim that these good indices form a sorted subsequence. To see this, consider any two good indices and with . Let be the last common index of their binary search paths. Then we must have , which implies that by distinctness.
Since the list is far from sorted, there can be at most good indices. So the probability of picking a good index in every iteration is .
3. Estimating Maximal Matching Size
Our next example deviates from the property testing model described above. Instead of testing whether an object has a property or not, we will estimate some real-valued statistic of that object.
Our example is estimating the size of a maximal matching in a bounded-degree graph. There is an important distinction here. A maximum matching is a matching in the graph such that no other matching contains more edges. A maximal matching is a matching for which it is impossible to get a bigger matching by adding a single edge. Whereas all maximum matchings have the same size, it is not necessarily true that all maximal matchings have the same size.
Here is a greedy algorithm for generating a maximal matching in a graph with , and maximum degree .
Theorem 3 Let be a graph with maximum degree . There is an algorithm to estimate the size of the maximal matching to within a multiplicative factor of in time .
Estimating given an oracle. Suppose we have a oracle which, assuming some fixed maximal matching , can answer queries about whether a given edge is covered by that matching . We will use this oracle to estimate .
The algorithm just involves simple sampling.
Let and . The actual number of edges in is
Our estimate for the size of is . Since , Fact 2 shows that , which implies that . Therefore
by a Chernoff bound.
Implementing the oracle. We will not actually implement the oracle for an arbitrary ; we will require to be generated by the greedy algorithm above with a random ordering .
Associate independent random numbers with each edge . They are all distinct with probability , and so they induce a uniformly random ordering on the edges. We let be the maximal matching created by the greedy algorithm above, using this ordering.
Our algorithm for implementing the oracle is as follows.
Consider an edge reachable by a path of length from the initial edge . The recursion only considers the edge if the edges on this path have the values in decreasing order. The probability of that event is . The number of edges reachable by a path of length is less than . So the expected number of nodes explored is less than . So the expected time required by an oracle call is .
Since the estimation algorithm calls the oracle times, the expected total running time is .
1. Expanders
The easiest one-sentence description of an expander graph is: a graph that looks in many important ways like a random graph. So, although expander graphs are often viewed as a difficult field of study, in many ways they are very simple. In particular, here’s an easy one-sentence construction of an expander graph: a random graph is very likely to be a good expander graph. Proving that sentence is not difficult, and we will do so today.
Beyond the basics, the study of expanders becomes quite non-trivial. For example, in many scenarios we need a deterministic algorithm to construct an expander; such constructions often involve very sophisticated mathematics. Alternatively, we might like to give very precise guarantees on the quality of randomly generated expanders. That also gets very difficult.
But today we will be content with a simple, crude, randomized construction. This suffices for many amazing applications.
The Definition. Let be a bipartite graph, where is the set of left-vertices, is the set of right-vertices, and is the set of edges (each of which have one endpoint in and one in ). For any subset , let
There are several different and interrelated definitions of expanders. We will say that is -expander if , , every vertex in has degree , and
Theorem 1 For some sufficiently large constant , sufficiently large and , there exists an -expander.
Remark. The theorem is trivial when but it is interesting even in the case .
Proof: We generate the graph as follows. First we generate vertices to form the set , and vertices to form the set . Next, for every vertex , we randomly choose exactly vertices from (with repetition), and we add edges from to all of those vertices. (So technically the resulting graph will be a multigraph, but this is not a problem.)
Let be an arbitrary subset with . We need to show that . We do this in a very naive way. We simply consider every set with and show that it is unlikely that . That probability is easy to analyze:
Observe that we don’t need to consider all sets with , it suffices to consider those with . (In fact, we will simplify our notation by considering the negligibly harder case of .) As long as is not contained in any such , then we have as required.
The remainder of the proof simply does a union bound over all such and .
For , the base of the exponent is less than , and so the infinite series adds up to less than .
So the probability that the random graph fails to satisfy the second inequality in (1) is less than .
2. Superconcentrators
Let be a graph. There are two disjoint subsets and with . The vertices in are called the inputs and the vertices in are called the outputs. The graph is called a superconcentrator if, for every and every and with , there are vertex-disjoint paths from to .
It is important to note that we do not insist that the path starting at some has to arrive at a particular ; it is acceptable for it to arrive at an arbitrary . In other words, the paths from to induce some bijective mapping but we do not care what that mapping is.
Why might we be interested in such graphs? Suppose we want to design a telephone network with input lines and output lines such that any set of input lines can be connected to any set of output lines. To minimize the cost of this network, we would like the number of wires and routing points inside the network to be as small as possible. It is easy to design a superconcentrator with edges: simply add an edge between every vertex in and every vertex in . Can one do better than that? It was conjectured by Valiant that edges are necessary. His conjecture came from an attempt to prove a lower bound in circuit complexity. But a short while later he realized that his own conjecture was false:
Theorem 2 For any , there are superconcentrators with inputs, outputs, and edges.
The proof uses expanders, recursion and Hall’s theorem in a very clever way. First we observe the following interesting property of expanders: they have matchings covering any small set of vertices.
Claim 3 Let be a -expander. For every set of size , there is a matching in covering . In other words, there exists an injective map such that for every .)
Proof: Hall’s marriage theorem says that has a matching covering if and only if every has . That condition holds by (1), since is an expander.
The Main Idea. The construction proceeds as follows.
The Size. First let us analyze the total number of edges in this construction. We use a simple recurrence relation: let be the number of edges used in a superconcentrator on inputs and outputs. The total number of edges used in the construction is:
Thus . For the base case, we can take whenever is a sufficiently small constant. This recurrence has the solution .
The Superconcentrator Property. Consider any and with . We must find vertex-disjoint paths from to .
Case 1: . By Claim 3, the first expander contains a matching covering . Let be the other endpoints of that matching. Similarly, the second expander contains a matching covering . Let be the other endpoints of that matching. Note that . By induction, the smaller superconcentrator contains vertex-disjoint paths between and . Combining those paths with the edges of the matchings gives the desired paths between and .
Case 2: . Claim 3 only provides a matching covering sets of size at most , so the previous argument cannot handle the case . The solution is to use the edges that directly connect and .
By the pigeonhole principle, there are at least vertices in that are directly connected by the matching to vertices in . The remaining vertices (of which there are at most ) are handled by the argument of Case 1.
1. The LovÃ¡sz Local Lemma
Suppose are a collection of “bad” events. We would like to show that there is positive probability that none of them occur. If the events are mutually independent then this is simple:
(assuming that for every ). The LLL is a method for proving that when the ‘s are not mutually independent, but they can have some sort of limited dependencies.
Formally, we say that an event does not depend on the events if
So, regardless of whether some of the events in occur, the probability of occurring is unaffected.
Theorem 1 (The “Symmetric” LLL) Let be events with for all . Suppose that every event does not depend on at least other events. If then .
We will not prove this theorem. Instead, we will illustrate the LLL by considering a concrete application of it in showing satisfiability of -CNF Boolean formulas. Recall that a -CNF formula is a Boolean formula, involving any finite number of variables, where the formula is a conjunction (“and”) of any number of clauses, each of which is a disjunction (“or”) of exactly distinct literals (a variable or its negation).
For example, here is a -CNF formula with three variables and eight clauses.
This formula is obviously unsatisfiable. One can easily generalize this construction to get an unsatisfiable -CNF formula with variables and clauses. Our next theorem says: the reason this formula is unsatisfiable is that we allowed each variable to appear in too many clauses.
Theorem 2 There is a universal constant such that the following is true. Let be a -CNF formula where each variable appears in at most clauses. Then is satisfiable. Moreover, there is a randomized, polynomial time algorithm to find a satisfying assignment.
The theorem is stronger when is small. The proof that we will present can be optimized to get . By applying the full-blown LLL one can achieve .
Let be the number of variables and be the number of clauses in . Each clause contains variables, each of which can appear in only other clauses. So each clause shares a variable with less than other clauses.
The algorithm proving the theorem is perhaps the most natural algorithm that one could imagine. However it took more than 30 years from the introduction of the LLL for this algorithm to be provably analyzed.
Solve()
Fix()
Claim 3 Suppose every call to Fix terminates. Then Solve calls Fix at most times, and terminates with a satisfying assignment.
Proof: For any call to Fix, we claim that every clause that was satisfied before the call is still satisfied after the call completes. This follows by induction, starting at the deepest level of recursion. So, for every call from Solve to Fix() the number of satisfied clauses increases by one, since must now be satisfied when Fix() terminates.
So it remains to show that, with high probability, every call to Fix terminates.
Theorem 4 Let where is a sufficiently large constant. Then the probability that the algorithm makes more than calls to (including both the top-level and recursive calls) is at most .
The proof proceeds by considering the interactions between two agents: the “CPU” and the “Debugger”. The CPU runs the algorithm, periodically sending messages to the Debugger (we describe these messages in more detail below). However, if Fix gets called more than times the CPU interrupts the execution and halts the algorithm.
The CPU needs bits of randomness to generate the initial assignment in Solve, and needs bits to regenerate variables in each call to Fix. Since the CPU will not execute Fix more than times, it might as well generate all its random bits at the very start of the algorithm. So the first step performed by the CPU is to generate a random bitstring of length to provide all the randomness used in executing the algorithm.
The messages sent from the CPU to the Debugger are as follows.
Because the Debugger is notified when every call to Fix starts or finishes, he always knows which clause is currently being processed by Fix. A crucial detail is to figure out how many bits of communication are required to send these messages.
The main point of the proof is to show that, if Fix gets called times, then these messages reveal the random string to the Debugger.
Since each clause is a disjunction (an “or” of literals), there is exactly one assignment to those variables that does not satisfy the clause. So, whenever the CPU tells the Debugger that he is calling Fix(), the Debugger knows exactly what the current assignment to is. So, starting from the assignment that the Debugger received in the final message, he can work backwards and figure out what the previous assignment was before calling Fix. Repeating this process, he can figure out how the variables were set in each call to Fix, and also what the initial assignment was. Thus the Debugger can reconstruct the random string .
The total number of bits sent by the CPU are
So has been compressed from bits to
This is an overall shrinking of
bits, assuming that and are sufficiently big constants.
We have argued that, if Fix gets called times, then can be compressed by bits. The next claim argues that this happens with probability at most .
Claim 5 The probability that can be compressed by bits is at most .
Proof: Consider any deterministic algorithm for encoding all bit strings of length into bit strings of arbitrary length. The number of bit strings that are encoded into bits is at most . So, a random bit string has probability of being encoded into bits. (One can view this as a simple special case of the Kraft inequality.)
In the last lecture we defined the notion of pairwise independence, and saw some ways to use this concept in reducing the randomness of a randomized algorithm. Today we will apply these ideas to designing hash functions and “perfect” hash tables.
1. Universal Hash Families
Last time we showed how to create a joint distribution on random variables such that
Let be the sample space underlying the ‘s. Drawing a sample from determines values for all the ‘s. We can think of this another way: after choosing , then associated with every there is a value in (namely, the value of ).
Formally, for every there is a function given by . (The random variable is implicitly a function of .) This family of functions satisfies the following important property
A family of functions satisfying this property is called a -universal hash family. So we can restate our construction from last time as follows.
Theorem 1 For any , there exists a -universal hash family with , and . Sampling a function from this family requires mutually independent, uniform random bits. Consequently, each hash function can be represented using bits of space. Evaluating any hash function in this family at any point of takes only a constant number of arithmetic operations involving -bit integers.
One can easily modify this construction to handle any and with , and a power of two.
2. Perfect Hashing
Let be a “universe” of items. We will assume that our computational model has words of bits and that any standard arithmetic operation involving -bit words takes time.
Suppose we wish to store a given subset as a “dictionary”, so that we can efficiently test whether any element belongs to . There are many well-known solutions to this problem. For example, a balanced binary tree allows us to store the dictionary in words of space, while search operations take time in the worst case. (Insertion and deletion of items also take time.)
We will present an alternative dictionary design based on hashing. This is a “static dictionary”, which does not allow insertion or deletion of items.
Theorem 2 There is a randomized algorithm that, given a set , stores the set in a data structure using words of space. There is a deterministic algorithm that, given any item , can search for in that data structure in time in the worst case.
Let be any set with and a power of two. Theorem 1 gives us a -universal hash family mapping to for which any hash function can be evaluated in time.
Using this family, consider the following simple design for our dictionary. First create a table of size and randomly choose a function from the hash family. Can we simply store each item in the table in location ? This would certainly be very efficient, because finding in the table only requires evaluating , which takes only time. But the trouble of course is that there might be collisions: there might be two items such that , meaning that and want to lie in the same location of the table.
Most likely you have discussed hash tables collisions in an undergraduate class on data structures. One obvious way to try to avoid collisions is to take to be larger; the problem is that this increases the storage requirements, and we are aiming for a data structure that uses words. Alternatively one can allow collisions to happen and use some other technique to deal with them. For example one could use chaining, in which all items hashing to the same location of the table are stored in a linked list, or linear probing, in which some items hashing to the same location and relocated to nearby locations. Unfortunately neither of those techniques solves our problem: when the table size is both of those techniques will typically require super-constant time to search the table in the worst case.
The solution we present today is based on a simple two-level hierarchical hashing scheme. At the top-level, a single hash function is used to assign each item to a “bucket” in the top-level table. Then, for every bucket of the top-level table we create a new second-level hash table to store the items that hashed to that bucket.
Collisions. Using the pairwise independence property, we can easily determine the probability of any two items having a collision. For any unordered pair of distinct items ,
So the expected total number of collisions is
We would like to carefully choose the size of in order to have few collisions. A natural choice is to let be (rounded up to a power of two), so
by Markov’s inequality. The problem is that storing this would require space.
Our design. Instead, we will choose to be (rounded up to a power of two). This will typically result in some collisions, but the second-level hash tables are used to deal with those collisions. The top-level hash function is denoted , and the elements of are called buckets. By (1),
Let us condition on that event happening: the number of collisions of the top-level hash function is at most . For every bucket , let
be the set of items that are hashed to bucket by the top-level hash function. Let . For each we will sample a new second-level hash function mapping where . With probability at least there will be no collisions of the hash function , as shown by (2). We will repeatedly sample and count its number of collisions until finding a function with no collisions.
Search operation. Every item in is stored in exactly one location in these second-level tables. So to search our data structure for an item we follow a very simple procedure. First we compute , the top-level bucket containing item . Then we compute , which gives the location of in the second-level table . This process takes time.
Space requirements. The only thing left to analyze is the space requirements. The size of the top-level table is at most , by construction. The total size of the second-level tables is
We also need to need to write down the description of the hash functions themselves. By Theorem 1 each hash function can be represented using only a constant number of words. So the total space required is words.
]]>
1. Variance
We begin by reviewing variance, and other related notions, which should be familiar from an introductory probability course. The variance of a random variable is
The covariance between two random variables and is
This gives some measure of the correlation between and .
Here are some properties of variance and covariance that follow from the definitions by simple calculations.
Claim 1 If and are independent then .
Claim 2 .
More generally, induction shows
Claim 3 .
Claim 4 .
More generally, induction shows
In particular,
Claim 6 Let be mutually independent random variables. Then .
2. Chebyshev’s Inequality
Chebyshev’s inequality you’ve also presumably seen before. It is a 1-line consequence of Markov’s inequality.
Theorem 7 For any ,
Proof:
where the inequality is by Markov’s inequality.
As a quick example, suppose we independently flip fair coins. What’s the probability that we see at least heads? Let be the indictator random variable of the event “th toss is heads”. Let . So we want to analyze .
Bound from Chebyshev: Note that
By independence,
By Chebyshev’s inequality
Bound from Chernoff: Chernoff’s inequality gives
This is better than the bound from Chebyshev for .
So Chebyshev is weaker than Chernoff, at least for analyzing sums of independent Bernoulli trials. So why do we bother studying Chebyshev? One reason is that Chernoff is designed for analyzing sums of mutually independent random variables. That is quite a strong assumption. In some scenarios, our random variables are not mutually independent, or perhaps we deliberately choose them not to be mutually independent.
3. -wise independence
A set of events are called -wise independent if for any set with we have
The term pairwise independence is a synonym for -wise independence.
Similarly, a set of discrete random variables are called -wise independent if for any set with and any values we have
Proof: For notational simplicity, consider the case . Then
Example: To get a feel for pairwise independence, consider the following three Bernoulli random variables that are pairwise independent but not mutually independent. There are 4 possible outcomes of these three random variables. Each of these outcomes has probability .
They are certainly not mutually independent because the event has probability , whereas . But, by checking all cases, one can verify that they are pairwise independent.
3.1. Constructing Pairwise Independent RVs
Let be a finite field and . We will construct RVs such that each is uniform over and the ‘s are pairwise independent. To do so, we need to generate only two independent RVs and that are uniformly distributed over . We then define
Claim 9 Each is uniformly distributed on .
Proof: For we have , which is uniform. For and any we have
since as ranges through , also ranges through all of . (In other words, the map is a bijection of to itself.) So is uniform.
Claim 10 The ‘s are pairwise independent.
Proof: We wish to show that, for any distinct RVs and and any values , we have
This event is equivalent to and . We can also rewrite that as:
This holds precisely when
Since and are independent and uniform over , this event holds with probability .
Corollary 11 Given mutually independent, uniformly random bits, we can construct pairwise independent, uniformly random strings in .
Proof: Apply the previous construction to the finite field . The mutually independent random bits are used to construct and . The random strings are constructed as in (1).
3.2. Example: Max Cut with pairwise independent RVs
Once again let’s consider the Max Cut problem. We are given a graph where . We will choose -valued random variables . If then we add vertex to .
Our original algorithm chose to be mutually independent and uniform. Instead we will pick to be pairwise independent and uniform. Then
So the original algorithm works just as well if we make pairwise independent decisions instead of mutually independent decisions for placing vertices in . The following theorem shows the advantage of making pairwise independent decisions.
Theorem 12 There is a deterministic, polynomial time algorithm to find a cut with .
Proof: By Corollary 11, we only need mutually independent, uniform random bits in order to generate our pairwise independent, uniform random bits . We have just argued that these pairwise independent ‘s will give us
So there must exist some particular bits such that fixing for all , we get . We can deterministically find such bits by exhaustive search in trials. This gives a deterministic, polynomial time algorithm.
4. Chebyshev with pairwise independent RVs
One of the main benefits of pairwise independent RVs is that Chebyshev’s inequality still works beautifully. Suppose that are pairwise independent. For any ,
by Claim 8. So
by Claim 5. So
This is exactly the same bound that we would get if the ‘s were mutually independent.
1. Method of Conditional Expectations
One of the simplest methods for derandomizing an algorithm is the “method of conditional expectations”. In some contexts this is also called the “method of conditional probabilities”.
Let us start with a simple example. Let denote . Suppose is an random variable taking values in . Let be any function and suppose . How can we find an such that ? Well, the assumption guarantees that there exists with . So we can simply use exhaustive search to try all possible values for in only time. The same idea can also be used to find an with .
Now let’s make the example a bit more complicated. Suppose are independent random variables taking values in . Let be any function and suppose . How can we find a vector with ? Exhaustive search is again an option, but now it will take time, which might be too much.
The method of conditional expectations gives a more efficient solution, under some additional assumptions. Suppose that for any numbers we can efficiently evaluate
(If you prefer, you can think of this as , which is a conditional expectation of . This is where the method gets its name.) Then the following algorithm will produce a point with .
Just like in our simple example above, there exists an with , so we can find such an by exhaustive search. That is exactly what the repeat loop is doing.
1.1. Example: Max Cut
To illustrate this method, let us consider our algorithm for the Max Cut problem from Lecture 1. We are given a graph . Recall that this algorithm generates a cut simply by picking a set uniformly at random. Equivalently, for each vertex , the algorithm independently flips a fair coin to decide whether to put . We argued that .
We will use the method of conditional expectations to derandomize this algorithm. Let the vertex set of the graph be . Let
Let be independent random variables where each is or with probability . We identify the event “” with the event “vertex ”. Then . We wish to deterministically find values for which .
To apply the method of conditional probabilities we must be able to efficiently compute
for any numbers . What is this quantity? It is the expected number of edges cut when we have already decided which vertices amongst belong to , and the remaining vertices are placed in randomly (independently, with probability ). This expectation is easy to compute! For any edge with both endpoints in we already know whether it will be cut or not. Every other edge has probability exactly of being cut. So we can compute that expected value in linear time.
In conclusion, the method of condition expectations gives us a deterministic, polynomial time algorithm outputting a set with .
2. Method of Pessimistic Estimators
So far we have derandomized our very simple Max Cut algorithm, which doesn’t use any sophisticated probabilistic tools. Next we will see what happens when we try to apply these ideas to algorithms that use the Chernoff bound.
Let be independent random variables in . Define the function as follows:
So
which is the typical sort of quantity to which one would apply a Chernoff bound.
Can we apply the method of conditional expectations to this function ? For any numbers , we need to efficiently evaluate
Unfortunately, computing this is not so easy. If the ‘s were i.i.d. Bernoullis then we could compute that probability by expanding it in terms of binomial coefficients. But in the non-i.i.d. or non-Bernoulli case, there does not seem to be an efficient way to compute this probability.
Here is the main idea of “pessimistic estimators”: instead of defining to be equal to that probability, we will define to be an easily-computable upper-bound on that probability. Because is an upper bound on the probability of the bad event “”, the function is called a pessimistic estimate of that probability. So what upper bound should we use? The Chernoff bound, of course!
For simplicity, suppose that are independent Bernoulli random variables. The first step of the Chernoff bound (exponentiation and Markov’s inequality) shows that, for any parameter ,
Important Remark: This step holds for any joint distribution on the ‘s, including any non-independent or conditional distribution. This is because we have only used exponentiation and Markov’s inequality, which need no assumptions on the distribution.
We will use the upper bound in (1) to define our function . Specifically, define
Let’s check that the conditional expectations are easy to compute with this new definition of . Given any numbers , we have
This expectation is easy to compute in linear time, assuming we know the distribution of each (i.e., we know that ).
Applying the method of conditional expectations to the pessimistic estimator: Now we’ll see how to use this function to find with . Set , and . We have
where the first inequality is from (1) and the second inequality comes from the remainder of our Chernoff bound proof. Suppose and are such that this last quantity is strictly less than . Then we know that there exists a vector with .
We now explain how to efficiently and deterministically find such a vector. The method of conditional expectation will give us a vector for which . We now apply the same argument as in (1) to a conditional distribution:
But, under the conditional distribution “”, there is no randomness remaining. The sum is not a random variable; it is simply the number . Since the event “” has probability less than , it must have probability . In other words, we must have .
This example is actually quite silly. If we want to achieve , the best thing to do is obviously to set each . But the method is useful because we can apply it in more complicated scenarios that involve multiple Chernoff bounds.
2.1. Congestion Minimization
In Lecture 3 we gave a randomized algorithm which gives a approximation to the congestion minimization problem. We now get a deterministic algorithm by the method of pessimistic estimators.
Recall that an instance of the problem consists of a directed graph with and a sequence of pairs of vertices. We want to find – paths such that each arc is contained in few paths. Let be the set of all paths in from to . For every path , we create a variable .
We obtain a fractional solution to the problem by solving this LP.
Let be the optimal value of the LP.
We showed how randomized rounding gives us an integer solution (i.e., an actual set of paths). The algorithm chooses exactly one path from by setting with probability . For every arc let be the indicator of the event “”. Then the congestion on arc is . We showed that . Let . We applied Chernoff bounds to every arc and a union bound to show that
We will derandomize that algorithm with the function
How did we obtain this function? For each arc we applied a Chernoff bound, so each arc has a pessimistic estimator as in (2). We add all of those functions to give us this function .
Note that
Applying the method of conditional expectations, we can find a vector of paths for which . Thus,
Under that conditional distribution there is no randomness left, so the event “any has ” must have probability . So, if we choose the paths then every arc has congestion at most , as desired.