Lecture 25: Online Steiner Tree

The algorithm for probabilistically embedding metric spaces into trees has numerous theoretical applications. It is a key tool in the design of many approximation algorithms and online algorithms. Today we will illustrate the usefulness of these trees in designing an algorithm for the online Steiner tree problem.

1. Online Steiner Tree

Let {G=(V,E)} be a graph and let {c : E \rightarrow {\mathbb R}_{>0}} be lengths on the edges. Let {(V,d_G)} be the shortest path metric on {G}.

For any {U \subseteq V}, a Steiner tree is a subtree of {G} that spans (i.e., contains) all vertices in {U}, but does not necessarily span all of {V}. The vertices in {U} are called “terminals”. Equivalently, we can define a Steiner tree to be an acyclic, connected subgraph of {G} that spans all of {U}. Computing a minimum-length Steiner tree is NP-hard.

Today we consider the problem of constructing a Steiner tree in an online setting. There is a sequence of {k} time steps. In each time step {i}, we are given a vertex {u_i \in V}. Our algorithm must then choose a connected subgraph {C_i} of {G} which spans {\{u_1,\ldots,u_i\}} (and possibly other vertices). The objective is to minimize the total length {c( \cup_{i=1}^k C_i )}. Since we only care about the cost of the union of the {C_i}‘s, we may assume without loss of generality that {C_1 \subseteq \cdots \subseteq C_k}. There is no restriction on computation time of the algorithm.

Remark. The {C_i}‘s are not actually Steiner trees because we did not insist that they are acyclic. If trees are desired, one could remove cycles from each {C_i} arbitrarily. This is equivalent to the problem that we stated above.

If we knew the terminal set {U = \{u_1,\ldots,u_k\}} in advance then the problem is trivial. The algorithm could compute in exponential time the minimum-length Steiner tree {C^*} spanning {U}, then set {C_i = C^*} in every time step. Unfortunately the algorithm does not know {U} in advance. Instead, our goal will be for the algorithm to behave almost as well as if it did know {U}. Formally, define competitive ratio of the algorithm to be the ratio

\displaystyle \frac{ c( \cup_{i=1}^k C_i ) }{ c(C^*) }.

We want our algorithm to have small competitive ratio.

Theorem 1 There is a randomized, polynomial time algorithm with expected competitive ratio {O(\log n)}.

I think this is optimal but I could not find a definitive reference. For very similar problems, Alon & Azar prove a {\Omega(\log n / \log \log n)} lower bound and Imase & Waxman prove a {\Omega(\log n)} lower bound.

1.1. The Algorithm

The main idea is to use algorithm of the last lecture to approximate the metric {(V,d_G)} by a tree {T} with edge lengths {w}. Let {d_T} be the corresponding distance function on {T}. Recall that the leaves of {T} are identified with the vertices in {V}. The algorithm will then build a sequence of Steiner trees {T_1,\ldots,T_k} that are subtrees of {T}, where each {T_i} spans {\{u_1,\ldots,u_i\}}. This is trivial: since {T} is itself a tree, there is really only one reasonable choice for what {T_i} should be. We set {T_i} to be the unique minimal subtree of {T} that spans {\{u_1,\ldots,u_i\}}.

Remark. This step of the algorithm illustrates the usefulness of probabilistically approximating the metric by a tree. Many problems can be solved either trivially or by very simple algorithms, when the underlying graph is a tree.

Clearly {T_1 \subseteq \cdots \subseteq T_k}. We would like to understand how the length of our final Steiner tree {T_k} compares to the optimal Steiner tree {C^*}.

Claim 2 {{\mathrm E}[w(T_k)] \leq O(\log n) \cdot c(C^*)}

Unfortunately the tree {T_k} itself isn’t a solution to our problem. Recall that {T} is not a subtree of {G}: the construction of {T} required adding extra vertices and edges. So {T_k} is not a subtree of {G} either. To obtain our actual solution, we will see below how to use the trees {T_i} to guide our construction of the desired subgraphs {C_i} of {G}.

Proof: (of Claim 2). Let {\pi : [k] \rightarrow U} be an ordering of the terminals given by a depth-first traversal of {C^*}. Equivalently, let {2 \cdot C^*} denote the graph obtained from {C^*} by replacing every edge with two parallel copies. Perform an Euler tour of {2 \cdot C^*}, and let {\pi} be the order in which the terminals are visited.

The Euler tour traverses every edge of {2 \cdot C^*} exactly once, so {c(C^*)} is exactly half the length of the Euler tour. Thus

\displaystyle c(C^*) ~=~ (1/2) \cdot (\mathrm{cost~of~Euler~tour}) ~\geq~ (1/2) \cdot \sum_{i=1}^k d_G(\pi_i,\pi_{i+1}). \ \ \ \ \ (1)

 

Now consider performing a walk through {T_k}, visiting the terminals in the order given by {\pi}. Since this walk visits every leaf of {T_k}, it is a traversal of the tree, and hence it crosses every edge at least once. (In fact, it is an Eulerian walk, so it crosses every edge at least twice.)

Thus

\displaystyle w(T_k) \leq \sum_{i=1}^k d_T(\pi_i,\pi_{i+1}). \ \ \ \ \ (2)

 

Combining (1) and (2), we get

\displaystyle {\mathrm E}[\, w(T_k) \,] ~\leq~ {\mathrm E}\Big[\: \sum_{i=1}^k d_T(\pi_i,\pi_{i+1}) \:\Big] \\ ~\leq~ O(\log n) \cdot \sum_{i=1}^k d_G(\pi_i,\pi_{i+1}) \\ ~\leq~ O(\log n) \cdot c(C^*), \ \ \ \ \ (3)

 

as required. \Box

Remark. The analysis in (3) illustrates why our random tree {T} is so useful. The quantity that we’re trying to analyze (namely {w(T_k)}) is bounded by a linear function of some distances in the tree (namely {\sum_i d_T(\pi_i,\pi_{i+1})}). Because we can bound {{\mathrm E}[ d_T(\pi_i,\pi_{i+1}) ]} by {O(\log n) \cdot d_G(\pi_i,\pi_{i+1})}, and because of linearity of expectation, we obtain a bound on {{\mathrm E}[w(T_k)]} involving distances in {G}.

Now let us explain how the algorithm actually solves the online Steiner tree problem. It will maintain a sequence {C_1 \subseteq \cdots \subseteq C_k} of subgraphs of {G} such that each {C_i} spans {\{u_1,\ldots,u_i\}}. Initially {C_1 = \{ u_1 \}}. Then at every subsequent time step {i}, we do the following.

  • Let {v_i} be the closest vertex to {u_i} amongst {\{u_1,\ldots,u_{i-1}\}}, using distances in {T}. In other words, {v_i \,=\, \mathrm{argmin}_{v \in \{u_1,\ldots,u_{i-1}\}} \: d_T(u_i,v) }.
  • Let {P_i} be a shortest path in {G} from {u_i} to {v_i}.
  • Let {C_i = C_{i-1} \cup P_i}.

The trees {T_1,\ldots,T_k} described above can also be viewed in this iterative fashion. Initially {T_1 = \{ u_1 \}}. Then at every subsequent time step {i}, we do the following.

  • Let {Q_i} be the unique path in {T} from {u_i} to the closest vertex in {T_{i-1}}.
  • Let {T_i = T_{i-1} \cup Q_i}.

The following important claim relates {P_i} and {Q_i}.

Claim 3 {c(P_i) \leq 2 \cdot w(Q_i)} for all {i}.

Proof: Let {z_i} be the closest vertex in {T_{i-1}} to {u_i}, so {Q_i} is a {u_i}{z_i} path. The vertex {z_i} would only be added to {T_{i-1}} if some leaf {v} beneath {z_i} belongs to {\{u_1,\ldots,u_{i-1}\}}. By the choice of weights in {T_{i-1}}, the weight of the {z_i}{v} path is no longer than the weight of the {u_i}{z_i} path. Thus {d_T(u_i,v_i) \leq 2 \cdot w(Q_i)} for all {i}. Consequently

\displaystyle c(P_i) ~=~ d_G(u_i,v_i) ~\leq~ d_T(u_i,v_i) ~\leq~ 2 \cdot w(Q_i),

as required. \Box

Consequently,

\displaystyle c(C_k) ~\leq~ \sum_{i=1}^k c(P_i) ~\leq~ 2 \sum_{i=1}^k w(Q_i) ~=~ 2 \cdot w(T_k),

since {C_k} is the union of the {P_i} paths and {T_k} is the disjoint union of the {Q_i} paths. So

\displaystyle {\mathrm E}[ c(C_k) ] ~\leq~ O(\log n) \cdot c(C^*),

by Claim 2. This proves that the expected competitive ratio is {O(\log n)}.

Posted in Uncategorized | Leave a comment

Lecture 24: Probabilistic approximation of metrics by trees

1. Probabilistic Approximation of Metrics

For many optimization problems, the input data involves some notion of distance, which we formalize as a metric. But unfortunately many optimization problems can be quite difficult to solve in an arbitrary metric. In this lecture we present a very approach to dealing with such problems, which is a method to approximate any metric by much simpler metrics. The simpler metrics we will use are trees, i.e., the shortest path metric on a graph that is a tree. Many optimization problems are easy to solve on trees, so in one fell swoop we get algorithms to approximate a huge number of optimization problems.

Roughly speaking, our main result is: any metric on {n} points can be represented by a distribution on trees, while preserving distances up to a {O(\log n)} factor. Consequently: for many optimization problems involving distances in a metric, if you are content with an {O(\log n)}-approximate solution, you can assume that your metric is a tree.

In order to state our results more formally, we will need to deal with a important issue. To illustrate the issue, and how to deal with it, we first present an example.

1.1. Example: Approximating a cycle

Let {G=(V,E)} be a cycle on {n} nodes. The (spanning) subtrees of {G} are simply the paths obtained by deleting a single edge. So let {uv} be an edge and let {T = G \setminus uv} be the corresponding tree. Is the shortest path metric of {T} a good approximation of the shortest path of {G}? The answer is no: the distance between {u} and {v} in {G} is only {1}, whereas the distance between {u} and {v} in {T} is {n-1}. So, no matter which subtree {T} of {G} we pick, there will be some pair of nodes whose distance is poorly approximated.

Is there some way around this problem? Perhaps we don’t need {T} to be a subtree of {G}. We could consider a tree {T=(U,F)} (possibly with lengths on the edges) where {U \supseteq V} and {F} is completely unrelated to {E}. Can such a tree do a better job of approximating distances in {G}? It turns out that the answer is still no: there will always be a pair of nodes whose distance is only preserved up to a factor {\Theta(n)}. But here is a small observation: any subtree of {T} approximately preserves the average distances. One can easily check that the total distance between all pairs of nodes is {\Theta(n^3)}, for both {G} and for any subtree of {G}. Thus, subtrees approximate the distances in {G} “on average”.

So for the {n}-cycle, a subtree cannot approximate all distances, but it can approximate the average distance. This motivates us to apply a trick that is both simple and counterintuitive. It turns out that we can approximate all distances if we allow ourself to pick the subtree randomly. (The trick is Von Neumann’s minimax theorem, and it implies that approximating the average distance is equivalent to finding a distribution on trees for which every distance is approximated in expectation.) To illustrate this, choose any pair of vertices {u, v \in V}. Let {d=d_G(u,v)} be the distance between {u} and {v} in {G}. Pick a subtree {T} by deleting an edge {e} at random and let {d_T(u,v)} be the {u}{v} distance in {T}. Obviously {d_T(u,v) \geq d_G(u,v)} since we constructed {T} by removing {e} from {G}. We now give an upper bound on {{\mathrm E}[ d_T(u,v) ]}. If {e} is on the shortest {u}{v} path then {d_T(u,v) = n-d}; the probability of that happening is {d/n}. Otherwise, {d_T(u,v) = d}. Thus,

\displaystyle {\mathrm E}[ d_T(u,v) ] ~=~ (d/n) \cdot (n-d) + (1-d/n) \cdot d ~\leq~ 2d.

So, every edge of {G} is approximated to within a factor of {2}, in expectation.

1.2. Main Theorem

We now show that, for every metric {(X,d)} with {|X|=n}, there is an algorithm that generates a random tree for which all distances are approximated to within a factor of {O(\log n)}, in expectation.

Theorem 1 Let {(X,d)} be a finite metric with {|X|=n}. There is a randomized algorithm that generates a set of vertices {Y}, a map {f : X \rightarrow Y}, a tree {T=(Y,F)}, and weights {w : F \rightarrow {\mathbb R}_{> 0}} such that

\displaystyle d_X(x,y) ~\leq~ d_T\big(f(x),f(y)\big) \ \ \ \ \ (1)

 

and

\displaystyle {\mathrm E}[ d_T\big(f(x),f(y)\big) ] ~\leq~ O(\log n) \cdot d_X(x,y) \quad\forall x,y \in X. \ \ \ \ \ (2)

 

The main tool in the proof is the random partitioning algorithm that we developed in the last two lectures. For notational simplicity, let us scale our distances and pick a value {m} such that such that {1 < d(x,y) \leq 2^m} for all distinct {x,y \in X}. Note that {m} does not appear in the statement of the theorem, so we do not care how big it is.

The main idea is to generate a {2^i}-bounded random partition {{\mathcal P}_i} of {X} for every {i=0,\ldots,m} then assemble those partitions into the desired tree. Assembling them is not too difficult, but there is one annoyance: the parts of {{\mathcal P}_i} have absolutely no relation to the parts of {{\mathcal P}_{i'}} for any {i \neq i'}. If the parts of {{\mathcal P}_i} were nicely nested inside the parts of {{\mathcal P}_{i+1}} then this would induce a natural hierarchy on the parts, and therefore give us a nice tree structure.

The solution to this annoyance is to forcibly construct a nice partition {{\mathcal Q}_i}, for {i=0,\ldots,m}, that is nested inside all of {{\mathcal P}_i,{\mathcal P}_{i+1},\ldots,{\mathcal P}_m}. In lattice theory terminology, we define the partition

\displaystyle {\mathcal Q}_i ~=~ {\mathcal P}_m \wedge \cdots \wedge {\mathcal P}_{i+1} \wedge {\mathcal P}_i,

where {\wedge} is the meet operation in the partition lattice. If you’re not familiar with this notation, don’t worry; it is easy to explain. Simply define {{\mathcal Q}_m = {\mathcal P}_m}, then let

\displaystyle {\mathcal Q}_i ~=~ \{\: A \cap B \::\: A \in {\mathcal Q}_{i+1} ,\: B \in {\mathcal P}_i \:\}.

Note that {{\mathcal Q}_i} is also a partition of {X}. Furthermore, the parts of {{\mathcal Q}_i} are nicely nested inside the parts of {{\mathcal Q}_{i+1}}, so we have obtained the desired hierarchical structure.

1.3. Example

Consider the following example which shows some possible partitions {{\mathcal P}_0,\ldots,{\mathcal P}_4} for the points {X=\{a,b,c,d,e,f\}}, and the corresponding partitions {{\mathcal Q}_0,\ldots,{\mathcal Q}_4}.

The tree corresponding to these partitions is as follows.

1.4. Algorithm

More formally, here is our algorithm for generating the random tree.

  • For {i=0,\ldots,m}, let {{\mathcal P}_i} be a {2^i}-bounded random partition generated by our algorithm from the last lecture.
  • {\rhd}: The vertices in {Y} will be pairs of the form {(i,S)} where {i \in \{0,\ldots,m\}} and {S \subseteq X}. The vertices and edges of the tree {T} are generated by the following steps.
  • Define {{\mathcal Q}_m = {\mathcal P}_m}. Add the vertex {(m,X)} as the root of the tree.
  • For {i=m-1} downto {0}
    • Define {{\mathcal Q}_i = \{\: A \cap B \::\: A \in {\mathcal Q}_{i+1} ,\: B \in {\mathcal P}_i \:\}}.

 

  • For every such set {A \cap B \in {\mathcal Q}_i}, add the vertex {(i,A \cap B)} to {T} as a child of {(i+1,A)}, connected by an edge of length {2^{i+1}}
  • Since {1 < d(x,y)} for all distinct {x,y} and since {{\mathcal P}_0} is {1}-bounded, the partition {{\mathcal P}_0} must partition {X} into singletons. Therefore we may define the map {f : X \rightarrow Y} by {f(x) = (0,\{x\})}. 1.5. Analysis

    Claim 2 Fix any distinct points {x, y \in X}. Let {\ell \in \{0,\ldots,m-1\}} be the largest index with {{\mathcal P}_\ell(x) \neq {\mathcal P}_\ell(y)}. Then { d_T\big(f(x),f(y)\big) = 2^{\ell+3} - 4 }.

    Proof: The level {\ell} is the highest level of the {{\mathcal P}_i} partitions in which {x} and {y} are separated. A simple inductive argument shows that {\ell} is also the highest level of the {{\mathcal Q}_i} partitions in which {x} and {y} are separated. So the least common ancestor in {T} of {f(x)} and {f(y)} is at level {\ell+1}. Let us call the least common ancestor {v}. Then

    \displaystyle d_T\big(f(x),v\big) ~=~ d_T\big(f(y),v\big) ~=~ \sum_{i=0}^{\ell} 2^{i+1} ~=~ 2^{\ell+2} - 2.

    Since {d_T\big(f(x),f(y)\big) = d_T\big(f(x),v\big) + d_T\big(v,f(y)\big)}, the proof is complete. \Box

    Claim 3 (1) holds.

    Proof: Let {i} be such that {2^i < d_X(x,y) \leq 2^{i+1}}. Since {{\mathcal P}_i} is {2^i}-bounded, {x} and {y} must lie in different parts of {{\mathcal P}_i}, i.e., {{\mathcal P}(x) \neq {\mathcal P}(y)}. By Claim 2,

    \displaystyle d_T\big(f(x),f(y)\big) ~\geq~ 2^{i+3} - 4 ~>~ 2^{i+1} ~\geq~ d_X(x,y),

    as required. \Box

    Claim 4 (2) holds.

    Proof: Fix any {x,y \in X} and let {r = d_X(x,y)}. We have

    \displaystyle \begin{array}{rcl} {\mathrm E}[\: d_T\big(f(x),f(y)\big) \:] &=& \sum_{i=0}^m {\mathrm{Pr}}[\: i \mathrm{~is~the~largest~index~with~} {\mathcal P}_i(x) \neq {\mathcal P}_i(y) \:] \cdot (2^{i+3} - 4) \\ &<& \sum_{i=0}^m {\mathrm{Pr}}[ {\mathcal P}_i(x) \neq {\mathcal P}_i(y) ] \cdot 2^{i+3} \\ &\leq& \sum_{i=0}^m {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}_i(x) ] \cdot 2^{i+3} \\ &\leq& O(\log n) \cdot r, \end{array}

    where the last inequality, proven in the following claim, applies Theorem 2 of Lecture 22 and peforms a short calculation. \Box

    Claim 5 For any {x \in X} and {r>1},

    \displaystyle \sum_{i=0}^m {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}_i(x) ] \cdot 2^{i+3} ~\leq~ O(\log n) \cdot r.

    Proof: Let {k} be the integer with {2^k < r \leq 2^{k+1}}. Then

    \displaystyle \begin{array}{rcl} & & \sum_{i=0}^m {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}_i(x) ] \cdot 2^{i+3} \\ &\leq& \sum_{i=0}^{k+3} 2^{i+3} ~+~ \sum_{i=k+4}^m \frac{8r}{2^i} \cdot H\Big( |B(x,2^{i-2}-r)|, |B(x,2^{i-1}+r)| \Big) \cdot 2^{i+3} \\ &\leq& 128r ~+~ 64r \cdot \sum_{i=k+4}^m H\Big( |B(x,2^{i-3})|, |B(x,2^i)| \Big), \end{array}

    since {r \leq 2^{i-3}} when {i \geq k+4}. The final sum is upper bounded as follows.

    \displaystyle \begin{array}{rcl} & & \sum_{i=k+4}^m H\big( |B(x,2^{i-3})|, |B(x,2^i)| \big) \\ &=& \sum_{i=k+4}^m \Big( H\big( |B(x,2^{i-3})|, |B(x,2^{i-2})| \big) + H\big( |B(x,2^{i-2})|, |B(x,2^{i-1})| \big) + H\big( |B(x,2^{i-1})|, |B(x,2^i)| \big) \Big) \\ &<& 3 \sum_{i=k+2}^m H\big( |B(x,2^{i-1})|, |B(x,2^i)| \big) \\ &=& 3 \cdot H\big(|B(x,2^{k+1})|,|B(x,2^m)|\big) \\ &=& O(\log n). \end{array}

    This proves the claimed inequality. \Box

 

Posted in Uncategorized | Leave a comment

Lecture 23: Random partitions of metric spaces (continued)

We continue our theorem from last time on random partitions of metric spaces

1. Review of Previous Lecture

Define the partial Harmonic sum {H(a,b) = \sum_{i=a+1}^b 1/i}. Let {B(x,r) = \{\: y \in X \::\: d(x,y) \leq r \}} be the ball of radius {r} around {x}.

Theorem 1 Let {(X,d)} be a metric with {|X|=n}. For every {\Delta>0}, there is {\Delta}-bounded random partition {{\mathcal P}} of {X} with

\displaystyle {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}(x) ] ~\leq~ \frac{ 8r }{ \Delta } \:\cdot\: H\big(\: |B(x,\Delta/4-r)|,\: |B(x,\Delta/2+r)| \:\big) \quad \forall x \in X ,\: \forall r>0. \ \ \ \ \ (1)

 

The algorithm to construct {{\mathcal P}} is as follows.

  • Pick {\alpha \in (1/4,1/2]} uniformly at random.
  • Pick a bijection (i.e., ordering) {\pi : \{1,\ldots,n\} \rightarrow X} uniformly at random.
  • For {i=1,\ldots,n}
  • Set { P_i \,=\, B(\pi(i),\alpha \Delta) \,\setminus\, \cup_{j=1}^{i-1} \, P_j }.

 

  • Output the random partition {{\mathcal P} = \{P_1,\ldots,P_n\}}.We have already proven that this outputs a {\Delta}-bounded partition. So it remains to prove (1).

    2. The Proof

    Fix any point {x \in X} and radius {r>0}. For brevity let {B = B(x,r)}. Let us order all points of {X} as {\{y_1,\ldots,y_n\}} where {d(x,y_1) \leq \cdots \leq d(x,y_n)}. The proof involves two important definitions.

    • Sees: A point {y} sees{B} if {d(x,y) \leq \alpha \Delta+r}.
    • Cuts: A point {y} cuts {B} if {\alpha \Delta - r \leq d(x,y) \leq \alpha \Delta+r}.

    Obviously “cuts” implies “sees”. To help visualize these definitions, the following claim interprets their meaning in Euclidean space. (In a finite metric, the ball {B} is not a continuous object, so it doesn’t really have a “boundary”.)

    Claim 2 Consider the metric {(X,d)} where {X = {\mathbb R}^n} and {d} is the Euclidean metric. Then

    • {y} sees {B} if and only if {B=B(x,r)} intersects {B(y,\alpha \Delta)}.
    • {y} cuts {B} if and only if {B=B(x,r)} intersects the boundary of {B(y,\alpha \Delta)}.

     

    The following claim is in the same spirit, but holds for any metric.

    Claim 3 Let {(X,d)} be an arbitrary metric. Then

    • If {y} does not see {B} then {B \cap B(y,\alpha \Delta) = \emptyset}.
    • If {y} sees {B} but does not cut {B} then {B \subseteq B(y, \alpha \Delta)}.

     

    To illustrate the definitions of “sees” and “cuts”, consider the following example. The blue ball around {x} is {B}. The points {y_1} and {y_2} both see {B}; {y_3} does not. The point {y_2} cuts {B}; {y_1} and {y_3} do not. This example illustrates Claim 3: {y_1} sees {B} but does not cut {B}, and we have {B \subseteq B(y, \alpha \Delta)}.

    The most important point for us to consider is the first point under the ordering {\pi} that sees {B}. We call this point {y_{\pi(k)}}.

    The first {k-1} iterations of the algorithm did not assign any point in {B} to any {P_i}. To see this, note that {y_{\pi(1)},\ldots,y_{\pi(k-1)}} do not see {B}, by choice of {k}. So Claim 3 implies that {B \cap B(y_{\pi(i)},\alpha \Delta) = \emptyset ~\forall i<k}. Consequently

    \displaystyle B \cap P_i = \emptyset \quad\forall i<k. \ \ \ \ \ (2)

     

    The point {y_{\pi(k)}} sees {B} by definition, but it may or may not cut {B}. If it does not cut {B} then Claim 3 shows that {B \subseteq B(y_{\pi(k)},\alpha \Delta)}. Thus

    \displaystyle B \cap P_k ~=~ \Big(\underbrace{B \cap B(y_{\pi(k)},\alpha \Delta)}_{=\, B}\Big) ~\setminus~ \bigcup_{i=1}^{k-1} \underbrace{ B \!\cap\! P_i }_{=\, \emptyset} ~=~ B,

    i.e., {B \subseteq P_k}. Since {{\mathcal P}(x) = P_k}, we have shown that

    \displaystyle y \mathrm{~does~not~cut~} B \quad\implies\quad B \subseteq {\mathcal P}(x).

    Taking the contrapositive of this statement, we obtain

    \displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ {\mathrm{Pr}}[ y_{\pi(k)} \mathrm{~cuts~} B ] ~=~ \sum_{i=1}^n {\mathrm{Pr}}[ y_{\pi(k)}=y_i ~\wedge~ y_i \mathrm{~cuts~} B ].

    Let us now simplify that sum by eliminating terms that are equal to {0}.

    Claim 4 If {y \not \in B(x,\Delta/2+r)} then {y} does not see {B}.

    Claim 5 If {y \in B(x,\Delta/4-r)} then {y} sees {B} but does not cut {B}.

    So define {a = |B(x,\Delta/4-r)|} and {b=|B(x,\Delta/2+r)|}. Then we have shown that

    \displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ \sum_{i=a+1}^b {\mathrm{Pr}}[ y_{\pi(k)}=y_i ~\wedge~ y_i \mathrm{~cuts~} B ].

    The remainder of the proof is quite interesting. The main point is that these two events are “nearly independent”, since {\alpha} and {\pi} are independent, “{y_i \mathrm{~cuts~} B}” depends only on {\alpha}, and “{y_{\pi(k)}=y_i}” depends primarily on {\pi}. Formally, we write

    \displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ \sum_{i=a+1}^b {\mathrm{Pr}}[ y_i \mathrm{~cuts~} B ] \cdot {\mathrm{Pr}}[\: y_{\pi(k)}=y_i \:|\: y_i \mathrm{~cuts~} B ]

    and separately upper bound these two probabilities.

    The first probability is easy to bound:

    \displaystyle {\mathrm{Pr}}[ y_i \mathrm{~cuts~} B ] ~=~ {\mathrm{Pr}}[\: \alpha \Delta \in [d(x,y)-r,d(x,y)+r] \:] ~\leq~ \frac{2r}{\Delta/4},

    because {2r} is the length of the interval {[d(x,y)-r,d(x,y)+r]} and {\Delta/4} is the length of the interval from which {\alpha \Delta} is randomly chosen.

    Next we bound the second probability. Recall that {y_{\pi(k)}} is defined to be the first element in the ordering {\pi} that sees {B}. Since {y_i} cuts {B}, we know that {d(x,y_i) \leq \alpha/2+r}. Every {y_j} coming earlier in the ordering has {d(x,y_j) \leq d(x,y_i) \leq \alpha/2+r}, so {y_j} also sees {B}. This shows that there are at least {i} elements that see {B}. So the probability that {y_i} is the first element in the random ordering to see {B} is at most {1/i}.

    Combining these bounds on the two probabilities we get

    \displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ \sum_{i=a+1}^b \frac{8r}{\Delta} \cdot \frac{1}{i} ~=~ \frac{8r}{\Delta} \cdot H(a,b),

    as required.

    3. Optimality of these partitions

    Theorem 1 from the previous lecture shows that there is a universal constant {L=O(1)} such that every metric has a {\log(n)/10}-bounded, {L}-Lipschitz random partition. We now show that this is optimal.

    Theorem 6 There exist graphs {G} whose shortest path metric {(X,d)} has the property that any {\log(n)/10}-bounded, {L}-Lipschitz random partition must have {L = \Omega(1)}.

    The graphs we need are expander graphs. In Lecture 20 we defined bipartite expanders. Today we need non-bipartite expanders. We say that {G=(V,E)} is a non-bipartite expander if, for some constants {c > 0} and {d \geq 3}:

    • {G} is {d}-regular, and
    • {|\delta(S)| \geq c|S|} for all {|S| \leq |V|/2}.

    It is known that expanders exist for all {n=|V|}, {d=3} and {c \geq 1/1000}. (The constant {c} can of course be improved.)

    Proof: Suppose {(X,d)} has a {\log(n)/10}-bounded, {L}-Lipschitz random partition. Then there exists a particular partition {P} that is {\log(n)/10}-bounded and cuts at most an {L}-fraction of the edges. Every part {P_i} in the partition has diameter at most {\log(n)/10}. Since the graph is {3}-regular, the number of vertices in {P_i} is at most {3^{\log(n)/10} < n/2}. So every part {P_i} has size less than {n/2}. By the expansion condition, the number of edges cut is at least

    \displaystyle \frac{1}{2} \sum_{i} c \cdot |P_i| ~=~ cn/2 ~=~ \Omega(|E|).

    So {L = \Omega(1)}. \Box

    4. Appendix: Proofs of Claims

    Proof: (of Claim 3) Suppose {y} does not see {B}. Then {d(x,y) > \alpha \Delta + r}. Every point {z \in B} has {d(x,z) \leq r}, so {d(y,z) \geq d(y,x) - d(x,z) > \alpha \Delta + r - r}, implying that {z \not \in B(y,\alpha \Delta)}.

    Suppose {y} sees {B} but does not cut {B}. Then {d(x,y) < \alpha \Delta - r}. Every point {z \in B} has {d(x,z) \leq r}. So {d(y,z) \leq d(y,x) + d(x,z) < \alpha \Delta - r + r}, implying that {z \in B(y,\alpha \Delta)}. \Box

    Proof: (of Claim 4) The hypothesis of the claim is that {d(x,y) > \Delta/2+r}, which is at least {\alpha \Delta+r}. So {d(x,y) \geq \alpha \Delta+r}, implying that {y} does not see {B}. \Box

    Proof: (of Claim 5) The hypothesis of the claim is that {d(x,y) \leq \Delta/4-r}, which is strictly less than {\alpha \Delta-r}. So {d(x,y) < \alpha \Delta - r}, which implies that {y} sees {B} but does not cut {B}. \Box

 

Posted in Uncategorized | Leave a comment

Lecture 22: Random partitions of metric spaces

For many problems in computer science, there is a natural notion of “distance”. For example, perhaps the input data consists of real vectors for which it makes sense to measure their distance via the usual Euclidean distance. But in many cases, it makes sense to measure distances using a different “metric” that is not Euclidean distance.

There are many sophisticated algorithms for manipulating data involving these general metrics. One common paradigm for designing such algorithms is the familiar divide-and-conquer approach. Today we will discuss the fundamental algorithmic tool of partitioning metric spaces for the purpose of designing such divide-and-conquer algorithms.

This topic might seem a bit abstract and unmotivated. The next lecture will build on today’s ideas and present some algorithms whose usefulness is more easily appreciable.

1. Metrics

A metric is a set of points {X} together with a “distance function” {d : X \times X \rightarrow {\mathbb R}} such that

  • “Non-negativity”: {d(x,y) \geq 0} for all {x, y \in X}.
  • {d(x,x) = 0} for all {x \in X} (but we allow {d(x,y)=0} for {x \neq y}).
  • “Symmetry”: {d(x,y) = d(y,x)} for all {x,y \in X}.
  • “The triangle inequality”: {d(x,z) \leq d(x,y) + d(y,z)} for all {x,y,z \in X}.

In some scenarios this would be called a “semimetric” or “pseudometric”; a “metric” would additionally require that {d(x,y)=0 \iff x=y}.

Here are some standard examples of metrics with which you should be familiar.

  • The Euclidean (or {L_2}) Metric: {X={\mathbb R}^n} and {d(x,y) = \sqrt{\sum_i (x_i-y_i)^2}}.
  • The Manhattan (or {L_1}) Metric: {X={\mathbb R}^n} and {d(x,y) = \sum_i | x_i-y_i |}.
  • Shortest Path Metrics: Let {G=(V,E)} be a graph with non-negative lengths associated with the edges. The set of points is {X=V}. The distance function is defined by letting {d(x,y)} be the length of the shortest path between {x} and {y}.

Notice that in the first two examples the set {X} is infinite, but in the last example {X} is finite. When {X} is finite we call the pair {(X,d)} a finite metric space. We can also obtain finite metric spaces by restricting the first two examples to a finite subset of {{\mathbb R}^n} and keeping the same distance functions. In computer science we are often only interested in finite metric spaces because the input data to the problem is finite.

2. Lipschitz Random Partitions

Let {(X,d)} be a metric space. Let {P = \{ P_1, P_2, \ldots \}} be a partition of {X}, i.e., the {P_i}‘s are pairwise disjoint and their union is {X}. The {P_i}‘s are called the parts. Let us define the following notation: for {x \in X}, let {P(x)} be the unique part {P_i} that contains {x}.

The diameter of a part {P_i} is { \max \{\: d(x,y) \::\: x,y \in P_i \} }. We say that the partition {P} is {\Delta}-bounded if every {P_i} has diameter at most {\Delta}.

As you will see soon, it will be very useful for us to choose a partition of {(X,d)} randomly. We say that a random partition {{\mathcal P}} is {\Delta}-bounded if every possible partition that can occur as a realization of the random {{\mathcal P}} is a {\Delta}-bounded partition.

Our goal is that points that are “close” to each other should have good probability of ending up in the same part. Formally, let {{\mathcal P}} be a randomly chosen partition. We say that {{\mathcal P}} is {L}-Lipschitz if

\displaystyle {\mathrm{Pr}}[ {\mathcal P}(x) \neq {\mathcal P}(y) ] ~\leq~ L \cdot d(x,y) \qquad \forall x,y \in X.

So if {x} and {y} are close, they have a smaller probability of being assigned to different parts of the partition. Note that this definition is not “scale-invariant”, in the sense that we need to double {L} if we halve all the distances.

Combining the {\Delta}-bounded and {L}-Lipschitz concepts is very interesting. Let us illustrate this with an example.

2.1. Example

Consider the “line metric”, where {X=\{1,\ldots,n\}}, {n} is odd, and {d(x,y) = |x-y|}. The diameter of this metric is clearly {n-1}. Consider the partition {P} where {P_1 = \{1,\ldots,\lfloor n/2 \rfloor \}} and {P_2 = \{\lceil n/2 \rceil,\ldots,n\}}. This partition is {\Delta}-bounded for {\Delta = \lfloor n/2 \rfloor-1}. Does it capture our goal that “close points should end up in the same part”?

In some sense, yes. The only two consecutive points that ended up in different parts are the points {\lfloor n/2 \rfloor} and {\lceil n/2 \rceil}, so most pairs of consecutive points did end up in the same part. But if we modify our metric slightly, this is no longer true. Consider making {n} copies of both of the points {\lfloor n/2 \rfloor} and {\lfloor n/2 \rfloor}, keeping the copies at the same location in the metric. (This is valid because we’re really looking at semimetrics: we allow {d(x,y)=0} for different points {x} and {y}.) After this change, a constant fraction of the consecutive points now ended up in different parts!

So, even in this simple example of a line metric, if we allow multiple copies of (equivalently, non-negative weights on) each point, it becomes much less clear how to choose a partition for which most close points end up in the same part.

Choosing a random partition makes life much easier. Pick an index {k \in \{n/3,\ldots,2n/3\}} uniformly at random, then set {P_1 = \{1,\ldots,k\}} and {P_2 = \{k+1,\ldots,n\}}. Let {{\mathcal P}} be the resulting random partition. These partitions always have diameter at most {2n/3} (even if we made multiple copies of points). So {{\mathcal P}} is {\Delta}-bounded with {\Delta=2n/3}.

Now consider any two consecutive points {i} and {i+1}. They end up in different parts of the partition only if {k=i}, which happens with probability at most {3/n}. Thus {{\mathrm{Pr}}[ {\mathcal P}(i) \neq {\mathcal P}(i+1) ] \leq 3/n}. More generally

\displaystyle {\mathrm{Pr}}[ {\mathcal P}(x) \neq {\mathcal P}(y) ] ~\leq~ \frac{3}{n} \cdot d(x,y).

So {{\mathcal P}} is a {L}-Lipschitz partition with {L=3/n}. The key point is: this holds regardless of how many copies of the points we make. So this same random partition {{\mathcal P}} works under any scheme of copying (i.e., weighting) the points of this metric.

2.2. The General Theorem

The previous example achieves our “gold standard” of a random partition: {L = O(1/\Delta)}. We can think of this as meaning that the probability of adjacent points ending up in different parts is roughly the inverse of the diameter of those parts. Our main theorem is that, by increasing {L} by a logarithmic factor, we can obtain a similar partition of any finite metric.

Theorem 1 Let {(X,d)} be a metric with {|X|=n}. For every {\Delta>0}, there is a {\Delta}-bounded, {L}-Lipschitz random partition {{\mathcal P}} of {X} with {L = O(\log(n)/\Delta)}.

This theorem is optimal: for any {n} there are metrics on {n} points for which every {\Delta}-bounded, {L}-Lipschitz partition has {L = \Omega(\log(n)/\Delta)}.

Theorem 1 is a corollary of the following more general theorem. The statement is a bit messy, but the mess will be important in proving the main result of the next lecture. Define the partial Harmonic sum {H(a,b) = \sum_{i=a+1}^b 1/i}. Let {B(x,r) = \{\: y \in X \::\: d(x,y) \leq r \}} be the ball of radius {r} around {x}.

Theorem 2 Let {(X,d)} be a metric with {|X|=n}. For every {\Delta>0}, there is {\Delta}-bounded random partition {{\mathcal P}} of {X} with

\displaystyle {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}(x) ] ~\leq~ \frac{8r}{\Delta} \:\cdot\: H\big(\: |B(x,\Delta/4-r)|,\: |B(x,\Delta/2+r)| \:\big) \qquad \forall x \in X ,\: \forall r>0. \ \ \ \ \ (1)

 

Proof: (of Theorem~1) Let {{\mathcal P}} be the random partition from Theorem 2. Consider any {x,y \in X} and let {r = d(x,y)}. Note that if {{\mathcal P}(x) \neq {\mathcal P}(y)} then {B(x,r) \not\subseteq {\mathcal P}(x)}. Thus

\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[ {\mathcal P}(x) \neq {\mathcal P}(y) ] &\leq& (8r/\Delta) \cdot H\big(\: |B(x,\Delta/4-r)|,\: |B(x,\Delta/2+r)| \:\big) \\ &\leq& (8r/\Delta) \cdot H(0, n ) \\ &=& O(\log(n)/\Delta) \cdot d(x,y), \end{array}

since {H(0,n) = \sum_{i=1}^{n+1} 1/i = O(\log n)}. \Box

2.3. Proof of Theorem 2

We start off by presenting the algorithm that generates the random partition {{\mathcal P}}. As is often the case with the algorithms we have seen in this class, the algorithm is very short, yet extremely clever and subtle.

  • Pick {\alpha \in (1/4,1/2]} uniformly at random.
  • Pick a bijection (i.e., ordering) {\pi : \{1,\ldots,n\} \rightarrow X} uniformly at random.
  • For {i=1,\ldots,n}
  • Set { P_i \,=\, B(\pi(i),\alpha \Delta) \,\setminus\, \cup_{j=1}^{i-1} \, P_j }.

 

  • Output the random partition {{\mathcal P} = \{P_1,\ldots,P_n\}}.Remark. Note that {\alpha} is not an arbitrary constant; it is random. Its role is analogous to the random choice of {k} in our previous example.

    To prove Theorem 2, we first need to check that the algorithm indeed outputs a partition. By definition, each {P_i} is disjoint from all earlier {P_j} with {j < i}, so the {P_i}‘s are pairwise disjoint. Next, each point {\pi(i)} is either contained in {P_i} or some earlier {P_j}, so the union of the {P_i}‘s is {X}.

    Next we should check that this partition is {\Delta}-bounded. That is also easy: since {P_i \subseteq B(\pi(i),\alpha \Delta)}, the diameter of {P_i} is at most {\alpha \Delta < \Delta}.

    The difficult step is proving (1), which we do next time. Here is some vague intuition as to why it might be true. Condition (1) asks us to show that, with good probability, the ball {B(x,r)} is not chopped into pieces by the partition. But the parts of the partition are themselves balls of radius at least {\Delta/4} (minus all previous parts of the partition). So as long as {r \ll \Delta/4}, we might be optimistic that the ball {B(x,r)} does not get chopped up.

    2.4. Example

    Let {X} be the following eight points in the plane, with the usual Euclidean distance. Let {\pi} be the identity ordering: {\pi(i)=i \:~\forall i}. The algorithm generates the following partition.

 

Posted in Uncategorized | Leave a comment

Lecture 21: Property Testing

1. Overview of Property Testing

Property testing is a research area in theoretical computer science that has seen a lot of activity over the past 15 years or so. A one-sentence description of this area’s goal is: design algorithms that, given a very large object, examine the object in very few places and decide whether it either has a certain property, or is “far” from having that property.

As a simple example, let the object be the set of all people in Canada, and let the property be “does a majority like Stephen Harper?”. The first algorithm that comes to mind for this problem is: sample a few people at random, ask them if they like Stephen Harper, then return the majority vote of the sample.

Does this algorithm have a good probability of deciding whether a majority of people likes Stephen Harper? The answer is no. Suppose there are 999,999 people in Canada. The sampling algorithm cannot reliably distinguish between the case that exactly 500,000 people like Stephen Harper (a majority) and exactly 499,999 people like Stephen Harper (not a majority).

But our algorithm is in fact a good property testing algorithm for this problem! The reason is that we have not yet discussed the important word “far”. That word allows us to ignore these scenarios that are right “on the boundary” of having the desired property.

In our example, we could formalize the word “far” as meaning “more than 5% of the population needs to change their vote in order for a majority to like Stephen Harper”. Equivalently, our algorithm only needs to distinguish two scenarios:

  • At least 500,000 people like Stephen Harper, or
  • Less than 450,000 people like Stephen Harper.

Our sampling algorithm has good probability of distinguishing these scenarios. The number of samples needed depends only on the fraction 5% and the desired probability of success, and does not depend on the size of the population. (This is easy to prove using Chernoff bounds.)

This example probably doesn’t excite you very much because statisticians have studied these sorts of polling problems for centuries, and we know all about confidence intervals, etc. The field of property testing does not focus on these simple polling problems, but instead tends to look at problems of a more algebraic or combinatorial flavour. Some central problems in property testing are:

  • Given a function {f : \{0,1\}^n \rightarrow \{0,1\}}, decide if it is either a linear function or far from a linear function. There is a property testing algorithm for this problem that uses only a constant number of queries. This algorithm is a key ingredient in proving the infamous PCP theorem.
  • Given a graph {G=(V,E)}, decide if {G} has no triangles, or is far from having no triangles. There is a property testing algorithm for this problem that uses only a constant number of queries, however the constant is a horrifying tower function (a.k.a., a tetration).

2. Testing Sortedness

Let {A_1,\ldots,A_n} be a list of {n} distinct numbers. The property we would like to test is “is the list sorted?”. Our definition of “far” from sorted will be “we need to remove at least {\epsilon n} numbers for the remaining list to be sorted”. In other words, we wish to distinguish the following two scenarios:

  • {A_1 < A_2 < \cdots < A_n}.
  • The longest sorted subsequence has length less than {(1-\epsilon)n}.

We will give a randomized algorithm which can distinguish these cases with constant probability by examining only {O(\log(n)/\epsilon)} entries of the list.

A natural algorithm to try is: pick {i} at random and test whether {A_i < A_{i+1}}. Consider the input

\displaystyle 1, 2, \ldots, n/2-1, n/2, 1, 2, \ldots, n/2-1, n/2.

We need to remove at least half of the list to make it sorted. But that algorithm will only find an unsorted pair if it picks {i=n/2}, which happens with low probability.

So instead we consider the following more sophisticated algorithm.

  • For {t=1,\ldots,1/\epsilon}
  • Pick {i \in \{1,\ldots,n\}} uniformly at random.
  • Search for the value {A_i} in the list using binary search.
  • If we discover some unsorted elements during the binary search, return “No”.
  • If the binary search does not end up at position {i}, return “No”.

 

  • Return “Yes”.

    Theorem 1 This algorithm has constant probability of correctly distinguishing between lists that are sorted and those that are far from sorted.

    Proof: If the given list is sorted, obviously every binary search will work correctly and the algorithm will return “Yes”.

    So let us suppose that the list is far from sorted. Say that an index {i} is “good” if a binary search for {y_i} correctly terminates at position {i} (without discovering any unsorted elements). We claim that these good indices form a sorted subsequence. To see this, consider any two good indices {i} and {j} with {i<j}. Let {k} be the last common index of their binary search paths. Then we must have {y_i \leq y_k \leq y_j}, which implies that {y_i < y_j} by distinctness.

    Since the list is far from sorted, there can be at most {(1-\epsilon)n} good indices. So the probability of picking a good index in every iteration is {(1-\epsilon)^{1/\epsilon} \leq e^{-\epsilon (1/\epsilon)} = 1/e}. \Box

    3. Estimating Maximal Matching Size

    Our next example deviates from the property testing model described above. Instead of testing whether an object has a property or not, we will estimate some real-valued statistic of that object.

    Our example is estimating the size of a maximal matching in a bounded-degree graph. There is an important distinction here. A maximum matching is a matching in the graph such that no other matching contains more edges. A maximal matching is a matching for which it is impossible to get a bigger matching by adding a single edge. Whereas all maximum matchings have the same size, it is not necessarily true that all maximal matchings have the same size.

    Here is a greedy algorithm for generating a maximal matching in a graph {G=(V,E)} with {n = |V|}, {m=|E|} and maximum degree {d}.

    • Let {M = \emptyset}.
    • Let {\pi : \{1,\ldots,m\} \rightarrow E} be an arbitrary bijection.
    • For {i=1,\ldots,m}
    • If both endpoints of edge {\pi(i)} are uncovered, add {\pi(i)} to {M}.

    Fact 2 The resulting matching {M} is maximal. Furthermore {|M| \geq n/2d}.

    Theorem 3 Let {G} be a graph with maximum degree {d}. There is an algorithm to estimate the size of the maximal matching to within a multiplicative factor of {1+\epsilon} in time {2^{O(d)} / \epsilon^2}.

    Estimating given an oracle. Suppose we have a oracle which, assuming some fixed maximal matching {M}, can answer queries about whether a given edge {e} is covered by that matching {M}. We will use this oracle to estimate {|M|}.

    The algorithm just involves simple sampling.

    • Fix a maximal matching {M}.
    • Set {k=9 d^2/\epsilon^2}.
    • For {i=1,\ldots,k}
    • Pick a random edge {e}.
    • Let {X_i=1} if {e} is in {M}, otherwise {X_i=0}.

    Let {X = \sum_{i=1}^k X_i} and {\mu = {\mathrm E}[X]}. The actual number of edges in {M} is

    \displaystyle |M| ~=~ {\mathrm E}[X_i] m ~=~ {\textstyle \frac{m}{k} \mu}.

    Our estimate for the size of {M} is {\frac{m}{k} X}. Since {m \leq nd/2}, Fact 2 shows that {{\mathrm E}[X_i] \geq 1/d^2}, which implies that {\mu \geq k/d^2}. Therefore

    \displaystyle \begin{array}{rcl} {\textstyle {\mathrm{Pr}}\Big[~ (1-\epsilon) \frac{m}{k} \mu \leq \frac{m}{k} X \leq (1+\epsilon) \frac{m}{k} \mu ~\Big] } &=& {\mathrm{Pr}}\Big[~ (1-\epsilon) \mu \leq X \leq (1+\epsilon) \mu ~\Big] \\ &\geq& 1-2\exp(-\epsilon^2 \mu / 3) ~\geq~ 0.9, \end{array}

    by a Chernoff bound.

    Implementing the oracle. We will not actually implement the oracle for an arbitrary {M}; we will require {M} to be generated by the greedy algorithm above with a random ordering {\pi}.

    Associate independent random numbers {r_e \in [0,1]} with each edge {e \in E}. They are all distinct with probability {1}, and so they induce a uniformly random ordering on the edges. We let {M} be the maximal matching created by the greedy algorithm above, using this ordering.

    Our algorithm for implementing the oracle is as follows.

    • Given an edge {e \in E}
    • For every edge {e' \in E} sharing an endpoint with {e}
    • If {r_{e'} < r_e}, recursively test if {e'} is in the matching. If so, return “No”.

     

  • Return “Yes”.Clearly this algorithm is always correct: its logic for deciding whether {e \in M} is equivalent to the greedy algorithm. The only question is: what is the running time of the oracle?

    Consider an edge {f} reachable by a path of length {k} from the initial edge {e}. The recursion only considers the edge {f} if the edges on this path have the {r_e} values in decreasing order. The probability of that event is {k!}. The number of edges reachable by a path of length {k} is less than {(2d)^k}. So the expected number of nodes explored is less than {\sum_{k=0}^\infty (2d)^k / k! = e^{2d}}. So the expected time required by an oracle call is {O(d) \cdot e^{2d} = 2^{O(d)}}.

    Since the estimation algorithm calls the oracle {9d/\epsilon^2} times, the expected total running time is {2^{O(d)} / \epsilon^2}.

     

 

Posted in Uncategorized | Leave a comment

Lecture 20: Expanders, Superconcentrators

Today we will make a brief excursion into the fascinating world of expander graphs. It is hard to overstate the importance of these objects to theoretical computer science. They play an important role in some astonishing algorithms, the PCP theorem, derandomization, circuit complexity, etc. Today we will show how expanders are used in designing special routing networks called superconcentrators.

1. Expanders

The easiest one-sentence description of an expander graph is: a graph that looks in many important ways like a random graph. So, although expander graphs are often viewed as a difficult field of study, in many ways they are very simple. In particular, here’s an easy one-sentence construction of an expander graph: a random graph is very likely to be a good expander graph. Proving that sentence is not difficult, and we will do so today.

Beyond the basics, the study of expanders becomes quite non-trivial. For example, in many scenarios we need a deterministic algorithm to construct an expander; such constructions often involve very sophisticated mathematics. Alternatively, we might like to give very precise guarantees on the quality of randomly generated expanders. That also gets very difficult.

But today we will be content with a simple, crude, randomized construction. This suffices for many amazing applications.

The Definition. Let {G=(L,R,E)} be a bipartite graph, where {L} is the set of left-vertices, {R} is the set of right-vertices, and {E} is the set of edges (each of which have one endpoint in {L} and one in {R}). For any subset {S \subseteq L}, let

\displaystyle \Gamma(S) ~=~ \{~ v \in R \::\: \{u,v\} \in E ~\mathrm{for~some}~ u \in S ~\}.

There are several different and interrelated definitions of expanders. We will say that {G} is {(n,m,d)}-expander if {|L|=n}, {|R|=m}, every vertex in {L} has degree {d}, and

\displaystyle |\Gamma(S)| ~\geq~ |S| \qquad\forall S \subseteq L ~\mathrm{s.t.}~ |S| \leq \frac{n}{2}. \ \ \ \ \ (1)

 

Theorem 1 For some sufficiently large constant {d}, sufficiently large {n} and {m \geq \frac{7}{8} n}, there exists an {(n,m,d)}-expander.

Remark. The theorem is trivial when {m \geq n} but it is interesting even in the case {m=0.999n}.

Proof: We generate the graph as follows. First we generate {n} vertices to form the set {L}, and {m} vertices to form the set {R}. Next, for every vertex {u \in L}, we randomly choose exactly {d} vertices from {R} (with repetition), and we add edges from {u} to all of those vertices. (So technically the resulting graph will be a multigraph, but this is not a problem.)

Let {S \subseteq L} be an arbitrary subset with {s := |S| \leq \frac{n}{2}}. We need to show that {|\Gamma(S)| \geq |S|}. We do this in a very naive way. We simply consider every set {T \subseteq R} with {|T| < |S|} and show that it is unlikely that {\Gamma(S) \subseteq T}. That probability is easy to analyze:

\displaystyle {\mathrm{Pr}}[ \Gamma(S) \subseteq T ] ~=~ \Big( \frac{|T|}{m} \Big)^{|S|d}.

Observe that we don’t need to consider all sets {T} with {|T| < |S|}, it suffices to consider those with {|T|=|S|-1}. (In fact, we will simplify our notation by considering the negligibly harder case of {|T|=|S|}.) As long as {\Gamma(S)} is not contained in any such {T}, then we have {|\Gamma(S)| \geq |S|} as required.

The remainder of the proof simply does a union bound over all such {S} and {T}.

\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[ G \mathrm{~is~not~a~} (n,m,d)-\mathrm{expander} ] &\leq& \sum_{\substack{S \subseteq L \\ |S| \leq n/2}} ~ \sum_{\substack{T \subseteq R \\ |T|=|S|}} {\mathrm{Pr}}[ \Gamma(S) \subseteq T ] \\ &\leq& \sum_{s \leq n/2} \binom{n}{s} \binom{m}{s} (s/m)^{sd} \\ &\leq& \sum_{s \leq n/2} \Big( \frac{ne}{s} \Big)^s \Big(\frac{me}{s}\Big)^s \Big(\frac{s}{m}\Big)^{sd} \\ &\leq& \sum_{s \leq n/2} \Bigg( \frac{ne}{m} \frac{me}{m} \frac{s^{d-2}}{m^{d-2}}\Bigg)^{s} \\ &\leq& \sum_{s \leq n/2} \Bigg( \frac{ne^2}{(7/8)n} \frac{(n/2)^{d-2}}{\big((7/8)n\big)^{d-2}}\Bigg)^{s} \\ &\leq& \sum_{s \leq n/2} \Bigg( \frac{8e^2}{7} (4/7)^{d-2} \Bigg)^{s} \\ &\leq& \sum_{s=1}^{\infty} \Bigg( 8.5 \cdot 0.58^{d-2} \Bigg)^{s}. \end{array}

For {d \geq 9}, the base of the exponent is less than {0.19}, and so the infinite series adds up to less than {\frac{0.19}{1-0.19} < 0.25}.

So the probability that the random graph fails to satisfy the second inequality in (1) is less than {1/4}. \Box

2. Superconcentrators

Let {G=(V,E)} be a graph. There are two disjoint subsets {I \subseteq V} and {O \subseteq V} with {|I|=|O|=n}. The vertices in {I} are called the inputs and the vertices in {O} are called the outputs. The graph {G} is called a superconcentrator if, for every {k \leq n} and every {S \subseteq I} and {T \subseteq O} with {|S|=|T|=k}, there are {k} vertex-disjoint paths from {S} to {T}.

It is important to note that we do not insist that the path starting at some {s \in S} has to arrive at a particular {t \in T}; it is acceptable for it to arrive at an arbitrary {t \in T}. In other words, the paths from {S} to {T} induce some bijective mapping {\pi : S \rightarrow T} but we do not care what that mapping is.

Why might we be interested in such graphs? Suppose we want to design a telephone network with {n} input lines and {n} output lines such that any set of {k} input lines can be connected to any set of {k} output lines. To minimize the cost of this network, we would like the number of wires and routing points inside the network to be as small as possible. It is easy to design a superconcentrator with {O(n^2)} edges: simply add an edge between every vertex in {I} and every vertex in {O}. Can one do better than that? It was conjectured by Valiant that {\Omega(n \log n)} edges are necessary. His conjecture came from an attempt to prove a lower bound in circuit complexity. But a short while later he realized that his own conjecture was false:

Theorem 2 For any {n}, there are superconcentrators with {n} inputs, {n} outputs, and {O(n)} edges.

The proof uses expanders, recursion and Hall’s theorem in a very clever way. First we observe the following interesting property of expanders: they have matchings covering any small set of vertices.

Claim 3 Let {H=(L,R,E)} be a {(n,7n/8,9)}-expander. For every set {S \subseteq L} of size {|S| \leq n/2}, there is a matching in {H} covering {S}. In other words, there exists an injective map {\pi_S : S \rightarrow R} such that {\{s,\pi_S(s)\} \in E} for every {s \in S}.)

Proof: Hall’s marriage theorem says that {H} has a matching covering {S} if and only if every {S' \subseteq S} has {|\Gamma(S')| \geq |S'|}. That condition holds by (1), since {H} is an expander. \Box

The Main Idea. The construction proceeds as follows.

  • First we create the input vertices {I} and output vertices {O}, where {|I|=|O|=n}.
  • Next we create some internal vertices {I'} and {O'} with {|I'|=|O'|=7n/8}.
  • We directly connect {I} to {O} using {n} edges that form a matching. (Say, the {i}th vertex in {I} is connected to the {i}th vertex in {O}, for every {i}.)
  • Let {H} be a {(n,7n/8,9)}-expander. We make one copy of {H} with left-vertices {I} and right-vertices {I'}, and an additional copy with left-vertices {O} and right-vertices {O'}.
  • Finally, we create a smaller superconcentrator with inputs {I'} and outputs {O'}.

The Size. First let us analyze the total number of edges in this construction. We use a simple recurrence relation: let {f(n)} be the number of edges used in a superconcentrator on {n} inputs and outputs. The total number of edges used in the construction is:

  • {n} edges for the matching from {I} to {O}.
  • {9n} edges for each of the two expanders.
  • {f(7n/8)} edges for the smaller superconcentrator.

Thus {f(n) = f(7n/8) + 19n}. For the base case, we can take {f(n) = n^2} whenever {n} is a sufficiently small constant. This recurrence has the solution {f(n) = O(n)}.

The Superconcentrator Property. Consider any {S \subseteq I} and {T \subseteq O} with {|S|=|T|=k}. We must find vertex-disjoint paths from {S} to {T}.

Case 1: {k \leq n/2}. By Claim 3, the first expander contains a matching covering {S}. Let {S' \subseteq I'} be the other endpoints of that matching. Similarly, the second expander contains a matching covering {T}. Let {T' \subseteq O'} be the other endpoints of that matching. Note that {|S'|=|T'|=k}. By induction, the smaller superconcentrator contains {k} vertex-disjoint paths between {S'} and {T'}. Combining those paths with the edges of the matchings gives the desired paths between {S} and {T}.

Case 2: {k > n/2}. Claim 3 only provides a matching covering sets of size at most {n/2}, so the previous argument cannot handle the case {k > n/2}. The solution is to use the edges that directly connect {I} and {O}.

By the pigeonhole principle, there are at least {k-n/2} vertices in {S} that are directly connected by the matching to vertices in {T}. The remaining vertices (of which there are at most {n/2}) are handled by the argument of Case 1.

Posted in Uncategorized | 2 Comments

Lecture 19: The Lovasz Local Lemma

Today’s lecture introduces a completely new topic: the Lovász Local Lemma (LLL). This is an important method for analyzing events that are not independent, but have some restricted sort of dependencies. It is not as widely applicable as many of the other the techniques we have seen so far, but from time to time one encounters scenarios in which the LLL is the only technique that works.

1. The Lovász Local Lemma

Suppose {{\mathcal E}_1,\ldots,{\mathcal E}_n} are a collection of “bad” events. We would like to show that there is positive probability that none of them occur. If the events are mutually independent then this is simple:

\displaystyle {\mathrm{Pr}}[\: \wedge_{i=1}^n \overline{{\mathcal E}_i} \:] ~=~ \prod_{i=1}^n {\mathrm{Pr}}[ \overline{{\mathcal E}_i} ] ~>~ 0

(assuming that {{\mathrm{Pr}}[{\mathcal E}_i]<1} for every {i}). The LLL is a method for proving that {{\mathrm{Pr}}[ \wedge_{i=1}^n \overline{{\mathcal E}_i} ] > 0} when the {{\mathcal E}_i}‘s are not mutually independent, but they can have some sort of limited dependencies.

Formally, we say that an event {{\mathcal E}_j} does not depend on the events {\{ {\mathcal E}_i : i \in I \}} if

\displaystyle {\mathrm{Pr}}[ {\mathcal E}_j ] ~=~ {\mathrm{Pr}}[\: {\mathcal E}_j \:|\: \wedge_{i \in I'} {\mathcal E}_i \:] \qquad \forall I' \subseteq I.

So, regardless of whether some of the events in {I} occur, the probability of {{\mathcal E}_j} occurring is unaffected.

Theorem 1 (The “Symmetric” LLL) Let {{\mathcal E}_1,\ldots,{\mathcal E}_n} be events with {{\mathrm{Pr}}[{\mathcal E}_i] \leq p} for all {i}. Suppose that every event {{\mathcal E}_i} does not depend on at least {n-d} other events. If {pd \leq 1/e} then {{\mathrm{Pr}}[ \wedge_{i=1}^n \overline{{\mathcal E}_i} ] > 0}.

We will not prove this theorem. Instead, we will illustrate the LLL by considering a concrete application of it in showing satisfiability of {k}-CNF Boolean formulas. Recall that a {k}-CNF formula is a Boolean formula, involving any finite number of variables, where the formula is a conjunction (“and”) of any number of clauses, each of which is a disjunction (“or”) of exactly {k} distinct literals (a variable or its negation).

For example, here is a {3}-CNF formula with three variables and eight clauses.

\displaystyle \begin{array}{rcl} \phi(a,b,c) &=& (a \vee b \vee c) \wedge (a \vee b \vee \bar{c}) \wedge (a \vee \bar{b} \vee c) \wedge (a \vee \bar{b} \vee \bar{c}) \wedge \\ && (\bar{a} \vee b \vee c) \wedge (\bar{a} \vee b \vee \bar{c}) \wedge (\bar{a} \vee \bar{b} \vee c) \wedge (\bar{a} \vee \bar{b} \vee \bar{c}) \end{array}

This formula is obviously unsatisfiable. One can easily generalize this construction to get an unsatisfiable {k}-CNF formula with {k} variables and {2^k} clauses. Our next theorem says: the reason this formula is unsatisfiable is that we allowed each variable to appear in too many clauses.

Theorem 2 There is a universal constant {d \geq 1} such that the following is true. Let {\phi} be a {k}-CNF formula where each variable appears in at most {T := 2^{k-d} / k} clauses. Then {\phi} is satisfiable. Moreover, there is a randomized, polynomial time algorithm to find a satisfying assignment.

The theorem is stronger when {d} is small. The proof that we will present can be optimized to get {d = 3}. By applying the full-blown LLL one can achieve {d = \log_2(e) \approx 1.44}.

Let {n} be the number of variables and {m} be the number of clauses in {\phi}. Each clause contains {k} variables, each of which can appear in only {T-1} other clauses. So each clause shares a variable with less than {R := kT = 2^{k-d}} other clauses.

The algorithm proving the theorem is perhaps the most natural algorithm that one could imagine. However it took more than 30 years from the introduction of the LLL for this algorithm to be provably analyzed.

Solve({\phi})

  • Set each variable in {\phi} to either {0} or {1} randomly and independently.
  • While there is an unsatisfied clause {C}
  • {\qquad~} Fix({C})

Fix({C})

  • Set each variable in {C} to either {0} or {1} randomly and independently.
  • While there is an unsatisfied clause {D} sharing some variable with {C} (possibly {D=C})
  • {\qquad~} Fix({D})

Claim 3 Suppose every call to Fix terminates. Then Solve calls Fix at most {m} times, and terminates with a satisfying assignment.

Proof: For any call to Fix, we claim that every clause that was satisfied before the call is still satisfied after the call completes. This follows by induction, starting at the deepest level of recursion. So, for every call from Solve to Fix({C}) the number of satisfied clauses increases by one, since {C} must now be satisfied when Fix({C}) terminates. \Box

So it remains to show that, with high probability, every call to Fix terminates.

Theorem 4 Let {s = m( \log m + c) + \log \frac{1}{\delta}} where {c} is a sufficiently large constant. Then the probability that the algorithm makes more than {s} calls to {\textsc{Fix}} (including both the top-level and recursive calls) is at most {\delta}.

The proof proceeds by considering the interactions between two agents: the “CPU” and the “Debugger”. The CPU runs the algorithm, periodically sending messages to the Debugger (we describe these messages in more detail below). However, if Fix gets called more than {s} times the CPU interrupts the execution and halts the algorithm.

The CPU needs {n} bits of randomness to generate the initial assignment in Solve, and needs {k} bits to regenerate variables in each call to Fix. Since the CPU will not execute Fix more than {s} times, it might as well generate all its random bits at the very start of the algorithm. So the first step performed by the CPU is to generate a random bitstring {x} of length {n+sk} to provide all the randomness used in executing the algorithm.

The messages sent from the CPU to the Debugger are as follows.

  • Every time the CPU runs Fix({C}), he sends a message containing the identity of the clause {C}, and an extra bit indicating whether this is a top-level Fix (i.e., a call from Solve) or a recursive Fix.
  • Every time Fix({C}) finishes the CPU sends a message stating “recursive call finished”.
  • If Fix gets called {s} times, the CPU sends a message to the Debugger containing the current {\{0,1\}} assignment of all {n} variables.

Because the Debugger is notified when every call to Fix starts or finishes, he always knows which clause is currently being processed by Fix. A crucial detail is to figure out how many bits of communication are required to send these messages.

  • For a top-level Fix, {\log m + O(1)} bits suffice because there are only {m} clauses in {\phi}.
  • For a recursive Fix, {\log R + O(1)} bits suffice because the Debugger already knows what clause is currently being fixed, and that clause shares variables with only {R} other clauses, so only {R} possible clauses could be passed to the next call to Fix.
  • When each call to Fix({C}) finishes, the corresponding message takes {O(1)} bits.
  • When Fix gets called {s} times, the corresponding message takes {n+O(1)} bits.

The main point of the proof is to show that, if Fix gets called {s} times, then these messages reveal the random string {x} to the Debugger.

Since each clause is a disjunction (an “or” of {k} literals), there is exactly one assignment to those variables that does not satisfy the clause. So, whenever the CPU tells the Debugger that he is calling Fix({C}), the Debugger knows exactly what the current assignment to {C} is. So, starting from the assignment that the Debugger received in the final message, he can work backwards and figure out what the previous assignment was before calling Fix. Repeating this process, he can figure out how the variables were set in each call to Fix, and also what the initial assignment was. Thus the Debugger can reconstruct the random string {x}.

The total number of bits sent by the CPU are

  • {m (\log m+O(1))} bits for all the messages sent when Solve calls Fix.
  • {s \cdot (\log R + O(1))} for all the messages sent in the {\leq s} recursive calls.
  • {n+O(1)} bits to send the final assignment.

So {x} has been compressed from {n+sk} bits to

\displaystyle m (\log m +O(1)) ~+~ s(\log R + O(1)) ~+~ n + O(1) ~~~\mathrm{bits}.

This is an overall shrinking of

\displaystyle \begin{array}{rcl} &&\Big( n + sk \Big) ~-~ \Big( m(\log m +O(1)) ~+~ s(\log R + O(1)) ~+~ n + O(1) \Big) \\ &=& s(k - \log R - O(1)) ~-~ m(\log m +O(1)) ~-~ O(1) \\ &=& s(d-O(1)) ~-~ m(\log m +O(1)) \qquad(\mathrm{since}~R=2^{k-d}) \\ &=& \big(m (\log m + c) + {\textstyle \log \frac{1}{\delta} } \big)(d-O(1)) ~-~ m(\log m +O(1)) \qquad(\mathrm{definition~of}~ s)\\ &\geq& \log \frac{1}{\delta} \end{array}

bits, assuming that {c} and {d} are sufficiently big constants.

We have argued that, if Fix gets called {s} times, then {x} can be compressed by {\log \frac{1}{\delta}} bits. The next claim argues that this happens with probability at most {\delta}.

Claim 5 The probability that {x} can be compressed by {\log \frac{1}{\delta}} bits is at most {\delta}.

Proof: Consider any deterministic algorithm for encoding all bit strings of length {\ell} into bit strings of arbitrary length. The number of bit strings that are encoded into {\ell-b} bits is at most {2^{\ell-b}}. So, a random bit string has probability {2^{-b}} of being encoded into {\ell-b} bits. (One can view this as a simple special case of the Kraft inequality.) \Box

Posted in Uncategorized | Leave a comment

Lecture 18: Perfect Hashing

In the last lecture we defined the notion of pairwise independence, and saw some ways to use this concept in reducing the randomness of a randomized algorithm. Today we will apply these ideas to designing hash functions and “perfect” hash tables.

1. Universal Hash Families

Last time we showed how to create a joint distribution on random variables {\{\: Y_i \::\: i \in U \:\}} such that

  • each {Y_i} is uniformly distributed on some finite set {T}, and
  • the joint distribution on the {Y_i}‘s is pairwise independent.

Let {S} be the sample space underlying the {Y_i}‘s. Drawing a sample {s} from {S} determines values for all the {Y_i}‘s. We can think of this another way: after choosing {s \in S}, then associated with every {i \in U} there is a value in {T} (namely, the value of {Y_i}).

Formally, for every {s \in S} there is a function {h_s : U \rightarrow T} given by {h_s(i) = Y_i}. (The random variable {Y_i} is implicitly a function of {s}.) This family of functions {\{\: h_s \::\: s \in S \:\}} satisfies the following important property

\displaystyle  {\mathrm{Pr}}_{s \in S}[\: h_s(x)=\alpha \:\wedge\: h_s(y)=\beta \:] ~=~ \frac{1}{|T|^2} \qquad\forall x \neq y \in U, ~~ \forall \alpha, \beta \in T.

A family of functions satisfying this property is called a {2}-universal hash family. So we can restate our construction from last time as follows.

Theorem 1 For any {n}, there exists a {2}-universal hash family with {|S|=2^{2n}}, {|U|=2^n} and {|T|=2^n}. Sampling a function from this family requires {O(n)} mutually independent, uniform random bits. Consequently, each hash function can be represented using {O(n)} bits of space. Evaluating any hash function in this family at any point of {U} takes only a constant number of arithmetic operations involving {O(n)}-bit integers.

One can easily modify this construction to handle any {U} and {T} with {|U| \leq 2^n}, {|T| \leq 2^n} and {|T|} a power of two.

2. Perfect Hashing

Let {U} be a “universe” of items. We will assume that our computational model has words of {\Theta(\log |U|)} bits and that any standard arithmetic operation involving {\Theta(\log |U|)}-bit words takes {O(1)} time.

Suppose we wish to store a given subset {N \subseteq U} as a “dictionary”, so that we can efficiently test whether any element {x \in U} belongs to {N}. There are many well-known solutions to this problem. For example, a balanced binary tree allows us to store the dictionary in {O(|N|)} words of space, while search operations take {O(\log |N|)} time in the worst case. (Insertion and deletion of items also take {O(\log |N|)} time.)

We will present an alternative dictionary design based on hashing. This is a “static dictionary”, which does not allow insertion or deletion of items.

Theorem 2 There is a randomized algorithm that, given a set {N \subseteq U}, stores the set {N} in a data structure using {O(|N|)} words of space. There is a deterministic algorithm that, given any item {x \in U}, can search for {x} in that data structure in {O(1)} time in the worst case.

Let {T} be any set with {T \subseteq U} and {|T|} a power of two. Theorem 1 gives us a {2}-universal hash family mapping {U} to {T} for which any hash function can be evaluated in {O(1)} time.

Using this family, consider the following simple design for our dictionary. First create a table of size {|T|} and randomly choose a function {h_s} from the hash family. Can we simply store each item {x \in N} in the table in location {h_s(x)}? This would certainly be very efficient, because finding {x} in the table only requires evaluating {h_s(x)}, which takes only {O(1)} time. But the trouble of course is that there might be collisions: there might be two items {x, y \in N} such that {h_s(x) = h_s(y)}, meaning that {x} and {y} want to lie in the same location of the table.

Most likely you have discussed hash tables collisions in an undergraduate class on data structures. One obvious way to try to avoid collisions is to take {|T|} to be larger; the problem is that this increases the storage requirements, and we are aiming for a data structure that uses {O(|N|)} words. Alternatively one can allow collisions to happen and use some other technique to deal with them. For example one could use chaining, in which all items hashing to the same location of the table are stored in a linked list, or linear probing, in which some items hashing to the same location and relocated to nearby locations. Unfortunately neither of those techniques solves our problem: when the table size is {O(|N|)} both of those techniques will typically require super-constant time to search the table in the worst case.

The solution we present today is based on a simple two-level hierarchical hashing scheme. At the top-level, a single hash function is used to assign each item {x \in N} to a “bucket” in the top-level table. Then, for every bucket of the top-level table we create a new second-level hash table to store the items that hashed to that bucket.

Collisions. Using the pairwise independence property, we can easily determine the probability of any two items having a collision. For any unordered pair of distinct items {\{x, y\} \subset U},

\displaystyle  {\mathrm{Pr}}_{s \in S}[ h_s(x) = h_s(y) ] ~=~ \sum_{i \in T} {\mathrm{Pr}}_{s \in S}[\: h_s(x)=i \:\wedge\: h_s(y)=i \:] ~=~ \sum_{i \in T} \frac{1}{|T|^2} ~=~ \frac{1}{|T|}.

So the expected total number of collisions is

\displaystyle   {\mathrm E}[\#~\mathrm{collisions}] ~=~ \sum_{\{x,y\} \subset N ,\: x \neq y} {\mathrm{Pr}}_{s \in S}[ h_s(x) = h_s(y) ] ~=~ \binom{|N|}{2} / |T|. \ \ \ \ \ (1)

We would like to carefully choose the size of {T} in order to have few collisions. A natural choice is to let {|T|} be {|N|^2} (rounded up to a power of two), so

\displaystyle   {\mathrm E}[\#~\mathrm{collisions}] \leq 1/2 \qquad\implies\qquad {\mathrm{Pr}}[\mathrm{no~collisions}] \geq 1/2, \ \ \ \ \ (2)

by Markov’s inequality. The problem is that storing this {T} would require {\Theta(|N|^2)} space.

Our design. Instead, we will choose {|T|} to be {|N|} (rounded up to a power of two). This will typically result in some collisions, but the second-level hash tables are used to deal with those collisions. The top-level hash function is denoted {h : U \rightarrow T}, and the elements of {T} are called buckets. By (1),

\displaystyle  {\mathrm E}[\#~\mathrm{collisions}] \leq |N|/2 \qquad\text{and}\qquad {\mathrm{Pr}}[\#~\mathrm{collisions} \leq |N|] \geq 1/2.

Let us condition on that event happening: the number of collisions of the top-level hash function {h} is at most {|N|}. For every bucket {i \in T}, let

\displaystyle  N_i ~=~ \{\: x \in N \::\: h(x)=i \:\}

be the set of items that are hashed to bucket {i} by the top-level hash function. Let {n_i = |N_i|}. For each {i \in T} we will sample a new second-level hash function mapping {h_i : U \rightarrow T_i} where {|T_i| = n_i^2}. With probability at least {1/2} there will be no collisions of the hash function {h_i}, as shown by (2). We will repeatedly sample {h_i} and count its number of collisions until finding a function {h_i} with no collisions.

Search operation. Every item in {N} is stored in exactly one location in these second-level tables. So to search our data structure for an item {x \in U} we follow a very simple procedure. First we compute {i = h(x)}, the top-level bucket containing item {x}. Then we compute {h_i(x)}, which gives the location of {x} in the second-level table {T_i}. This process takes {O(1)} time.

Space requirements. The only thing left to analyze is the space requirements. The size of the top-level table {T} is at most {2|N|}, by construction. The total size of the second-level tables is

\displaystyle  \sum_{i \in T} n_i^2 ~\leq~ \sum_{i \in T} \Big( 2 \binom{n_i}{2} + n_i \Big) ~=~ |N| + 2 \sum_{i \in T} \binom{ n_i }{ 2 } ~=~ |N| + 2 \cdot \#~\mathrm{collisions} ~\leq~ 3 \cdot |N|.

We also need to need to write down the description of the hash functions themselves. By Theorem 1 each hash function can be represented using only a constant number of words. So the total space required is {O(|N|)} words.

Posted in Uncategorized | Leave a comment

Lecture 17: Chebyshev’s Inequality, Pairwise Independence

So far we have seen two concentration bounds for scalar random variables: Markov and Chernoff. For our sort of applications, these are by far the most useful. In most introductory probability courses, you are likely to see another inequality, which is Chebyshev’s inequality. Its strength lies between the Markov and Chernoff inequalities: the concentration bounds we get from Chebyshev is usually better than Markov but worse than Chernoff. On the other hand, Chebyshev requires stronger hypotheses than Markov but weaker hypotheses than Chernoff.

1. Variance

We begin by reviewing variance, and other related notions, which should be familiar from an introductory probability course. The variance of a random variable {X} is

\displaystyle \begin{array}{rcl} {\mathrm{Var}}[X] ~=~ {\mathrm E}\Big[ \big(X-{\mathrm E}[X]\big)^2 \Big] ~=~ {\mathrm E}[ X^2 ] - {\mathrm E}[X]^2. \end{array}

The covariance between two random variables {X} and {Y} is

\displaystyle {\mathrm{Cov}}[X,Y] ~=~ {\mathrm E}\Big[ \big(X-{\mathrm E}[X]\big) \big(Y-{\mathrm E}[Y]\big) \Big] ~=~ {\mathrm E}[XY] - {\mathrm E}[X]{\mathrm E}[Y].

This gives some measure of the correlation between {X} and {Y}.

Here are some properties of variance and covariance that follow from the definitions by simple calculations.

Claim 1 If {X} and {Y} are independent then {{\mathrm{Cov}}[X,Y] = 0}.

Claim 2 {{\mathrm{Cov}}[X+Y,Z] = {\mathrm{Cov}}[X,Z] + {\mathrm{Cov}}[Y,Z]}.

More generally, induction shows

Claim 3 {{\mathrm{Cov}}[\sum_i X_i,Z] = \sum_i {\mathrm{Cov}}[X_i,Z]}.

Claim 4 {{\mathrm{Var}}[X+Y] = {\mathrm{Var}}[X] + {\mathrm{Var}}[Y] + 2 \cdot {\mathrm{Cov}}[X,Y]}.

More generally, induction shows

Claim 5 Let {X_1,\ldots,X_n} be arbitrary random variables. Then

\displaystyle {\mathrm{Var}}\Bigg[\sum_{i=1}^n X_i\Bigg] ~=~ \sum_{i=1}^n {\mathrm{Var}}[X_i] + 2 \sum_{i=1}^n \sum_{j>i} {\mathrm{Cov}}[X_i,X_j].

In particular,

Claim 6 Let {X_1,\ldots,X_n} be mutually independent random variables. Then { {\mathrm{Var}}[\sum_{i=1}^n X_i]~= \sum_{i=1}^n {\mathrm{Var}}[X_i] }.

2. Chebyshev’s Inequality

Chebyshev’s inequality you’ve also presumably seen before. It is a 1-line consequence of Markov’s inequality.

Theorem 7 For any {t > 0},

\displaystyle {\mathrm{Pr}}\Big[ \big|X-{\mathrm E}[X]\big|\geq t \Big] ~\leq~ \frac{{\mathrm{Var}}[X]}{t^2}.

Proof:

\displaystyle {\mathrm{Pr}}\Big[ \big|X-{\mathrm E}[X]\big|\geq t \Big] ~=~ {\mathrm{Pr}}\Big[ \big(X-{\mathrm E}[X]\big)^2\geq t^2 \Big] ~\leq~ \frac{ {\mathrm E}[ \big(X-{\mathrm E}[X]\big)^2 ]}{t^2} ~=~ \frac{ {\mathrm{Var}}[ X ]}{t^2},

where the inequality is by Markov’s inequality. \Box

As a quick example, suppose we independently flip {n} fair coins. What’s the probability that we see at least {3n/4} heads? Let {X_i} be the indictator random variable of the event “{i}th toss is heads”. Let {\mu = {\mathrm E}[\sum_i X_i] = n/2}. So we want to analyze {{\mathrm{Pr}}[ \sum_i X_i - \mu \geq n/4 ]}.

Bound from Chebyshev: Note that

\displaystyle {\mathrm{Var}}[X_i] ~=~ {\mathrm E}[X_i^2] - {\mathrm E}[X_i]^2 ~=~ 1/2 - 1/4 ~=~ 1/4.

By independence,

\displaystyle {\mathrm{Var}}[{\textstyle \sum_i X_i}] ~=~ \sum_i {\mathrm{Var}}[X_i] ~=~ n/4.

By Chebyshev’s inequality

\displaystyle {\mathrm{Pr}}[ {\textstyle \sum_i X_i - \mu \geq n/4 }] ~\leq~ {\mathrm{Pr}}\Big[ {\textstyle \big|\sum_i X_i-{\mathrm E}[\sum_i X_i]\big|\geq n/4 } \Big] ~\leq~ \frac{{\mathrm{Var}}[\sum_i X_i]}{(n/4)^2} ~=~ \frac{n/4}{(n/4)^2} ~=~ \frac{4}{n}.

Bound from Chernoff: Chernoff’s inequality gives

\displaystyle {\mathrm{Pr}}[ {\textstyle \sum_i X_i - \mu \geq n/4 }] ~=~ {\mathrm{Pr}}[ {\textstyle \sum_i X_i - \mu \geq \mu/2 }] ~\leq~ \exp( - (1/2)^2 \mu / 3 ) ~<~ 0.96^n.

This is better than the bound from Chebyshev for {n \geq 71}.

So Chebyshev is weaker than Chernoff, at least for analyzing sums of independent Bernoulli trials. So why do we bother studying Chebyshev? One reason is that Chernoff is designed for analyzing sums of mutually independent random variables. That is quite a strong assumption. In some scenarios, our random variables are not mutually independent, or perhaps we deliberately choose them not to be mutually independent.

  • For example, generating mutually independent random variables requires a lot of random bits and, as discussed last time, randomness is a “precious resource”. We will see that decreasing the number of random bits give another method to derandomize algorithms.
  • Another important example is in constructing hash functions, which are random-like functions. Generating a completely random function takes a huge number of random bits. So instead we often try to use hash functions involving less randomness.

3. {k}-wise independence

A set of events {{\mathcal E}_1,\ldots,{\mathcal E}_n} are called {k}-wise independent if for any set {I \subseteq \{1,\ldots,n\}} with {|I| \leq k} we have

\displaystyle {\mathrm{Pr}}[ \wedge_{i \in I} {\mathcal E}_i ] ~=~ \prod_{i \in I} {\mathrm{Pr}}[ {\mathcal E}_i ].

The term pairwise independence is a synonym for {2}-wise independence.

Similarly, a set of discrete random variables {X_1,\ldots,X_n} are called {k}-wise independent if for any set {I \subseteq \{1,\ldots,n\}} with {|I| \leq k} and any values {x_i} we have

\displaystyle {\mathrm{Pr}}[~ \wedge_{i \in I} \: X_i \!=\! x_i ~] = \prod_{i \in I} {\mathrm{Pr}}[ X_i = x_i ].

Claim 8 Suppose {X_1,\ldots,X_n} are {k}-wise independent. Then

\displaystyle {\mathrm E}[~{\textstyle \prod_{i \in I} X_i }~] ~=~ \prod_{i \in I} {\mathrm E}[X_i] \qquad\forall I ~\mathrm{with}~ |I| \leq k.

Proof: For notational simplicity, consider the case {I = \{1,\ldots,k\}}. Then

\displaystyle \begin{array}{rcl} {\mathrm E}[~{\textstyle \prod_{i=1}^k X_i }~] &=& \sum_{x_1} \sum_{x_2} \cdots \sum_{x_k} ~ {\mathrm{Pr}}[~ \wedge_{i=1}^k \, X_i \!=\! x_i ~] \cdot \prod_{i=1}^k x_i \\ &=& \sum_{x_1} \sum_{x_2} \cdots \sum_{x_k} ~ \prod_{i=1}^k {\mathrm{Pr}}[X_i \!=\! x_i ] \cdot x_i \qquad(k\!-\!\mathrm{wise~independence})\\ &=& \Big(\sum_{x_1} {\mathrm{Pr}}[X_1 \!=\! x_1 ] \cdot x_1\Big) \cdots \Big(\sum_{x_k} {\mathrm{Pr}}[X_k \!=\! x_k ] \cdot x_k\Big) \\ &=& \prod_{i=1}^k {\mathrm E}[ X_i ]. \end{array}

\Box

Example: To get a feel for pairwise independence, consider the following three Bernoulli random variables that are pairwise independent but not mutually independent. There are 4 possible outcomes of these three random variables. Each of these outcomes has probability {1/4}.

\displaystyle \begin{array}{ccc} X_1 & X_2 & X_3 \\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{array}

They are certainly not mutually independent because the event {X_1=X_2=X_3=1} has probability {0}, whereas {\prod_{i=1}^3 {\mathrm{Pr}}[X_i=1]=(1/2)^3}. But, by checking all cases, one can verify that they are pairwise independent.

3.1. Constructing Pairwise Independent RVs

Let {{\mathbb F}} be a finite field and {q=|{\mathbb F}|}. We will construct RVs {Y_1, \cdots, Y_q} such that each {Y_i} is uniform over {{\mathbb F}} and the {Y_i}‘s are pairwise independent. To do so, we need to generate only two independent RVs {X_1} and {X_2} that are uniformly distributed over {{\mathbb F}}. We then define

\displaystyle Y_i ~=~ X_1 + i \cdot X_2. \ \ \ \ \ (1)

 

Claim 9 Each {Y_i} is uniformly distributed on {{\mathbb F}}.

Proof: For {i=0} we have {Y_i = X_1}, which is uniform. For {i \neq 0} and any {j \in {\mathbb F}} we have

\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[ Y_i = j ] &=& {\mathrm{Pr}}[ X_1 + i \cdot X_2 = j ] \\ &=& {\mathrm{Pr}}[ X_2 = (j-X_1)/i ] \\ &=& \sum_{x \in {\mathbb F}} {\mathrm{Pr}}[~ X_2 = (j-x)/i ~\wedge~ X_1=x ~] \qquad(\mathrm{total~probability~law}) \\ &=& \sum_{x \in {\mathbb F}} {\mathrm{Pr}}[ X_2 = (j-x)/i] \cdot {\mathrm{Pr}}[ X_1=x ] \qquad(\mathrm{pairwise~independence}) \\ &=& (1/q) \cdot \sum_{x \in {\mathbb F}} {\mathrm{Pr}}[ X_2 = (j-x)/i ] \\ &=& (1/q), \end{array}

since as {x} ranges through {{\mathbb F}}, {(j-x)/i} also ranges through all of {{\mathbb F}}. (In other words, the map {x \mapsto (j-x)/i} is a bijection of {{\mathbb F}} to itself.) So {Y_i} is uniform. \Box

Claim 10 The {Y_i}‘s are pairwise independent.

Proof: We wish to show that, for any distinct RVs {Y_i} and {Y_j} and any values {a,b \in {\mathbb F}}, we have

\displaystyle {\mathrm{Pr}}[~ Y_i=a \:\wedge\: Y_j=b ~] ~=~ {\mathrm{Pr}}[ Y_i=a ] \cdot {\mathrm{Pr}}[ Y_j=b ] ~=~ 1/q^2.

This event is equivalent to {X_1+i \cdot X_2 = a} and {X_1+j \cdot X_2 = b}. We can also rewrite that as:

\displaystyle \begin{pmatrix} 1 & i \\ 1 & j \end{pmatrix} \cdot \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} ~=~ \begin{pmatrix} a \\ b \end{pmatrix}.

This holds precisely when

\displaystyle \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} ~=~ \begin{pmatrix} 1 & i \\ 1 & j \end{pmatrix}^{-1} \cdot \begin{pmatrix} a \\ b \end{pmatrix} ~=~ \begin{pmatrix} ja-ib \\ b-a \end{pmatrix} /(j-i).

Since {X_1} and {X_2} are independent and uniform over {{\mathbb F}}, this event holds with probability {1/q^2}. \Box

Corollary 11 Given {2n} mutually independent, uniformly random bits, we can construct {2^n} pairwise independent, uniformly random strings in {\{0,1\}^n}.

Proof: Apply the previous construction to the finite field {{\mathbb F}_{2^n}}. The {2n} mutually independent random bits are used to construct {X_1} and {X_2}. The random strings {Y_1,\ldots,Y_{2^n}} are constructed as in (1). \Box

3.2. Example: Max Cut with pairwise independent RVs

Once again let’s consider the Max Cut problem. We are given a graph {G=(V,E)} where {V = \{1,\ldots,n\}}. We will choose {\{0,1\}}-valued random variables {Y_1,\ldots,Y_n}. If {Y_i=1} then we add vertex {i} to {U}.

Our original algorithm chose {Y_1,\ldots,Y_n} to be mutually independent and uniform. Instead we will pick {Y_1,\ldots,Y_n} to be pairwise independent and uniform. Then

\displaystyle \begin{array}{rcl} {\mathrm E}[ |\delta(U)| ] &=& \sum_{ij \in E} {\mathrm{Pr}}[~ (i \!\in\! U \wedge j \!\not\in\! U) \:\vee\: (i \!\not\in\! U \wedge j \!\in\! U) ~] \\ &=& \sum_{ij \in E} {\mathrm{Pr}}[~ i \!\in\! U \wedge j \!\not\in\! U ~] + {\mathrm{Pr}}[~ i \!\not\in\! U \wedge j \!\in\! U ~] \\ &=& \sum_{ij \in E} {\mathrm{Pr}}[Y_i]{\mathrm{Pr}}[\overline{Y_j}] + {\mathrm{Pr}}[\overline{Y_i}]{\mathrm{Pr}}[Y_j] \qquad(\mathrm{pairwise~independence}) \\ &=& \sum_{ij \in E} \Big( (1/2)^2 + (1/2)^2 \Big) \\ &=& |E|/2 \end{array}

So the original algorithm works just as well if we make pairwise independent decisions instead of mutually independent decisions for placing vertices in {U}. The following theorem shows the advantage of making pairwise independent decisions.

Theorem 12 There is a deterministic, polynomial time algorithm to find a cut {\delta(U)} with {|\delta(U)| \geq |E|/2}.

Proof: By Corollary 11, we only need {b = O(\log n)} mutually independent, uniform random bits {X_1,\ldots,X_b} in order to generate our pairwise independent, uniform random bits {Y_1,\ldots,Y_n}. We have just argued that these pairwise independent {Y_i}‘s will give us

\displaystyle {\mathrm E}_{X_1,\ldots,X_b}[ |\delta(U)| ] ~=~ |E|/2.

So there must exist some particular bits {(x_1,\ldots,x_b)} such that fixing {X_i=x_i} for all {i}, we get { |\delta(U)| \geq |E|/2 }. We can deterministically find such bits {(x_1,\ldots,x_b)} by exhaustive search in {2^b = 2^{O(\log n)} = \mathrm{poly}(n)} trials. This gives a deterministic, polynomial time algorithm. \Box

4. Chebyshev with pairwise independent RVs

One of the main benefits of pairwise independent RVs is that Chebyshev’s inequality still works beautifully. Suppose that {X_1,\ldots,X_n} are pairwise independent. For any {i \neq j},

\displaystyle {\mathrm{Cov}}[X_i,X_j] ~=~ {\mathrm E}[X_i X_j] - {\mathrm E}[X_i]{\mathrm E}[X_j] ~=~ 0,

by Claim 8. So

\displaystyle {\mathrm{Var}}[{\textstyle \sum_{i=1}^n X_i}] ~=~ \sum_{i=1}^n {\mathrm{Var}}[X_i],

by Claim 5. So

\displaystyle {\mathrm{Pr}}[{\textstyle |\sum_i X_i - {\mathrm E}[\sum_i X_i]| > t }] ~\leq~ \frac{{\mathrm{Var}}[\sum_i X_i]}{t^2}.

This is exactly the same bound that we would get if the {X_i}‘s were mutually independent.

Posted in Uncategorized | 1 Comment

Lecture 16: Derandomization: Method of Conditional Expectations, Method of Pessimistic Estimators

In this lecture we discuss the topic of derandomization — converting a randomized algorithm into a deterministic one.

1. Method of Conditional Expectations

One of the simplest methods for derandomizing an algorithm is the “method of conditional expectations”. In some contexts this is also called the “method of conditional probabilities”.

Let us start with a simple example. Let {[k]} denote {\{1,\ldots,k\}}. Suppose {X} is an random variable taking values in {[k]}. Let {f : [k] \rightarrow {\mathbb R}} be any function and suppose {{\mathrm E}[ f(X) ] \leq \mu}. How can we find an {x \in \{1,\ldots,k\}} such that {f(x) \leq \mu}? Well, the assumption {{\mathrm E}[ f(X) ] \leq \mu} guarantees that there exists {x \in \{1,\ldots,k\}} with {f(x) \leq \mu}. So we can simply use exhaustive search to try all possible values for {x} in only {O(k)} time. The same idea can also be used to find an {x} with {f(x) \geq \mu}.

Now let’s make the example a bit more complicated. Suppose {X_1,\ldots,X_n} are independent random variables taking values in {[k]}. Let {f : [k]^n \rightarrow {\mathbb R}} be any function and suppose {{\mathrm E}[ f(X_1,\ldots,X_n) ] \leq \mu}. How can we find a vector {(x_1,\ldots,x_n) \in [k]^n} with {f(x_1,\ldots,x_n) \leq \mu}? Exhaustive search is again an option, but now it will take {O(k^n)} time, which might be too much.

The method of conditional expectations gives a more efficient solution, under some additional assumptions. Suppose that for any numbers {(x_1,\ldots,x_i)} we can efficiently evaluate

\displaystyle {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:].

(If you prefer, you can think of this as {{\mathrm E}[~ f(X_1,\ldots,X_n) ~|~ X_1=x_1,\ldots,X_i=x_i ~]}, which is a conditional expectation of {f}. This is where the method gets its name.) Then the following algorithm will produce a point {x_1,\ldots,x_n} with {f(x_1,\ldots,x_n) \leq \mu}.

  • For {i=1,\ldots,n}
  • Set {x_i=0}.
  • Repeat
  • Set {x_i = x_i + 1}.

 

  • Until {{\mathrm E}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:] \leq {\mathrm E}[\: f(x_1,\ldots,x_{i-1},X_i,\ldots,X_n) \:]}
  • EndFirst we claim that the algorithm will terminate (i.e., the repeat loop will eventually succeed). To see this, define

    \displaystyle g(y) ~=~ {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_{i-1},y,X_{i+1},\ldots,X_n) \:].

    Just like in our simple example above, there exists an {x_i} with {g(x_i) \leq {\mathrm E}_{X_i}[ g(X_i) ]}, so we can find such an {x_i} by exhaustive search. That is exactly what the repeat loop is doing.

    1.1. Example: Max Cut

    To illustrate this method, let us consider our algorithm for the Max Cut problem from Lecture 1. We are given a graph {G=(V,E)}. Recall that this algorithm generates a cut {\delta(U)} simply by picking a set {U \subseteq V} uniformly at random. Equivalently, for each vertex {v \in V}, the algorithm independently flips a fair coin to decide whether to put {v \in U}. We argued that {{\mathrm E}[|\delta(U)|] \geq |E|/2}.

    We will use the method of conditional expectations to derandomize this algorithm. Let the vertex set of the graph be {V = \{1,\ldots,n\}}. Let

    \displaystyle f(x_1,\ldots,x_n) = |\delta(U)| \qquad\text{where}\qquad U = \{\: i \::\: x_i=1 \}.

    Let {X_1,\ldots,X_n} be independent random variables where each {X_i} is {0} or {1} with probability {1/2}. We identify the event “{X_i=1}” with the event “vertex {i \in U}”. Then {{\mathrm E}[ f(X_1,\ldots,X_n) ] = {\mathrm E}[ |\delta(U)| ] = |E|/2}. We wish to deterministically find values {x_1,\ldots,x_n} for which {f(x_1,\ldots,x_n) \geq |E|/2}.

    To apply the method of conditional probabilities we must be able to efficiently compute

    \displaystyle {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:],

    for any numbers {(x_1,\ldots,x_i)}. What is this quantity? It is the expected number of edges cut when we have already decided which vertices amongst {\{1,\ldots,i\}} belong to {U}, and the remaining vertices {\{i+1,\ldots,n\}} are placed in {U} randomly (independently, with probability {1/2}). This expectation is easy to compute! For any edge with both endpoints in {\{1,\ldots,i\}} we already know whether it will be cut or not. Every other edge has probability exactly {1/2} of being cut. So we can compute that expected value in linear time.

    In conclusion, the method of condition expectations gives us a deterministic, polynomial time algorithm outputting a set {U} with {|\delta(U)| \geq |E|/2}.

    2. Method of Pessimistic Estimators

    So far we have derandomized our very simple Max Cut algorithm, which doesn’t use any sophisticated probabilistic tools. Next we will see what happens when we try to apply these ideas to algorithms that use the Chernoff bound.

    Let {X_1,\ldots,X_n} be independent random variables in {[0,1]}. Define the function {f} as follows:

    \displaystyle f(x_1,\ldots,x_n) ~=~ \begin{cases} 1 &\qquad\mathrm{if}~ \sum_i x_i \geq \alpha \\ 0 &\qquad\mathrm{if}~ \sum_i x_i > \alpha \end{cases}.

    So

    \displaystyle {\mathrm E}[f(X_1,\ldots,X_n)] ~=~ {\mathrm{Pr}}[\textstyle \sum_i X_i \geq \alpha ],

    which is the typical sort of quantity to which one would apply a Chernoff bound.

    Can we apply the method of conditional expectations to this function {f}? For any numbers {(x_1,\ldots,x_i)}, we need to efficiently evaluate

    \displaystyle {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:] ~=~ {\mathrm{Pr}}[~ \textstyle \sum_i X_i \geq \alpha ~|~ X_1=x_1,\ldots,X_i=x_i ~].

    Unfortunately, computing this is not so easy. If the {X_i}‘s were i.i.d. Bernoullis then we could compute that probability by expanding it in terms of binomial coefficients. But in the non-i.i.d. or non-Bernoulli case, there does not seem to be an efficient way to compute this probability.

    Here is the main idea of “pessimistic estimators”: instead of defining {f} to be equal to that probability, we will define {f} to be an easily-computable upper-bound on that probability. Because {f} is an upper bound on the probability of the bad event “{\sum_i X_i \geq \alpha}”, the function {f} is called a pessimistic estimate of that probability. So what upper bound should we use? The Chernoff bound, of course!

    For simplicity, suppose that {X_1,\ldots,X_n} are independent Bernoulli random variables. The first step of the Chernoff bound (exponentiation and Markov’s inequality) shows that, for any parameter {t>0},

    \displaystyle {\mathrm{Pr}}[~ \textstyle \sum_i X_i \geq \alpha ~] ~\leq~ {\mathrm E}\Big[ e^{-t \alpha} \prod_{j=1}^n e^{t X_j} \Big]. \ \ \ \ \ (1)

     

    Important Remark: This step holds for any joint distribution on the {X_i}‘s, including any non-independent or conditional distribution. This is because we have only used exponentiation and Markov’s inequality, which need no assumptions on the distribution.

    We will use the upper bound in (1) to define our function {f}. Specifically, define

    \displaystyle f(x_1,\ldots,x_n) ~=~ e^{-t \alpha} \cdot \prod_{j=1}^n e^{t x_j}. \ \ \ \ \ (2)

     

    Let’s check that the conditional expectations are easy to compute with this new definition of {f}. Given any numbers {(x_1,\ldots,x_i)}, we have

    \displaystyle \begin{array}{rcl} {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:] &=& e^{-t \alpha} \cdot \prod_{j=1}^i e^{t x_j} ~\cdot~ {\mathrm E}_{X_{i+1},\ldots,X_n}\Bigg[\: \prod_{j=i+1}^n e^{t X_j} \:\Bigg] \\ &=& e^{-t \alpha} \cdot \prod_{j=1}^i e^{t x_j} ~\cdot~ \prod_{j=i+1}^n {\mathrm E}_{X_j}\big[\: e^{t X_j} \:\big]. \end{array}

    This expectation is easy to compute in linear time, assuming we know the distribution of each {X_j} (i.e., we know that {{\mathrm{Pr}}[ X_j=1 ] = p_j}).

    Applying the method of conditional expectations to the pessimistic estimator: Now we’ll see how to use this function {f} to find {(x_1,\ldots,x_n)} with {\sum_i x_i < \alpha}. Set {\mu = \sum_i {\mathrm E}[X_i]}, {\alpha = (1+\delta) \mu} and {t = \ln(1+\delta)}. We have

    \displaystyle {\mathrm{Pr}}[ \textstyle \sum_i X_i > (1+\delta) \mu ] ~\leq~ {\mathrm E}[ f(X_1,\ldots,X_n) ] ~\leq~ \exp\Big( - \mu \big( (1+\delta) \ln(1+\delta) - \delta\big) \Big),

    where the first inequality is from (1) and the second inequality comes from the remainder of our Chernoff bound proof. Suppose {\mu} and {\delta} are such that this last quantity is strictly less than {1}. Then we know that there exists a vector {(x_1,\ldots,x_n)} with {\sum_i x_i < \alpha}.

    We now explain how to efficiently and deterministically find such a vector. The method of conditional expectation will give us a vector {(x_1,\ldots,x_n)} for which {f(x_1,\ldots,x_n) < 1}. We now apply the same argument as in (1) to a conditional distribution:

    \displaystyle \begin{array}{rcl} {\mathrm{Pr}}[~ \textstyle \sum_i X_i \geq (1+\delta) \mu ~|~ X_1=x_1,\ldots,X_n=x_n ~] &\leq& {\mathrm E}\Big[ f(X_1,\ldots,X_n) ~|~ X_1=x_1,\ldots,X_n=x_n \Big] \\ &=& f(x_1,\ldots,x_n) ~<~ 1. \end{array}

    But, under the conditional distribution “{X_1=x_1,\ldots,X_n=x_n}”, there is no randomness remaining. The sum {\sum_i X_i} is not a random variable; it is simply the number {\sum_i x_i}. Since the event “{\sum_i x_i \geq (1+\delta) \mu}” has probability less than {1}, it must have probability {0}. In other words, we must have {\sum_i x_i < (1+\delta) \mu}.

    This example is actually quite silly. If we want to achieve {\sum_i x_i < \alpha}, the best thing to do is obviously to set each {x_i = 0}. But the method is useful because we can apply it in more complicated scenarios that involve multiple Chernoff bounds.

    2.1. Congestion Minimization

    In Lecture 3 we gave a randomized algorithm which gives a {O(\log n / \log \log n)} approximation to the congestion minimization problem. We now get a deterministic algorithm by the method of pessimistic estimators.

    Recall that an instance of the problem consists of a directed graph {G=(V,A)} with {n = |V|} and a sequence {(s_1,t_1), \ldots, (s_k,t_k)} of pairs of vertices. We want to find {s_i}{t_i} paths such that each arc {a} is contained in few paths. Let {{\mathcal P}_i} be the set of all paths in {G} from {s_i} to {t_i}. For every path {P \in {\mathcal P}_i}, we create a variable {x_P^i}.

    We obtain a fractional solution to the problem by solving this LP.

    \displaystyle \begin{array}{llll} \mathrm{min} & C && \\ \mathrm{s.t.} & {\displaystyle \sum_{P \in {\mathcal P}_i}} x_P^i &= 1 &\qquad\forall i=1,\ldots,k \\ & {\displaystyle \sum_i ~~ \sum_{P \in {\mathcal P}_i \mathrm{~with~} a \in P}} x_P^i &\leq C &\qquad\forall a \in A \\ & x_P^i &\geq 0 &\qquad\forall i=1,\ldots,k \mathrm{~and~} P \in {\mathcal P}_i \end{array}

    Let {C^*} be the optimal value of the LP.

    We showed how randomized rounding gives us an integer solution (i.e., an actual set of paths). The algorithm chooses exactly one path {P_i} from {{\mathcal P}_i} by setting {P_i = P} with probability {x^i_P}. For every arc {a} let {Y_i^a} be the indicator of the event “{a \in P_i}”. Then the congestion on arc {a} is {Y^a = \sum_i Y_i^a}. We showed that {{\mathrm E}[ Y^a ] \leq C^*}. Let {\alpha = 6 \log n / \log \log n}. We applied Chernoff bounds to every arc and a union bound to show that

    \displaystyle {\mathrm{Pr}}[~ \mathrm{any} ~a~ \mathrm{has}~ Y^a > \alpha C^* ~] ~\leq~ \sum_{a \in A} {\mathrm{Pr}}[~ Y^a > \alpha C^* ~] ~\leq~ 1/n.

    We will derandomize that algorithm with the function

    \displaystyle f(P_1,\ldots,P_k) ~=~ \sum_{a \in A} e^{-t \alpha} \cdot \prod_{j=1}^k e^{t Y_j^a}.

    How did we obtain this function? For each arc {a} we applied a Chernoff bound, so each arc {a} has a pessimistic estimator as in (2). We add all of those functions to give us this function {f}.

    Note that

    \displaystyle \textstyle {\mathrm{Pr}}[~ \mathrm{any} ~a~ \mathrm{has}~ Y^a > \alpha C^* ~] ~\leq~ \sum_{a \in A} {\mathrm{Pr}}[~ Y^a > \alpha C^* ~] ~\leq~ {\mathrm E}[ f(P_1,\ldots,P_k) ] ~\leq~ 1/n.

    Applying the method of conditional expectations, we can find a vector of paths {(p_1,\ldots,p_k)} for which {f(p_1,\ldots,p_k) \leq 1/n}. Thus,

    \displaystyle \begin{array}{rcl} {\mathrm{Pr}}[~ \mathrm{any} ~a~ \mathrm{has}~ Y^a > \alpha C^* \:|\: P_1=p_1,\ldots,P_k=p_k ~] &\leq& {\mathrm E}[~ f(P_1,\ldots,P_k) \:|\: P_1=p_1,\ldots,P_k=p_k ~] \\ &=& f(p_1,\ldots,p_k) ~\leq~ 1/n. \end{array}

    Under that conditional distribution there is no randomness left, so the event “any {a} has {Y^a > \alpha C^*}” must have probability {0}. So, if we choose the paths {p_1,\ldots,p_k} then every arc {a} has congestion at most {\alpha C^*}, as desired.

 

Posted in Uncategorized | 1 Comment