## Lecture 25: Online Steiner Tree

The algorithm for probabilistically embedding metric spaces into trees has numerous theoretical applications. It is a key tool in the design of many approximation algorithms and online algorithms. Today we will illustrate the usefulness of these trees in designing an algorithm for the online Steiner tree problem.

1. Online Steiner Tree

Let ${G=(V,E)}$ be a graph and let ${c : E \rightarrow {\mathbb R}_{>0}}$ be lengths on the edges. Let ${(V,d_G)}$ be the shortest path metric on ${G}$.

For any ${U \subseteq V}$, a Steiner tree is a subtree of ${G}$ that spans (i.e., contains) all vertices in ${U}$, but does not necessarily span all of ${V}$. The vertices in ${U}$ are called “terminals”. Equivalently, we can define a Steiner tree to be an acyclic, connected subgraph of ${G}$ that spans all of ${U}$. Computing a minimum-length Steiner tree is NP-hard.

Today we consider the problem of constructing a Steiner tree in an online setting. There is a sequence of ${k}$ time steps. In each time step ${i}$, we are given a vertex ${u_i \in V}$. Our algorithm must then choose a connected subgraph ${C_i}$ of ${G}$ which spans ${\{u_1,\ldots,u_i\}}$ (and possibly other vertices). The objective is to minimize the total length ${c( \cup_{i=1}^k C_i )}$. Since we only care about the cost of the union of the ${C_i}$‘s, we may assume without loss of generality that ${C_1 \subseteq \cdots \subseteq C_k}$. There is no restriction on computation time of the algorithm.

Remark. The ${C_i}$‘s are not actually Steiner trees because we did not insist that they are acyclic. If trees are desired, one could remove cycles from each ${C_i}$ arbitrarily. This is equivalent to the problem that we stated above.

If we knew the terminal set ${U = \{u_1,\ldots,u_k\}}$ in advance then the problem is trivial. The algorithm could compute in exponential time the minimum-length Steiner tree ${C^*}$ spanning ${U}$, then set ${C_i = C^*}$ in every time step. Unfortunately the algorithm does not know ${U}$ in advance. Instead, our goal will be for the algorithm to behave almost as well as if it did know ${U}$. Formally, define competitive ratio of the algorithm to be the ratio

$\displaystyle \frac{ c( \cup_{i=1}^k C_i ) }{ c(C^*) }.$

We want our algorithm to have small competitive ratio.

Theorem 1 There is a randomized, polynomial time algorithm with expected competitive ratio ${O(\log n)}$.

I think this is optimal but I could not find a definitive reference. For very similar problems, Alon & Azar prove a ${\Omega(\log n / \log \log n)}$ lower bound and Imase & Waxman prove a ${\Omega(\log n)}$ lower bound.

1.1. The Algorithm

The main idea is to use algorithm of the last lecture to approximate the metric ${(V,d_G)}$ by a tree ${T}$ with edge lengths ${w}$. Let ${d_T}$ be the corresponding distance function on ${T}$. Recall that the leaves of ${T}$ are identified with the vertices in ${V}$. The algorithm will then build a sequence of Steiner trees ${T_1,\ldots,T_k}$ that are subtrees of ${T}$, where each ${T_i}$ spans ${\{u_1,\ldots,u_i\}}$. This is trivial: since ${T}$ is itself a tree, there is really only one reasonable choice for what ${T_i}$ should be. We set ${T_i}$ to be the unique minimal subtree of ${T}$ that spans ${\{u_1,\ldots,u_i\}}$.

Remark. This step of the algorithm illustrates the usefulness of probabilistically approximating the metric by a tree. Many problems can be solved either trivially or by very simple algorithms, when the underlying graph is a tree.

Clearly ${T_1 \subseteq \cdots \subseteq T_k}$. We would like to understand how the length of our final Steiner tree ${T_k}$ compares to the optimal Steiner tree ${C^*}$.

Claim 2 ${{\mathrm E}[w(T_k)] \leq O(\log n) \cdot c(C^*)}$

Unfortunately the tree ${T_k}$ itself isn’t a solution to our problem. Recall that ${T}$ is not a subtree of ${G}$: the construction of ${T}$ required adding extra vertices and edges. So ${T_k}$ is not a subtree of ${G}$ either. To obtain our actual solution, we will see below how to use the trees ${T_i}$ to guide our construction of the desired subgraphs ${C_i}$ of ${G}$.

Proof: (of Claim 2). Let ${\pi : [k] \rightarrow U}$ be an ordering of the terminals given by a depth-first traversal of ${C^*}$. Equivalently, let ${2 \cdot C^*}$ denote the graph obtained from ${C^*}$ by replacing every edge with two parallel copies. Perform an Euler tour of ${2 \cdot C^*}$, and let ${\pi}$ be the order in which the terminals are visited.

The Euler tour traverses every edge of ${2 \cdot C^*}$ exactly once, so ${c(C^*)}$ is exactly half the length of the Euler tour. Thus

$\displaystyle c(C^*) ~=~ (1/2) \cdot (\mathrm{cost~of~Euler~tour}) ~\geq~ (1/2) \cdot \sum_{i=1}^k d_G(\pi_i,\pi_{i+1}). \ \ \ \ \ (1)$

Now consider performing a walk through ${T_k}$, visiting the terminals in the order given by ${\pi}$. Since this walk visits every leaf of ${T_k}$, it is a traversal of the tree, and hence it crosses every edge at least once. (In fact, it is an Eulerian walk, so it crosses every edge at least twice.)

Thus

$\displaystyle w(T_k) \leq \sum_{i=1}^k d_T(\pi_i,\pi_{i+1}). \ \ \ \ \ (2)$

Combining (1) and (2), we get

$\displaystyle {\mathrm E}[\, w(T_k) \,] ~\leq~ {\mathrm E}\Big[\: \sum_{i=1}^k d_T(\pi_i,\pi_{i+1}) \:\Big] \\ ~\leq~ O(\log n) \cdot \sum_{i=1}^k d_G(\pi_i,\pi_{i+1}) \\ ~\leq~ O(\log n) \cdot c(C^*), \ \ \ \ \ (3)$

as required. $\Box$

Remark. The analysis in (3) illustrates why our random tree ${T}$ is so useful. The quantity that we’re trying to analyze (namely ${w(T_k)}$) is bounded by a linear function of some distances in the tree (namely ${\sum_i d_T(\pi_i,\pi_{i+1})}$). Because we can bound ${{\mathrm E}[ d_T(\pi_i,\pi_{i+1}) ]}$ by ${O(\log n) \cdot d_G(\pi_i,\pi_{i+1})}$, and because of linearity of expectation, we obtain a bound on ${{\mathrm E}[w(T_k)]}$ involving distances in ${G}$.

Now let us explain how the algorithm actually solves the online Steiner tree problem. It will maintain a sequence ${C_1 \subseteq \cdots \subseteq C_k}$ of subgraphs of ${G}$ such that each ${C_i}$ spans ${\{u_1,\ldots,u_i\}}$. Initially ${C_1 = \{ u_1 \}}$. Then at every subsequent time step ${i}$, we do the following.

• Let ${v_i}$ be the closest vertex to ${u_i}$ amongst ${\{u_1,\ldots,u_{i-1}\}}$, using distances in ${T}$. In other words, ${v_i \,=\, \mathrm{argmin}_{v \in \{u_1,\ldots,u_{i-1}\}} \: d_T(u_i,v) }$.
• Let ${P_i}$ be a shortest path in ${G}$ from ${u_i}$ to ${v_i}$.
• Let ${C_i = C_{i-1} \cup P_i}$.

The trees ${T_1,\ldots,T_k}$ described above can also be viewed in this iterative fashion. Initially ${T_1 = \{ u_1 \}}$. Then at every subsequent time step ${i}$, we do the following.

• Let ${Q_i}$ be the unique path in ${T}$ from ${u_i}$ to the closest vertex in ${T_{i-1}}$.
• Let ${T_i = T_{i-1} \cup Q_i}$.

The following important claim relates ${P_i}$ and ${Q_i}$.

Claim 3 ${c(P_i) \leq 2 \cdot w(Q_i)}$ for all ${i}$.

Proof: Let ${z_i}$ be the closest vertex in ${T_{i-1}}$ to ${u_i}$, so ${Q_i}$ is a ${u_i}$${z_i}$ path. The vertex ${z_i}$ would only be added to ${T_{i-1}}$ if some leaf ${v}$ beneath ${z_i}$ belongs to ${\{u_1,\ldots,u_{i-1}\}}$. By the choice of weights in ${T_{i-1}}$, the weight of the ${z_i}$${v}$ path is no longer than the weight of the ${u_i}$${z_i}$ path. Thus ${d_T(u_i,v_i) \leq 2 \cdot w(Q_i)}$ for all ${i}$. Consequently

$\displaystyle c(P_i) ~=~ d_G(u_i,v_i) ~\leq~ d_T(u_i,v_i) ~\leq~ 2 \cdot w(Q_i),$

as required. $\Box$

Consequently,

$\displaystyle c(C_k) ~\leq~ \sum_{i=1}^k c(P_i) ~\leq~ 2 \sum_{i=1}^k w(Q_i) ~=~ 2 \cdot w(T_k),$

since ${C_k}$ is the union of the ${P_i}$ paths and ${T_k}$ is the disjoint union of the ${Q_i}$ paths. So

$\displaystyle {\mathrm E}[ c(C_k) ] ~\leq~ O(\log n) \cdot c(C^*),$

by Claim 2. This proves that the expected competitive ratio is ${O(\log n)}$.

## Lecture 24: Probabilistic approximation of metrics by trees

1. Probabilistic Approximation of Metrics

For many optimization problems, the input data involves some notion of distance, which we formalize as a metric. But unfortunately many optimization problems can be quite difficult to solve in an arbitrary metric. In this lecture we present a very approach to dealing with such problems, which is a method to approximate any metric by much simpler metrics. The simpler metrics we will use are trees, i.e., the shortest path metric on a graph that is a tree. Many optimization problems are easy to solve on trees, so in one fell swoop we get algorithms to approximate a huge number of optimization problems.

Roughly speaking, our main result is: any metric on ${n}$ points can be represented by a distribution on trees, while preserving distances up to a ${O(\log n)}$ factor. Consequently: for many optimization problems involving distances in a metric, if you are content with an ${O(\log n)}$-approximate solution, you can assume that your metric is a tree.

In order to state our results more formally, we will need to deal with a important issue. To illustrate the issue, and how to deal with it, we first present an example.

1.1. Example: Approximating a cycle

Let ${G=(V,E)}$ be a cycle on ${n}$ nodes. The (spanning) subtrees of ${G}$ are simply the paths obtained by deleting a single edge. So let ${uv}$ be an edge and let ${T = G \setminus uv}$ be the corresponding tree. Is the shortest path metric of ${T}$ a good approximation of the shortest path of ${G}$? The answer is no: the distance between ${u}$ and ${v}$ in ${G}$ is only ${1}$, whereas the distance between ${u}$ and ${v}$ in ${T}$ is ${n-1}$. So, no matter which subtree ${T}$ of ${G}$ we pick, there will be some pair of nodes whose distance is poorly approximated.

Is there some way around this problem? Perhaps we don’t need ${T}$ to be a subtree of ${G}$. We could consider a tree ${T=(U,F)}$ (possibly with lengths on the edges) where ${U \supseteq V}$ and ${F}$ is completely unrelated to ${E}$. Can such a tree do a better job of approximating distances in ${G}$? It turns out that the answer is still no: there will always be a pair of nodes whose distance is only preserved up to a factor ${\Theta(n)}$. But here is a small observation: any subtree of ${T}$ approximately preserves the average distances. One can easily check that the total distance between all pairs of nodes is ${\Theta(n^3)}$, for both ${G}$ and for any subtree of ${G}$. Thus, subtrees approximate the distances in ${G}$ “on average”.

So for the ${n}$-cycle, a subtree cannot approximate all distances, but it can approximate the average distance. This motivates us to apply a trick that is both simple and counterintuitive. It turns out that we can approximate all distances if we allow ourself to pick the subtree randomly. (The trick is Von Neumann’s minimax theorem, and it implies that approximating the average distance is equivalent to finding a distribution on trees for which every distance is approximated in expectation.) To illustrate this, choose any pair of vertices ${u, v \in V}$. Let ${d=d_G(u,v)}$ be the distance between ${u}$ and ${v}$ in ${G}$. Pick a subtree ${T}$ by deleting an edge ${e}$ at random and let ${d_T(u,v)}$ be the ${u}$${v}$ distance in ${T}$. Obviously ${d_T(u,v) \geq d_G(u,v)}$ since we constructed ${T}$ by removing ${e}$ from ${G}$. We now give an upper bound on ${{\mathrm E}[ d_T(u,v) ]}$. If ${e}$ is on the shortest ${u}$${v}$ path then ${d_T(u,v) = n-d}$; the probability of that happening is ${d/n}$. Otherwise, ${d_T(u,v) = d}$. Thus,

$\displaystyle {\mathrm E}[ d_T(u,v) ] ~=~ (d/n) \cdot (n-d) + (1-d/n) \cdot d ~\leq~ 2d.$

So, every edge of ${G}$ is approximated to within a factor of ${2}$, in expectation.

1.2. Main Theorem

We now show that, for every metric ${(X,d)}$ with ${|X|=n}$, there is an algorithm that generates a random tree for which all distances are approximated to within a factor of ${O(\log n)}$, in expectation.

Theorem 1 Let ${(X,d)}$ be a finite metric with ${|X|=n}$. There is a randomized algorithm that generates a set of vertices ${Y}$, a map ${f : X \rightarrow Y}$, a tree ${T=(Y,F)}$, and weights ${w : F \rightarrow {\mathbb R}_{> 0}}$ such that

$\displaystyle d_X(x,y) ~\leq~ d_T\big(f(x),f(y)\big) \ \ \ \ \ (1)$

and

$\displaystyle {\mathrm E}[ d_T\big(f(x),f(y)\big) ] ~\leq~ O(\log n) \cdot d_X(x,y) \quad\forall x,y \in X. \ \ \ \ \ (2)$

The main tool in the proof is the random partitioning algorithm that we developed in the last two lectures. For notational simplicity, let us scale our distances and pick a value ${m}$ such that such that ${1 < d(x,y) \leq 2^m}$ for all distinct ${x,y \in X}$. Note that ${m}$ does not appear in the statement of the theorem, so we do not care how big it is.

The main idea is to generate a ${2^i}$-bounded random partition ${{\mathcal P}_i}$ of ${X}$ for every ${i=0,\ldots,m}$ then assemble those partitions into the desired tree. Assembling them is not too difficult, but there is one annoyance: the parts of ${{\mathcal P}_i}$ have absolutely no relation to the parts of ${{\mathcal P}_{i'}}$ for any ${i \neq i'}$. If the parts of ${{\mathcal P}_i}$ were nicely nested inside the parts of ${{\mathcal P}_{i+1}}$ then this would induce a natural hierarchy on the parts, and therefore give us a nice tree structure.

The solution to this annoyance is to forcibly construct a nice partition ${{\mathcal Q}_i}$, for ${i=0,\ldots,m}$, that is nested inside all of ${{\mathcal P}_i,{\mathcal P}_{i+1},\ldots,{\mathcal P}_m}$. In lattice theory terminology, we define the partition

$\displaystyle {\mathcal Q}_i ~=~ {\mathcal P}_m \wedge \cdots \wedge {\mathcal P}_{i+1} \wedge {\mathcal P}_i,$

where ${\wedge}$ is the meet operation in the partition lattice. If you’re not familiar with this notation, don’t worry; it is easy to explain. Simply define ${{\mathcal Q}_m = {\mathcal P}_m}$, then let

$\displaystyle {\mathcal Q}_i ~=~ \{\: A \cap B \::\: A \in {\mathcal Q}_{i+1} ,\: B \in {\mathcal P}_i \:\}.$

Note that ${{\mathcal Q}_i}$ is also a partition of ${X}$. Furthermore, the parts of ${{\mathcal Q}_i}$ are nicely nested inside the parts of ${{\mathcal Q}_{i+1}}$, so we have obtained the desired hierarchical structure.

1.3. Example

Consider the following example which shows some possible partitions ${{\mathcal P}_0,\ldots,{\mathcal P}_4}$ for the points ${X=\{a,b,c,d,e,f\}}$, and the corresponding partitions ${{\mathcal Q}_0,\ldots,{\mathcal Q}_4}$.

The tree corresponding to these partitions is as follows.

1.4. Algorithm

More formally, here is our algorithm for generating the random tree.

• For ${i=0,\ldots,m}$, let ${{\mathcal P}_i}$ be a ${2^i}$-bounded random partition generated by our algorithm from the last lecture.
• ${\rhd}$: The vertices in ${Y}$ will be pairs of the form ${(i,S)}$ where ${i \in \{0,\ldots,m\}}$ and ${S \subseteq X}$. The vertices and edges of the tree ${T}$ are generated by the following steps.
• Define ${{\mathcal Q}_m = {\mathcal P}_m}$. Add the vertex ${(m,X)}$ as the root of the tree.
• For ${i=m-1}$ downto ${0}$
• Define ${{\mathcal Q}_i = \{\: A \cap B \::\: A \in {\mathcal Q}_{i+1} ,\: B \in {\mathcal P}_i \:\}}$.

• For every such set ${A \cap B \in {\mathcal Q}_i}$, add the vertex ${(i,A \cap B)}$ to ${T}$ as a child of ${(i+1,A)}$, connected by an edge of length ${2^{i+1}}$
• Since ${1 < d(x,y)}$ for all distinct ${x,y}$ and since ${{\mathcal P}_0}$ is ${1}$-bounded, the partition ${{\mathcal P}_0}$ must partition ${X}$ into singletons. Therefore we may define the map ${f : X \rightarrow Y}$ by ${f(x) = (0,\{x\})}$. 1.5. Analysis

Claim 2 Fix any distinct points ${x, y \in X}$. Let ${\ell \in \{0,\ldots,m-1\}}$ be the largest index with ${{\mathcal P}_\ell(x) \neq {\mathcal P}_\ell(y)}$. Then ${ d_T\big(f(x),f(y)\big) = 2^{\ell+3} - 4 }$.

Proof: The level ${\ell}$ is the highest level of the ${{\mathcal P}_i}$ partitions in which ${x}$ and ${y}$ are separated. A simple inductive argument shows that ${\ell}$ is also the highest level of the ${{\mathcal Q}_i}$ partitions in which ${x}$ and ${y}$ are separated. So the least common ancestor in ${T}$ of ${f(x)}$ and ${f(y)}$ is at level ${\ell+1}$. Let us call the least common ancestor ${v}$. Then

$\displaystyle d_T\big(f(x),v\big) ~=~ d_T\big(f(y),v\big) ~=~ \sum_{i=0}^{\ell} 2^{i+1} ~=~ 2^{\ell+2} - 2.$

Since ${d_T\big(f(x),f(y)\big) = d_T\big(f(x),v\big) + d_T\big(v,f(y)\big)}$, the proof is complete. $\Box$

Claim 3 (1) holds.

Proof: Let ${i}$ be such that ${2^i < d_X(x,y) \leq 2^{i+1}}$. Since ${{\mathcal P}_i}$ is ${2^i}$-bounded, ${x}$ and ${y}$ must lie in different parts of ${{\mathcal P}_i}$, i.e., ${{\mathcal P}(x) \neq {\mathcal P}(y)}$. By Claim 2,

$\displaystyle d_T\big(f(x),f(y)\big) ~\geq~ 2^{i+3} - 4 ~>~ 2^{i+1} ~\geq~ d_X(x,y),$

as required. $\Box$

Claim 4 (2) holds.

Proof: Fix any ${x,y \in X}$ and let ${r = d_X(x,y)}$. We have

$\displaystyle \begin{array}{rcl} {\mathrm E}[\: d_T\big(f(x),f(y)\big) \:] &=& \sum_{i=0}^m {\mathrm{Pr}}[\: i \mathrm{~is~the~largest~index~with~} {\mathcal P}_i(x) \neq {\mathcal P}_i(y) \:] \cdot (2^{i+3} - 4) \\ &<& \sum_{i=0}^m {\mathrm{Pr}}[ {\mathcal P}_i(x) \neq {\mathcal P}_i(y) ] \cdot 2^{i+3} \\ &\leq& \sum_{i=0}^m {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}_i(x) ] \cdot 2^{i+3} \\ &\leq& O(\log n) \cdot r, \end{array}$

where the last inequality, proven in the following claim, applies Theorem 2 of Lecture 22 and peforms a short calculation. $\Box$

Claim 5 For any ${x \in X}$ and ${r>1}$,

$\displaystyle \sum_{i=0}^m {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}_i(x) ] \cdot 2^{i+3} ~\leq~ O(\log n) \cdot r.$

Proof: Let ${k}$ be the integer with ${2^k < r \leq 2^{k+1}}$. Then

$\displaystyle \begin{array}{rcl} & & \sum_{i=0}^m {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}_i(x) ] \cdot 2^{i+3} \\ &\leq& \sum_{i=0}^{k+3} 2^{i+3} ~+~ \sum_{i=k+4}^m \frac{8r}{2^i} \cdot H\Big( |B(x,2^{i-2}-r)|, |B(x,2^{i-1}+r)| \Big) \cdot 2^{i+3} \\ &\leq& 128r ~+~ 64r \cdot \sum_{i=k+4}^m H\Big( |B(x,2^{i-3})|, |B(x,2^i)| \Big), \end{array}$

since ${r \leq 2^{i-3}}$ when ${i \geq k+4}$. The final sum is upper bounded as follows.

$\displaystyle \begin{array}{rcl} & & \sum_{i=k+4}^m H\big( |B(x,2^{i-3})|, |B(x,2^i)| \big) \\ &=& \sum_{i=k+4}^m \Big( H\big( |B(x,2^{i-3})|, |B(x,2^{i-2})| \big) + H\big( |B(x,2^{i-2})|, |B(x,2^{i-1})| \big) + H\big( |B(x,2^{i-1})|, |B(x,2^i)| \big) \Big) \\ &<& 3 \sum_{i=k+2}^m H\big( |B(x,2^{i-1})|, |B(x,2^i)| \big) \\ &=& 3 \cdot H\big(|B(x,2^{k+1})|,|B(x,2^m)|\big) \\ &=& O(\log n). \end{array}$

This proves the claimed inequality. $\Box$

## Lecture 23: Random partitions of metric spaces (continued)

We continue our theorem from last time on random partitions of metric spaces

1. Review of Previous Lecture

Define the partial Harmonic sum ${H(a,b) = \sum_{i=a+1}^b 1/i}$. Let ${B(x,r) = \{\: y \in X \::\: d(x,y) \leq r \}}$ be the ball of radius ${r}$ around ${x}$.

Theorem 1 Let ${(X,d)}$ be a metric with ${|X|=n}$. For every ${\Delta>0}$, there is ${\Delta}$-bounded random partition ${{\mathcal P}}$ of ${X}$ with

$\displaystyle {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}(x) ] ~\leq~ \frac{ 8r }{ \Delta } \:\cdot\: H\big(\: |B(x,\Delta/4-r)|,\: |B(x,\Delta/2+r)| \:\big) \quad \forall x \in X ,\: \forall r>0. \ \ \ \ \ (1)$

The algorithm to construct ${{\mathcal P}}$ is as follows.

• Pick ${\alpha \in (1/4,1/2]}$ uniformly at random.
• Pick a bijection (i.e., ordering) ${\pi : \{1,\ldots,n\} \rightarrow X}$ uniformly at random.
• For ${i=1,\ldots,n}$
• Set ${ P_i \,=\, B(\pi(i),\alpha \Delta) \,\setminus\, \cup_{j=1}^{i-1} \, P_j }$.

• Output the random partition ${{\mathcal P} = \{P_1,\ldots,P_n\}}$.We have already proven that this outputs a ${\Delta}$-bounded partition. So it remains to prove (1).

2. The Proof

Fix any point ${x \in X}$ and radius ${r>0}$. For brevity let ${B = B(x,r)}$. Let us order all points of ${X}$ as ${\{y_1,\ldots,y_n\}}$ where ${d(x,y_1) \leq \cdots \leq d(x,y_n)}$. The proof involves two important definitions.

• Sees: A point ${y}$ sees${B}$ if ${d(x,y) \leq \alpha \Delta+r}$.
• Cuts: A point ${y}$ cuts ${B}$ if ${\alpha \Delta - r \leq d(x,y) \leq \alpha \Delta+r}$.

Obviously “cuts” implies “sees”. To help visualize these definitions, the following claim interprets their meaning in Euclidean space. (In a finite metric, the ball ${B}$ is not a continuous object, so it doesn’t really have a “boundary”.)

Claim 2 Consider the metric ${(X,d)}$ where ${X = {\mathbb R}^n}$ and ${d}$ is the Euclidean metric. Then

• ${y}$ sees ${B}$ if and only if ${B=B(x,r)}$ intersects ${B(y,\alpha \Delta)}$.
• ${y}$ cuts ${B}$ if and only if ${B=B(x,r)}$ intersects the boundary of ${B(y,\alpha \Delta)}$.

The following claim is in the same spirit, but holds for any metric.

Claim 3 Let ${(X,d)}$ be an arbitrary metric. Then

• If ${y}$ does not see ${B}$ then ${B \cap B(y,\alpha \Delta) = \emptyset}$.
• If ${y}$ sees ${B}$ but does not cut ${B}$ then ${B \subseteq B(y, \alpha \Delta)}$.

To illustrate the definitions of “sees” and “cuts”, consider the following example. The blue ball around ${x}$ is ${B}$. The points ${y_1}$ and ${y_2}$ both see ${B}$; ${y_3}$ does not. The point ${y_2}$ cuts ${B}$; ${y_1}$ and ${y_3}$ do not. This example illustrates Claim 3: ${y_1}$ sees ${B}$ but does not cut ${B}$, and we have ${B \subseteq B(y, \alpha \Delta)}$.

The most important point for us to consider is the first point under the ordering ${\pi}$ that sees ${B}$. We call this point ${y_{\pi(k)}}$.

The first ${k-1}$ iterations of the algorithm did not assign any point in ${B}$ to any ${P_i}$. To see this, note that ${y_{\pi(1)},\ldots,y_{\pi(k-1)}}$ do not see ${B}$, by choice of ${k}$. So Claim 3 implies that ${B \cap B(y_{\pi(i)},\alpha \Delta) = \emptyset ~\forall i. Consequently

$\displaystyle B \cap P_i = \emptyset \quad\forall i

The point ${y_{\pi(k)}}$ sees ${B}$ by definition, but it may or may not cut ${B}$. If it does not cut ${B}$ then Claim 3 shows that ${B \subseteq B(y_{\pi(k)},\alpha \Delta)}$. Thus

$\displaystyle B \cap P_k ~=~ \Big(\underbrace{B \cap B(y_{\pi(k)},\alpha \Delta)}_{=\, B}\Big) ~\setminus~ \bigcup_{i=1}^{k-1} \underbrace{ B \!\cap\! P_i }_{=\, \emptyset} ~=~ B,$

i.e., ${B \subseteq P_k}$. Since ${{\mathcal P}(x) = P_k}$, we have shown that

$\displaystyle y \mathrm{~does~not~cut~} B \quad\implies\quad B \subseteq {\mathcal P}(x).$

Taking the contrapositive of this statement, we obtain

$\displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ {\mathrm{Pr}}[ y_{\pi(k)} \mathrm{~cuts~} B ] ~=~ \sum_{i=1}^n {\mathrm{Pr}}[ y_{\pi(k)}=y_i ~\wedge~ y_i \mathrm{~cuts~} B ].$

Let us now simplify that sum by eliminating terms that are equal to ${0}$.

Claim 4 If ${y \not \in B(x,\Delta/2+r)}$ then ${y}$ does not see ${B}$.

Claim 5 If ${y \in B(x,\Delta/4-r)}$ then ${y}$ sees ${B}$ but does not cut ${B}$.

So define ${a = |B(x,\Delta/4-r)|}$ and ${b=|B(x,\Delta/2+r)|}$. Then we have shown that

$\displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ \sum_{i=a+1}^b {\mathrm{Pr}}[ y_{\pi(k)}=y_i ~\wedge~ y_i \mathrm{~cuts~} B ].$

The remainder of the proof is quite interesting. The main point is that these two events are “nearly independent”, since ${\alpha}$ and ${\pi}$ are independent, “${y_i \mathrm{~cuts~} B}$” depends only on ${\alpha}$, and “${y_{\pi(k)}=y_i}$” depends primarily on ${\pi}$. Formally, we write

$\displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ \sum_{i=a+1}^b {\mathrm{Pr}}[ y_i \mathrm{~cuts~} B ] \cdot {\mathrm{Pr}}[\: y_{\pi(k)}=y_i \:|\: y_i \mathrm{~cuts~} B ]$

and separately upper bound these two probabilities.

The first probability is easy to bound:

$\displaystyle {\mathrm{Pr}}[ y_i \mathrm{~cuts~} B ] ~=~ {\mathrm{Pr}}[\: \alpha \Delta \in [d(x,y)-r,d(x,y)+r] \:] ~\leq~ \frac{2r}{\Delta/4},$

because ${2r}$ is the length of the interval ${[d(x,y)-r,d(x,y)+r]}$ and ${\Delta/4}$ is the length of the interval from which ${\alpha \Delta}$ is randomly chosen.

Next we bound the second probability. Recall that ${y_{\pi(k)}}$ is defined to be the first element in the ordering ${\pi}$ that sees ${B}$. Since ${y_i}$ cuts ${B}$, we know that ${d(x,y_i) \leq \alpha/2+r}$. Every ${y_j}$ coming earlier in the ordering has ${d(x,y_j) \leq d(x,y_i) \leq \alpha/2+r}$, so ${y_j}$ also sees ${B}$. This shows that there are at least ${i}$ elements that see ${B}$. So the probability that ${y_i}$ is the first element in the random ordering to see ${B}$ is at most ${1/i}$.

Combining these bounds on the two probabilities we get

$\displaystyle {\mathrm{Pr}}[ B \not\subseteq {\mathcal P}(x) ] ~\leq~ \sum_{i=a+1}^b \frac{8r}{\Delta} \cdot \frac{1}{i} ~=~ \frac{8r}{\Delta} \cdot H(a,b),$

as required.

3. Optimality of these partitions

Theorem 1 from the previous lecture shows that there is a universal constant ${L=O(1)}$ such that every metric has a ${\log(n)/10}$-bounded, ${L}$-Lipschitz random partition. We now show that this is optimal.

Theorem 6 There exist graphs ${G}$ whose shortest path metric ${(X,d)}$ has the property that any ${\log(n)/10}$-bounded, ${L}$-Lipschitz random partition must have ${L = \Omega(1)}$.

The graphs we need are expander graphs. In Lecture 20 we defined bipartite expanders. Today we need non-bipartite expanders. We say that ${G=(V,E)}$ is a non-bipartite expander if, for some constants ${c > 0}$ and ${d \geq 3}$:

• ${G}$ is ${d}$-regular, and
• ${|\delta(S)| \geq c|S|}$ for all ${|S| \leq |V|/2}$.

It is known that expanders exist for all ${n=|V|}$, ${d=3}$ and ${c \geq 1/1000}$. (The constant ${c}$ can of course be improved.)

Proof: Suppose ${(X,d)}$ has a ${\log(n)/10}$-bounded, ${L}$-Lipschitz random partition. Then there exists a particular partition ${P}$ that is ${\log(n)/10}$-bounded and cuts at most an ${L}$-fraction of the edges. Every part ${P_i}$ in the partition has diameter at most ${\log(n)/10}$. Since the graph is ${3}$-regular, the number of vertices in ${P_i}$ is at most ${3^{\log(n)/10} < n/2}$. So every part ${P_i}$ has size less than ${n/2}$. By the expansion condition, the number of edges cut is at least

$\displaystyle \frac{1}{2} \sum_{i} c \cdot |P_i| ~=~ cn/2 ~=~ \Omega(|E|).$

So ${L = \Omega(1)}$. $\Box$

4. Appendix: Proofs of Claims

Proof: (of Claim 3) Suppose ${y}$ does not see ${B}$. Then ${d(x,y) > \alpha \Delta + r}$. Every point ${z \in B}$ has ${d(x,z) \leq r}$, so ${d(y,z) \geq d(y,x) - d(x,z) > \alpha \Delta + r - r}$, implying that ${z \not \in B(y,\alpha \Delta)}$.

Suppose ${y}$ sees ${B}$ but does not cut ${B}$. Then ${d(x,y) < \alpha \Delta - r}$. Every point ${z \in B}$ has ${d(x,z) \leq r}$. So ${d(y,z) \leq d(y,x) + d(x,z) < \alpha \Delta - r + r}$, implying that ${z \in B(y,\alpha \Delta)}$. $\Box$

Proof: (of Claim 4) The hypothesis of the claim is that ${d(x,y) > \Delta/2+r}$, which is at least ${\alpha \Delta+r}$. So ${d(x,y) \geq \alpha \Delta+r}$, implying that ${y}$ does not see ${B}$. $\Box$

Proof: (of Claim 5) The hypothesis of the claim is that ${d(x,y) \leq \Delta/4-r}$, which is strictly less than ${\alpha \Delta-r}$. So ${d(x,y) < \alpha \Delta - r}$, which implies that ${y}$ sees ${B}$ but does not cut ${B}$. $\Box$

## Lecture 22: Random partitions of metric spaces

For many problems in computer science, there is a natural notion of “distance”. For example, perhaps the input data consists of real vectors for which it makes sense to measure their distance via the usual Euclidean distance. But in many cases, it makes sense to measure distances using a different “metric” that is not Euclidean distance.

There are many sophisticated algorithms for manipulating data involving these general metrics. One common paradigm for designing such algorithms is the familiar divide-and-conquer approach. Today we will discuss the fundamental algorithmic tool of partitioning metric spaces for the purpose of designing such divide-and-conquer algorithms.

This topic might seem a bit abstract and unmotivated. The next lecture will build on today’s ideas and present some algorithms whose usefulness is more easily appreciable.

1. Metrics

A metric is a set of points ${X}$ together with a “distance function” ${d : X \times X \rightarrow {\mathbb R}}$ such that

• “Non-negativity”: ${d(x,y) \geq 0}$ for all ${x, y \in X}$.
• ${d(x,x) = 0}$ for all ${x \in X}$ (but we allow ${d(x,y)=0}$ for ${x \neq y}$).
• “Symmetry”: ${d(x,y) = d(y,x)}$ for all ${x,y \in X}$.
• “The triangle inequality”: ${d(x,z) \leq d(x,y) + d(y,z)}$ for all ${x,y,z \in X}$.

In some scenarios this would be called a “semimetric” or “pseudometric”; a “metric” would additionally require that ${d(x,y)=0 \iff x=y}$.

Here are some standard examples of metrics with which you should be familiar.

• The Euclidean (or ${L_2}$) Metric: ${X={\mathbb R}^n}$ and ${d(x,y) = \sqrt{\sum_i (x_i-y_i)^2}}$.
• The Manhattan (or ${L_1}$) Metric: ${X={\mathbb R}^n}$ and ${d(x,y) = \sum_i | x_i-y_i |}$.
• Shortest Path Metrics: Let ${G=(V,E)}$ be a graph with non-negative lengths associated with the edges. The set of points is ${X=V}$. The distance function is defined by letting ${d(x,y)}$ be the length of the shortest path between ${x}$ and ${y}$.

Notice that in the first two examples the set ${X}$ is infinite, but in the last example ${X}$ is finite. When ${X}$ is finite we call the pair ${(X,d)}$ a finite metric space. We can also obtain finite metric spaces by restricting the first two examples to a finite subset of ${{\mathbb R}^n}$ and keeping the same distance functions. In computer science we are often only interested in finite metric spaces because the input data to the problem is finite.

2. Lipschitz Random Partitions

Let ${(X,d)}$ be a metric space. Let ${P = \{ P_1, P_2, \ldots \}}$ be a partition of ${X}$, i.e., the ${P_i}$‘s are pairwise disjoint and their union is ${X}$. The ${P_i}$‘s are called the parts. Let us define the following notation: for ${x \in X}$, let ${P(x)}$ be the unique part ${P_i}$ that contains ${x}$.

The diameter of a part ${P_i}$ is ${ \max \{\: d(x,y) \::\: x,y \in P_i \} }$. We say that the partition ${P}$ is ${\Delta}$-bounded if every ${P_i}$ has diameter at most ${\Delta}$.

As you will see soon, it will be very useful for us to choose a partition of ${(X,d)}$ randomly. We say that a random partition ${{\mathcal P}}$ is ${\Delta}$-bounded if every possible partition that can occur as a realization of the random ${{\mathcal P}}$ is a ${\Delta}$-bounded partition.

Our goal is that points that are “close” to each other should have good probability of ending up in the same part. Formally, let ${{\mathcal P}}$ be a randomly chosen partition. We say that ${{\mathcal P}}$ is ${L}$-Lipschitz if

$\displaystyle {\mathrm{Pr}}[ {\mathcal P}(x) \neq {\mathcal P}(y) ] ~\leq~ L \cdot d(x,y) \qquad \forall x,y \in X.$

So if ${x}$ and ${y}$ are close, they have a smaller probability of being assigned to different parts of the partition. Note that this definition is not “scale-invariant”, in the sense that we need to double ${L}$ if we halve all the distances.

Combining the ${\Delta}$-bounded and ${L}$-Lipschitz concepts is very interesting. Let us illustrate this with an example.

2.1. Example

Consider the “line metric”, where ${X=\{1,\ldots,n\}}$, ${n}$ is odd, and ${d(x,y) = |x-y|}$. The diameter of this metric is clearly ${n-1}$. Consider the partition ${P}$ where ${P_1 = \{1,\ldots,\lfloor n/2 \rfloor \}}$ and ${P_2 = \{\lceil n/2 \rceil,\ldots,n\}}$. This partition is ${\Delta}$-bounded for ${\Delta = \lfloor n/2 \rfloor-1}$. Does it capture our goal that “close points should end up in the same part”?

In some sense, yes. The only two consecutive points that ended up in different parts are the points ${\lfloor n/2 \rfloor}$ and ${\lceil n/2 \rceil}$, so most pairs of consecutive points did end up in the same part. But if we modify our metric slightly, this is no longer true. Consider making ${n}$ copies of both of the points ${\lfloor n/2 \rfloor}$ and ${\lfloor n/2 \rfloor}$, keeping the copies at the same location in the metric. (This is valid because we’re really looking at semimetrics: we allow ${d(x,y)=0}$ for different points ${x}$ and ${y}$.) After this change, a constant fraction of the consecutive points now ended up in different parts!

So, even in this simple example of a line metric, if we allow multiple copies of (equivalently, non-negative weights on) each point, it becomes much less clear how to choose a partition for which most close points end up in the same part.

Choosing a random partition makes life much easier. Pick an index ${k \in \{n/3,\ldots,2n/3\}}$ uniformly at random, then set ${P_1 = \{1,\ldots,k\}}$ and ${P_2 = \{k+1,\ldots,n\}}$. Let ${{\mathcal P}}$ be the resulting random partition. These partitions always have diameter at most ${2n/3}$ (even if we made multiple copies of points). So ${{\mathcal P}}$ is ${\Delta}$-bounded with ${\Delta=2n/3}$.

Now consider any two consecutive points ${i}$ and ${i+1}$. They end up in different parts of the partition only if ${k=i}$, which happens with probability at most ${3/n}$. Thus ${{\mathrm{Pr}}[ {\mathcal P}(i) \neq {\mathcal P}(i+1) ] \leq 3/n}$. More generally

$\displaystyle {\mathrm{Pr}}[ {\mathcal P}(x) \neq {\mathcal P}(y) ] ~\leq~ \frac{3}{n} \cdot d(x,y).$

So ${{\mathcal P}}$ is a ${L}$-Lipschitz partition with ${L=3/n}$. The key point is: this holds regardless of how many copies of the points we make. So this same random partition ${{\mathcal P}}$ works under any scheme of copying (i.e., weighting) the points of this metric.

2.2. The General Theorem

The previous example achieves our “gold standard” of a random partition: ${L = O(1/\Delta)}$. We can think of this as meaning that the probability of adjacent points ending up in different parts is roughly the inverse of the diameter of those parts. Our main theorem is that, by increasing ${L}$ by a logarithmic factor, we can obtain a similar partition of any finite metric.

Theorem 1 Let ${(X,d)}$ be a metric with ${|X|=n}$. For every ${\Delta>0}$, there is a ${\Delta}$-bounded, ${L}$-Lipschitz random partition ${{\mathcal P}}$ of ${X}$ with ${L = O(\log(n)/\Delta)}$.

This theorem is optimal: for any ${n}$ there are metrics on ${n}$ points for which every ${\Delta}$-bounded, ${L}$-Lipschitz partition has ${L = \Omega(\log(n)/\Delta)}$.

Theorem 1 is a corollary of the following more general theorem. The statement is a bit messy, but the mess will be important in proving the main result of the next lecture. Define the partial Harmonic sum ${H(a,b) = \sum_{i=a+1}^b 1/i}$. Let ${B(x,r) = \{\: y \in X \::\: d(x,y) \leq r \}}$ be the ball of radius ${r}$ around ${x}$.

Theorem 2 Let ${(X,d)}$ be a metric with ${|X|=n}$. For every ${\Delta>0}$, there is ${\Delta}$-bounded random partition ${{\mathcal P}}$ of ${X}$ with

$\displaystyle {\mathrm{Pr}}[ B(x,r) \not\subseteq {\mathcal P}(x) ] ~\leq~ \frac{8r}{\Delta} \:\cdot\: H\big(\: |B(x,\Delta/4-r)|,\: |B(x,\Delta/2+r)| \:\big) \qquad \forall x \in X ,\: \forall r>0. \ \ \ \ \ (1)$

Proof: (of Theorem~1) Let ${{\mathcal P}}$ be the random partition from Theorem 2. Consider any ${x,y \in X}$ and let ${r = d(x,y)}$. Note that if ${{\mathcal P}(x) \neq {\mathcal P}(y)}$ then ${B(x,r) \not\subseteq {\mathcal P}(x)}$. Thus

$\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[ {\mathcal P}(x) \neq {\mathcal P}(y) ] &\leq& (8r/\Delta) \cdot H\big(\: |B(x,\Delta/4-r)|,\: |B(x,\Delta/2+r)| \:\big) \\ &\leq& (8r/\Delta) \cdot H(0, n ) \\ &=& O(\log(n)/\Delta) \cdot d(x,y), \end{array}$

since ${H(0,n) = \sum_{i=1}^{n+1} 1/i = O(\log n)}$. $\Box$

2.3. Proof of Theorem 2

We start off by presenting the algorithm that generates the random partition ${{\mathcal P}}$. As is often the case with the algorithms we have seen in this class, the algorithm is very short, yet extremely clever and subtle.

• Pick ${\alpha \in (1/4,1/2]}$ uniformly at random.
• Pick a bijection (i.e., ordering) ${\pi : \{1,\ldots,n\} \rightarrow X}$ uniformly at random.
• For ${i=1,\ldots,n}$
• Set ${ P_i \,=\, B(\pi(i),\alpha \Delta) \,\setminus\, \cup_{j=1}^{i-1} \, P_j }$.

• Output the random partition ${{\mathcal P} = \{P_1,\ldots,P_n\}}$.Remark. Note that ${\alpha}$ is not an arbitrary constant; it is random. Its role is analogous to the random choice of ${k}$ in our previous example.

To prove Theorem 2, we first need to check that the algorithm indeed outputs a partition. By definition, each ${P_i}$ is disjoint from all earlier ${P_j}$ with ${j < i}$, so the ${P_i}$‘s are pairwise disjoint. Next, each point ${\pi(i)}$ is either contained in ${P_i}$ or some earlier ${P_j}$, so the union of the ${P_i}$‘s is ${X}$.

Next we should check that this partition is ${\Delta}$-bounded. That is also easy: since ${P_i \subseteq B(\pi(i),\alpha \Delta)}$, the diameter of ${P_i}$ is at most ${\alpha \Delta < \Delta}$.

The difficult step is proving (1), which we do next time. Here is some vague intuition as to why it might be true. Condition (1) asks us to show that, with good probability, the ball ${B(x,r)}$ is not chopped into pieces by the partition. But the parts of the partition are themselves balls of radius at least ${\Delta/4}$ (minus all previous parts of the partition). So as long as ${r \ll \Delta/4}$, we might be optimistic that the ball ${B(x,r)}$ does not get chopped up.

2.4. Example

Let ${X}$ be the following eight points in the plane, with the usual Euclidean distance. Let ${\pi}$ be the identity ordering: ${\pi(i)=i \:~\forall i}$. The algorithm generates the following partition.

## Lecture 21: Property Testing

1. Overview of Property Testing

Property testing is a research area in theoretical computer science that has seen a lot of activity over the past 15 years or so. A one-sentence description of this area’s goal is: design algorithms that, given a very large object, examine the object in very few places and decide whether it either has a certain property, or is “far” from having that property.

As a simple example, let the object be the set of all people in Canada, and let the property be “does a majority like Stephen Harper?”. The first algorithm that comes to mind for this problem is: sample a few people at random, ask them if they like Stephen Harper, then return the majority vote of the sample.

Does this algorithm have a good probability of deciding whether a majority of people likes Stephen Harper? The answer is no. Suppose there are 999,999 people in Canada. The sampling algorithm cannot reliably distinguish between the case that exactly 500,000 people like Stephen Harper (a majority) and exactly 499,999 people like Stephen Harper (not a majority).

But our algorithm is in fact a good property testing algorithm for this problem! The reason is that we have not yet discussed the important word “far”. That word allows us to ignore these scenarios that are right “on the boundary” of having the desired property.

In our example, we could formalize the word “far” as meaning “more than 5% of the population needs to change their vote in order for a majority to like Stephen Harper”. Equivalently, our algorithm only needs to distinguish two scenarios:

• At least 500,000 people like Stephen Harper, or
• Less than 450,000 people like Stephen Harper.

Our sampling algorithm has good probability of distinguishing these scenarios. The number of samples needed depends only on the fraction 5% and the desired probability of success, and does not depend on the size of the population. (This is easy to prove using Chernoff bounds.)

This example probably doesn’t excite you very much because statisticians have studied these sorts of polling problems for centuries, and we know all about confidence intervals, etc. The field of property testing does not focus on these simple polling problems, but instead tends to look at problems of a more algebraic or combinatorial flavour. Some central problems in property testing are:

• Given a function ${f : \{0,1\}^n \rightarrow \{0,1\}}$, decide if it is either a linear function or far from a linear function. There is a property testing algorithm for this problem that uses only a constant number of queries. This algorithm is a key ingredient in proving the infamous PCP theorem.
• Given a graph ${G=(V,E)}$, decide if ${G}$ has no triangles, or is far from having no triangles. There is a property testing algorithm for this problem that uses only a constant number of queries, however the constant is a horrifying tower function (a.k.a., a tetration).

2. Testing Sortedness

Let ${A_1,\ldots,A_n}$ be a list of ${n}$ distinct numbers. The property we would like to test is “is the list sorted?”. Our definition of “far” from sorted will be “we need to remove at least ${\epsilon n}$ numbers for the remaining list to be sorted”. In other words, we wish to distinguish the following two scenarios:

• ${A_1 < A_2 < \cdots < A_n}$.
• The longest sorted subsequence has length less than ${(1-\epsilon)n}$.

We will give a randomized algorithm which can distinguish these cases with constant probability by examining only ${O(\log(n)/\epsilon)}$ entries of the list.

A natural algorithm to try is: pick ${i}$ at random and test whether ${A_i < A_{i+1}}$. Consider the input

$\displaystyle 1, 2, \ldots, n/2-1, n/2, 1, 2, \ldots, n/2-1, n/2.$

We need to remove at least half of the list to make it sorted. But that algorithm will only find an unsorted pair if it picks ${i=n/2}$, which happens with low probability.

So instead we consider the following more sophisticated algorithm.

• For ${t=1,\ldots,1/\epsilon}$
• Pick ${i \in \{1,\ldots,n\}}$ uniformly at random.
• Search for the value ${A_i}$ in the list using binary search.
• If we discover some unsorted elements during the binary search, return “No”.
• If the binary search does not end up at position ${i}$, return “No”.

• Return “Yes”.

Theorem 1 This algorithm has constant probability of correctly distinguishing between lists that are sorted and those that are far from sorted.

Proof: If the given list is sorted, obviously every binary search will work correctly and the algorithm will return “Yes”.

So let us suppose that the list is far from sorted. Say that an index ${i}$ is “good” if a binary search for ${y_i}$ correctly terminates at position ${i}$ (without discovering any unsorted elements). We claim that these good indices form a sorted subsequence. To see this, consider any two good indices ${i}$ and ${j}$ with ${i. Let ${k}$ be the last common index of their binary search paths. Then we must have ${y_i \leq y_k \leq y_j}$, which implies that ${y_i < y_j}$ by distinctness.

Since the list is far from sorted, there can be at most ${(1-\epsilon)n}$ good indices. So the probability of picking a good index in every iteration is ${(1-\epsilon)^{1/\epsilon} \leq e^{-\epsilon (1/\epsilon)} = 1/e}$. $\Box$

3. Estimating Maximal Matching Size

Our next example deviates from the property testing model described above. Instead of testing whether an object has a property or not, we will estimate some real-valued statistic of that object.

Our example is estimating the size of a maximal matching in a bounded-degree graph. There is an important distinction here. A maximum matching is a matching in the graph such that no other matching contains more edges. A maximal matching is a matching for which it is impossible to get a bigger matching by adding a single edge. Whereas all maximum matchings have the same size, it is not necessarily true that all maximal matchings have the same size.

Here is a greedy algorithm for generating a maximal matching in a graph ${G=(V,E)}$ with ${n = |V|}$, ${m=|E|}$ and maximum degree ${d}$.

• Let ${M = \emptyset}$.
• Let ${\pi : \{1,\ldots,m\} \rightarrow E}$ be an arbitrary bijection.
• For ${i=1,\ldots,m}$
• If both endpoints of edge ${\pi(i)}$ are uncovered, add ${\pi(i)}$ to ${M}$.

Fact 2 The resulting matching ${M}$ is maximal. Furthermore ${|M| \geq n/2d}$.

Theorem 3 Let ${G}$ be a graph with maximum degree ${d}$. There is an algorithm to estimate the size of the maximal matching to within a multiplicative factor of ${1+\epsilon}$ in time ${2^{O(d)} / \epsilon^2}$.

Estimating given an oracle. Suppose we have a oracle which, assuming some fixed maximal matching ${M}$, can answer queries about whether a given edge ${e}$ is covered by that matching ${M}$. We will use this oracle to estimate ${|M|}$.

The algorithm just involves simple sampling.

• Fix a maximal matching ${M}$.
• Set ${k=9 d^2/\epsilon^2}$.
• For ${i=1,\ldots,k}$
• Pick a random edge ${e}$.
• Let ${X_i=1}$ if ${e}$ is in ${M}$, otherwise ${X_i=0}$.

Let ${X = \sum_{i=1}^k X_i}$ and ${\mu = {\mathrm E}[X]}$. The actual number of edges in ${M}$ is

$\displaystyle |M| ~=~ {\mathrm E}[X_i] m ~=~ {\textstyle \frac{m}{k} \mu}.$

Our estimate for the size of ${M}$ is ${\frac{m}{k} X}$. Since ${m \leq nd/2}$, Fact 2 shows that ${{\mathrm E}[X_i] \geq 1/d^2}$, which implies that ${\mu \geq k/d^2}$. Therefore

$\displaystyle \begin{array}{rcl} {\textstyle {\mathrm{Pr}}\Big[~ (1-\epsilon) \frac{m}{k} \mu \leq \frac{m}{k} X \leq (1+\epsilon) \frac{m}{k} \mu ~\Big] } &=& {\mathrm{Pr}}\Big[~ (1-\epsilon) \mu \leq X \leq (1+\epsilon) \mu ~\Big] \\ &\geq& 1-2\exp(-\epsilon^2 \mu / 3) ~\geq~ 0.9, \end{array}$

by a Chernoff bound.

Implementing the oracle. We will not actually implement the oracle for an arbitrary ${M}$; we will require ${M}$ to be generated by the greedy algorithm above with a random ordering ${\pi}$.

Associate independent random numbers ${r_e \in [0,1]}$ with each edge ${e \in E}$. They are all distinct with probability ${1}$, and so they induce a uniformly random ordering on the edges. We let ${M}$ be the maximal matching created by the greedy algorithm above, using this ordering.

Our algorithm for implementing the oracle is as follows.

• Given an edge ${e \in E}$
• For every edge ${e' \in E}$ sharing an endpoint with ${e}$
• If ${r_{e'} < r_e}$, recursively test if ${e'}$ is in the matching. If so, return “No”.

• Return “Yes”.Clearly this algorithm is always correct: its logic for deciding whether ${e \in M}$ is equivalent to the greedy algorithm. The only question is: what is the running time of the oracle?

Consider an edge ${f}$ reachable by a path of length ${k}$ from the initial edge ${e}$. The recursion only considers the edge ${f}$ if the edges on this path have the ${r_e}$ values in decreasing order. The probability of that event is ${k!}$. The number of edges reachable by a path of length ${k}$ is less than ${(2d)^k}$. So the expected number of nodes explored is less than ${\sum_{k=0}^\infty (2d)^k / k! = e^{2d}}$. So the expected time required by an oracle call is ${O(d) \cdot e^{2d} = 2^{O(d)}}$.

Since the estimation algorithm calls the oracle ${9d/\epsilon^2}$ times, the expected total running time is ${2^{O(d)} / \epsilon^2}$.

## Lecture 20: Expanders, Superconcentrators

Today we will make a brief excursion into the fascinating world of expander graphs. It is hard to overstate the importance of these objects to theoretical computer science. They play an important role in some astonishing algorithms, the PCP theorem, derandomization, circuit complexity, etc. Today we will show how expanders are used in designing special routing networks called superconcentrators.

1. Expanders

The easiest one-sentence description of an expander graph is: a graph that looks in many important ways like a random graph. So, although expander graphs are often viewed as a difficult field of study, in many ways they are very simple. In particular, here’s an easy one-sentence construction of an expander graph: a random graph is very likely to be a good expander graph. Proving that sentence is not difficult, and we will do so today.

Beyond the basics, the study of expanders becomes quite non-trivial. For example, in many scenarios we need a deterministic algorithm to construct an expander; such constructions often involve very sophisticated mathematics. Alternatively, we might like to give very precise guarantees on the quality of randomly generated expanders. That also gets very difficult.

But today we will be content with a simple, crude, randomized construction. This suffices for many amazing applications.

The Definition. Let ${G=(L,R,E)}$ be a bipartite graph, where ${L}$ is the set of left-vertices, ${R}$ is the set of right-vertices, and ${E}$ is the set of edges (each of which have one endpoint in ${L}$ and one in ${R}$). For any subset ${S \subseteq L}$, let

$\displaystyle \Gamma(S) ~=~ \{~ v \in R \::\: \{u,v\} \in E ~\mathrm{for~some}~ u \in S ~\}.$

There are several different and interrelated definitions of expanders. We will say that ${G}$ is ${(n,m,d)}$-expander if ${|L|=n}$, ${|R|=m}$, every vertex in ${L}$ has degree ${d}$, and

$\displaystyle |\Gamma(S)| ~\geq~ |S| \qquad\forall S \subseteq L ~\mathrm{s.t.}~ |S| \leq \frac{n}{2}. \ \ \ \ \ (1)$

Theorem 1 For some sufficiently large constant ${d}$, sufficiently large ${n}$ and ${m \geq \frac{7}{8} n}$, there exists an ${(n,m,d)}$-expander.

Remark. The theorem is trivial when ${m \geq n}$ but it is interesting even in the case ${m=0.999n}$.

Proof: We generate the graph as follows. First we generate ${n}$ vertices to form the set ${L}$, and ${m}$ vertices to form the set ${R}$. Next, for every vertex ${u \in L}$, we randomly choose exactly ${d}$ vertices from ${R}$ (with repetition), and we add edges from ${u}$ to all of those vertices. (So technically the resulting graph will be a multigraph, but this is not a problem.)

Let ${S \subseteq L}$ be an arbitrary subset with ${s := |S| \leq \frac{n}{2}}$. We need to show that ${|\Gamma(S)| \geq |S|}$. We do this in a very naive way. We simply consider every set ${T \subseteq R}$ with ${|T| < |S|}$ and show that it is unlikely that ${\Gamma(S) \subseteq T}$. That probability is easy to analyze:

$\displaystyle {\mathrm{Pr}}[ \Gamma(S) \subseteq T ] ~=~ \Big( \frac{|T|}{m} \Big)^{|S|d}.$

Observe that we don’t need to consider all sets ${T}$ with ${|T| < |S|}$, it suffices to consider those with ${|T|=|S|-1}$. (In fact, we will simplify our notation by considering the negligibly harder case of ${|T|=|S|}$.) As long as ${\Gamma(S)}$ is not contained in any such ${T}$, then we have ${|\Gamma(S)| \geq |S|}$ as required.

The remainder of the proof simply does a union bound over all such ${S}$ and ${T}$.

$\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[ G \mathrm{~is~not~a~} (n,m,d)-\mathrm{expander} ] &\leq& \sum_{\substack{S \subseteq L \\ |S| \leq n/2}} ~ \sum_{\substack{T \subseteq R \\ |T|=|S|}} {\mathrm{Pr}}[ \Gamma(S) \subseteq T ] \\ &\leq& \sum_{s \leq n/2} \binom{n}{s} \binom{m}{s} (s/m)^{sd} \\ &\leq& \sum_{s \leq n/2} \Big( \frac{ne}{s} \Big)^s \Big(\frac{me}{s}\Big)^s \Big(\frac{s}{m}\Big)^{sd} \\ &\leq& \sum_{s \leq n/2} \Bigg( \frac{ne}{m} \frac{me}{m} \frac{s^{d-2}}{m^{d-2}}\Bigg)^{s} \\ &\leq& \sum_{s \leq n/2} \Bigg( \frac{ne^2}{(7/8)n} \frac{(n/2)^{d-2}}{\big((7/8)n\big)^{d-2}}\Bigg)^{s} \\ &\leq& \sum_{s \leq n/2} \Bigg( \frac{8e^2}{7} (4/7)^{d-2} \Bigg)^{s} \\ &\leq& \sum_{s=1}^{\infty} \Bigg( 8.5 \cdot 0.58^{d-2} \Bigg)^{s}. \end{array}$

For ${d \geq 9}$, the base of the exponent is less than ${0.19}$, and so the infinite series adds up to less than ${\frac{0.19}{1-0.19} < 0.25}$.

So the probability that the random graph fails to satisfy the second inequality in (1) is less than ${1/4}$. $\Box$

2. Superconcentrators

Let ${G=(V,E)}$ be a graph. There are two disjoint subsets ${I \subseteq V}$ and ${O \subseteq V}$ with ${|I|=|O|=n}$. The vertices in ${I}$ are called the inputs and the vertices in ${O}$ are called the outputs. The graph ${G}$ is called a superconcentrator if, for every ${k \leq n}$ and every ${S \subseteq I}$ and ${T \subseteq O}$ with ${|S|=|T|=k}$, there are ${k}$ vertex-disjoint paths from ${S}$ to ${T}$.

It is important to note that we do not insist that the path starting at some ${s \in S}$ has to arrive at a particular ${t \in T}$; it is acceptable for it to arrive at an arbitrary ${t \in T}$. In other words, the paths from ${S}$ to ${T}$ induce some bijective mapping ${\pi : S \rightarrow T}$ but we do not care what that mapping is.

Why might we be interested in such graphs? Suppose we want to design a telephone network with ${n}$ input lines and ${n}$ output lines such that any set of ${k}$ input lines can be connected to any set of ${k}$ output lines. To minimize the cost of this network, we would like the number of wires and routing points inside the network to be as small as possible. It is easy to design a superconcentrator with ${O(n^2)}$ edges: simply add an edge between every vertex in ${I}$ and every vertex in ${O}$. Can one do better than that? It was conjectured by Valiant that ${\Omega(n \log n)}$ edges are necessary. His conjecture came from an attempt to prove a lower bound in circuit complexity. But a short while later he realized that his own conjecture was false:

Theorem 2 For any ${n}$, there are superconcentrators with ${n}$ inputs, ${n}$ outputs, and ${O(n)}$ edges.

The proof uses expanders, recursion and Hall’s theorem in a very clever way. First we observe the following interesting property of expanders: they have matchings covering any small set of vertices.

Claim 3 Let ${H=(L,R,E)}$ be a ${(n,7n/8,9)}$-expander. For every set ${S \subseteq L}$ of size ${|S| \leq n/2}$, there is a matching in ${H}$ covering ${S}$. In other words, there exists an injective map ${\pi_S : S \rightarrow R}$ such that ${\{s,\pi_S(s)\} \in E}$ for every ${s \in S}$.)

Proof: Hall’s marriage theorem says that ${H}$ has a matching covering ${S}$ if and only if every ${S' \subseteq S}$ has ${|\Gamma(S')| \geq |S'|}$. That condition holds by (1), since ${H}$ is an expander. $\Box$

The Main Idea. The construction proceeds as follows.

• First we create the input vertices ${I}$ and output vertices ${O}$, where ${|I|=|O|=n}$.
• Next we create some internal vertices ${I'}$ and ${O'}$ with ${|I'|=|O'|=7n/8}$.
• We directly connect ${I}$ to ${O}$ using ${n}$ edges that form a matching. (Say, the ${i}$th vertex in ${I}$ is connected to the ${i}$th vertex in ${O}$, for every ${i}$.)
• Let ${H}$ be a ${(n,7n/8,9)}$-expander. We make one copy of ${H}$ with left-vertices ${I}$ and right-vertices ${I'}$, and an additional copy with left-vertices ${O}$ and right-vertices ${O'}$.
• Finally, we create a smaller superconcentrator with inputs ${I'}$ and outputs ${O'}$.

The Size. First let us analyze the total number of edges in this construction. We use a simple recurrence relation: let ${f(n)}$ be the number of edges used in a superconcentrator on ${n}$ inputs and outputs. The total number of edges used in the construction is:

• ${n}$ edges for the matching from ${I}$ to ${O}$.
• ${9n}$ edges for each of the two expanders.
• ${f(7n/8)}$ edges for the smaller superconcentrator.

Thus ${f(n) = f(7n/8) + 19n}$. For the base case, we can take ${f(n) = n^2}$ whenever ${n}$ is a sufficiently small constant. This recurrence has the solution ${f(n) = O(n)}$.

The Superconcentrator Property. Consider any ${S \subseteq I}$ and ${T \subseteq O}$ with ${|S|=|T|=k}$. We must find vertex-disjoint paths from ${S}$ to ${T}$.

Case 1: ${k \leq n/2}$. By Claim 3, the first expander contains a matching covering ${S}$. Let ${S' \subseteq I'}$ be the other endpoints of that matching. Similarly, the second expander contains a matching covering ${T}$. Let ${T' \subseteq O'}$ be the other endpoints of that matching. Note that ${|S'|=|T'|=k}$. By induction, the smaller superconcentrator contains ${k}$ vertex-disjoint paths between ${S'}$ and ${T'}$. Combining those paths with the edges of the matchings gives the desired paths between ${S}$ and ${T}$.

Case 2: ${k > n/2}$. Claim 3 only provides a matching covering sets of size at most ${n/2}$, so the previous argument cannot handle the case ${k > n/2}$. The solution is to use the edges that directly connect ${I}$ and ${O}$.

By the pigeonhole principle, there are at least ${k-n/2}$ vertices in ${S}$ that are directly connected by the matching to vertices in ${T}$. The remaining vertices (of which there are at most ${n/2}$) are handled by the argument of Case 1.

Posted in Uncategorized | 1 Comment

## Lecture 19: The Lovasz Local Lemma

Today’s lecture introduces a completely new topic: the LovÃ¡sz Local Lemma (LLL). This is an important method for analyzing events that are not independent, but have some restricted sort of dependencies. It is not as widely applicable as many of the other the techniques we have seen so far, but from time to time one encounters scenarios in which the LLL is the only technique that works.

1. The LovÃ¡sz Local Lemma

Suppose ${{\mathcal E}_1,\ldots,{\mathcal E}_n}$ are a collection of “bad” events. We would like to show that there is positive probability that none of them occur. If the events are mutually independent then this is simple:

$\displaystyle {\mathrm{Pr}}[\: \wedge_{i=1}^n \overline{{\mathcal E}_i} \:] ~=~ \prod_{i=1}^n {\mathrm{Pr}}[ \overline{{\mathcal E}_i} ] ~>~ 0$

(assuming that ${{\mathrm{Pr}}[{\mathcal E}_i]<1}$ for every ${i}$). The LLL is a method for proving that ${{\mathrm{Pr}}[ \wedge_{i=1}^n \overline{{\mathcal E}_i} ] > 0}$ when the ${{\mathcal E}_i}$‘s are not mutually independent, but they can have some sort of limited dependencies.

Formally, we say that an event ${{\mathcal E}_j}$ does not depend on the events ${\{ {\mathcal E}_i : i \in I \}}$ if

$\displaystyle {\mathrm{Pr}}[ {\mathcal E}_j ] ~=~ {\mathrm{Pr}}[\: {\mathcal E}_j \:|\: \wedge_{i \in I'} {\mathcal E}_i \:] \qquad \forall I' \subseteq I.$

So, regardless of whether some of the events in ${I}$ occur, the probability of ${{\mathcal E}_j}$ occurring is unaffected.

Theorem 1 (The “Symmetric” LLL) Let ${{\mathcal E}_1,\ldots,{\mathcal E}_n}$ be events with ${{\mathrm{Pr}}[{\mathcal E}_i] \leq p}$ for all ${i}$. Suppose that every event ${{\mathcal E}_i}$ does not depend on at least ${n-d}$ other events. If ${pd \leq 1/e}$ then ${{\mathrm{Pr}}[ \wedge_{i=1}^n \overline{{\mathcal E}_i} ] > 0}$.

We will not prove this theorem. Instead, we will illustrate the LLL by considering a concrete application of it in showing satisfiability of ${k}$-CNF Boolean formulas. Recall that a ${k}$-CNF formula is a Boolean formula, involving any finite number of variables, where the formula is a conjunction (“and”) of any number of clauses, each of which is a disjunction (“or”) of exactly ${k}$ distinct literals (a variable or its negation).

For example, here is a ${3}$-CNF formula with three variables and eight clauses.

$\displaystyle \begin{array}{rcl} \phi(a,b,c) &=& (a \vee b \vee c) \wedge (a \vee b \vee \bar{c}) \wedge (a \vee \bar{b} \vee c) \wedge (a \vee \bar{b} \vee \bar{c}) \wedge \\ && (\bar{a} \vee b \vee c) \wedge (\bar{a} \vee b \vee \bar{c}) \wedge (\bar{a} \vee \bar{b} \vee c) \wedge (\bar{a} \vee \bar{b} \vee \bar{c}) \end{array}$

This formula is obviously unsatisfiable. One can easily generalize this construction to get an unsatisfiable ${k}$-CNF formula with ${k}$ variables and ${2^k}$ clauses. Our next theorem says: the reason this formula is unsatisfiable is that we allowed each variable to appear in too many clauses.

Theorem 2 There is a universal constant ${d \geq 1}$ such that the following is true. Let ${\phi}$ be a ${k}$-CNF formula where each variable appears in at most ${T := 2^{k-d} / k}$ clauses. Then ${\phi}$ is satisfiable. Moreover, there is a randomized, polynomial time algorithm to find a satisfying assignment.

The theorem is stronger when ${d}$ is small. The proof that we will present can be optimized to get ${d = 3}$. By applying the full-blown LLL one can achieve ${d = \log_2(e) \approx 1.44}$.

Let ${n}$ be the number of variables and ${m}$ be the number of clauses in ${\phi}$. Each clause contains ${k}$ variables, each of which can appear in only ${T-1}$ other clauses. So each clause shares a variable with less than ${R := kT = 2^{k-d}}$ other clauses.

The algorithm proving the theorem is perhaps the most natural algorithm that one could imagine. However it took more than 30 years from the introduction of the LLL for this algorithm to be provably analyzed.

Solve(${\phi}$)

• Set each variable in ${\phi}$ to either ${0}$ or ${1}$ randomly and independently.
• While there is an unsatisfied clause ${C}$
• ${\qquad~}$ Fix(${C}$)

Fix(${C}$)

• Set each variable in ${C}$ to either ${0}$ or ${1}$ randomly and independently.
• While there is an unsatisfied clause ${D}$ sharing some variable with ${C}$ (possibly ${D=C}$)
• ${\qquad~}$ Fix(${D}$)

Claim 3 Suppose every call to Fix terminates. Then Solve calls Fix at most ${m}$ times, and terminates with a satisfying assignment.

Proof: For any call to Fix, we claim that every clause that was satisfied before the call is still satisfied after the call completes. This follows by induction, starting at the deepest level of recursion. So, for every call from Solve to Fix(${C}$) the number of satisfied clauses increases by one, since ${C}$ must now be satisfied when Fix(${C}$) terminates. $\Box$

So it remains to show that, with high probability, every call to Fix terminates.

Theorem 4 Let ${s = m( \log m + c) + \log \frac{1}{\delta}}$ where ${c}$ is a sufficiently large constant. Then the probability that the algorithm makes more than ${s}$ calls to ${\textsc{Fix}}$ (including both the top-level and recursive calls) is at most ${\delta}$.

The proof proceeds by considering the interactions between two agents: the “CPU” and the “Debugger”. The CPU runs the algorithm, periodically sending messages to the Debugger (we describe these messages in more detail below). However, if Fix gets called more than ${s}$ times the CPU interrupts the execution and halts the algorithm.

The CPU needs ${n}$ bits of randomness to generate the initial assignment in Solve, and needs ${k}$ bits to regenerate variables in each call to Fix. Since the CPU will not execute Fix more than ${s}$ times, it might as well generate all its random bits at the very start of the algorithm. So the first step performed by the CPU is to generate a random bitstring ${x}$ of length ${n+sk}$ to provide all the randomness used in executing the algorithm.

The messages sent from the CPU to the Debugger are as follows.

• Every time the CPU runs Fix(${C}$), he sends a message containing the identity of the clause ${C}$, and an extra bit indicating whether this is a top-level Fix (i.e., a call from Solve) or a recursive Fix.
• Every time Fix(${C}$) finishes the CPU sends a message stating “recursive call finished”.
• If Fix gets called ${s}$ times, the CPU sends a message to the Debugger containing the current ${\{0,1\}}$ assignment of all ${n}$ variables.

Because the Debugger is notified when every call to Fix starts or finishes, he always knows which clause is currently being processed by Fix. A crucial detail is to figure out how many bits of communication are required to send these messages.

• For a top-level Fix, ${\log m + O(1)}$ bits suffice because there are only ${m}$ clauses in ${\phi}$.
• For a recursive Fix, ${\log R + O(1)}$ bits suffice because the Debugger already knows what clause is currently being fixed, and that clause shares variables with only ${R}$ other clauses, so only ${R}$ possible clauses could be passed to the next call to Fix.
• When each call to Fix(${C}$) finishes, the corresponding message takes ${O(1)}$ bits.
• When Fix gets called ${s}$ times, the corresponding message takes ${n+O(1)}$ bits.

The main point of the proof is to show that, if Fix gets called ${s}$ times, then these messages reveal the random string ${x}$ to the Debugger.

Since each clause is a disjunction (an “or” of ${k}$ literals), there is exactly one assignment to those variables that does not satisfy the clause. So, whenever the CPU tells the Debugger that he is calling Fix(${C}$), the Debugger knows exactly what the current assignment to ${C}$ is. So, starting from the assignment that the Debugger received in the final message, he can work backwards and figure out what the previous assignment was before calling Fix. Repeating this process, he can figure out how the variables were set in each call to Fix, and also what the initial assignment was. Thus the Debugger can reconstruct the random string ${x}$.

The total number of bits sent by the CPU are

• ${m (\log m+O(1))}$ bits for all the messages sent when Solve calls Fix.
• ${s \cdot (\log R + O(1))}$ for all the messages sent in the ${\leq s}$ recursive calls.
• ${n+O(1)}$ bits to send the final assignment.

So ${x}$ has been compressed from ${n+sk}$ bits to

$\displaystyle m (\log m +O(1)) ~+~ s(\log R + O(1)) ~+~ n + O(1) ~~~\mathrm{bits}.$

This is an overall shrinking of

$\displaystyle \begin{array}{rcl} &&\Big( n + sk \Big) ~-~ \Big( m(\log m +O(1)) ~+~ s(\log R + O(1)) ~+~ n + O(1) \Big) \\ &=& s(k - \log R - O(1)) ~-~ m(\log m +O(1)) ~-~ O(1) \\ &=& s(d-O(1)) ~-~ m(\log m +O(1)) \qquad(\mathrm{since}~R=2^{k-d}) \\ &=& \big(m (\log m + c) + {\textstyle \log \frac{1}{\delta} } \big)(d-O(1)) ~-~ m(\log m +O(1)) \qquad(\mathrm{definition~of}~ s)\\ &\geq& \log \frac{1}{\delta} \end{array}$

bits, assuming that ${c}$ and ${d}$ are sufficiently big constants.

We have argued that, if Fix gets called ${s}$ times, then ${x}$ can be compressed by ${\log \frac{1}{\delta}}$ bits. The next claim argues that this happens with probability at most ${\delta}$.

Claim 5 The probability that ${x}$ can be compressed by ${\log \frac{1}{\delta}}$ bits is at most ${\delta}$.

Proof: Consider any deterministic algorithm for encoding all bit strings of length ${\ell}$ into bit strings of arbitrary length. The number of bit strings that are encoded into ${\ell-b}$ bits is at most ${2^{\ell-b}}$. So, a random bit string has probability ${2^{-b}}$ of being encoded into ${\ell-b}$ bits. (One can view this as a simple special case of the Kraft inequality.) $\Box$