## Lecture 18: Perfect Hashing

In the last lecture we defined the notion of pairwise independence, and saw some ways to use this concept in reducing the randomness of a randomized algorithm. Today we will apply these ideas to designing hash functions and “perfect” hash tables.

1. Universal Hash Families

Last time we showed how to create a joint distribution on random variables ${\{\: Y_i \::\: i \in U \:\}}$ such that

• each ${Y_i}$ is uniformly distributed on some finite set ${T}$, and
• the joint distribution on the ${Y_i}$‘s is pairwise independent.

Let ${S}$ be the sample space underlying the ${Y_i}$‘s. Drawing a sample ${s}$ from ${S}$ determines values for all the ${Y_i}$‘s. We can think of this another way: after choosing ${s \in S}$, then associated with every ${i \in U}$ there is a value in ${T}$ (namely, the value of ${Y_i}$).

Formally, for every ${s \in S}$ there is a function ${h_s : U \rightarrow T}$ given by ${h_s(i) = Y_i}$. (The random variable ${Y_i}$ is implicitly a function of ${s}$.) This family of functions ${\{\: h_s \::\: s \in S \:\}}$ satisfies the following important property

$\displaystyle {\mathrm{Pr}}_{s \in S}[\: h_s(x)=\alpha \:\wedge\: h_s(y)=\beta \:] ~=~ \frac{1}{|T|^2} \qquad\forall x \neq y \in U, ~~ \forall \alpha, \beta \in T.$

A family of functions satisfying this property is called a ${2}$-universal hash family. So we can restate our construction from last time as follows.

Theorem 1 For any ${n}$, there exists a ${2}$-universal hash family with ${|S|=2^{2n}}$, ${|U|=2^n}$ and ${|T|=2^n}$. Sampling a function from this family requires ${O(n)}$ mutually independent, uniform random bits. Consequently, each hash function can be represented using ${O(n)}$ bits of space. Evaluating any hash function in this family at any point of ${U}$ takes only a constant number of arithmetic operations involving ${O(n)}$-bit integers.

One can easily modify this construction to handle any ${U}$ and ${T}$ with ${|U| \leq 2^n}$, ${|T| \leq 2^n}$ and ${|T|}$ a power of two.

2. Perfect Hashing

Let ${U}$ be a “universe” of items. We will assume that our computational model has words of ${\Theta(\log |U|)}$ bits and that any standard arithmetic operation involving ${\Theta(\log |U|)}$-bit words takes ${O(1)}$ time.

Suppose we wish to store a given subset ${N \subseteq U}$ as a “dictionary”, so that we can efficiently test whether any element ${x \in U}$ belongs to ${N}$. There are many well-known solutions to this problem. For example, a balanced binary tree allows us to store the dictionary in ${O(|N|)}$ words of space, while search operations take ${O(\log |N|)}$ time in the worst case. (Insertion and deletion of items also take ${O(\log |N|)}$ time.)

We will present an alternative dictionary design based on hashing. This is a “static dictionary”, which does not allow insertion or deletion of items.

Theorem 2 There is a randomized algorithm that, given a set ${N \subseteq U}$, stores the set ${N}$ in a data structure using ${O(|N|)}$ words of space. There is a deterministic algorithm that, given any item ${x \in U}$, can search for ${x}$ in that data structure in ${O(1)}$ time in the worst case.

Let ${T}$ be any set with ${T \subseteq U}$ and ${|T|}$ a power of two. Theorem 1 gives us a ${2}$-universal hash family mapping ${U}$ to ${T}$ for which any hash function can be evaluated in ${O(1)}$ time.

Using this family, consider the following simple design for our dictionary. First create a table of size ${|T|}$ and randomly choose a function ${h_s}$ from the hash family. Can we simply store each item ${x \in N}$ in the table in location ${h_s(x)}$? This would certainly be very efficient, because finding ${x}$ in the table only requires evaluating ${h_s(x)}$, which takes only ${O(1)}$ time. But the trouble of course is that there might be collisions: there might be two items ${x, y \in N}$ such that ${h_s(x) = h_s(y)}$, meaning that ${x}$ and ${y}$ want to lie in the same location of the table.

Most likely you have discussed hash tables collisions in an undergraduate class on data structures. One obvious way to try to avoid collisions is to take ${|T|}$ to be larger; the problem is that this increases the storage requirements, and we are aiming for a data structure that uses ${O(|N|)}$ words. Alternatively one can allow collisions to happen and use some other technique to deal with them. For example one could use chaining, in which all items hashing to the same location of the table are stored in a linked list, or linear probing, in which some items hashing to the same location and relocated to nearby locations. Unfortunately neither of those techniques solves our problem: when the table size is ${O(|N|)}$ both of those techniques will typically require super-constant time to search the table in the worst case.

The solution we present today is based on a simple two-level hierarchical hashing scheme. At the top-level, a single hash function is used to assign each item ${x \in N}$ to a “bucket” in the top-level table. Then, for every bucket of the top-level table we create a new second-level hash table to store the items that hashed to that bucket.

Collisions. Using the pairwise independence property, we can easily determine the probability of any two items having a collision. For any unordered pair of distinct items ${\{x, y\} \subset U}$,

$\displaystyle {\mathrm{Pr}}_{s \in S}[ h_s(x) = h_s(y) ] ~=~ \sum_{i \in T} {\mathrm{Pr}}_{s \in S}[\: h_s(x)=i \:\wedge\: h_s(y)=i \:] ~=~ \sum_{i \in T} \frac{1}{|T|^2} ~=~ \frac{1}{|T|}.$

So the expected total number of collisions is

$\displaystyle {\mathrm E}[\#~\mathrm{collisions}] ~=~ \sum_{\{x,y\} \subset N ,\: x \neq y} {\mathrm{Pr}}_{s \in S}[ h_s(x) = h_s(y) ] ~=~ \binom{|N|}{2} / |T|. \ \ \ \ \ (1)$

We would like to carefully choose the size of ${T}$ in order to have few collisions. A natural choice is to let ${|T|}$ be ${|N|^2}$ (rounded up to a power of two), so

$\displaystyle {\mathrm E}[\#~\mathrm{collisions}] \leq 1/2 \qquad\implies\qquad {\mathrm{Pr}}[\mathrm{no~collisions}] \geq 1/2, \ \ \ \ \ (2)$

by Markov’s inequality. The problem is that storing this ${T}$ would require ${\Theta(|N|^2)}$ space.

Our design. Instead, we will choose ${|T|}$ to be ${|N|}$ (rounded up to a power of two). This will typically result in some collisions, but the second-level hash tables are used to deal with those collisions. The top-level hash function is denoted ${h : U \rightarrow T}$, and the elements of ${T}$ are called buckets. By (1),

$\displaystyle {\mathrm E}[\#~\mathrm{collisions}] \leq |N|/2 \qquad\text{and}\qquad {\mathrm{Pr}}[\#~\mathrm{collisions} \leq |N|] \geq 1/2.$

Let us condition on that event happening: the number of collisions of the top-level hash function ${h}$ is at most ${|N|}$. For every bucket ${i \in T}$, let

$\displaystyle N_i ~=~ \{\: x \in N \::\: h(x)=i \:\}$

be the set of items that are hashed to bucket ${i}$ by the top-level hash function. Let ${n_i = |N_i|}$. For each ${i \in T}$ we will sample a new second-level hash function mapping ${h_i : U \rightarrow T_i}$ where ${|T_i| = n_i^2}$. With probability at least ${1/2}$ there will be no collisions of the hash function ${h_i}$, as shown by (2). We will repeatedly sample ${h_i}$ and count its number of collisions until finding a function ${h_i}$ with no collisions.

Search operation. Every item in ${N}$ is stored in exactly one location in these second-level tables. So to search our data structure for an item ${x \in U}$ we follow a very simple procedure. First we compute ${i = h(x)}$, the top-level bucket containing item ${x}$. Then we compute ${h_i(x)}$, which gives the location of ${x}$ in the second-level table ${T_i}$. This process takes ${O(1)}$ time.

Space requirements. The only thing left to analyze is the space requirements. The size of the top-level table ${T}$ is at most ${2|N|}$, by construction. The total size of the second-level tables is

$\displaystyle \sum_{i \in T} n_i^2 ~\leq~ \sum_{i \in T} \Big( 2 \binom{n_i}{2} + n_i \Big) ~=~ |N| + 2 \sum_{i \in T} \binom{ n_i }{ 2 } ~=~ |N| + 2 \cdot \#~\mathrm{collisions} ~\leq~ 3 \cdot |N|.$

We also need to need to write down the description of the hash functions themselves. By Theorem 1 each hash function can be represented using only a constant number of words. So the total space required is ${O(|N|)}$ words.

## Lecture 17: Chebyshev’s Inequality, Pairwise Independence

So far we have seen two concentration bounds for scalar random variables: Markov and Chernoff. For our sort of applications, these are by far the most useful. In most introductory probability courses, you are likely to see another inequality, which is Chebyshev’s inequality. Its strength lies between the Markov and Chernoff inequalities: the concentration bounds we get from Chebyshev is usually better than Markov but worse than Chernoff. On the other hand, Chebyshev requires stronger hypotheses than Markov but weaker hypotheses than Chernoff.

1. Variance

We begin by reviewing variance, and other related notions, which should be familiar from an introductory probability course. The variance of a random variable ${X}$ is

$\displaystyle \begin{array}{rcl} {\mathrm{Var}}[X] ~=~ {\mathrm E}\Big[ \big(X-{\mathrm E}[X]\big)^2 \Big] ~=~ {\mathrm E}[ X^2 ] - {\mathrm E}[X]^2. \end{array}$

The covariance between two random variables ${X}$ and ${Y}$ is

$\displaystyle {\mathrm{Cov}}[X,Y] ~=~ {\mathrm E}\Big[ \big(X-{\mathrm E}[X]\big) \big(Y-{\mathrm E}[Y]\big) \Big] ~=~ {\mathrm E}[XY] - {\mathrm E}[X]{\mathrm E}[Y].$

This gives some measure of the correlation between ${X}$ and ${Y}$.

Here are some properties of variance and covariance that follow from the definitions by simple calculations.

Claim 1 If ${X}$ and ${Y}$ are independent then ${{\mathrm{Cov}}[X,Y] = 0}$.

Claim 2 ${{\mathrm{Cov}}[X+Y,Z] = {\mathrm{Cov}}[X,Z] + {\mathrm{Cov}}[Y,Z]}$.

More generally, induction shows

Claim 3 ${{\mathrm{Cov}}[\sum_i X_i,Z] = \sum_i {\mathrm{Cov}}[X_i,Z]}$.

Claim 4 ${{\mathrm{Var}}[X+Y] = {\mathrm{Var}}[X] + {\mathrm{Var}}[Y] + 2 \cdot {\mathrm{Cov}}[X,Y]}$.

More generally, induction shows

Claim 5 Let ${X_1,\ldots,X_n}$ be arbitrary random variables. Then

$\displaystyle {\mathrm{Var}}\Bigg[\sum_{i=1}^n X_i\Bigg] ~=~ \sum_{i=1}^n {\mathrm{Var}}[X_i] + 2 \sum_{i=1}^n \sum_{j>i} {\mathrm{Cov}}[X_i,X_j].$

In particular,

Claim 6 Let ${X_1,\ldots,X_n}$ be mutually independent random variables. Then ${ {\mathrm{Var}}[\sum_{i=1}^n X_i]~= \sum_{i=1}^n {\mathrm{Var}}[X_i] }$.

2. Chebyshev’s Inequality

Chebyshev’s inequality you’ve also presumably seen before. It is a 1-line consequence of Markov’s inequality.

Theorem 7 For any ${t > 0}$,

$\displaystyle {\mathrm{Pr}}\Big[ \big|X-{\mathrm E}[X]\big|\geq t \Big] ~\leq~ \frac{{\mathrm{Var}}[X]}{t^2}.$

Proof:

$\displaystyle {\mathrm{Pr}}\Big[ \big|X-{\mathrm E}[X]\big|\geq t \Big] ~=~ {\mathrm{Pr}}\Big[ \big(X-{\mathrm E}[X]\big)^2\geq t^2 \Big] ~\leq~ \frac{ {\mathrm E}[ \big(X-{\mathrm E}[X]\big)^2 ]}{t^2} ~=~ \frac{ {\mathrm{Var}}[ X ]}{t^2},$

where the inequality is by Markov’s inequality. $\Box$

As a quick example, suppose we independently flip ${n}$ fair coins. What’s the probability that we see at least ${3n/4}$ heads? Let ${X_i}$ be the indictator random variable of the event “${i}$th toss is heads”. Let ${\mu = {\mathrm E}[\sum_i X_i] = n/2}$. So we want to analyze ${{\mathrm{Pr}}[ \sum_i X_i - \mu \geq n/4 ]}$.

Bound from Chebyshev: Note that

$\displaystyle {\mathrm{Var}}[X_i] ~=~ {\mathrm E}[X_i^2] - {\mathrm E}[X_i]^2 ~=~ 1/2 - 1/4 ~=~ 1/4.$

By independence,

$\displaystyle {\mathrm{Var}}[{\textstyle \sum_i X_i}] ~=~ \sum_i {\mathrm{Var}}[X_i] ~=~ n/4.$

By Chebyshev’s inequality

$\displaystyle {\mathrm{Pr}}[ {\textstyle \sum_i X_i - \mu \geq n/4 }] ~\leq~ {\mathrm{Pr}}\Big[ {\textstyle \big|\sum_i X_i-{\mathrm E}[\sum_i X_i]\big|\geq n/4 } \Big] ~\leq~ \frac{{\mathrm{Var}}[\sum_i X_i]}{(n/4)^2} ~=~ \frac{n/4}{(n/4)^2} ~=~ \frac{4}{n}.$

Bound from Chernoff: Chernoff’s inequality gives

$\displaystyle {\mathrm{Pr}}[ {\textstyle \sum_i X_i - \mu \geq n/4 }] ~=~ {\mathrm{Pr}}[ {\textstyle \sum_i X_i - \mu \geq \mu/2 }] ~\leq~ \exp( - (1/2)^2 \mu / 3 ) ~<~ 0.96^n.$

This is better than the bound from Chebyshev for ${n \geq 71}$.

So Chebyshev is weaker than Chernoff, at least for analyzing sums of independent Bernoulli trials. So why do we bother studying Chebyshev? One reason is that Chernoff is designed for analyzing sums of mutually independent random variables. That is quite a strong assumption. In some scenarios, our random variables are not mutually independent, or perhaps we deliberately choose them not to be mutually independent.

• For example, generating mutually independent random variables requires a lot of random bits and, as discussed last time, randomness is a “precious resource”. We will see that decreasing the number of random bits give another method to derandomize algorithms.
• Another important example is in constructing hash functions, which are random-like functions. Generating a completely random function takes a huge number of random bits. So instead we often try to use hash functions involving less randomness.

3. ${k}$-wise independence

A set of events ${{\mathcal E}_1,\ldots,{\mathcal E}_n}$ are called ${k}$-wise independent if for any set ${I \subseteq \{1,\ldots,n\}}$ with ${|I| \leq k}$ we have

$\displaystyle {\mathrm{Pr}}[ \wedge_{i \in I} {\mathcal E}_i ] ~=~ \prod_{i \in I} {\mathrm{Pr}}[ {\mathcal E}_i ].$

The term pairwise independence is a synonym for ${2}$-wise independence.

Similarly, a set of discrete random variables ${X_1,\ldots,X_n}$ are called ${k}$-wise independent if for any set ${I \subseteq \{1,\ldots,n\}}$ with ${|I| \leq k}$ and any values ${x_i}$ we have

$\displaystyle {\mathrm{Pr}}[~ \wedge_{i \in I} \: X_i \!=\! x_i ~] = \prod_{i \in I} {\mathrm{Pr}}[ X_i = x_i ].$

Claim 8 Suppose ${X_1,\ldots,X_n}$ are ${k}$-wise independent. Then

$\displaystyle {\mathrm E}[~{\textstyle \prod_{i \in I} X_i }~] ~=~ \prod_{i \in I} {\mathrm E}[X_i] \qquad\forall I ~\mathrm{with}~ |I| \leq k.$

Proof: For notational simplicity, consider the case ${I = \{1,\ldots,k\}}$. Then

$\displaystyle \begin{array}{rcl} {\mathrm E}[~{\textstyle \prod_{i=1}^k X_i }~] &=& \sum_{x_1} \sum_{x_2} \cdots \sum_{x_k} ~ {\mathrm{Pr}}[~ \wedge_{i=1}^k \, X_i \!=\! x_i ~] \cdot \prod_{i=1}^k x_i \\ &=& \sum_{x_1} \sum_{x_2} \cdots \sum_{x_k} ~ \prod_{i=1}^k {\mathrm{Pr}}[X_i \!=\! x_i ] \cdot x_i \qquad(k\!-\!\mathrm{wise~independence})\\ &=& \Big(\sum_{x_1} {\mathrm{Pr}}[X_1 \!=\! x_1 ] \cdot x_1\Big) \cdots \Big(\sum_{x_k} {\mathrm{Pr}}[X_k \!=\! x_k ] \cdot x_k\Big) \\ &=& \prod_{i=1}^k {\mathrm E}[ X_i ]. \end{array}$

$\Box$

Example: To get a feel for pairwise independence, consider the following three Bernoulli random variables that are pairwise independent but not mutually independent. There are 4 possible outcomes of these three random variables. Each of these outcomes has probability ${1/4}$.

$\displaystyle \begin{array}{ccc} X_1 & X_2 & X_3 \\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{array}$

They are certainly not mutually independent because the event ${X_1=X_2=X_3=1}$ has probability ${0}$, whereas ${\prod_{i=1}^3 {\mathrm{Pr}}[X_i=1]=(1/2)^3}$. But, by checking all cases, one can verify that they are pairwise independent.

3.1. Constructing Pairwise Independent RVs

Let ${{\mathbb F}}$ be a finite field and ${q=|{\mathbb F}|}$. We will construct RVs ${Y_1, \cdots, Y_q}$ such that each ${Y_i}$ is uniform over ${{\mathbb F}}$ and the ${Y_i}$‘s are pairwise independent. To do so, we need to generate only two independent RVs ${X_1}$ and ${X_2}$ that are uniformly distributed over ${{\mathbb F}}$. We then define

$\displaystyle Y_i ~=~ X_1 + i \cdot X_2. \ \ \ \ \ (1)$

Claim 9 Each ${Y_i}$ is uniformly distributed on ${{\mathbb F}}$.

Proof: For ${i=0}$ we have ${Y_i = X_1}$, which is uniform. For ${i \neq 0}$ and any ${j \in {\mathbb F}}$ we have

$\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[ Y_i = j ] &=& {\mathrm{Pr}}[ X_1 + i \cdot X_2 = j ] \\ &=& {\mathrm{Pr}}[ X_2 = (j-X_1)/i ] \\ &=& \sum_{x \in {\mathbb F}} {\mathrm{Pr}}[~ X_2 = (j-x)/i ~\wedge~ X_1=x ~] \qquad(\mathrm{total~probability~law}) \\ &=& \sum_{x \in {\mathbb F}} {\mathrm{Pr}}[ X_2 = (j-x)/i] \cdot {\mathrm{Pr}}[ X_1=x ] \qquad(\mathrm{pairwise~independence}) \\ &=& (1/q) \cdot \sum_{x \in {\mathbb F}} {\mathrm{Pr}}[ X_2 = (j-x)/i ] \\ &=& (1/q), \end{array}$

since as ${x}$ ranges through ${{\mathbb F}}$, ${(j-x)/i}$ also ranges through all of ${{\mathbb F}}$. (In other words, the map ${x \mapsto (j-x)/i}$ is a bijection of ${{\mathbb F}}$ to itself.) So ${Y_i}$ is uniform. $\Box$

Claim 10 The ${Y_i}$‘s are pairwise independent.

Proof: We wish to show that, for any distinct RVs ${Y_i}$ and ${Y_j}$ and any values ${a,b \in {\mathbb F}}$, we have

$\displaystyle {\mathrm{Pr}}[~ Y_i=a \:\wedge\: Y_j=b ~] ~=~ {\mathrm{Pr}}[ Y_i=a ] \cdot {\mathrm{Pr}}[ Y_j=b ] ~=~ 1/q^2.$

This event is equivalent to ${X_1+i \cdot X_2 = a}$ and ${X_1+j \cdot X_2 = b}$. We can also rewrite that as:

$\displaystyle \begin{pmatrix} 1 & i \\ 1 & j \end{pmatrix} \cdot \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} ~=~ \begin{pmatrix} a \\ b \end{pmatrix}.$

This holds precisely when

$\displaystyle \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} ~=~ \begin{pmatrix} 1 & i \\ 1 & j \end{pmatrix}^{-1} \cdot \begin{pmatrix} a \\ b \end{pmatrix} ~=~ \begin{pmatrix} ja-ib \\ b-a \end{pmatrix} /(j-i).$

Since ${X_1}$ and ${X_2}$ are independent and uniform over ${{\mathbb F}}$, this event holds with probability ${1/q^2}$. $\Box$

Corollary 11 Given ${2n}$ mutually independent, uniformly random bits, we can construct ${2^n}$ pairwise independent, uniformly random strings in ${\{0,1\}^n}$.

Proof: Apply the previous construction to the finite field ${{\mathbb F}_{2^n}}$. The ${2n}$ mutually independent random bits are used to construct ${X_1}$ and ${X_2}$. The random strings ${Y_1,\ldots,Y_{2^n}}$ are constructed as in (1). $\Box$

3.2. Example: Max Cut with pairwise independent RVs

Once again let’s consider the Max Cut problem. We are given a graph ${G=(V,E)}$ where ${V = \{1,\ldots,n\}}$. We will choose ${\{0,1\}}$-valued random variables ${Y_1,\ldots,Y_n}$. If ${Y_i=1}$ then we add vertex ${i}$ to ${U}$.

Our original algorithm chose ${Y_1,\ldots,Y_n}$ to be mutually independent and uniform. Instead we will pick ${Y_1,\ldots,Y_n}$ to be pairwise independent and uniform. Then

$\displaystyle \begin{array}{rcl} {\mathrm E}[ |\delta(U)| ] &=& \sum_{ij \in E} {\mathrm{Pr}}[~ (i \!\in\! U \wedge j \!\not\in\! U) \:\vee\: (i \!\not\in\! U \wedge j \!\in\! U) ~] \\ &=& \sum_{ij \in E} {\mathrm{Pr}}[~ i \!\in\! U \wedge j \!\not\in\! U ~] + {\mathrm{Pr}}[~ i \!\not\in\! U \wedge j \!\in\! U ~] \\ &=& \sum_{ij \in E} {\mathrm{Pr}}[Y_i]{\mathrm{Pr}}[\overline{Y_j}] + {\mathrm{Pr}}[\overline{Y_i}]{\mathrm{Pr}}[Y_j] \qquad(\mathrm{pairwise~independence}) \\ &=& \sum_{ij \in E} \Big( (1/2)^2 + (1/2)^2 \Big) \\ &=& |E|/2 \end{array}$

So the original algorithm works just as well if we make pairwise independent decisions instead of mutually independent decisions for placing vertices in ${U}$. The following theorem shows the advantage of making pairwise independent decisions.

Theorem 12 There is a deterministic, polynomial time algorithm to find a cut ${\delta(U)}$ with ${|\delta(U)| \geq |E|/2}$.

Proof: By Corollary 11, we only need ${b = O(\log n)}$ mutually independent, uniform random bits ${X_1,\ldots,X_b}$ in order to generate our pairwise independent, uniform random bits ${Y_1,\ldots,Y_n}$. We have just argued that these pairwise independent ${Y_i}$‘s will give us

$\displaystyle {\mathrm E}_{X_1,\ldots,X_b}[ |\delta(U)| ] ~=~ |E|/2.$

So there must exist some particular bits ${(x_1,\ldots,x_b)}$ such that fixing ${X_i=x_i}$ for all ${i}$, we get ${ |\delta(U)| \geq |E|/2 }$. We can deterministically find such bits ${(x_1,\ldots,x_b)}$ by exhaustive search in ${2^b = 2^{O(\log n)} = \mathrm{poly}(n)}$ trials. This gives a deterministic, polynomial time algorithm. $\Box$

4. Chebyshev with pairwise independent RVs

One of the main benefits of pairwise independent RVs is that Chebyshev’s inequality still works beautifully. Suppose that ${X_1,\ldots,X_n}$ are pairwise independent. For any ${i \neq j}$,

$\displaystyle {\mathrm{Cov}}[X_i,X_j] ~=~ {\mathrm E}[X_i X_j] - {\mathrm E}[X_i]{\mathrm E}[X_j] ~=~ 0,$

by Claim 8. So

$\displaystyle {\mathrm{Var}}[{\textstyle \sum_{i=1}^n X_i}] ~=~ \sum_{i=1}^n {\mathrm{Var}}[X_i],$

by Claim 5. So

$\displaystyle {\mathrm{Pr}}[{\textstyle |\sum_i X_i - {\mathrm E}[\sum_i X_i]| > t }] ~\leq~ \frac{{\mathrm{Var}}[\sum_i X_i]}{t^2}.$

This is exactly the same bound that we would get if the ${X_i}$‘s were mutually independent.

Posted in Uncategorized | 1 Comment

## Lecture 16: Derandomization: Method of Conditional Expectations, Method of Pessimistic Estimators

In this lecture we discuss the topic of derandomization — converting a randomized algorithm into a deterministic one.

1. Method of Conditional Expectations

One of the simplest methods for derandomizing an algorithm is the “method of conditional expectations”. In some contexts this is also called the “method of conditional probabilities”.

Let us start with a simple example. Let ${[k]}$ denote ${\{1,\ldots,k\}}$. Suppose ${X}$ is an random variable taking values in ${[k]}$. Let ${f : [k] \rightarrow {\mathbb R}}$ be any function and suppose ${{\mathrm E}[ f(X) ] \leq \mu}$. How can we find an ${x \in \{1,\ldots,k\}}$ such that ${f(x) \leq \mu}$? Well, the assumption ${{\mathrm E}[ f(X) ] \leq \mu}$ guarantees that there exists ${x \in \{1,\ldots,k\}}$ with ${f(x) \leq \mu}$. So we can simply use exhaustive search to try all possible values for ${x}$ in only ${O(k)}$ time. The same idea can also be used to find an ${x}$ with ${f(x) \geq \mu}$.

Now let’s make the example a bit more complicated. Suppose ${X_1,\ldots,X_n}$ are independent random variables taking values in ${[k]}$. Let ${f : [k]^n \rightarrow {\mathbb R}}$ be any function and suppose ${{\mathrm E}[ f(X_1,\ldots,X_n) ] \leq \mu}$. How can we find a vector ${(x_1,\ldots,x_n) \in [k]^n}$ with ${f(x_1,\ldots,x_n) \leq \mu}$? Exhaustive search is again an option, but now it will take ${O(k^n)}$ time, which might be too much.

The method of conditional expectations gives a more efficient solution, under some additional assumptions. Suppose that for any numbers ${(x_1,\ldots,x_i)}$ we can efficiently evaluate

$\displaystyle {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:].$

(If you prefer, you can think of this as ${{\mathrm E}[~ f(X_1,\ldots,X_n) ~|~ X_1=x_1,\ldots,X_i=x_i ~]}$, which is a conditional expectation of ${f}$. This is where the method gets its name.) Then the following algorithm will produce a point ${x_1,\ldots,x_n}$ with ${f(x_1,\ldots,x_n) \leq \mu}$.

• For ${i=1,\ldots,n}$
• Set ${x_i=0}$.
• Repeat
• Set ${x_i = x_i + 1}$.

• Until ${{\mathrm E}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:] \leq {\mathrm E}[\: f(x_1,\ldots,x_{i-1},X_i,\ldots,X_n) \:]}$
• EndFirst we claim that the algorithm will terminate (i.e., the repeat loop will eventually succeed). To see this, define

$\displaystyle g(y) ~=~ {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_{i-1},y,X_{i+1},\ldots,X_n) \:].$

Just like in our simple example above, there exists an ${x_i}$ with ${g(x_i) \leq {\mathrm E}_{X_i}[ g(X_i) ]}$, so we can find such an ${x_i}$ by exhaustive search. That is exactly what the repeat loop is doing.

1.1. Example: Max Cut

To illustrate this method, let us consider our algorithm for the Max Cut problem from Lecture 1. We are given a graph ${G=(V,E)}$. Recall that this algorithm generates a cut ${\delta(U)}$ simply by picking a set ${U \subseteq V}$ uniformly at random. Equivalently, for each vertex ${v \in V}$, the algorithm independently flips a fair coin to decide whether to put ${v \in U}$. We argued that ${{\mathrm E}[|\delta(U)|] \geq |E|/2}$.

We will use the method of conditional expectations to derandomize this algorithm. Let the vertex set of the graph be ${V = \{1,\ldots,n\}}$. Let

$\displaystyle f(x_1,\ldots,x_n) = |\delta(U)| \qquad\text{where}\qquad U = \{\: i \::\: x_i=1 \}.$

Let ${X_1,\ldots,X_n}$ be independent random variables where each ${X_i}$ is ${0}$ or ${1}$ with probability ${1/2}$. We identify the event “${X_i=1}$” with the event “vertex ${i \in U}$”. Then ${{\mathrm E}[ f(X_1,\ldots,X_n) ] = {\mathrm E}[ |\delta(U)| ] = |E|/2}$. We wish to deterministically find values ${x_1,\ldots,x_n}$ for which ${f(x_1,\ldots,x_n) \geq |E|/2}$.

To apply the method of conditional probabilities we must be able to efficiently compute

$\displaystyle {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:],$

for any numbers ${(x_1,\ldots,x_i)}$. What is this quantity? It is the expected number of edges cut when we have already decided which vertices amongst ${\{1,\ldots,i\}}$ belong to ${U}$, and the remaining vertices ${\{i+1,\ldots,n\}}$ are placed in ${U}$ randomly (independently, with probability ${1/2}$). This expectation is easy to compute! For any edge with both endpoints in ${\{1,\ldots,i\}}$ we already know whether it will be cut or not. Every other edge has probability exactly ${1/2}$ of being cut. So we can compute that expected value in linear time.

In conclusion, the method of condition expectations gives us a deterministic, polynomial time algorithm outputting a set ${U}$ with ${|\delta(U)| \geq |E|/2}$.

2. Method of Pessimistic Estimators

So far we have derandomized our very simple Max Cut algorithm, which doesn’t use any sophisticated probabilistic tools. Next we will see what happens when we try to apply these ideas to algorithms that use the Chernoff bound.

Let ${X_1,\ldots,X_n}$ be independent random variables in ${[0,1]}$. Define the function ${f}$ as follows:

$\displaystyle f(x_1,\ldots,x_n) ~=~ \begin{cases} 1 &\qquad\mathrm{if}~ \sum_i x_i \geq \alpha \\ 0 &\qquad\mathrm{if}~ \sum_i x_i > \alpha \end{cases}.$

So

$\displaystyle {\mathrm E}[f(X_1,\ldots,X_n)] ~=~ {\mathrm{Pr}}[\textstyle \sum_i X_i \geq \alpha ],$

which is the typical sort of quantity to which one would apply a Chernoff bound.

Can we apply the method of conditional expectations to this function ${f}$? For any numbers ${(x_1,\ldots,x_i)}$, we need to efficiently evaluate

$\displaystyle {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:] ~=~ {\mathrm{Pr}}[~ \textstyle \sum_i X_i \geq \alpha ~|~ X_1=x_1,\ldots,X_i=x_i ~].$

Unfortunately, computing this is not so easy. If the ${X_i}$‘s were i.i.d. Bernoullis then we could compute that probability by expanding it in terms of binomial coefficients. But in the non-i.i.d. or non-Bernoulli case, there does not seem to be an efficient way to compute this probability.

Here is the main idea of “pessimistic estimators”: instead of defining ${f}$ to be equal to that probability, we will define ${f}$ to be an easily-computable upper-bound on that probability. Because ${f}$ is an upper bound on the probability of the bad event “${\sum_i X_i \geq \alpha}$”, the function ${f}$ is called a pessimistic estimate of that probability. So what upper bound should we use? The Chernoff bound, of course!

For simplicity, suppose that ${X_1,\ldots,X_n}$ are independent Bernoulli random variables. The first step of the Chernoff bound (exponentiation and Markov’s inequality) shows that, for any parameter ${t>0}$,

$\displaystyle {\mathrm{Pr}}[~ \textstyle \sum_i X_i \geq \alpha ~] ~\leq~ {\mathrm E}\Big[ e^{-t \alpha} \prod_{j=1}^n e^{t X_j} \Big]. \ \ \ \ \ (1)$

Important Remark: This step holds for any joint distribution on the ${X_i}$‘s, including any non-independent or conditional distribution. This is because we have only used exponentiation and Markov’s inequality, which need no assumptions on the distribution.

We will use the upper bound in (1) to define our function ${f}$. Specifically, define

$\displaystyle f(x_1,\ldots,x_n) ~=~ e^{-t \alpha} \cdot \prod_{j=1}^n e^{t x_j}. \ \ \ \ \ (2)$

Let’s check that the conditional expectations are easy to compute with this new definition of ${f}$. Given any numbers ${(x_1,\ldots,x_i)}$, we have

$\displaystyle \begin{array}{rcl} {\mathrm E}_{X_{i+1},\ldots,X_n}[\: f(x_1,\ldots,x_i,X_{i+1},\ldots,X_n) \:] &=& e^{-t \alpha} \cdot \prod_{j=1}^i e^{t x_j} ~\cdot~ {\mathrm E}_{X_{i+1},\ldots,X_n}\Bigg[\: \prod_{j=i+1}^n e^{t X_j} \:\Bigg] \\ &=& e^{-t \alpha} \cdot \prod_{j=1}^i e^{t x_j} ~\cdot~ \prod_{j=i+1}^n {\mathrm E}_{X_j}\big[\: e^{t X_j} \:\big]. \end{array}$

This expectation is easy to compute in linear time, assuming we know the distribution of each ${X_j}$ (i.e., we know that ${{\mathrm{Pr}}[ X_j=1 ] = p_j}$).

Applying the method of conditional expectations to the pessimistic estimator: Now we’ll see how to use this function ${f}$ to find ${(x_1,\ldots,x_n)}$ with ${\sum_i x_i < \alpha}$. Set ${\mu = \sum_i {\mathrm E}[X_i]}$, ${\alpha = (1+\delta) \mu}$ and ${t = \ln(1+\delta)}$. We have

$\displaystyle {\mathrm{Pr}}[ \textstyle \sum_i X_i > (1+\delta) \mu ] ~\leq~ {\mathrm E}[ f(X_1,\ldots,X_n) ] ~\leq~ \exp\Big( - \mu \big( (1+\delta) \ln(1+\delta) - \delta\big) \Big),$

where the first inequality is from (1) and the second inequality comes from the remainder of our Chernoff bound proof. Suppose ${\mu}$ and ${\delta}$ are such that this last quantity is strictly less than ${1}$. Then we know that there exists a vector ${(x_1,\ldots,x_n)}$ with ${\sum_i x_i < \alpha}$.

We now explain how to efficiently and deterministically find such a vector. The method of conditional expectation will give us a vector ${(x_1,\ldots,x_n)}$ for which ${f(x_1,\ldots,x_n) < 1}$. We now apply the same argument as in (1) to a conditional distribution:

$\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[~ \textstyle \sum_i X_i \geq (1+\delta) \mu ~|~ X_1=x_1,\ldots,X_n=x_n ~] &\leq& {\mathrm E}\Big[ f(X_1,\ldots,X_n) ~|~ X_1=x_1,\ldots,X_n=x_n \Big] \\ &=& f(x_1,\ldots,x_n) ~<~ 1. \end{array}$

But, under the conditional distribution “${X_1=x_1,\ldots,X_n=x_n}$”, there is no randomness remaining. The sum ${\sum_i X_i}$ is not a random variable; it is simply the number ${\sum_i x_i}$. Since the event “${\sum_i x_i \geq (1+\delta) \mu}$” has probability less than ${1}$, it must have probability ${0}$. In other words, we must have ${\sum_i x_i < (1+\delta) \mu}$.

This example is actually quite silly. If we want to achieve ${\sum_i x_i < \alpha}$, the best thing to do is obviously to set each ${x_i = 0}$. But the method is useful because we can apply it in more complicated scenarios that involve multiple Chernoff bounds.

2.1. Congestion Minimization

In Lecture 3 we gave a randomized algorithm which gives a ${O(\log n / \log \log n)}$ approximation to the congestion minimization problem. We now get a deterministic algorithm by the method of pessimistic estimators.

Recall that an instance of the problem consists of a directed graph ${G=(V,A)}$ with ${n = |V|}$ and a sequence ${(s_1,t_1), \ldots, (s_k,t_k)}$ of pairs of vertices. We want to find ${s_i}$${t_i}$ paths such that each arc ${a}$ is contained in few paths. Let ${{\mathcal P}_i}$ be the set of all paths in ${G}$ from ${s_i}$ to ${t_i}$. For every path ${P \in {\mathcal P}_i}$, we create a variable ${x_P^i}$.

We obtain a fractional solution to the problem by solving this LP.

$\displaystyle \begin{array}{llll} \mathrm{min} & C && \\ \mathrm{s.t.} & {\displaystyle \sum_{P \in {\mathcal P}_i}} x_P^i &= 1 &\qquad\forall i=1,\ldots,k \\ & {\displaystyle \sum_i ~~ \sum_{P \in {\mathcal P}_i \mathrm{~with~} a \in P}} x_P^i &\leq C &\qquad\forall a \in A \\ & x_P^i &\geq 0 &\qquad\forall i=1,\ldots,k \mathrm{~and~} P \in {\mathcal P}_i \end{array}$

Let ${C^*}$ be the optimal value of the LP.

We showed how randomized rounding gives us an integer solution (i.e., an actual set of paths). The algorithm chooses exactly one path ${P_i}$ from ${{\mathcal P}_i}$ by setting ${P_i = P}$ with probability ${x^i_P}$. For every arc ${a}$ let ${Y_i^a}$ be the indicator of the event “${a \in P_i}$”. Then the congestion on arc ${a}$ is ${Y^a = \sum_i Y_i^a}$. We showed that ${{\mathrm E}[ Y^a ] \leq C^*}$. Let ${\alpha = 6 \log n / \log \log n}$. We applied Chernoff bounds to every arc and a union bound to show that

$\displaystyle {\mathrm{Pr}}[~ \mathrm{any} ~a~ \mathrm{has}~ Y^a > \alpha C^* ~] ~\leq~ \sum_{a \in A} {\mathrm{Pr}}[~ Y^a > \alpha C^* ~] ~\leq~ 1/n.$

We will derandomize that algorithm with the function

$\displaystyle f(P_1,\ldots,P_k) ~=~ \sum_{a \in A} e^{-t \alpha} \cdot \prod_{j=1}^k e^{t Y_j^a}.$

How did we obtain this function? For each arc ${a}$ we applied a Chernoff bound, so each arc ${a}$ has a pessimistic estimator as in (2). We add all of those functions to give us this function ${f}$.

Note that

$\displaystyle \textstyle {\mathrm{Pr}}[~ \mathrm{any} ~a~ \mathrm{has}~ Y^a > \alpha C^* ~] ~\leq~ \sum_{a \in A} {\mathrm{Pr}}[~ Y^a > \alpha C^* ~] ~\leq~ {\mathrm E}[ f(P_1,\ldots,P_k) ] ~\leq~ 1/n.$

Applying the method of conditional expectations, we can find a vector of paths ${(p_1,\ldots,p_k)}$ for which ${f(p_1,\ldots,p_k) \leq 1/n}$. Thus,

$\displaystyle \begin{array}{rcl} {\mathrm{Pr}}[~ \mathrm{any} ~a~ \mathrm{has}~ Y^a > \alpha C^* \:|\: P_1=p_1,\ldots,P_k=p_k ~] &\leq& {\mathrm E}[~ f(P_1,\ldots,P_k) \:|\: P_1=p_1,\ldots,P_k=p_k ~] \\ &=& f(p_1,\ldots,p_k) ~\leq~ 1/n. \end{array}$

Under that conditional distribution there is no randomness left, so the event “any ${a}$ has ${Y^a > \alpha C^*}$” must have probability ${0}$. So, if we choose the paths ${p_1,\ldots,p_k}$ then every arc ${a}$ has congestion at most ${\alpha C^*}$, as desired.

Posted in Uncategorized | 1 Comment

## Lecture 15: Low-rank approximation of matrices

1. Low-rank approximation of matrices

Let ${A}$ be an arbitrary ${n \times m}$ matrix. We assume ${n \leq m}$. We consider the problem of approximating ${A}$ by a low-rank matrix. For example, we could seek to find a rank ${s}$ matrix ${B}$ minimizing ${ \lVert A - B \rVert }$.

It is known that a truncated singular value decomposition gives an optimal solution to this problem. Formally, let ${A = U \Sigma V ^T}$ be the singular value decomposition of ${A}$. Let ${\sigma_1(A) \geq \cdots \geq \sigma_n(A)}$ be the singular values (i.e., diagonal entries of ${\Sigma}$.) Let ${u_1, \ldots, u_n}$ be the left singular vectors (i.e., columns of ${U}$). Let ${v_1, \ldots, v_n}$ be the right singular vectors (i.e., columns of ${V}$).

Fact 1 ${ \sum_{i=1}^s \sigma_i u_i v_i ^T }$ is a solution to ${ \min_{B \::\: \mathrm{rank}(B) \leq s} \lVert A - B \rVert }$, and the minimum value equals ${\sigma_{s+1}}$.

Another way of stating this same fact is as follows.

Fact 2 Let ${V_s = [ v_1, \ldots, v_s ]}$ be the ${n \times s}$ matrix consisting of the top ${s}$ left singular vectors. Let ${P_s = V_s V_s ^T}$ be the orthogonal projection onto the span of ${\{ v_1, \ldots, v_s \}}$. Then ${ B = A P_s}$ is a solution to ${ \min_{B \::\: \mathrm{rank}(B) \leq s} \lVert A - B \rVert }$. Furthermore, ${\lVert A - A P_s \rVert = \sigma_{s+1}}$.

The SVD can be computed in ${O(mn^2)}$ time. (Strictly speaking, this is not correct — the singular values can be irrational, so realistically we can only compute an ${\epsilon}$-approximate SVD.) With the recent trend towards analyzing “big data”, a running time of ${O(mn^2)}$ might be too slow.

In the past 15 years there has been a lot of work on sophisticated algorithms to quickly compute low-rank approximations. For example, one could measure the approximate error in different norms, sample more or fewer vectors, improve the running time, reduce the number of passes over the data, improve numerical stability, etc. Much more information can be found in the survey of Mahoney, the review article of Halko-Martinsson-Tropp, the PhD thesis of Boutsidis, etc.

2. Rudelson & Vershynin’s Algorithm

Let ${A}$ be an ${n \times m}$ matrix. The Frobenius norm ${\lVert A \rVert_F}$ is defined by

$\displaystyle \lVert A \rVert_F^2 ~:=~ \mathop{\mathrm{tr}}\,( A A ^T ) ~=~ \sum_{i,j} A_{i,j}^2 ~=~ \sum_{i} \sigma_i^2.$

The stable rank (or numerical rank) of ${A}$ is

$\displaystyle \frac{ \lVert A \rVert_F^2 }{ \lVert A \rVert^2 } ~=~ \frac{ \sum_{i} \sigma_i^2 }{ \max_i \sigma_i^2 }.$

Clearly the stable rank cannot exceed the usual rank, which is the number of strictly positive singular values. The stable rank is a useful surrogate for the rank because it is largely unaffected by tiny singular values.

Our theorem will be invariant under scaling of ${A}$, so for convenience let us assume that ${\lVert A \rVert = 1}$.

Let ${r = \lVert A \rVert_F^2}$ denote the stable rank of ${A}$ and let ${a_1,\ldots,a_n}$ be the rows of ${A}$. We consider the following algorithm for computing a low-rank approximation ${\tilde{A}}$ to ${A}$.

• Initially ${\tilde{A}}$ is the empty matrix.

• Fix any ${k \geq 32 r \log(n) / \epsilon^4}$. (Here we are assuming that the algorithm knows ${r}$, or at least reasonable bounds on ${r}$.)
• For ${i=1,\ldots,k}$
• Pick a row ${a_i}$ with probability proportional to ${\lVert a_i \rVert^2 / \lVert A \rVert_F^2}$.
• Add the row ${\frac{ \lVert A \rVert_F }{ \sqrt{k} \, \lVert a_i \rVert } a_i}$ to ${\tilde{A}}$.

• Compute the SVD of ${\tilde{A}}$.The runtime of this algorithm is dominated two main tasks. (1) The computation of the sampling probabilities. This can be done in time linear in the number of non-zero entries of ${A}$. (2) Computing the SVD of ${\tilde{A}}$. Since ${\tilde{A}}$ has size ${k \times m}$, this takes ${O(mk^2) = O(m \cdot \mathrm{poly}(r,\log n,1/\epsilon) )}$ time.

Theorem 3 Fix any ${s \in \{1,\ldots,n\}}$. Let ${P_s}$ be the orthogonal projection onto top ${s}$ right singular vectors of ${\tilde{A}}$. With probability at least ${1-2/n}$,

$\displaystyle \lVert A - A P_s \rVert ~\leq~ \sigma_{s+1}(A) + \epsilon \lVert A \rVert.$

In other words, the best rank ${s}$ projection obtained from ${\tilde{A}}$ does nearly as well as the best rank ${s}$ projection obtained from ${A}$. (Compare against Fact 2.) Since our algorithm explicitly computes the SVD of ${\tilde{A}}$, so it can easily compute the matrix ${P_s}$. We can then use ${P_s}$ to efficiently compute an approximate SVD of ${A}$ as well; see the survey of Halko, Martinsson and Tropp.

Corollary 4 Set ${s = r / \epsilon^2}$. Let ${P_s}$ be the orthogonal projection onto top ${s}$ right singular vectors of ${\tilde{A}}$. Then, with probability at least ${1-2/n}$,

$\displaystyle \lVert A - A P_s \rVert ~\leq~ 2 \epsilon \lVert A \rVert. \ \ \ \ \ (1)$

Let us contrast the error guarantee of (1) with the guarantee we achieved in the previous lecture. Last time we sampled a matrix by sampled a matrix ${L_w}$ and showed it approximates ${L_G}$ in the sense that

$\displaystyle \big|\: x ^T L_G x - x ^T L_w x \:\big| ~\leq~ \epsilon x ^T L_G x \qquad \forall x \in {\mathbb R}^n.$

We say that our result from last time achieves “multiplicative” or “relative” error guarantee. In contrast, Corollary 4 only guarantees that

$\displaystyle \big|\: \lVert A x \rVert - \lVert A P_s x \rVert \:\big| ~\leq~ 2 \epsilon \lVert A \rVert \qquad\forall x \in {\mathbb R}^n \mathrm{~with~} \lVert x \rVert = 1,$

even though ${\lVert A x \rVert}$ may be significantly smaller than ${\epsilon \lVert A \rVert}$. We say that today’s theorem only achieves “additive” or “absolute” error guarantee. To prove today’s theorem, we will use a version of the Ahlswede-Winter inequality that provides an additive error guarantee.

Theorem 5 Let ${Y}$ be a random, symmetric, positive semi-definite ${n \times n}$ matrix such that ${\lVert {\mathrm E}[ Y ] \rVert \leq 1}$. Suppose ${\lVert Y \rVert \leq R}$ for some fixed scalar ${R \geq 1}$. Let ${Y_1, \ldots, Y_k}$ be independent copies of ${Y}$ (i.e., independently sampled matrices with the same distribution as ${Y}$). For any ${\epsilon \in (0,1)}$, we have

$\displaystyle {\mathrm{Pr}}\Bigg[~ \Big\lVert \frac{1}{k} \sum_{i=1}^k Y_i - {\mathrm E}[Y_i] \Big\rVert > \epsilon ~\Bigg] ~\leq~ 2n \cdot \exp( - k \epsilon^2 / 4R ).$

The proof is almost identical to the proof of Theorem 1 in Lecture 13. The only difference is that the final sentence of that proof should be deleted.

Our Theorem 3 is actually weaker than Rudelson & Vershynin’s result. They show that one can take ${k}$ to be roughly ${r \log r / \epsilon^4}$, which is quite remarkable because it is “dimension free”: the number of samples does not depend on the dimension ${n}$. Unfortunately our proof, which uses little more than the Ahlswede-Winter inequality, does not give that stronger bound because the failure probability in the Ahlswede-Winter inequality depends on the dimension. Rudelson & Vershynin prove an (additive error) variant of Ahlswede-Winter which avoids avoids this dependence on the dimension. Oliveira 2010 and Hsu-Kakade-Zhang 2011 give further progress in this direction.

2.1. Proofs

The proof of Theorem 3 follows quite straightforwardly from Theorem 5 and the following lemma, which we prove later.

Lemma 6 Let ${B}$ be an arbitrary ${n \times m}$ matrix. Let ${P_s}$ be the orthogonal projection onto top ${s}$ left singular vectors of ${B}$. Then

$\displaystyle \lVert A - A P_s \rVert^2 ~\leq~ \sigma_{s+1}(A)^2 + 2 \lVert A ^T A - B ^T B \rVert.$

Proof: (of Theorem 3.) We defined ${a_1,\ldots,a_n}$ to be the rows of ${A}$, but let us now transpose them to become column vectors, so

$\displaystyle A ^T A ~=~ \sum_{i=1}^n a_i a_i ^T.$

Let ${Y_1,\ldots,Y_k}$ be independent, identically distributed random matrices with the following distribution:

$\displaystyle {\mathrm{Pr}}\Bigg[~ Y_j ~=~ \lVert A \rVert_F^2 \cdot \frac{ a_i a_i ^T}{ a_i ^T a_i } ~\Bigg] ~=~ \frac{a_i ^T a_i}{ \lVert A \rVert_F^2 }.$

This is indeed a probability distribution because

$\displaystyle \lVert A \rVert_F^2 ~=~ \mathop{\mathrm{tr}}\,(A ^T A) ~=~ \mathop{\mathrm{tr}}\,(\sum_i a_i a_i ^T) ~=~ \sum_i a_i ^T a_i.$

Note that the change to ${\tilde{A} ^T \tilde{A}}$ during the ${j}$th iteration of the algorithm is ${Y_j / k}$.

We will apply Theorem 5 to ${Y_1,\ldots, Y_k}$. We have

$\displaystyle {\mathrm E}[ Y_j ] ~=~ \sum_{i=1}^n \frac{a_i ^T a_i}{ \lVert A \rVert_F^2 } \cdot \Bigg(\lVert A \rVert_F^2 \cdot \frac{ a_i a_i ^T}{ a_i ^T a_i } \Bigg) ~=~ \sum_{i=1}^n a_i a_i ^T ~=~ A ^T A,$

so ${\lVert {\mathrm E}[ Y_j ] \rVert \leq 1}$. We may take ${R := \lVert Y_j \rVert = \lVert A \rVert_F^2 = r}$, the stable rank of ${A}$ (since we assume ${\lVert A \rVert = 1}$).

Since ${\tilde{A} ^T \tilde{A} = \sum_{j=1}^k Y_j / k}$ and ${{\mathrm E}[ Y_j ] = A ^T A}$, we get

$\displaystyle {\mathrm{Pr}}\Bigg[~ \Big\lVert \tilde{A} ^T \tilde{A} - A ^T A \Big\rVert > \epsilon^2 / 2 ~\Bigg] ~=~ {\mathrm{Pr}}\Bigg[~ \Big\lVert \frac{1}{k} \sum_{j=1}^k Y_j - {\mathrm E}[ Y_j ] \Big\rVert > \epsilon^2 / 2 ~\Bigg] ~\leq~ 2n \cdot \exp( - k \epsilon^4 / 16r ) ~=~ 2/n,$

by Theorem 5.

Now we apply Lemma 6 to ${A}$ and ${\tilde{A}}$. So, with probability at least ${1-2/n}$,

$\displaystyle \lVert A - A P_s \rVert^2 ~\leq~ \sigma_{s+1}(A)^2 + \epsilon^2 ~\leq~ (\sigma_{s+1}(A) + \epsilon)^2.$

Taking square roots completes the proof. $\Box$

${~}$

Proof: (of Corollary 4). By the ordering of the singular values,

$\displaystyle s \sigma_s(A)^2 ~\leq~ \sum_{i=1}^s \sigma_s(A)^2 ~\leq~ \lVert A \rVert_F^2,$

implying ${\sigma_s(A)^2 \leq \lVert A \rVert_F^2 / s}$. In particular, if ${s = \lVert A \rVert_F^2 / \epsilon^2 \lVert A \rVert^2 = r / \epsilon^2}$ then ${ \sigma_s(A) ~\leq~ \epsilon \lVert A \rVert }$. $\Box$

${~}$

Proof: (of Lemma 6). Let ${Q_s}$ be the orthogonal projection onto the kernel of ${P_s}$ (i.e., the span of the bottom ${n-s}$ left singular vectors of ${B}$). Then

$\displaystyle \lVert A - A P_s \rVert ~=~ \lVert A(I - P_s) \rVert ~=~ \lVert A Q_s \rVert ~=~ \sup_{x \::\: \lVert x \rVert = 1} \lVert A Q_s x \rVert ~=~ \sup_{x \in \mathrm{span}(Q_s) \::\: \lVert x \rVert = 1} \lVert A x \rVert.$

So,

$\displaystyle \begin{array}{rcl} \lVert A - A P_s \rVert^2 &=& \sup_{x \in \mathrm{ker}(P_s) \::\: \lVert x \rVert = 1} \lVert A x \rVert^2 \\ &=& \sup_{x \in \mathrm{ker}(P_s) \::\: \lVert x \rVert = 1} \langle A ^T A x, x \rangle \\ &\leq& \sup_{x \in \mathrm{ker}(P_s) \::\: \lVert x \rVert = 1} \langle (A ^T A - B ^T B) x, x \rangle ~+~ \sup_{x \in \mathrm{ker}(P_s) \::\: \lVert x \rVert = 1} \langle B ^T B x, x \rangle \\ &\leq& \lVert A ^T A - B ^T B \rVert ~+~ \sigma_{s+1}(B ^T B) \\ &\leq& 2 \lVert A ^T A - B ^T B \rVert ~+~ \sigma_{s+1}(A ^T A), \end{array}$

where the last step uses the following fact. $\Box$

Fact 7 Let ${X}$ and ${Y}$ be symmetric matrices of the same size. Let ${\lambda_i(X)}$ and ${\lambda_i(Y)}$ respectively denote the ${i}$th largest eigenvalues of ${X}$ and ${Y}$. Then

$\displaystyle \max_i |\lambda_i(X) - \lambda_i(Y)| ~\leq~ \lVert X-Y \rVert.$

Proof: See Horn & Johnson, “Matrix Analysis”, page 370. $\Box$

## Lecture 14: Spectral sparsifiers

1. Spectral Sparsifiers

1.1. Graph Laplacians

Let ${G=(V,E)}$ be an unweighted graph. For notational simplicity, we will think of the vertex set as ${V = \{1,\ldots,n\}}$. Let ${e_i \in {\mathbb R}^n}$ be the ${i}$th standard basis vector, meaning that ${e_i}$ has a ${1}$ in the ${i}$th coordinate and ${0}$s in all other coordinates. For an edge ${uv \in E}$, define the vector ${x_{uv}}$ and the matrix ${X_{uv}}$ as follows:

$\displaystyle \begin{array}{rcl} x_{uv} &:=& e_u - u_v \\ X_{uv} &:=& x_{uv} x_{uv} ^T \end{array}$

In the definition of ${x_{uv}}$ it does not matter which vertex gets the ${+1}$ and which gets the ${-1}$ because the matrix ${X_{uv}}$ is the same either way.

Definition 1 The Laplacian matrix of ${G}$ is the matrix

$\displaystyle L_G ~:=~ \sum_{uv \in E} X_{uv}$

Let us consider an example.

Note that each matrix ${X_{uv}}$ has only four non-zero entries: we have ${X_{uu} = X_{vv} = 1}$ and ${X_{uv} = X_{vu} = -1}$. Consequently, the ${u}$th diagonal entry of ${L_G}$ is simply the degree of vertex ${u}$. Moreover, we have the following fact.

Fact 2 Let ${D}$ be the diagonal matrix with ${D_{u,u}}$ equal to the degree of vertex ${u}$. Let ${A}$ be the adjacency matrix of ${G}$. Then ${L_G = D - A}$.

If ${G}$ had weights ${w : E \rightarrow {\mathbb R}}$ on the edges we could define the weighted Laplacian as follows:

$\displaystyle L_G ~=~ \sum_{uv \in E} w_{uv} \cdot X_{uv}.$

Claim 3 Let ${G=(V,E)}$ be a graph with non-negative weights ${w : E \rightarrow {\mathbb R}}$. Then the weighted Laplacian ${L_G}$ is positive semi-definite.

Proof: Since ${X_{uv} = x_{uv} x_{uv} ^T}$, it is positive semi-definite. So ${L_G}$ is a weighted sum of positive semi-definite matrices with non-negative coefficients. Fact 5 in the Notes on Symmetric Matrices implies ${L_G}$ is positive semi-definite. $\Box$

The Laplacian can tell us many interesting things about the graph. For example:

Claim 4 Let ${G=(V,E)}$ be a graph with Laplacian ${L_G}$. For any ${U \subseteq V}$, let ${\chi(U) \in {\mathbb R}^n}$ be the characteristic vector of ${U}$, i.e., the vector with ${\chi(U)_v}$ equal to ${1}$ if ${v \in U}$ and equal to ${0}$ otherwise. Then ${\chi(U) ^T \, L_G \, \chi(U) = | \delta(U) |}$.

Proof: For any edge ${uv}$ we have ${ \chi(U) ^T \, X_{uv} \, \chi(U) = ( \chi(U) ^T \, x_{uv} )^2 }$. But ${| \chi(U) ^T \, x_{uv} |}$ is ${1}$ if exactly one of ${u}$ or ${v}$ is in ${U}$, and otherwise it is ${0}$. So ${\chi(U) ^T \, X_{uv} \, \chi(U) = 1}$ if ${uv \in \delta(U)}$, and otherwise it is ${0}$. Summing over all edges proves the claim. $\Box$

Similarly, if ${G=(V,E)}$ is a graph with edge weights ${w : E \rightarrow {\mathbb R}}$ and ${L_G}$ is the weighted Laplacian, then then ${\chi(U) ^T \, L_G \, \chi(U) = w( \delta(U) )}$.

Fact 5 If ${G}$ is connected then ${\mathrm{image}(L_G) ~=~ \{\: x \::\: \sum_i x_i = 0 \:\} }$, which is an ${(n-1)}$-dimensional subspace.

1.2. Main Theorem

Theorem 6 Let ${G=(V,E)}$ be a graph with ${n = |V|}$. There is a randomized algorithm to compute weights ${w : E \rightarrow {\mathbb R}}$ such that:

• only ${O(n \log n / \epsilon^2)}$ of the weights are non-zero, and
• with probability at least ${1-2/n}$,

$\displaystyle (1-\epsilon) \cdot L_G ~\preceq~ L_w ~\preceq~ (1+\epsilon) \cdot L_G,$

where ${L_w}$ denotes the weighted Laplacian of ${G}$ with weights ${w}$. By Fact 4 in Notes on Symmetric Matrices, this is equivalent to

$\displaystyle (1-\epsilon) x ^T L_G x ~\leq~ x ^T L_w x ~\leq~ (1+\epsilon) x ^T L_G x \qquad\forall x \in {\mathbb R}^n. \ \ \ \ \ (1)$

By (1) and Claim 4, the resulting weights are a graph sparsifier of ${G}$:

$\displaystyle (1-\epsilon) \cdot |\delta(U)| ~\leq~ w(\delta(U)) ~\leq~ (1+\epsilon) \cdot |\delta(U)| \qquad\forall U \subseteq V.$

The algorithm that proves Theorem 6 is as follows.

• Initially ${w = 0}$.
• Set ${k=8 n \log(n) / \epsilon^2}$.
• For every edge ${e \in E}$ compute ${r_e = \mathop{\mathrm{tr}}\,( X_e L_G^+ )}$.
• For ${i=1,\ldots,k}$
• ${\quad}$ Let ${e}$ be a random edge chosen with probability ${r_e/(n-1)}$.
• ${\quad}$ Increase ${w_e}$ by ${\frac{n-1}{r_e \, k}}$.

Claim 7 The values ${\{ r_e/(n-1) \::\: e \in E \}}$ indeed form a probability distribution.

Proof: (of Theorem 6). How does the matrix ${L_w}$ change during the ${i}$th iteration? The edge ${e}$ is chosen with probability ${\frac{r_e}{n-1}}$ and then ${L_w}$ increases by ${\frac{n-1}{r_e \cdot k} X_e}$. Let ${Z_i}$ be this random change in ${L_w}$ during the ${i}$th iteration. So ${Z_i}$ equals ${\frac{n-1}{r_e \cdot k} X_e}$ with probability ${\frac{r_e}{n-1}}$. The random matrices ${Z_1,\ldots,Z_k}$ are mutually independent and they all have this same distribution. Note that

$\displaystyle {\mathrm E}[Z_i] ~=~ \sum_{e \in E} \frac{r_e}{n-1} \cdot \frac{n-1}{r_e \cdot k} X_e ~=~ \frac{1}{k} \sum_e X_e ~=~ \frac{L_G}{k}.$

The final matrix ${L_w}$ is simply ${\sum_{i=1}^k Z_i}$. To analyze this final matrix, we will use the Ahlswede-Winter inequality. All that we require is the following claim, which we prove later.

Claim 8 ${Z_i \preceq (n-1) \cdot {\mathrm E}[Z_i]}$.

We apply Corollary 2 from the previous lecture with ${R=n-1}$, obtaining

$\displaystyle \begin{array}{rcl} {\mathrm{Pr}}\big[ (1-\epsilon) L_G \:\preceq\: L_w \:\preceq\: (1+\epsilon) L_G \big] &=& {\mathrm{Pr}}\Bigg[ (1-\epsilon) \frac{L_G}{k} \:\preceq\: \frac{1}{k} \sum_{i=1}^k Z_i \:\preceq\: (1+\epsilon) \frac{L_G}{k} \Bigg] \\ &\leq& 2n \cdot \exp\big( - \epsilon^2 k / 4 (n-1) \big) \\ &\leq& 2n \cdot \exp\big( - 2 \ln n \big) ~<~ 2/n. \end{array}$

$\Box$

Proof: (of Claim 7) First we check that the ${r_e}$ values are non-negative. By the cyclic property of trace

$\displaystyle \mathop{\mathrm{tr}}\,( X_e L_G^+ ) ~=~ \mathop{\mathrm{tr}}\,( x_e ^T L_G^+ x_e ) ~=~ x_e ^T L_G^+ x_e,$

This is non-negative since ${L_G^+ \succeq 0}$ because ${L_G \succeq 0}$. Thus ${r_e \geq 0}$. Next, note that

$\displaystyle \sum_e \mathop{\mathrm{tr}}\,( X_e L_G^+ ) ~=~ \mathop{\mathrm{tr}}\,( \sum_e X_e L_G^+ ) ~=~ \mathop{\mathrm{tr}}\,( L_G L_G^+ ) ~=~ \mathop{\mathrm{tr}}\,( I_{\mathrm{im}~L_G} ),$

where ${I_{\mathrm{im}~L_G}}$ is the orthogonal projection onto the image of ${L_G}$. The image has dimension ${n-1}$ by Fact 5, and so

$\displaystyle \sum_e r_e ~=~ \frac{1}{n-1} \sum_e \mathop{\mathrm{tr}}\,( X_e L_G^+ ) ~=~ \frac{1}{n-1} \mathop{\mathrm{tr}}\,( I_{\mathrm{im}~L_G} ) ~=~ 1.$

$\Box$

Proof: (of Claim 8). The maximum eigenvalue of a positive semi-definite matrix never exceeds its trace, so

$\displaystyle \lambda_{\mathrm{max}}( L_G^{+/2} \cdot X_e \cdot L_G^{+/2} ) ~\leq~ \mathop{\mathrm{tr}}\,( L_G^{+/2} \cdot X_e \cdot L_G^{+/2} ) ~=~ r_e.$

By Fact 8 in the Notes on Symmetric Matrices,

$\displaystyle L_G^{+/2} \cdot X_e \cdot L_G^{+/2} ~\preceq~ r_e \cdot I.$

So, by Fact 4 in the Notes on Symmetric Matrices, for every vector ${v}$,

$\displaystyle v ^T \frac{L_G^{+/2} \cdot X_e \cdot L_G^{+/2}}{r_e} v ~\leq~ v ^T v.$

Now let us write ${v=v_1 + v_2}$ where ${v_1 = I_{\mathrm{im}~L_G} \, v}$ is the projection onto the image of ${L_G}$ and ${v_2 = I_{\mathrm{ker}~L_G} \, v}$ is the projection onto the kernel of ${L_G}$. Then ${L_G \, v = 0}$ and ${L_G^{+/2} \, v = 0}$. So

$\displaystyle \begin{array}{rcl} v ^T \frac{L_G^{+/2} \cdot X_e \cdot L_G^{+/2}}{r_e} v &=& v_1 ^T \frac{L_G^{+/2} \cdot X_e \cdot L_G^{+/2}}{r_e} v_1 ~+~ \underbrace{v_2 ^T \frac{L_G^{+/2} \cdot X_e \cdot L_G^{+/2}}{r_e} v_2}_{=0} \\ &=& v_1 ^T \frac{L_G^{+/2} \cdot X_e \cdot L_G^{+/2}}{r_e} v_1 \\ &\leq& v_1 ^T v_1 ~=~ v ^T I_{\mathrm{im}~L_G} v. \end{array}$

Since this holds for every vector ${v}$, Fact 4 in the Notes on Symmetric Matrices again implies

$\displaystyle \frac{L_G^{+/2} \cdot X_e \cdot L_G^{+/2}}{r_e} ~\preceq~ I_{\mathrm{im}~L_G}.$

Since ${\mathrm{im}~X_e \subseteq \mathrm{im}~L_G}$, Claim 16 in the Notes on Symmetric Matrices shows this is equivalent to

$\displaystyle \frac{n-1}{r_e \cdot k} X_e ~\preceq~ \frac{n-1}{k} L_G.$

This completes the proof of the claim. $\Box$

## Lecture 13: The Ahlswede-Winter Inequality

1. Useful versions of the Ahlswede-Winter Inequality

Theorem 1 Let ${Y}$ be a random, symmetric, positive semi-definite ${d \times d}$ matrix such that ${{\mathrm E}[ Y ] = I}$. Suppose ${\lVert Y \rVert \leq R}$ for some fixed scalar ${R \geq 1}$. Let ${Y_1, \ldots, Y_k}$ be independent copies of ${Y}$ (i.e., independently sampled matrices with the same distribution as ${Y}$). For any ${\epsilon \in (0,1)}$, we have

$\displaystyle {\mathrm{Pr}}\Bigg[ (1-\epsilon) I \:\preceq\: \frac{1}{k} \sum_{i=1}^k Y_i \:\preceq\: (1+\epsilon) I \Bigg] ~\geq~ 1 - 2d \cdot \exp( - \epsilon^2 k / 4 R ).$

This event is equivalent to the sample average ${\frac{1}{k} \sum_{i=1}^k Y_i}$ having minimum eigenvalue at least ${1-\epsilon}$ and maximum eigenvalue at most ${1+\epsilon}$.

Proof: We apply the Ahlswede-Winter inequality with ${ X_i = \big(Y_i - {\mathrm E}[ Y_i ] \big) / R}$. Note that ${{\mathrm E}[ X_i ] = 0}$, ${\lVert X_i \rVert \leq 1}$, and

$\displaystyle \begin{array}{rcl} {\mathrm E}[ X_i^2 ] &=& \frac{1}{R^2} {\mathrm E}\big[ \big(Y_i - {\mathrm E}[Y_i] )^2 \big] \\ &=& \frac{1}{R^2} \Big( {\mathrm E}[ Y_i^2 ] - 2 {\mathrm E}[Y_i]^2 + {\mathrm E}[Y_i]^2 \Big) \\ &\preceq& \frac{1}{R^2} {\mathrm E}[ Y_i^2 ] \qquad(\mathrm{since}~ {\mathrm E}[ Y_i ]^2 \succeq 0)\\ &\preceq& \frac{1}{R^2} {\mathrm E}[ \lVert Y_i \rVert \cdot Y_i ] \\ &\preceq& \frac{R}{R^2} {\mathrm E}[ Y_i ] \end{array}$

Finally, since ${0 \preceq {\mathrm E}[Y_i] \preceq I}$, we get

$\displaystyle \lambda_{\mathrm{max}} \big( {\mathrm E}[X_i^2] \big) ~\leq~ 1/R. \ \ \ \ \ (1)$

Now we use Claim 15 from the Notes on Symmetric Matrices, together with the inequalities

$\displaystyle \begin{array}{rcl} 1 + x &\leq& e^x \quad\forall x \in {\mathbb R} \\ e^x &\leq& 1 + x + x^2 \quad\forall x \in [-1,1]. \end{array}$

Since ${\lVert X_i \rVert \leq 1}$, for any ${\lambda \in [0,1]}$, we have ${ e^{\lambda X_i} \preceq I + \lambda X_i + \lambda^2 X_i^2 }$, and so

$\displaystyle {\mathrm E}[ e^{\lambda X_i} ] ~\preceq~ {\mathrm E}[ I + \lambda X_i + \lambda^2 X_i^2 ] ~\preceq~ I + \lambda^2 {\mathrm E}[ X_i^2 ] ~\preceq~ e^{ \lambda^2 {\mathrm E}[ X_i^2 ] }.$

Thus by (1) we have

$\displaystyle \lVert {\mathrm E}[ e^{\lambda X_i} ] \rVert ~\leq~ \lVert e^{ \lambda^2 {\mathrm E}[ X_i^2 ] } \rVert ~\leq~ e^{ \lambda^2 / R }.$

The same analysis also shows that ${\lVert {\mathrm E}[ e^{-\lambda X_i} ] \rVert \leq e^{ \lambda^2 / R }}$. Substituting these two bounds into the basic Ahlswede-Winter inequality from the previous lecture, we obtain

$\displaystyle {\mathrm{Pr}}\Bigg[~ \Big\lVert \sum_{i=1}^k \frac{1}{R} \big(Y_i - {\mathrm E}[Y_i] \big) \Big\rVert >t ~\Bigg] ~\leq~ 2d \cdot e^{-\lambda t} \prod_{i=1}^k e^{ \lambda^2 / R} ~=~ 2d \cdot \exp( -\lambda t + k \lambda^2 / R ).$

Substituting ${t = k \epsilon / R}$ and ${\lambda = \epsilon/2}$ we get

$\displaystyle {\mathrm{Pr}}\Bigg[~ \Big\lVert \frac{1}{R} \sum_{i=1}^k Y_i - \frac{k}{R} {\mathrm E}[Y_i] \Big\rVert > \frac{k \epsilon}{R} ~\Bigg] ~\leq~ 2d \cdot \exp( - k \epsilon^2 / 4R ).$

Multiplying by ${R/k}$ and using the fact that ${{\mathrm E}[Y_i]=I}$, we have bounded the probability that any eigenvalue of the sample average matrix ${\sum_{i=1}^k Y_i/k}$ is less than ${1-\epsilon}$ or greater than ${1+\epsilon}$. $\Box$

Corollary 2 Let ${Z}$ be a random, symmetric, positive semi-definite ${d \times d}$ matrix. Define ${U := {\mathrm E}[ Z ]}$ and suppose ${Z \preceq R \cdot U}$ for some scalar ${R \geq 1}$. Let ${Z_1, \ldots, Z_k}$ be independent copies of ${Z}$. For any ${\epsilon \in (0,1)}$, we have

$\displaystyle {\mathrm{Pr}}\Bigg[ (1-\epsilon) U \:\preceq\: \frac{1}{k} \sum_{i=1}^k Z_i \:\preceq\: (1+\epsilon) U \Bigg] ~\geq~ 1 - 2d \cdot \exp( - \epsilon^2 k / 4 R ).$

Proof: Let ${U^{+/2} := (U^+)^{1/2}}$ denote the square root of the pseudoinverse of ${U}$. Let ${I_{\mathrm{im}~U}}$ denote the orthogonal projection on the image of ${U}$. Define the random, positive semi-definite matrices

$\displaystyle Y ~:=~ U^{+/2} \cdot Z \cdot U^{+/2} \qquad\mathrm{and}\qquad Y_i ~:=~ U^{+/2} \cdot Z_i \cdot U^{+/2}.$

Because ${Z_i \succeq 0}$ and ${U = {\mathrm E}[\sum_i Z_i]}$, we have ${\mathrm{im}(Z_i) \subseteq \mathrm{im}(U)}$. So Claim 16 in Notes on Symmetric Matrices implies

$\displaystyle (1-\epsilon) U \:\preceq\: \frac{1}{k} \sum_{i=1}^k Z_i \:\preceq\: (1+\epsilon) U \qquad\Longleftrightarrow\qquad (1-\epsilon) I_{\mathrm{im}~U} \:\preceq\: \frac{1}{k} \sum_{i=1}^k Y_i \:\preceq\: (1+\epsilon) I_{\mathrm{im}~U}.$

We would like to use Theorem 1 to obtain our desired bound. We just need to check that the hypotheses of the theorem are satisfied. By Fact 6 from the Notes on Symmetric Matrices, we have

$\displaystyle Y ~=~ U^{+/2} \cdot Z \cdot U^{+/2} ~\preceq~ U^{+/2} \cdot (R \cdot U) \cdot U^{+/2} ~=~ R \cdot I_{\mathrm{im}~U},$

showing that ${\lVert Y \rVert \leq R}$. Next,

$\displaystyle {\mathrm E}[Y] ~=~ U^{+/2} \cdot {\mathrm E}[Z] \cdot U^{+/2} ~=~ U^{+/2} \cdot U \cdot U^{+/2} ~=~ I_{\mathrm{im}~U}.$

So the hypotheses of Theorem 1 are almost satisfied, with the small issue that ${{\mathrm E}[Y]}$ is not actually the identity, but merely the identity on the image of ${U}$. But, one may check that the proof of Theorem 1 still goes through as long as every eigenvalue of ${{\mathrm E}[Y]}$ is either ${0}$ or ${1}$, i.e., ${{\mathrm E}[Y]}$ is an orthogonal projection matrix. The details are left as an exercise. $\Box$

Posted in Uncategorized | 1 Comment

## Lecture 12: Concentration for sums of random matrices

1. Concentration for sums of random matrices

Let ${X}$ be a random real matrix of size ${d \times d}$. In other words, we have some probability distribution on the space of all ${d \times d}$ matrices, and we let ${X}$ be a matrix obtained by sampling from that distribution. Alternatively, we can think of ${X}$ as a matrix whose entries are real-valued random variables (that are not necessarily independent).

As usual, the expectation of ${X}$ is simply the weighted average of the possible matrices that ${X}$ could be, i.e., ${{\mathrm E}[X] = \sum_A {\mathrm{Pr}}[ X=A ] \cdot A}$. Alternatively, we can think of ${{\mathrm E}[X]}$ as matrix whose entries are the expectations of the entries of ${X}$.

Many concentration results are known for matrices whose entries are independent random variables from certain real-valued distributions (e.g., Gaussian, subgaussian, etc.) In fact, in Lecture 8 on Compressed Sensing, we proved concentration of the singular values of a matrix whose entries are independent Gaussians. In this lecture, we will look at random matrices whose entries are not independent, and we will obtain concentration results by summing multiple independent copies of those matrices.

1.1. The Ahlswede-Winter Inequality

The Chernoff bound is a very powerful tool for proving concentration for sums of independent, real-valued random variables. Today we will prove the Ahlswede-Winter inequality, which is a generalization of the Chernoff bound for proving concentration for sums of independent, matrix-valued random variables.

Let ${X_1, \ldots, X_k}$ be random, independent, symmetric matrices of size ${d \times d}$. Define the partial sums ${S_j = \sum_{i=1}^j X_i}$. We would like to analyze the probability that all eigenvalues of ${S_k}$ are at most ${t}$ (i.e., ${S_k \preceq tI}$). For any ${\lambda > 0}$, this is equivalent to all eigenvalues of ${e^{\lambda S_k}}$ being at most ${e^{\lambda t}}$ (i.e., ${e^{\lambda S_k} \preceq e^{\lambda tI}}$). If this event fails to hold then then certainly ${\mathop{\mathrm{tr}}\,{e^{\lambda S_k}} > e^{\lambda t}}$, since all eigenvalues of ${e^{\lambda S_k}}$ are non-negative. Thus we have bounded the probability that some eigenvalue of ${S_k}$ is greater than ${t}$ as follows:

$\displaystyle {\mathrm{Pr}}[ S_k \not\preceq tI ] ~\leq~ {\mathrm{Pr}}[ \mathop{\mathrm{tr}}\,{e^{\lambda S_k}} > e^{\lambda t} ] ~\leq~ {\mathrm E}[ \mathop{\mathrm{tr}}\,{e^{\lambda S_k}} ] / e^{\lambda t}, \ \ \ \ \ (1)$

by Markov’s inequality.

Now let us observe a useful property of the trace. Since it is linear, it commutes with expectation:

$\displaystyle \begin{array}{rcl} {\mathrm E}[ \mathop{\mathrm{tr}}\, X ] &=& \sum_{A} {\mathrm{Pr}}[ X=A ] \cdot \sum_i A_{i,i} ~=~ \sum_i \sum_{A} {\mathrm{Pr}}[ X=A ] \cdot A_{i,i} \\ &=& \sum_i \sum_{a} {\mathrm{Pr}}[ X_{i,i}=a ] \cdot a ~=~ \sum_i {\mathrm E}[ X_{i,i} ] ~=~ \mathop{\mathrm{tr}}\,\big( {\mathrm E}[X] \big). \end{array}$

The proof of the Ahlswede-Winter inequality is very similar to the proof of the Chernoff bound; one just has to be a bit careful to do the matrix algebra properly. As in the proof of the Chernoff bound, the main technical step is to bound the expectation in (1) by a product of expectations that each involve a single ${X_i}$, because those individual expectations are much easier to analyze. This is where the Golden-Thompson inequality (Theorem 17 in the Notes on Symmetric Matrices) is needed.

$\displaystyle \begin{array}{rcl} {\mathrm E}[ \mathop{\mathrm{tr}}\,{e^{\lambda S_k}} ] &=& \mathrm{~} {\mathrm E}[ \mathop{\mathrm{tr}}\,{e^{\lambda X_k + \lambda S_{k-1}}} ] \qquad(\mathrm{since}~ S_k = X_k + S_{k-1})\\ ~ &\leq& \mathrm{~} {\mathrm E}[ \mathop{\mathrm{tr}}\,(e^{\lambda X_k} \cdot e^{\lambda S_{k-1}}) ] \qquad(\mathrm{by~Golden-Thompson})\\ ~ &=& \mathrm{~} {\mathrm E}_{X_1,...,X_{k-1}}\big[ {\mathrm E}_{X_k}[ \mathop{\mathrm{tr}}\,(e^{\lambda X_k} \cdot e^{\lambda S_{k-1}}) ] \big] \qquad(\mathrm{by~independence}) \\ ~ &=& \mathrm{~} {\mathrm E}_{X_1,...,X_{k-1}}\Big[ \mathop{\mathrm{tr}}\,\big( {\mathrm E}_{X_k}[ e^{\lambda X_k} \cdot e^{\lambda S_{k-1}} ] \big) \Big] \qquad(\mathrm{trace~and~expectation~commute})\\ ~ &=& \mathrm{~} {\mathrm E}_{X_1,...,X_{k-1}}\Big[ \mathop{\mathrm{tr}}\,\big( {\mathrm E}_{X_k}[ e^{\lambda X_k} ] \cdot e^{\lambda S_{k-1}} \big) \Big] \qquad (X_k ~\mathrm{and}~ S_{k-1} ~\mathrm{are~independent}) \\ ~ &\leq& \mathrm{~} {\mathrm E}_{X_1,...,X_{k-1}}\big[ \lVert {\mathrm E}_{X_k}[ e^{\lambda X_k} ] \rVert \cdot \mathop{\mathrm{tr}}\, e^{\lambda S_{k-1}} \big] \\ ~ &=& \mathrm{~} \lVert {\mathrm E}_{X_k}[ e^{\lambda X_k} ] \rVert \cdot {\mathrm E}_{X_1,...,X_{k-1}}[ \mathop{\mathrm{tr}}\, e^{\lambda S_{k-1}} ], \end{array}$

where the last inequality follows from Corollary 14 in the Notes on Symmetric Matrices. Applying this inequality inductively, we get

$\displaystyle {\mathrm E}[ \mathop{\mathrm{tr}}\,{e^{\lambda S_k}} ] ~\leq~ \prod_{i=1}^k \lVert {\mathrm E}[ e^{\lambda X_i} ] \rVert \cdot \mathop{\mathrm{tr}}\, e^{\lambda 0} ~=~ d \cdot \prod_{i=1}^k \lVert {\mathrm E}[ e^{\lambda X_i} ] \rVert,$

since ${e^{\lambda 0} = I}$ and ${\mathop{\mathrm{tr}}\, I = d}$. Combining this with (1), we obtain

$\displaystyle {\mathrm{Pr}}[ S_k \not \preceq tI ] ~\leq~ d e^{-\lambda t} \prod_{i=1}^k \lVert {\mathrm E}[ e^{\lambda X_i} ] \rVert.$

We can also bound the probability that any eigenvalue of ${S_k}$ is less than ${-t}$ by applying the same argument to ${-S_k}$. This shows that the probability that any eigenvalue of ${S_k}$ lies outside ${[-t,t]}$ is

$\displaystyle {\mathrm{Pr}}[ \lVert S_k \rVert > t ] ~\leq~ d e^{-\lambda t} \Bigg( \prod_{i=1}^k \lVert {\mathrm E}[ e^{\lambda X_i} ] \rVert + \prod_{i=1}^k \lVert {\mathrm E}[ e^{-\lambda X_i} ] \rVert \Bigg). \ \ \ \ \ (2)$

This is the basic inequality. Much like the Chernoff bound, there are many variations. We will see some next time.