**1. Compressed Sensing **

Let be some unknown vector. We can learn information about through the following “measurement” operation: we can choose a vector and we are told the value of . We would like to minimize the number of measurements needed to learn the vector . Note that performing measurements is equivalent to choosing an matrix then asking for the vector of measurements .

Without any further assumptions, operations are clearly necessary and sufficient. This is because (assuming that the query vectors are linearly independent) after operations, could be any vector in the dimensional subspace of points consistent with the measurements.

Now suppose that we know some additional properties of . We say that a vector is **-sparse** if it has only non-zero components. Suppose that is *almost -sparse*, meaning that where is -sparse and is a small “noise” vector. Can we perform queries and compute a good approximation to ? Intuitively, we want to learn exactly and we don’t care about learning .

We present a solution which involves two interesting ideas. The first idea is to perform *random measurements*, much like Johnson-Lindenstrauss measures the projection of vectors on a random subspace. We will set , let be a matrix of independent Gaussians, then ask for the vector of measurements .

The second idea is to use “small -norm” as a proxy for sparsity. After obtaining the measurements , we will find the vector with smallest -norm that is consistent with those measurements. In other words, we solve the optimization problem

where is the -norm. This optimization problem can be reformulated as a linear program, although the details of that transformation are not crucial for us.

Our main theorem is:

Theorem 1Define , where the minimum is taken over all -sparse vectors. With probability at least over the choice of ,

To interpret the error guarantee in (2), note that if as above, then setting we get . So the approximation error of only depends on the magnitude of the noise.

In fact, we will prove a stronger result than Theorem 1 states. We will prove that with probability at least , we obtain a matrix such that (2) holds *for every* vector .

To prove such a result, we need to define a condition on the matrix that will ensure that (2) holds for all . We say that satisfies the –**Restricted Isometry Property** (RIP) if, for any -sparse vector we have

This condition is equivalent to requiring that, for every submatrix of consisting of columns, all singular values lie in the interval . (An matrix with for which all singular values are exactly equal to is an orthogonal matrix if and if it is called an isometry.)

We will prove two theorems, which directly imply our main theorem.

Theorem 2If and is an matrix whose entries are independent normal random variables with distribution , then is -RIP with probability at least .

Theorem 3If is -RIP then (2) holds.

The second theorem does not involve any new concepts or any randomization; it just performs some clever algebraic manipulations. So we will not discuss it in class, but we include the proof in Section 3 for completeness.

**2. Proof of Theorem 2 **

First of all, let us derive a very crude bound on . If the entries of all have absolute value at most then for all . By standard bounds on the Gaussian tail, the probability that this fails to hold is less than . (Alternatively, we could construct with binomial random variables instead of Gaussians, in which case our crude bound certainly holds.)

To prove the theorem we must show that (3) holds for every -sparse vector . The argument proceeds by a union bound over all sets of indices with , then considering all vectors whose non-zeros all lie in the coordinates in .

So restricting to the coordinates in , our desired statement is: if is an matrix of independent Gaussians, then

This is almost identical to the statement of the Johnson-Lindenstrauss lemma, except previously we only wanted those inequalities to hold for a fixed vector, or a small collection of vectors. Now we want it to hold for *all* vectors.

Previously we analyzed the probability of (4) holding for a single vector, then extended that to a small collection of vectors by taking a union bound. Unfortunately we now want (4) to hold for infinitely many vectors, and we don’t want to union bound over an infinitely large set!

The solution is to observe that the function is a continuous function in , so if holds for some , then it approximately holds for all nearby . Also, since only depends on the direction of , not its norm, it suffices to consider vectors with .

So define the sphere . Since we only need to consider points , it seems conceivable that we could find *finitely* many points such that *every* point in is “nearby” to some . Such a set is called an \href{http://en.wikipedia.org/wiki/ and they are well-studied objects. The following fact is known.

Fact 1There exists a set of size that is an -net of , meaning that for all .

We set and let be an -net of size . Then we apply the Johnson-Lindenstrauss lemma with error parameter to the points in . By a union bound, with probability at least we have

Now consider any point with . Because is an -net, there is some point of distance at most to . That is, letting , we have . Now using the crude bound from above, we have . So, assuming that (4) holds for points in we get

Similarly,

So far we have only considered a single set of indices and shown that failure probability is at most . We analyze the total failure probability by a union bound over all choices of . The number of sets of size is . So the probability that (4) fails to hold for any point in any of our -nets is at most

which is less than if for a sufficiently large constant .

**3. Appendix: Proof of Theorem 3 **

Let . Our goal is to bound . To start off, let us order the coordinates such that

- are all at least .
- .

Now define the sets of indices

so and for . Also, define

We will use the notation to denote the restriction of to the coordinates in . For example is the vector . Finally, define

since the smallest error of any -sparse approximation is obtained letting be the -sparse vector that contains ‘s largest coordinates (in absolute value).

The proof involves three claims.

These claims easily imply Theorem 3, which states that . We have

*Proof:* (of Claim 1) The main facts we need are that the -norm satisfies the triangle inequality and it is additive on any partition of the coordinates, i.e., for any set of indices . Since is an optimal solution to (1), we have

by the triangle inequality. Rearranging, we get

which proves the claim.

*Proof:* (of Claim 2) By our ordering of the coordinates, the th coordinate of has absolute value . So

where we have bounded the sum by . Thus

Here the second inequality is by Claim 1 and the third is because for any -dimensional vector .

*Proof:* (of Claim 3) Our optimization problem ensures that , so . So,

where the first inequality is by the triangle inequality, and the second is by the RIP property.

Next we derive an upper bound on . By our ordering of the coordinates, every coordinate in is smaller (in absolute value) than the minimum coordinate in . This means it is also smaller than the average coordinate, which is . So , which implies that

Summing over all ,

Here the second inequality is by Claim 1. Combining this with (5), we get

which implies the claim.