**1. Low-rank approximation of matrices **

Let be an arbitrary matrix. We assume . We consider the problem of approximating by a low-rank matrix. For example, we could seek to find a rank matrix minimizing .

It is known that a truncated singular value decomposition gives an optimal solution to this problem. Formally, let be the singular value decomposition of . Let be the singular values (i.e., diagonal entries of .) Let be the left singular vectors (i.e., columns of ). Let be the right singular vectors (i.e., columns of ).

Fact 1is a solution to , and the minimum value equals .

Another way of stating this same fact is as follows.

Fact 2Let be the matrix consisting of the top left singular vectors. Let be the orthogonal projection onto the span of . Then is a solution to . Furthermore, .

The SVD can be computed in time. (Strictly speaking, this is not correct — the singular values can be irrational, so realistically we can only compute an -approximate SVD.) With the recent trend towards analyzing “big data”, a running time of might be too slow.

In the past 15 years there has been a lot of work on sophisticated algorithms to quickly compute low-rank approximations. For example, one could measure the approximate error in different norms, sample more or fewer vectors, improve the running time, reduce the number of passes over the data, improve numerical stability, etc. Much more information can be found in the survey of Mahoney, the review article of Halko-Martinsson-Tropp, the PhD thesis of Boutsidis, etc.

**2. Rudelson & Vershynin’s Algorithm **

Let be an matrix. The **Frobenius norm** is defined by

The **stable rank** (or **numerical rank**) of is

Clearly the stable rank cannot exceed the usual rank, which is the number of strictly positive singular values. The stable rank is a useful surrogate for the rank because it is largely unaffected by tiny singular values.

Our theorem will be invariant under scaling of , so for convenience let us assume that .

Let denote the stable rank of and let be the rows of . We consider the following algorithm for computing a low-rank approximation to .

- Initially is the empty matrix.

- Fix any . (Here we are assuming that the algorithm knows , or at least reasonable bounds on .)
- For
- Pick a row with probability proportional to .
- Add the row to .

- Compute the SVD of .The runtime of this algorithm is dominated two main tasks. (1) The computation of the sampling probabilities. This can be done in time linear in the number of non-zero entries of . (2) Computing the SVD of . Since has size , this takes time.

**Theorem 3***Fix any . Let be the orthogonal projection onto top right singular vectors of . With probability at least ,*

In other words, the best rank projection obtained from does nearly as well as the best rank projection obtained from . (Compare against Fact 2.) Since our algorithm explicitly computes the SVD of , so it can easily compute the matrix . We can then use to efficiently compute an approximate SVD of as well; see the survey of Halko, Martinsson and Tropp.

**Corollary 4***Set . Let be the orthogonal projection onto top right singular vectors of . Then, with probability at least ,*

Let us contrast the error guarantee of (1) with the guarantee we achieved in the previous lecture. Last time we sampled a matrix by sampled a matrix and showed it approximates in the sense that

We say that our result from last time achieves “multiplicative” or “relative” error guarantee. In contrast, Corollary 4 only guarantees that

even though may be significantly smaller than . We say that today’s theorem only achieves “additive” or “absolute” error guarantee. To prove today’s theorem, we will use a version of the Ahlswede-Winter inequality that provides an additive error guarantee.

**Theorem 5***Let be a random, symmetric, positive semi-definite matrix such that . Suppose for some fixed scalar . Let be independent copies of (i.e., independently sampled matrices with the same distribution as ). For any , we have*

The proof is almost identical to the proof of Theorem 1 in Lecture 13. The only difference is that the final sentence of that proof should be deleted.

Our Theorem 3 is actually weaker than Rudelson & Vershynin’s result. They show that one can take to be roughly , which is quite remarkable because it is “dimension free”: the number of samples does not depend on the dimension . Unfortunately our proof, which uses little more than the Ahlswede-Winter inequality, does not give that stronger bound because the failure probability in the Ahlswede-Winter inequality depends on the dimension. Rudelson & Vershynin prove an (additive error) variant of Ahlswede-Winter which avoids avoids this dependence on the dimension. Oliveira 2010 and Hsu-Kakade-Zhang 2011 give further progress in this direction.

**2.1. Proofs**The proof of Theorem 3 follows quite straightforwardly from Theorem 5 and the following lemma, which we prove later.

**Lemma 6***Let be an arbitrary matrix. Let be the orthogonal projection onto top left singular vectors of . Then*

*Proof:*(of Theorem 3.) We defined to be the rows of , but let us now transpose them to become column vectors, soLet be independent, identically distributed random matrices with the following distribution:

This is indeed a probability distribution because

Note that the change to during the th iteration of the algorithm is .

We will apply Theorem 5 to . We have

so . We may take , the stable rank of (since we assume ).

Since and , we get

by Theorem 5.

Now we apply Lemma 6 to and . So, with probability at least ,

Taking square roots completes the proof.

*Proof:*(of Corollary 4). By the ordering of the singular values,implying . In particular, if then .

*Proof:*(of Lemma 6). Let be the orthogonal projection onto the kernel of (i.e., the span of the bottom left singular vectors of ). ThenSo,

where the last step uses the following fact.

**Fact 7***Let and be symmetric matrices of the same size. Let and respectively denote the th largest eigenvalues of and . Then*

*Proof:*See Horn & Johnson, “Matrix Analysis”, page 370.