1. Low-rank approximation of matrices
Let be an arbitrary
matrix. We assume
. We consider the problem of approximating
by a low-rank matrix. For example, we could seek to find a rank
matrix
minimizing
.
It is known that a truncated singular value decomposition gives an optimal solution to this problem. Formally, let be the singular value decomposition of
. Let
be the singular values (i.e., diagonal entries of
.) Let
be the left singular vectors (i.e., columns of
). Let
be the right singular vectors (i.e., columns of
).
Fact 1
is a solution to
, and the minimum value equals
.
Another way of stating this same fact is as follows.
Fact 2 Let
be the
matrix consisting of the top
left singular vectors. Let
be the orthogonal projection onto the span of
. Then
is a solution to
. Furthermore,
.
The SVD can be computed in time. (Strictly speaking, this is not correct — the singular values can be irrational, so realistically we can only compute an
-approximate SVD.) With the recent trend towards analyzing “big data”, a running time of
might be too slow.
In the past 15 years there has been a lot of work on sophisticated algorithms to quickly compute low-rank approximations. For example, one could measure the approximate error in different norms, sample more or fewer vectors, improve the running time, reduce the number of passes over the data, improve numerical stability, etc. Much more information can be found in the survey of Mahoney, the review article of Halko-Martinsson-Tropp, the PhD thesis of Boutsidis, etc.
2. Rudelson & Vershynin’s Algorithm
Let be an
matrix. The Frobenius norm
is defined by
The stable rank (or numerical rank) of is
Clearly the stable rank cannot exceed the usual rank, which is the number of strictly positive singular values. The stable rank is a useful surrogate for the rank because it is largely unaffected by tiny singular values.
Our theorem will be invariant under scaling of , so for convenience let us assume that
.
Let denote the stable rank of
and let
be the rows of
. We consider the following algorithm for computing a low-rank approximation
to
.
- Initially
is the empty matrix.
- Fix any
. (Here we are assuming that the algorithm knows
, or at least reasonable bounds on
.)
- For
- Pick a row
with probability proportional to
.
- Add the row
to
.
- Pick a row
- Compute the SVD of
.The runtime of this algorithm is dominated two main tasks. (1) The computation of the sampling probabilities. This can be done in time linear in the number of non-zero entries of
. (2) Computing the SVD of
. Since
has size
, this takes
time.
Theorem 3 Fix any
. Let
be the orthogonal projection onto top
right singular vectors of
. With probability at least
,
In other words, the best rank
projection obtained from
does nearly as well as the best rank
projection obtained from
. (Compare against Fact 2.) Since our algorithm explicitly computes the SVD of
, so it can easily compute the matrix
. We can then use
to efficiently compute an approximate SVD of
as well; see the survey of Halko, Martinsson and Tropp.
Corollary 4 Set
. Let
be the orthogonal projection onto top
right singular vectors of
. Then, with probability at least
,
Let us contrast the error guarantee of (1) with the guarantee we achieved in the previous lecture. Last time we sampled a matrix by sampled a matrix
and showed it approximates
in the sense that
We say that our result from last time achieves “multiplicative” or “relative” error guarantee. In contrast, Corollary 4 only guarantees that
even though
may be significantly smaller than
. We say that today’s theorem only achieves “additive” or “absolute” error guarantee. To prove today’s theorem, we will use a version of the Ahlswede-Winter inequality that provides an additive error guarantee.
Theorem 5 Let
be a random, symmetric, positive semi-definite
matrix such that
. Suppose
for some fixed scalar
. Let
be independent copies of
(i.e., independently sampled matrices with the same distribution as
). For any
, we have
The proof is almost identical to the proof of Theorem 1 in Lecture 13. The only difference is that the final sentence of that proof should be deleted.
Our Theorem 3 is actually weaker than Rudelson & Vershynin’s result. They show that one can take
to be roughly
, which is quite remarkable because it is “dimension free”: the number of samples does not depend on the dimension
. Unfortunately our proof, which uses little more than the Ahlswede-Winter inequality, does not give that stronger bound because the failure probability in the Ahlswede-Winter inequality depends on the dimension. Rudelson & Vershynin prove an (additive error) variant of Ahlswede-Winter which avoids avoids this dependence on the dimension. Oliveira 2010 and Hsu-Kakade-Zhang 2011 give further progress in this direction.
2.1. Proofs
The proof of Theorem 3 follows quite straightforwardly from Theorem 5 and the following lemma, which we prove later.
Lemma 6 Let
be an arbitrary
matrix. Let
be the orthogonal projection onto top
left singular vectors of
. Then
Proof: (of Theorem 3.) We defined
to be the rows of
, but let us now transpose them to become column vectors, so
Let
be independent, identically distributed random matrices with the following distribution:
This is indeed a probability distribution because
Note that the change to
during the
th iteration of the algorithm is
.
We will apply Theorem 5 to
. We have
so
. We may take
, the stable rank of
(since we assume
).
Since
and
, we get
by Theorem 5.
Now we apply Lemma 6 to
and
. So, with probability at least
,
Taking square roots completes the proof.
Proof: (of Corollary 4). By the ordering of the singular values,
implying
. In particular, if
then
.
Proof: (of Lemma 6). Let
be the orthogonal projection onto the kernel of
(i.e., the span of the bottom
left singular vectors of
). Then
So,
where the last step uses the following fact.
Fact 7 Let
and
be symmetric matrices of the same size. Let
and
respectively denote the
th largest eigenvalues of
and
. Then
Proof: See Horn & Johnson, “Matrix Analysis”, page 370.