**Dimensionality reduction** is the process of mapping a high dimensional dataset to a lower dimensional space, while preserving much of the important structure. In statistics and machine learning, this often refers to the process of finding a few directions in which a high dimensional random vector has maximimum variance. Principal component analysis is a standard technique for that purpose.

In this lecture, we consider a different sort of dimensionality reduction where the goal is to preserve *pairwise distances* between the data points. We present a technique, known as the **random projection method**, for solving this problem. The analysis of this technique is known as the **Johnson-Lindenstrauss lemma**.

In the past few lectures, our main tool has been the Chernoff bound. In this lecture we will not directly use the Chernoff bound, but the main proof uses very similar ideas.

**1. Dimensionality Reduction **

Suppose we have points . We would like to find points , where , such that

Here the notation refers to the usual Euclidean norm of the vector . We will show that this can be accomplished while taking to be surprisingly small.

The main result is:

Theorem 1Let be arbitrary. Pick any . Then for some there exist points such that

Moreover, in polynomial time we can compute a linear transformation such that, defining , the inequalities in (1) are satisfied with probability at least .

Whereas principal component analysis is only useful when the original data points are inherently low dimensional, this theorem requires *absolutely no assumption* on the original data. Also, note that the final data points have no dependence on : the original data could live in an arbitrarily high dimension!

Let me now spoil the surprise: the linear transformation in Theorem 1 is simply multiplication by a matrix whose entries are independent Gaussian random variables.

Formally, for , let be a vector whose entries are independently drawn from , the normal distribution with mean 0 and variance 1. Define a linear map as follows: the th coordinate of is simply . We now prove a lemma about , which easily leads to our desired linear transformation .

Lemma 2 (Johnson-Lindenstrauss)Fix any vector with . For some we have

Given this lemma, our main theorem follows easily.

*Proof:* } Define the linear map . Since and are both linear, the lemma implies that for *any* , we have

Apply this result to all vectors and all vectors (with ). Since there are such vectors, a union bound shows that the probability of failing to satisfy (1) is at most .

** 1.1. Discussion **

First of all, you have probably noticed that we’ve now jumped from the world of discrete probability to continuous probability. This is to make our lives easier. The same theorem would be true if we picked the coordinates of to be uniform in rather than Gaussian. But the analysis of the case is trickier, and most proofs analyze that case by showing that its failure probability is not much worse than in the Gaussian case. So the Gaussian case is really the central problem.

Second of all, you might be wondering where the **random projection method** name comes from. Earlier versions of the Johnson-Lindenstrauss lemma used a slightly different function . Specifically, they chose where is a *projection* onto a *uniformly random subspace* of dimension . (Recall that an orthogonal projection matrix is any symmetric, positive semidefinite matrix whose eigenvalues are either or .) One advantage of that setup is its symmetry: one can argue that the failure probability in Lemma 2 would be the same if one instead chose a *fixed* subspace of dimension and a *random* unit vector . The latter problem can be analyzed by choosing the subspace to be the most convenient one of all: the span of the first vectors in the standard basis.

So how is our mapping different? It is almost a projection, but not quite. When we choose to be a matrix of independent Gaussians, it turns out that the range of is indeed a uniformly random subspace, but its eigenvalues are not necessarily in . If we had insisted that the random vectors that we choose were *orthonormal*, then we would have obtained a projection matrix. We could explicitly orthonormalize them by the Gram-Schmidt method, but fortunately that turns out to be unnecessary: the Johnson-Lindenstrauss lemma remains true, even if we ignore orthonormality of the ‘s.

Our definition of turns out to be a bit more convenient in some algorithmic applications, because we avoid the awkward Gram-Schmidt step.

** 1.2. The proof **

We need just one fact from probability theory: the sum of Gaussians is again Gaussian.

Fact 1Let and be independent random variables where has distribution and has distribution . Then has distribution .

Recall that if has distribution then has distribution . So by induction we get:

Fact 2Let be independent random variables where has distribution . Then, for any scalars , the sum has distribution .

The proof of Lemma 2 uses separate but similar arguments to analyze the upper and lower tails, as was the case with Chernoff bounds. We will prove only the upper tail. For convenience we square both sides, so our goal is to prove that

Define , which is the th coordinate of . By Fact 2, has distribution .

We get the following expansion:

Our goal is to prove an upper tail bound on . Fortunately, this random variable has a well-known distribution. We have just written as the the sum-of-squares of standard normal random variables, which is called the chi-squared distribution with parameter . It is easy to see that

since is the variance of , which we have shown is 1.

So our desired inequality (2) is asking for a bound on the probability that a chi-squared random variable slightly exceeds its expectation. Since the chi-squared distribution is sum of independent random variables, we know by the the central limit theorem that it converges to a normal distribution as . We just need to quantify the rate of convergence, and this is where the Chernoff-style ideas arise.

Claim 1Let have the chi-squared distribution with parameter . Set . Then .

Applying Claim 1 to with completes the proof of (2).

*Proof:* Pick any parameter . Just like with Chernoff bounds, we write

As with Chernoff bounds, the bulk of the effort is in analyzing , but we can use independence to write

Expanding the expectation we get

If that factor were simply a then we could evaluate that integral using the fact that is the PDF of a standard normal random variable, so it integrates to 1. We can accomplish that with a change of variables. Using , we get

Combining this with (3) and (4) we get

The last step is to plug in an appropriate choice of . We set , giving

Plugging in , this becomes

Using our usual techniques from Notes on Convexity Inequalities, one can show that for . So this shows that

**2. Remarks **

It turns out that the Johnson-Lindenstrauss lemma is almost optimal. Alon proved the following lower bound.

Theorem 3 (Alon)Let be vectors such that for all . Then .

To understand this theorem, let be the vertices of a simplex, i.e., for all . Then, if we map the ‘s to points in while preserving distances up to a factor , then the dimension must be at least , which is nearly what the Johnson-Lindenstrauss lemma would give. The only discrepancy is the small factor of .

The Johnson-Lindenstrauss lemma very strongly depends on properties of the Euclidean norm. For other norms, this remarkable dimensionality reduction is not necessarily possible. For example, for the norm , it is known that any map into that preserves pairwise distances between points up to a factor must have . (See Brinkman-Charikar 2003 and Lee-Naor 2004.)

great post. many thanks

Great and interesting post! With the rush to use MinHash, based upon Johnson-Lindenstrauss, for doing things like clustering character tokens, it’s nice to see its potential for doing dimension reduction of continuous measures and of counts, which is what I am interested in doing.

Can you fix the presentation of Alon’s Theorem 3 above?