In this lecture we will see two applications of the Johnson-Lindenstrauss lemma.
1. Streaming Algorithms
In 1996, Alon, Matias and Szegedy introduced the streaming model of computation. Given that they all were at AT&T Labs at the time, their work was presumably motivated by the problem of monitoring network traffic. Their paper was highly influential: it earned them the Gödel prize, and it motivated a huge amount of follow-up work, both on the theory of streaming algorithms, and on applications in networking, databases, etc.
The model aims to capture scenarios in which a computing device with a very limited amount of storage must process a huge amount of data, and must compute some aggregate statistics about that data. The motivating example is a network switch which must process dozens of gigabytes per second, and may only have have a few kilobytes or megabytes of fast memory. We might like to compute, for example, the number of distinct traffic flows (source/destination pairs) traversing the switch, the variance of the packet sizes, etc.
Let us formalize this model using histograms. The data is a sequence of indices, where each . Intuitively, we want to compute the histogram , where
If the algorithm were to maintain explicitly, it could start with , then at each time step , it receives the index and increments by 1. After receiving all the data, the algorithm could compute various statistics of , such as a norm , or the number of non-zero entries, etc.
So far the problem is trivial. What makes the model interesting, is that we will only allow the algorithm space . (We will assume that , so it takes bits to represent any integer in .) So the algorithm cannot store the data explicitly, nor can it store the histogram explicitly!
Remarkably, numerous interesting statistics can still be computed in this model, if we allow randomized algorithms that output approximate answers. Nearly optimal bounds are known for the amount of space required to estimate many interesting statistics.
Today we will give a simple algorithm to estimate the -norm, namely . As an example of a scenario where this would be useful, consider a database table . A self-join with the predicate would output all triples where and belong to the table. What is the size of this self-join? It is simply , where is the histogram for the values in the table. So a streaming algorithm for estimating could be quite useful in database query optimization.
The Algorithm. The idea is very simple: instead of storing explicitly, we will store a dimensionality reduced form of . Let be a matrix whose entries are drawn independently from the distribution . (This is the same as the linear map defined in the previous lecture.) The algorithm will explicitly maintain the vector , defined as . At time step , the algorithm receives the index so (implicitly) the th coordinate of increases by . The corresponding change in is to add the th column of to .
To analyze this algorithm, we use the Johnson-Lindenstrauss lemma. Our results from last time imply that
So if we set , then gives a approximation of with constant probability. Or, if we want to give an accurate estimate at each of the time steps, we can take .
How much space? At first glance, it seems that the space used by this algorithm is just the space needed to store the vector , which is words of space. (There is also the issue of how many bits of accuracy are needed when generating the Gaussian random variables, but we will ignore that issue. As discussed last time, the Johnson-Lindenstrauss lemma works equally well with random variables, so numerical accuracy is not a concern.)
There is one small problem: the matrix must not change during the execution of the algorithm. So, every time the algorithm sees the same index in the data stream, it must add the same th column of to . We cannot generate a new random column each time. The naive way to accomplish this would be to generate at the beginning of the algorithm and explicitly store it so that we can use its columns in each time step. However, has entries, so storing is even worse than storing !
The solution to this problem is to observe that is a random object, so we may not need to store it explicitly. In a practical implementation, will be generated by a pseudorandom generator initialized by some seed, so we can regenerate columns of at will by resetting the seed.
Alternatively, there is another solution which has provable guarantees but is probably too complicated to use in practice. Long before the streaming model was introduced, Nisan designed a beautiful pseudorandom generator which produces provably good random bits, but only for algorithms which use a small amount of space. Streaming algorithms meet that requirement, so we can simply use Nisan’s method to regenerate the matrix as necessary. Unfortunately, we do not have time to discuss the details.
2. Nearest Neighbor
The nearest neighbor problem is a classic problem involving high-dimensional data. Given points , preprocess so that, given a query point , we can quickly find minimizing . As usual we focus on the Euclidean norm, but this problem is interesting for many norms.
This problem can trivially be solved in polynomial time. We could do no processing of , then for each query find the closest point by exhaustive search. This requires time for each query. An alternative approach is to use a kd-tree, which is a well-known data structure for representing geometric points. Unfortunately this could take time for each query, which is only a substantial improvement over exhaustive search when the dimension is a constant. This phenomenon, the failure of low dimensional methods when applied in high dimensions, is known as the “curse of dimensionality”.
We will present an improved solution, by allowing the use of randomization and approximation. We will instead solve the -approximate nearest neighbor problem. Given a query point , we must find a point such that
Our solution is based on a reduction to a simpler problem, the -Point Location in Equal Balls problem. The input data is a collection of balls of radius , centered at points . Let denote the ball of radius around . Given a query point we must answer the query as follows:
- If there is any with , we must say Yes, and we must output any point with .
- If there is no point with , we must say No.
- Otherwise (meaning that the closest to has ), we can say either Yes or No. (As before, if we say Yes we must also output a point with .)
Let us call this problem .
In other words, let us call a ball of radius a “green ball” and a ball of radius a “red ball”. If is contained in any green ball, we must say Yes and output the center of any red ball containing . If is not contained in any red ball, we must say No. Otherwise, we could say either Yes or No, but in the former case we must again output the center of any red ball containing .
2.1. Reducing Approximate Nearest Neighbor to PLEB
We now explain how to solve the -approximate nearest neighbor problem using any solution to the -Point Location in Equal Balls problem. Scale the point set so that the minimum interpoint distance is at least , then let be the maximum interpoint distance. So for all . For every radius , we initialize our solution to . Given any query point , we use binary search to find the minimum for which says Yes. Let be the point that it returns.
The requirements of guarantee that . On the other hand, since said No, we know that there is no point with . Thus satisfies
And so this gives a solution to -approximate nearest neighbor, with a slightly different .
2.2. Solving PLEB
The main idea here is quite simple. We discretize the space, then use a hash table to identify locations belonging to a ball.
Preprocessing. In more detail, the preprocessing step for proceeds as follows. We first partition the space into cuboids (-dimensional cubes) of side length . Note that the diameter of a cuboid is its side length times , which is . Each cuboid is identified by a canonical point, say the minimal point contained in the cuboid. We then create a hash table, initially empty. For each point and each cuboid that intersects , we insert the pair into the hash table.
Queries. Now consider how to perform a query for a point . The first step is to determine the cuboid that contains , by simple arithmetic. Next, we look up in the hash table. If there are no matches, that means that no ball intersects , and therefore is not contained in any ball of radius (a green ball). So, by the requirements of , we can say No.
Suppose that is in the hash table. Then the hash table can return us an arbitrary pair , which tells us that intersects . By the triangle inequality, the distance from to is at most plus the diameter of the cuboid. So , i.e., is contained in the red ball around . By the requirements of , we can say Yes and we can return the point .
Time and Space Analysis. To analyze this algorithm, we first need to determine the number of cuboids that intersect a ball of radius . The volume of a ball of radius is roughly . On the other hand, the volume of a cuboid is . So the number of cuboids that intersect this ball is roughly
Therefore the time and space used by the preprocessing step is roughly .
To perform a query, we just need to compute the cuboid containing then look up that cuboid in the hash table. This takes time, which is optimal, since we must examine all coordinates of the vector .
Unfortunately the preprocessing time and space is exponential in , which is terrible. The curse of dimensionality has struck again! The next section gives an improved solution.
2.3. Approximate Nearest Neighbor by Johnson-Lindenstrauss
Our last key observation is, by applying the Johnson-Lindenstrauss lemma, we can assume that our points lie in a low-dimensional space. Specifically, we can randomly generate a matrix which maps our point set to with , while approximately preserving distances between all points in with high probability. That same map will also approximately preserve distances between any query points and the points in , as long as the number of queries performed is at most .
The analysis of changes as follows. The preprocessing step must apply the matrix to all points in , which takes time . The time to set up the hash table improves to . So assuming is a constant, the preprocessing step runs in polynomial time. Each query must also apply the Johnson-Lindenstrauss matrix to the query point, which takes time ,
Finally, we analyze the reduction which allowed us to solve Approximate Nearest Neighbor. The preprocessing step simply initializes for all values of , of which there are . So the total preprocessing time is
which is horrible, but polynomial time assuming is reasonable and is a constant. Each query must perform binary search to find the minimum radius for which says Yes, so the total query time is
Assuming is reasonable, this is .