In this lecture we continue to discuss applications of randomized algorithms in computer networking.

**1. Analysis of SkipNet Routing **

Last time we introduced the SkipNet peer-to-peer system and we discussed its routing protocol. Recall that every node has a string identifier and a random bitstring , where . The nodes are organized into several doubly-linked lists. For every bitstring of length at most , there is a list containing all nodes for which is a prefix of their random bitstring.

We proved that:

Claim 1With probability at least , every list with contains at most one node.

To route messages from a node , we use the following protocol.

- Send the message through node ‘s level list as far as possible
*towards*the destination without going*beyond*the destination, arriving at node . - Send the message from through node ‘s level list as far as possible
*towards*the destination without going*beyond*the destination, arriving at node . - Send the message from through node ‘s level list as far as possible
*towards*the destination without going*beyond*the destination, arriving at node . - …
- Send the message from through the level 0 list to the destination (which we can also call ).

Our main theorem shows that any message traverses only nodes to reach its destination.

Theorem 1With probability at least (over the choice of the nodes’ random bitstrings), this routing scheme can send a message from any node to any other node by traversing intermediate nodes.

*Proof:* Let the source node have identifier and random bits . Let the destination node have identifier . Let be the path giving the sequence of nodes traversed when routing a message from the source to the destination .

Intuitively, we want to show that the path traverses only a *constant* number of connections at each level, and since the number of levels is , that would complete the proof. Unfortunately, this statement is too strong: there might be a level where we traverse many connections (even connections). But at long as this doesn’t happen too often, we can still show that the total path length is .

So instead, our analysis is a bit more subtle. The main trick is to figure out a protocol for routing the message *backwards along the same path * from the destination to the source . Recall that each node is the node in ‘s level list which is closest to the destination (without going beyond it). Since is also contained in ‘s level list, we can find in the following way: Starting from , move *backward* through ‘s level list towards , until we encounter the first node lying in ‘s level list. That node must be .

In other words, the following protocol routes backwards along path from to .

**Phase 0.**Send the message from through the level 0 list towards , until we reach a node in ‘s level 1 list. (This node is .)**Phase 1.**Send the message from through ‘s level 1 list towards , until we reach a node in ‘s level 2 list. (This node is .)- …
**Phase .**Send the message from through ‘s level list towards , until we reach a node in ‘s level list. (This node is .)**Phase .**Send the message from through ‘s level list until reaching node .

This protocol is easier to analyze than the protocol for routing forwards along path because it allows us to “expose the random bits” in a natural order. To start off, imagine that only node has chosen its random bitstring .

In Phase 0, we walk backward through the level list towards node . At each node, we flip a coin to choose the first random bit of its bitstring. If that coin matches the first bit of then that node is in ‘s level list and so that node must be . So the number of steps in Phase 0 is a random variable with the following distribution: the number of fair coin flips needed to see the first head. This is a geometric random variable with probability . (Actually it is not quite that distribution because we would stop if we were to arrive at without ever seeing a head. But the number of steps is certainly upper bounded by this geometric random variable.)

We then flip coins to choose *every* node’s first bit in their random bitstring, except for the bits that are already determined (i.e., the first bit of and the bits that we just chose in Phase 0). Now ‘s level 1 list is completely determined, so we can proceed with Phase 1. We walk backward through ‘s level list from towards node . At each node, we flip a coin to choose the *second* random bit of its bitstring. If that coin matches the second bit of then that node is in ‘s level list and so that node must be . As before, the number of steps in Phase 1 is at most the number of fair coin flips needed to see the first head.

This process continues in a similar fashion until the end of Phase .

So how long is the path in total? Our discussion just showed that the number of steps needed to arrive in ‘s level list it is upper bounded by the number of fair coin flips needed see heads, which is a random variable with the negative binomial distribution. But once we arrive in ‘s level list, we are done because Claim 1 shows that this list contains only node (with probability at least ). So it remains to analyze the negative binomial random variable, which we do with our tail bound from Lecture 3.

Claim 2Let have the negative binomial distribution with parameters and . Pick and set . Then .

Apply this claim with parameters , , . This shows that the probability of being larger than is at most . Taking a union bound over all possible pairs of source and destination nodes, the probability that any pair has larger than is less than . As argued above, the probability than any node’s level list contains multiple nodes is also at most . So, with probability at least , any source node can send a message to any destination node while traversing at most nodes.

** 1.1. Handling multiple types of error **

In the previous proof, there were two possible bad events. The first bad event is that the source node ‘s list at level might have multiple nodes. The second bad event is that the path from the destination back to ‘s list at level might be too long. How can we show that neither of these happen? Can we condition on the event not happening, then analyze ?

Such an analysis is often difficult. The reason is that our analysis of was based on performing several independent trials. Those trials would presumably not be independent if we condition on the event not happening.

Instead, we use a union bound, which avoids conditioning and doesn’t require independence. We separately showed that and . So , and the probability of no bad event occuring is at least .

**2. Consistent Hashing **

Now we switch topics and discuss another use of randomized algorithms in computer networking. It is an approach for storing and retrieving data in a distributed system. There are several design goals, many of which are similar to the goals for peer-to-peer systems.

- There should be no centralized authority who decides where the data is stored. Indeed, no single node should know who all the other nodes in the system are, or even how many nodes there are.

- The system must efficiently support dynamic additional and removal of nodes from the system.
- Each node should store roughly the same amount of data.The motivation for these design goals is the hypothesis that it is too expensive or infeasible to maintain large, centralized, fault-tolerant data centers for storing data. Ultimately I think that hypothesis has turned out to be incorrect. Google and others have shown that large, fault-tolerant data centers are certainly feasible. The main advantage of peer-to-peer systems has been in avoiding a single point of
*legal*failure, which is why systems like BitTorrent have thrived for distributing illicit content. That said, Skype at least partly uses peer-to-peer technology, and Akamai’s design was originally inspired by these ideas, so this academic work has certainly made a lasting, valuable contribution to modern technology.We now describe (a simplification of) the

**consistent hashing**method, which meets all of the design goals. It uses a clever twist on the traditional hash table data structure. Recall that with a traditional hash table, there is a universe of “keys” and a collection of “buckets”. A function is called a “hash function”. The intention is that nicely “scrambles” the set . Perhaps is pseudorandom in some informal sense, or perhaps is actually chosen at random from some family of functions.For our purposes, the key point is that traditional hash tables have a

*fixed*collection of buckets. In our distributed system, the nodes are the buckets, and our goal is that the nodes should be dynamic. So we want a hashing scheme which can gracefully deal with a dynamically changing set of buckets.The main idea can be explained in two sentences. The nodes are given random locations on the unit circle, and the data is hashed to the unit circle. Each data item is stored on the node whose location is closest. In more detail, let be the unit circle. (In practice we can discretize it and let for some some sufficiently large .) Let be our set of nodes. Every node chooses its “location” to be some point , uniformly at random. We have a function which maps data to the circle, in a pseudorandom way. But what we really want is to map the data to the nodes (the buckets), so we also need some method of mapping points in to the nodes. To do this, we map each point to the node whose location is closest to (i.e., is as small as possible).

The system’s functionality is implemented as follows.

**Initial setup.**Setting up the system is quite trivial. The nodes choose their locations randomly from , then arrange themselves into a doubly-linked, circular linked list, sorted by their locations. (Network connections are formed to represent the links in the list.) Then the hash function is chosen, and made known to all users and nodes.

**Storing/retrieving data.**Suppose a user wishes to store or retrieve some data with a key . She first applies the function , obtaining a point on the circle. Then she searches through the linked list of nodes to find the node whose location is closest to . The data is stored or retrieved from node . (To search through the list of nodes, one could use naive exhaustive search, or perhaps smarter strategies. See Section 2.)**Adding a node.**Suppose a new node is added to the system. He chooses his random location, then inserts himself into the sorted linked list of nodes at the appropriate location.And now something interesting happens. There might be some existing data in the system for which the new node’s location is now the closest to . That data is currently stored on some other node , so it must now**migrate**to node . Note that must necessarily be a neighbor of in the linked list. So can simply ask his two neighbors to send him all of their data which for which ‘s location is now the closest.**Removing a node.**To remove a node, we do the opposite of addition. Before is removed from the system, it first contacts its two neighbors and sends them the data which they are now responsible for storing.**2.1. Analysis**By randomly mapping nodes and data to the unit circle, the consistent hashing scheme tries to ensure that no node stores a disproportionate fraction of the data.

Suppose there are nodes. For any node , the expected fraction of the circle for which it is responsible is clearly . (In other words, the arc corresponding to points that would be stored on has expected length .)

**Claim 3***With probability at least , every node is responsible for at most a fraction of the circle. This is just a factor larger than the expectation.**Proof:*Let be the integer such that . Define arcs on the circle as follows:We will show that every such arc probably contains a node. That implies that the fraction of the circle for which any node is responsible is at most twice the length of an arc, i.e., .

Pick points independently at random on the circle. Note that occupies a fraction of the unit circle. The probability that none of the points lie in is:

By a union bound, the probability that there exists an containing no node is at most .

This claim doesn’t tell the whole story. One would additionally like to say that, when storing multiple items of data in the system, each node is responsible for a fair fraction of that data. So would should argue that the hash function distributes the data sufficiently uniformly around the circle, and that the distribution of nodes and the distribution of data interact nicely.

We will not prove this. But, let us assume that it is true: each node stores nearly-equal fraction of the data. Then then is a nice consequence for data migration. When a node is added (or removed) from the system, recall that the only data that migrates is the data that is newly (or no longer) stored on node . So the system migrates a nearly-minimal amount of data each time a node is added or removed.

**2.2. Is this system efficient?**To store/retrieve data with key , we need to find the server closest to . This is done by a linear search through the list of nodes. That may be acceptable if the number of nodes is small. But, if one is happy to do a linear search of all nodes for each store/retrieve operation, which not simply store the data on the least-loaded node, and retrieve the data by exhaustive search over all nodes?

The original consistent hashing paper overcomes this problem by arguing that, roughly, if the nodes don’t change too rapidly then each user can keep track of a subset of nodes that are reasonably well spread out throughout the circle. So the store/retrieve operations don’t need to examine all nodes.

But there is another approach. We have just discussed the peer-to-peer system SkipNet, which forms an efficient routing structure between a system of distributed nodes. Each node can have an arbitrary identifier (e.g., its location on the unit circle), and messages suffice to find the node whose identifier equals some value , or even the node whose identifier is closest to .

Thus, by combining the data distribution method of consistent hashing and the routing method of a peer-to-peer routing system, one obtains an highly efficient method for storing data in a distributed system. Such a storage system is called a

**distributed hash table**. Actually our discussion is chronologically backwards: consistent hashing came first, then came the distributed hash tables such as Chord and Pastry. SkipNet is a variation on those ideas.