The Best Algorithm No One Knows About

Here’s a program roughly 0% of programmers know how to write: generate a list of random tweets, without duplication. You may think you know how to do this, but you almost assuredly don’t.

Say you work at Twitter and have to select just one tweet at random. This is an easy problem. Generate a random number less than the total number of tweets, and draw the tweet that corresponds somehow to your number. If you have to make a list of tweets, then, provided you don’t mind duplicates, you can repeat this same process over and over. We get a list of tweets drawn at random, possibly with some duplication. Any programmer could write this program.

The catch comes when you add the requirement that the selected tweets be unique. The whole problem changes. Once uniqueness is required, you get trapped by another, implicit, requirement, which is that we expect every tweet to be equally likely to appear.

(Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)

We started with the language of tweets, but it’s also possible to use the language of cards: deal k cards from a deck of size n, without replacement. So a poker hand like A2345 of clubs is valid for k=5 and n=52, but a hand containing AAAAA of clubs is not.

The misleading part in using the language of cards is that you don’t often consider a deck of size 264. What’s nice about the card formulation though is that it conveys how simple the problem statement really is. This is a fundamental problem that was open for a long time. Nobody knew how to deal cards.

The first obstacle that makes the “dealing” (without replacement) hard is that the “draw” method doesn’t work. If you draw (with replacement) over and over again, ignoring cards you’ve already picked, you can simulate a deal, but you run into problems. The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).

There are several things to try next and none of them work. You can’t generate a random number for every card in the deck (too many cards). You can’t rearrange the deck (too much space). Once you get the basic complexity requirements down, what makes the problem slippery is keeping the balance in the distribution. You can try to play with “resizing the window” you’re looking at, but most of these methods add bias (usually, you undersample early items and oversample later ones).

Card dealing was unsolved for a long time. Finally, Jeffrey Vitter published the solution in 1984. A cleaned-up version appeared in 1987. To put this in context, public-key cryptography algorithms appeared about a decade earlier in the 1970s. Basically all other fundamental problems that sound like this were solved in the 1960s and 1950s, so we might have expected to have a solution from then. But in the 1980s, dealing cards was a research-level problem in Knuth.

The algorithm is still relatively unknown. When I came across a need for the algorithm in my work, it took a lot of detective work to even find it. There was no source code online. The algorithm is not referenced often, and when it is, it is in a highly obscure fashion. After reaching Vitter’s papers it again takes a concentrated effort to figure out what you need. This was a big surprise to me.

I think the explanation for this is that the method was originally presented as a capstone effort in the area of “reservoirs” and “tape sampling”. These days, while it’s possible you’ve worked at a place that uses tape drives for backup, chances are you’ve never programmed for a tape drive. I certainly haven’t. Based on my reading of the research, by the mid-80s subjects like tape drive sampling were considered antique. The most important algorithm appeared after people had stopped looking at that field. The only reason I know the field exists is it’s a giant section in Knuth that I’ve flipped past many times. Earlier algorithms for “card dealing” or “urn sampling” are hidden in there.

Today tape drive algorithms seem pointless, but there’s a good reason for approaching the problem from this perspective. Random access is costly on a tape drive. Regular access is costly on a tape drive. Tape drives are much less nimble than disk drives, and this enforces a strongly sequential read mentality. A sequential, long-duration read pattern is a useful way to frame the problem, as it results in a solution which is both sorted and “online”. Sorted means the dealt cards are in order, and online means you can do them one at a time, perhaps while you wait for a spindle to turn. This completist approach may be how you need to think about the problem in order to solve it in the first place.

The reason it’s nice for an algorithm to produce results in sorted order is because it’s easy to randomize the order after the fact. The reverse is not true. Sorting algorithms are linear only in restricted cases, and so the addition of an O(k*log(k)) sort will forfeit our original O(k) runtime. List shuffling algorithms are O(k) and easy to implement.

That it’s so easy to shuffle a small list is probably why the algorithm is overlooked. When most people need to deal cards they shuffle the entire deck. This works as long as the size of the deck is within reason. What happens is that once the deck is over a certain size, people resort to simple drawing algorithms, and nobody looks too closely at bias or inefficiency in the randomized application.

This is unfortunate. If you’ve ever sat through a statistics or combinatorics class, you know drawing balls from urns without replacement is a fundamental sampling application. In array languages like APL, sampling with and without replacement are implemented as first-class primitives “deal” and “draw”. It’s nice in theory to have sampling as a primitive since then you don’t have to worry about screwing it up. You do have to worry about the language implementors getting it wrong, though. I’ve talked with several people who’ve implemented deal-like primitives in the past, and Vitter’s name only comes up when I bring it up, so I assume many existing deal implementations either use slower methods, or, I guess, some kind of ad hoc method with an incorrect distribution.

We use Vitter’s algorithm in Kerf, and deal() is a first-class primitive. Kerf’s deal() is great for when banks need to sample a random collection of stock tickers, or a random collection of trades from throughout the day. Uses appear all the time.

I’ve cleaned up an older version of the C source code we use and made it available below. This source first appeared in a previous project here. A description of the algorithm and pseudocode can be found here in Vitter’s paper.


//Copyright Kevin Lawler, released under ISC License

double random_double()  //your random double function
{
  //[0,1) uniformly random double, mt19937-64.c source is fine
  //keep in mind most double-based implementations drop randomness to 52-bits
  return genrand64_real2();
}

//POTENTIAL_OPTIMIZATION_POINT: Christian Neukirchen points out we can replace exp(log(x)*y) by pow(x,y)
//POTENTIAL_OPTIMIZATION_POINT: Vitter paper points out an exponentially distributed random var can provide speed ups
//Vitter, J.S. - An Efficient Algorithm for Sequential Random Sampling - ACM Trans. Math. Software 11 (1985), 37-57.
//'a' is space allocated for the hand
//'n' is the size of the hand
//'N' is the upper bound on the random card values
void vitter(int64_t *a, int64_t n, int64_t N) //Method D
{
 

  int64_t i = 0;
  int64_t j = -1;
  int64_t t; 
  int64_t qu1 = -n + 1 + N; 
  int64_t S; 
  int64_t negalphainv = -13;
  int64_t threshold = -negalphainv*n;
  int64_t limit;

  double nreal = n;
  double Nreal = N;
  double ninv = 1.0/n;
  double nmin1inv = 1.0/(n-1);
  double Vprime = exp(log(random_double())*ninv);

  double qu1real = -nreal + 1.0 + Nreal;
  double negSreal;
  double U; 
  double X; 
  double y1; 
  double y2; 
  double top; 
  double bottom;
  

  while(n > 1 && threshold < N)
  {
    nmin1inv=1.0/(-1.0+nreal);

    while(1)
    {
      while(1)
      {
        X = Nreal * (-Vprime + 1.0);
        S = floor(X);

        if(S < qu1)
        {
          break;
        }

        Vprime = exp(log(random_double()) * ninv);
      }

      U = random_double();
      
      negSreal = -S;
      
      y1 = exp(log(U*Nreal/qu1real) * nmin1inv);
      
      Vprime = y1 * (-X / Nreal+1.0) * (qu1real / (negSreal + qu1real));
      
      if(Vprime <= 1.0)
      {
        break;
      }
      
      y2 = 1.0; 
      
      top = -1.0 + Nreal;       
      
      if(-1+n > S)
      {
        bottom = -nreal + Nreal;
        limit = -S + N;
      }
      else
      {
        bottom = -1.0 + negSreal + Nreal; 
        limit = qu1;
      }
      
      for(t = N-1; t >= limit; t--)
      {
        y2=(y2 * top) / bottom;
        top--; 
        bottom--;
      }
      
      if(Nreal / (-X + Nreal) >= y1 * exp(log(y2) * nmin1inv))
      {
        Vprime = exp(log(random_double()) * nmin1inv);
        break;
      }

      Vprime = exp(log(random_double()) * ninv);
    }

    j += S + 1;

    a[i++] = j;

    N = -S + (-1 + N); 

    Nreal = negSreal + (-1.0 + Nreal);

    n--;
    nreal--;
    ninv = nmin1inv;

    qu1 = -S + qu1;
    qu1real = negSreal + qu1real;

    threshold += negalphainv;
  }

  if(n>1)
  {
    vitter_a(a+i,n,N,j); // if i>0 then n has been decremented
  }
  else
  {
    S = floor(N * Vprime);

    j += S + 1;

    a[i++] = j;
  }

}

void vitter_a(int64_t *a, int64_t n, int64_t N, int64_t j) //Method A
{
  int64_t S, i = 0;
  double top = N - n, Nreal = N, V, quot;

  while(n >= 2)
  {
    V = random_double(); 
    S=0;
    quot=top/Nreal;

    while (quot>V)
    {
      S++; top--; Nreal--;
      quot = (quot * top)/Nreal;
    }

    j+=S+1;

    a[i++]=j;

    Nreal--;

    n--;
  }

  S = floor(round(Nreal) * random_double());

  j+=S+1;

  a[i++]=j;

}

Advertisements