A CVM algorithm demo

The CVM online algorithm estimates the number of unique words in a very very long input stream (think 'one day worth of Google queries' for instance. Or IP addresses.) in an space and time efficient manner.
This is known as the Count-distinct problem.
The CVM algorithm maintains a fixed-size buffer of unique words (see the 'Max buffering size' slider below) along with a decimation probability 'p'.
The algorithm operates with 'rounds': unique words are collected in the buffer. When full, The buffer is partially flushed with a probability 'p' and another round starts with a lowered probability 'p'.
Initially, the probability is halved, but we can adjust the flushing rate (and thus, the average number of rounds) by adjusting the 'rejection proba' parameter below.
This allows different tradeoff in precision / rounds. Each round's flush has a higher computational cost than just looking up the set of unique words so far.

The demo below will take the input text below (which you can change) and run the algorithm 10000 times, collecting histogram of the estimated number of unique words (which depends on the parameters like 'Max buffering size').
This can be compared to the expected exact value (drawn in red in the histogram). The deviation (sigma) and average number of rounds is collected too.
Note: the algorithm doesn't return the list of unique words, not even a weighted estimation! It still is useful to estimate the space complexity of a proper second pass, e.g.

Predefined inputs:

Max buffering size for unique words:
Rejection proba: .5

Distribution of estimates:

number of bins:

Input text:


You can copy-paste the entire 'Hamlet' play above from this page.

References:

The CVM code itself is pretty straightforward:


function CVM(words,       /* the stream */
             max_size,    /* max buffer size */
             reject_proba /* 0.5 by default */) {
  let proba = 1.;
  const words_set = new Set();
  for (const w of words) {
    words_set.delete(w);  // <- probably slow op overall
    if (Math.random() < proba) words_set.add(w);
    if (words_set.size == max_size) {   // buffer full?
      const words_array = Array.from(words_set);
      words_set.clear();
      for (const W of words_array) {
        if (Math.random() < reject_proba) words_set.add(W);
      }
      proba *= reject_proba;   // new round starts
    }
  }
  return Math.floor(words_set.size / proba);  // estimate
}
BACK to Skal's page.