A CVM algorithm demo
The CVM online algorithm estimates the number of unique words in a
very very long input stream (think 'one day worth of Google
queries' for instance. Or IP addresses.) in an space and time efficient
manner.
This is known as the
Count-distinct
problem.
The CVM algorithm maintains a fixed-size buffer of unique words (see the
'Max buffering size' slider below) along with a decimation probability
'p'.
The algorithm operates with 'rounds': unique words are collected in the
buffer. When full, the buffer is partially flushed with a probability 'p'
and another round starts with a lowered probability 'p'.
Initially, the probability is halved, but we can adjust the flushing rate
(and thus, the average number of rounds) by adjusting the 'rejection
proba' parameter below.
This allows different tradeoff in precision / rounds. Each round's flush
has a higher computational cost than just looking up the set of unique
words so far.
The demo below will take the input text below (which you can change) and
run the algorithm 10000 times, collecting histogram of the estimated
number of unique words (which depends on the parameters like 'Max
buffering size').
This can be compared to the expected exact value (drawn in red in the
histogram). The deviation (sigma) and average number of rounds is
collected too.
Note: the algorithm doesn't return the list of unique words, not even a
weighted estimation! It still is useful to estimate the space complexity
of a proper second pass, e.g.
Distribution of estimates:
Input text:
You can copy-paste the entire 'Hamlet' play above from this page.
References:
by Sourav Chakraborty, N.V. Vinodchandran, Kuldeep S. Meel.
To appear in the Proceedings of 30th Annual European Symposium on Algorithms (ESA 2022).
The CVM code itself is pretty straightforward:
function CVM(words, /* the stream */
max_size, /* max buffer size */
reject_proba /* 0.5 by default */) {
let proba = 1.;
const words_set = new Set();
for (const w of words) {
words_set.delete(w); // <- probably slow op overall
if (Math.random() < proba) words_set.add(w);
if (words_set.size == max_size) { // buffer full?
const words_array = Array.from(words_set);
words_set.clear();
for (const W of words_array) {
if (Math.random() < reject_proba) words_set.add(W);
}
proba *= reject_proba; // new round starts
}
}
return Math.floor(words_set.size / proba); // estimate
}