The demo below will take the input text below (which you can change) and
run the algorithm 10000 times, collecting histogram of the estimated number
of unique words (which depends on the parameters like 'Max buffering size').
This can be compared to the expected exact value (drawn in red in the
histogram). The deviation (sigma) and average number of rounds is collected
too.
Note: the algorithm doesn't return the list of unique words, not even a weighted
estimation! It still is useful to estimate the space complexity of a proper
second pass, e.g.
Predefined inputs:
Distribution of estimates:
number of bins:
References:
by Sourav Chakraborty, N.V. Vinodchandran, Kuldeep S. Meel.
To appear in the Proceedings of 30th Annual European Symposium on Algorithms (ESA 2022).
The CVM code itself is pretty straightforward:
function CVM(words, /* the stream */
max_size, /* max buffer size */
reject_proba /* 0.5 by default */) {
let proba = 1.;
const words_set = new Set();
for (const w of words) {
words_set.delete(w); // <- probably slow op overall
if (Math.random() < proba) words_set.add(w);
if (words_set.size == max_size) { // buffer full?
const words_array = Array.from(words_set);
words_set.clear();
for (const W of words_array) {
if (Math.random() < reject_proba) words_set.add(W);
}
proba *= reject_proba; // new round starts
}
}
return Math.floor(words_set.size / proba); // estimate
}
BACK to Skal's page.