In the heavy hitters or frequent items problem, one must process a
stream of items and report those items that occur frequently. For
example a telecommunications company may wish to find popular
destination IP addresses in a packet stream across one of their links,
or a search engine may wish to report popular query words. A more
general problem is when there are two streams and we must report those
items whose frequencies significantly deviate between them; the former
problem is a special case since we can artificially pretend the first
of the two streams the empty stream. Such a problem naturally arises
in trend detection and anomaly detection.
We give a new algorithm, ExpanderSketch, for the more general problem,
which outperforms other well-known algorithms for the same problem such
as the CountMin sketch and CountSketch. Our main innovation is a novel
reduction to a new graph-clustering problem we formulate, in which finding
most of every cluster guarantees finding all the frequent items. We then solve
this problem by devising a novel spectral clustering algorithm potentially of
independent interest, based on divide-and-conquer and local search.