Counting distinct values is a trivial matter with small datasets, but it gets dramatically harder for streams with millions of distinct points. Absolute counting accuracy can be attained through a set, but then you have the undesirable tradeoff of linear memory growth.
What you really want when dealing with enormous datasets is something with predictable storage and performance characteristics. That is precisely what HyperLogLog (HLL) is — an algorithm optimized for counting the distinct values within a stream of data.
Nice introduction to HTTL from Codeship.