incolumitas.com

In the following blog post we will introduce three different Big Data algorithms. More specifically, we will learn about probabilistic data structures that allow us to estimate cardinalities and frequencies of elements that originate from a massive stream of data. This blog post is heavily inspired by a the well written article on probabilistic data structures for web analytics and data mining. I will not cover the mathematics behind those data structures, the beforementioned blog post does that much better. And if not, then you should probably consult the original papers.

What is Big Data anyways?

Everybody talks nowadays about Big Data, but what does it mean? For example, if we want to count the number of distinct IP Addresses that a very large web site encounters on each day, we need new approaches. Consider the following straightforward algorithm:

unique_ip_addresses = set()
for ip in stream_of_ip_addresses:
    unique_ip_addresses.add(ip)
    if end_of_day(time):
        print('We got {} distinct ip addresses'.format(len(unique_ip_addresses)))
        unique_ip_addresses = set()

This way of counting distinct elements works fine for millions of visitors. But what happens if a website is visited 10 Billion times a day? Then we would need to maintain a set with space 10^10 * 4 …