Analyzing data streams can be challenging. Imagine trying to analyze a stream of text messages that contain the phone numbers of the senders and recipients. To analyze the structure of the correspondence, you are interested in finding out the sender-recipient pairs.
The number of these distinct pairs grows quadratically with the number of phone numbers and can be calculated as
To find out whether a distinct phone number has ever texted another phone number could be a very interesting question to answer. It could also be interesting to know, whether certain phone numbers exchange messages very frequently. But how can we find these pairs?
To answer this kind of question efficiently, we make use of a data structure that is specifically convenient for this purpose: A bloomfilter. This is a probabilistic data structure designed to quickly analyze which data has previously occurred in a data stream. It is a suitable way of answering the question in our example, whether a specific individual has ever texted another individual.
A bloomfilter consists of an array of
To find out whether a specific element
False positives are possible, if we assume that our hash functions are not guaranteed to be free of collissions. This means that, with a certain probability, in our example we would potentially identify communicating pairs that do not in fact communicate. The likelihood of this event is increasing in the length of the data stream, as the likelihood of hash collissions increases with each stream element. The probability of a false positive can be calculated as:
where
The takeaway from this is that the length of the bloomfilter array should increase with the length of the data stream, if one wants to keep the probability of false positives constant.
Some further classic bloomfilter applications are for quickly deciding whether a specific username is free to be registered, or has already been taken by another user, or to decide whether a chosen password is considered safe. In the latter case, the bloomfilter array is initialized using the extensive list of unsafe passwords.
I hope you enjoyed this article, that’s it for now - thanks for stopping by!