Analyzing data streams can be challenging. Imagine trying to analyze a stream of text messages that contain the phone numbers of the senders and recipients. To analyze the structure of the correspondence, you are interested in finding out the sender-recipient pairs.
The number of these distinct pairs grows quadratically with the number of phone numbers and can be calculated as \(n^2 - n\). The quadratic term results from all possible pairs we can create using the available numbers, but we are not interested in identical pairs, i.e. people who text themselves, which is why we subtract \(n\). With large \(n\), the quadratic term dominates the linear term, so we disregard it.
To find out whether a distinct phone number has ever texted another phone number could be a very interesting question to answer. It could also be interesting to know, whether certain phone numbers exchange messages very frequently. But how can we find these pairs?
To answer this kind of question efficiently, we make use of a data structure that is specifically convenient for this purpose: A bloomfilter. This is a probabilistic data structure designed to quickly analyze which data has previously occurred in a data stream. It is a suitable way of answering the question in our example, whether a specific individual has ever texted another individual.
A bloomfilter consists of an array of \(m\) bits, initially consisting of only zeros. Using \(k\) differing hash functions, each element of the stream is randomly mapped to a position of this array. The hash functions mapping must be uniformly distributed across the array. Whenever a position of the array is hit by a hash function, that corresponding position is set from 0 to 1. The hash functions should yield independent and uniformly distributed results for the optimal use of the arrays length.
To find out whether a specific element \(x\) occurs in the data stream, all we have to do is evaluate the hash functions of \(x\) and check whether every one of these positions of the bit array contain a 1. If there is at least one position that contains a 0 instead of a 1, this means that the element cannot have occurred in the stream.
False positives are possible, if we assume that our hash functions are not guaranteed to be free of collissions. This means that, with a certain probability, in our example we would potentially identify communicating pairs that do not in fact communicate. The likelihood of this event is increasing in the length of the data stream, as the likelihood of hash collissions increases with each stream element. The probability of a false positive can be calculated as:
\[P( \text{false positive} ) = (1 -e^{(-n*w/m)})^w\]where \(n\) is the number of elements in the data stream, \(w\) is the number of hash functions, and \(m\) the length of the bloomfilter array.
The takeaway from this is that the length of the bloomfilter array should increase with the length of the data stream, if one wants to keep the probability of false positives constant.
Some further classic bloomfilter applications are for quickly deciding whether a specific username is free to be registered, or has already been taken by another user, or to decide whether a chosen password is considered safe. In the latter case, the bloomfilter array is initialized using the extensive list of unsafe passwords.
I hope you enjoyed this article, that’s it for now - thanks for stopping by!