Advanced hash tables
By Afshine Amidi and Shervine Amidi
Bloom filters
Definition
A bloom filter is a space-efficient data structure that can check whether a given element is:
either potentially in the set
or definitely not in the set
It is a data structure composed of:
independent hash functions with uniform distributions over integer values between and
an array of size of elements initially all equal to
Element of the array is set to 1 when an element is inserted and verifies for a given , with .
The false positive rate quantifies how unreliable a positive prediction given by the bloom filter is:
It is approximated using the number of bits , the number of hash functions and the number of elements that we want to insert as follows:
We want to be as small as possible. Given and , the optimal value of the number of hash functions is such that:
Application Suppose we would like to check whether an element belongs to a set of data stored on a disk that is expensive to call.
A naive approach would consist of calling the disk every time we want to check whether an element is there. This process would be very slow.
Now, let's assume we have a bloom filter that checks whether the element is in the set (fast operation) and only lets us call the disk if it is predicted to be there. We would be cutting the number of unnecessary call requests, thus making the process faster.
Operations
An initialized bloom filter has all the bits of its array set to 0.
Insertion In order to insert an element to the bloom filter, we follow the steps below:
Compute step: Compute hash values .
Update step: Update the -bit array by turning on the corresponding bits:
Deletion Deleting an element from the bloom filter is not possible, since removing associated bits may inadvertently remove other elements.
Search In order to check whether an element is potentially in the bloom filter, we follow the steps below:
Compute step: Compute hash values .
Check step: We have two possible outcomes:
All corresponding bits are on: This means that the element is possibly in the set. It can also not be, in which case the prediction is a false positive.
One of the bits is off: This means that the element is definitely not in the set, and we are sure of it.
Count-min sketches
Definition
A count-min sketch is a space-efficient data structure that provides an upper bound for the number of times a given element appeared.
It is a data structure composed of:
independent hash functions with uniform distributions over values between and
a 2D array with elements initially all equal to 0, with:
rows, corresponding to the output of each of the hash functions
columns, corresponding to the different values that each hash function can take
Each element is an integer that corresponds to the number of times an element was inserted and verified for a given , with .
and are hyperparameters that are chosen when initializing the data structure. The bigger they are, the more accurate the approximation will be, but also the more space the data structure will take and the more time each operation will take.
Application Suppose we would like to check how many times an element appeared in a large stream of data.
A naive approach would consist of keeping track of the stream of data and counting the number of times each element appeared in a large entry table. The space taken by this approach is as big as the number of distinct elements seen. This could be very big.
Now let's assume we add a count-min sketch that approximately counts each element. We would keep a fixed-size data structure that could handle large streams of data and provide reasonable approximations.
Operations
An initialized count-min sketch has its 2D array filled with zeros.
Insertion In order to insert an element to the count-min sketch, we follow the steps below:
Compute step: Compute hash values .
Update step: Update the 2D array by incrementing the corresponding counts:
Deletion Deleting an element from the count-min sketch is not possible, since removing associated counts may inadvertently remove other elements.
Search In order to compute an estimate of the number of times a given element was previously inserted into the count-min sketch, we follow the steps below:
Compute step: Compute hash values .
Check step: Take the minimum across all the corresponding counts:
This estimate is an upper bound because it might have been inflated by hash collisions.
Subscribe here to be notified of new Super Study Guide releases!