By Afshine Amidi and Shervine Amidi

## Bloom filters​

### Definition​

A bloom filter $\mathcal{B}$ is a space-efficient data structure that can check whether a given element $x$ is:

• either potentially in the set • or definitely not in the set It is a data structure composed of:

• $k$ independent hash functions $f_0, f_1, ..., f_{k-1}$ with uniform distributions over integer values between $0$ and $m-1$

• an array $B$ of size $m$ of elements $b_{0}, b_{1}, ..., b_{m-1}$ initially all equal to $0$ Element $b_j$ of the array is set to 1 when an element $x$ is inserted and verifies $f_i(x)=j$ for a given $i\in[\![0, k-1]\!]$, with $j\in[\![0, m-1]\!]$.

The false positive rate $\epsilon$ quantifies how unreliable a positive prediction given by the bloom filter is:

$\epsilon=\frac{\textrm{\# elements \textbf{falsely} predicted to be in the set}}{\textrm{\# elements predicted to be in the set}}$

It is approximated using the number of bits $m$, the number of hash functions $k$ and the number of elements $n$ that we want to insert as follows:

$\boxed{\epsilon\approx\left(1-e^{-\frac{kn}{m}}\right)^k}$

We want $\epsilon$ to be as small as possible. Given $m$ and $n$, the optimal value of the number of hash functions $k$ is such that:

$\boxed{k=\frac{m}{n}\log_e(2)}$

Application Suppose we would like to check whether an element belongs to a set of data stored on a disk that is expensive to call.

• A naive approach would consist of calling the disk every time we want to check whether an element is there. This process would be very slow.

• Now, let's assume we have a bloom filter that checks whether the element is in the set (fast operation) and only lets us call the disk if it is predicted to be there. We would be cutting the number of unnecessary call requests, thus making the process faster. ### Operations​

An initialized bloom filter has all the bits of its array $B$ set to 0.

Insertion In order to insert an element $x$ to the bloom filter, we follow the steps below:

• Compute step: Compute hash values $f_0(x), ..., f_{k-1}(x)$.

• Update step: Update the $m$-bit array $B$ by turning on the corresponding bits:

$\textrm{for }i\in[\![0, k-1]\!], \quad \boxed{b_{f_i(x)}\longleftarrow 1}$ Deletion Deleting an element from the bloom filter is not possible, since removing associated bits may inadvertently remove other elements.

Search In order to check whether an element $x$ is potentially in the bloom filter, we follow the steps below:

• Compute step: Compute hash values $f_0(x), ..., f_{k-1}(x)$.

• Check step: We have two possible outcomes:

• All corresponding bits are on: This means that the element $x$ is possibly in the set. It can also not be, in which case the prediction is a false positive. • One of the bits is off: This means that the element is definitely not in the set, and we are sure of it. ## Count-min sketches​

### Definition​

A count-min sketch $\mathcal{C}$ is a space-efficient data structure that provides an upper bound $\widehat{\# x}$ for the number of times $\# x$ a given element $x$ appeared. It is a data structure composed of:

• $k$ independent hash functions $f_0, ..., f_{k-1}$ with uniform distributions over values between $0$ and $m-1$

• a 2D array $C$ with $k\times m$ elements $c_{i, j}$ initially all equal to 0, with:

• $k$ rows, corresponding to the output of each of the $k$ hash functions

• $m$ columns, corresponding to the $m$ different values that each hash function can take Each element $c_{i,j}$ is an integer that corresponds to the number of times an element $x$ was inserted and verified $f_i(x)=j$ for a given $i\in[\![0, k-1]\!]$, with $j\in[\![0, m-1]\!]$.

$k$ and $m$ are hyperparameters that are chosen when initializing the data structure. The bigger they are, the more accurate the approximation will be, but also the more space the data structure will take and the more time each operation will take.

Application Suppose we would like to check how many times an element appeared in a large stream of data.

• A naive approach would consist of keeping track of the stream of data and counting the number of times each element appeared in a large entry table. The space taken by this approach is as big as the number of distinct elements seen. This could be very big.

• Now let's assume we add a count-min sketch that approximately counts each element. We would keep a fixed-size data structure that could handle large streams of data and provide reasonable approximations. ### Operations​

An initialized count-min sketch has its 2D array $C$ filled with zeros.

Insertion In order to insert an element $x$ to the count-min sketch, we follow the steps below:

• Compute step: Compute hash values $f_0(x), ..., f_{k-1}(x)$.

• Update step: Update the 2D array $C$ by incrementing the corresponding counts:

$\textrm{for }i\in[\![0, k-1]\!], \quad \boxed{c_{i, f_i(x)}\longleftarrow c_{i, f_i(x)}+1}\quad\quad$ Deletion Deleting an element from the count-min sketch is not possible, since removing associated counts may inadvertently remove other elements.

Search In order to compute an estimate of the number of times a given element $x$ was previously inserted into the count-min sketch, we follow the steps below:

• Compute step: Compute hash values $f_0(x), ..., f_{k-1}(x)$.

• Check step: Take the minimum across all the corresponding counts:

$\boxed{\widehat{\# x}=\min\left(c_{0, f_0(x)}, ..., c_{k-1, f_{k-1}(x)}\right)}$

This estimate is an upper bound because it might have been inflated by hash collisions. Want more content like this?

Subscribe here to be notified of new Super Study Guide releases!