Skip to main content

Advanced hash tables

By Afshine Amidi and Shervine Amidi

Bloom filters

Definition

A bloom filter B\mathcal{B} is a space-efficient data structure that can check whether a given element xx is:

  • either potentially in the set

  • or definitely not in the set

It is a data structure composed of:

  • kk independent hash functions f0,f1,...,fk1f_0, f_1, ..., f_{k-1} with uniform distributions over integer values between 00 and m1m-1

  • an array BB of size mm of elements b0,b1,...,bm1b_{0}, b_{1}, ..., b_{m-1} initially all equal to 00

Element bjb_j of the array is set to 1 when an element xx is inserted and verifies fi(x)=jf_i(x)=j for a given i[ ⁣[0,k1] ⁣]i\in[\![0, k-1]\!], with j[ ⁣[0,m1] ⁣]j\in[\![0, m-1]\!].

The false positive rate ϵ\epsilon quantifies how unreliable a positive prediction given by the bloom filter is:

ϵ=# elements falsely predicted to be in the set# elements predicted to be in the set\epsilon=\frac{\textrm{\# elements \textbf{falsely} predicted to be in the set}}{\textrm{\# elements predicted to be in the set}}

It is approximated using the number of bits mm, the number of hash functions kk and the number of elements nn that we want to insert as follows:

ϵ(1eknm)k\boxed{\epsilon\approx\left(1-e^{-\frac{kn}{m}}\right)^k}

We want ϵ\epsilon to be as small as possible. Given mm and nn, the optimal value of the number of hash functions kk is such that:

k=mnloge(2)\boxed{k=\frac{m}{n}\log_e(2)}

Application Suppose we would like to check whether an element belongs to a set of data stored on a disk that is expensive to call.

  • A naive approach would consist of calling the disk every time we want to check whether an element is there. This process would be very slow.

  • Now, let's assume we have a bloom filter that checks whether the element is in the set (fast operation) and only lets us call the disk if it is predicted to be there. We would be cutting the number of unnecessary call requests, thus making the process faster.

Operations

An initialized bloom filter has all the bits of its array BB set to 0.

Insertion In order to insert an element xx to the bloom filter, we follow the steps below:

  • Compute step: Compute hash values f0(x),...,fk1(x)f_0(x), ..., f_{k-1}(x).

  • Update step: Update the mm-bit array BB by turning on the corresponding bits:

    for i[ ⁣[0,k1] ⁣],bfi(x)1\textrm{for }i\in[\![0, k-1]\!], \quad \boxed{b_{f_i(x)}\longleftarrow 1}

Deletion Deleting an element from the bloom filter is not possible, since removing associated bits may inadvertently remove other elements.

Search In order to check whether an element xx is potentially in the bloom filter, we follow the steps below:

  • Compute step: Compute hash values f0(x),...,fk1(x)f_0(x), ..., f_{k-1}(x).

  • Check step: We have two possible outcomes:

    • All corresponding bits are on: This means that the element xx is possibly in the set. It can also not be, in which case the prediction is a false positive.

    • One of the bits is off: This means that the element is definitely not in the set, and we are sure of it.

Count-min sketches

Definition

A count-min sketch C\mathcal{C} is a space-efficient data structure that provides an upper bound #x^\widehat{\# x} for the number of times #x\# x a given element xx appeared.

It is a data structure composed of:

  • kk independent hash functions f0,...,fk1f_0, ..., f_{k-1} with uniform distributions over values between 00 and m1m-1

  • a 2D array CC with k×mk\times m elements ci,jc_{i, j} initially all equal to 0, with:

    • kk rows, corresponding to the output of each of the kk hash functions

    • mm columns, corresponding to the mm different values that each hash function can take

Each element ci,jc_{i,j} is an integer that corresponds to the number of times an element xx was inserted and verified fi(x)=jf_i(x)=j for a given i[ ⁣[0,k1] ⁣]i\in[\![0, k-1]\!], with j[ ⁣[0,m1] ⁣]j\in[\![0, m-1]\!].

kk and mm are hyperparameters that are chosen when initializing the data structure. The bigger they are, the more accurate the approximation will be, but also the more space the data structure will take and the more time each operation will take.

Application Suppose we would like to check how many times an element appeared in a large stream of data.

  • A naive approach would consist of keeping track of the stream of data and counting the number of times each element appeared in a large entry table. The space taken by this approach is as big as the number of distinct elements seen. This could be very big.

  • Now let's assume we add a count-min sketch that approximately counts each element. We would keep a fixed-size data structure that could handle large streams of data and provide reasonable approximations.

Operations

An initialized count-min sketch has its 2D array CC filled with zeros.

Insertion In order to insert an element xx to the count-min sketch, we follow the steps below:

  • Compute step: Compute hash values f0(x),...,fk1(x)f_0(x), ..., f_{k-1}(x).

  • Update step: Update the 2D array CC by incrementing the corresponding counts:

    for i[ ⁣[0,k1] ⁣],ci,fi(x)ci,fi(x)+1\textrm{for }i\in[\![0, k-1]\!], \quad \boxed{c_{i, f_i(x)}\longleftarrow c_{i, f_i(x)}+1}\quad\quad

Deletion Deleting an element from the count-min sketch is not possible, since removing associated counts may inadvertently remove other elements.

Search In order to compute an estimate of the number of times a given element xx was previously inserted into the count-min sketch, we follow the steps below:

  • Compute step: Compute hash values f0(x),...,fk1(x)f_0(x), ..., f_{k-1}(x).

  • Check step: Take the minimum across all the corresponding counts:

    #x^=min(c0,f0(x),...,ck1,fk1(x))\boxed{\widehat{\# x}=\min\left(c_{0, f_0(x)}, ..., c_{k-1, f_{k-1}(x)}\right)}

This estimate is an upper bound because it might have been inflated by hash collisions.


Want more content like this?

Subscribe here to be notified of new Super Study Guide releases!