# Hash tables

By Afshine Amidi and Shervine Amidi

## General concepts​

### Hash set​

A hash set is a data structure that stores a collection of unique values $\{s_{1}, ..., s_{n}\}$ with fast search times.

A hash function $f$ is used to link an element $s$ to the address $f(s)$ of the bucket that contains it. $f(s)$ is called hash value of $s$.

### Hash table​

A hash table, also called hash map, is a data structure used to index large amounts of data with fast key search times. It consists of an unordered collection of key-value pairs $\left\{(k_1, v_1), ..., (k_{n}, v_{n})\right\}$.

A hash function $f$ is used to link a key $k$ to the address $f(k)$ of the bucket containing the associated value $v$. $f(k)$ is called hash value of $k$.

An ideal hash function fulfills the following properties:

• Be easy to compute, to minimize runtimes
• Have a resolution method if two distinct keys have the same hash value, i.e. form a hash collision
• Have a uniform distribution of hash values to ensure a minimal amount of collisions
Remark

Basic hash functions include summing the ASCII representation of characters, whereas more sophisticated ones use advanced algebraic properties.

### Operations​

The following operations can be performed on a hash table with a reasonably good hash function:

TypeTimeDescriptionIllustration
AccessO(1)Using the key k, compute the associated bucket f(k) and access the corresponding value v.
SearchO(n)For each key k:
1. Compute the associated hash value f(k).
2. Check if the value v is in the bucket associated with f(k).
InsertionO(1)Given the key k, compute the corresponding bucket f(k) and add the value v there.
DeletionO(1)Given the key k, compute the corresponding bucket f(k) and remove the value v from there.
Remark

It is crucial to have a good hash function to ensure the $\mathcal{O}(1)$ runtimes.

## Collisions​

### Definition​

A hash collision happens when two different keys have the same hash value. When this is the case, we need a way to resolve it.

The load factor $\rho$ of a hash table is defined based on the total number of items stored $n=\#\{(k_i, v_i)\}$ and the number of buckets $b$ as follows:

$\boxed{\rho\triangleq\frac{n}{b}}$
Remark

If there are more items than there are buckets, then we have $\rho>1$. In this situation, the pigeonhole principle guarantees the presence of hash collisions.

### Collision resolution​

The most common resolution methods can be put into two categories:

• Open addressing: We try to find an open spot by looking at other buckets. This works best when $\rho$ is close to 0.

MethodDescriptionIllustration
Linear probingTries to put the value in the next bucket until there is no more collision.
Quadratic probingTries to put the value in buckets that are further and further away in a quadratic fashion, until there is no more collision.
Double hashingUses a secondary hash function if the first one leads to a collision.
• Closed addressing: We append the value to the existing bucket. This works best when when $\rho$ is close to 1.

MethodDescriptionIllustration
ChainingUses linked lists to sequentially add values assigned to the same bucket.

The risk of collisions highlight the importance of choosing a hash function that leads to a uniform distribution of hash values.

Want more content like this?

Subscribe here to be notified of new Super Study Guide releases!