# Hash tables

*By Afshine Amidi and Shervine Amidi*

## General concepts

### Hash set

A hash set is a data structure that stores a collection of unique values $\{s_{1}, ..., s_{n}\}$ with fast search times.

A hash function $f$ is used to link an element $s$ to the address $f(s)$ of the bucket that contains it. $f(s)$ is called hash value of $s$.

### Hash table

A hash table, also called hash map, is a data structure used to index large amounts of data with fast key search times. It consists of an unordered collection of key-value pairs $\left\{(k_1, v_1), ..., (k_{n}, v_{n})\right\}$.

A hash function $f$ is used to link a key $k$ to the address $f(k)$ of the bucket containing the associated value $v$. $f(k)$ is called hash value of $k$.

An ideal hash function fulfills the following properties:

*Be easy to compute*, to minimize runtimes*Have a resolution method*if two distinct keys have the same hash value, i.e. form a hash collision*Have a uniform distribution of hash values*to ensure a minimal amount of collisions

Basic hash functions include summing the ASCII representation of characters, whereas more sophisticated ones use advanced algebraic properties.

### Operations

The following operations can be performed on a hash table with a reasonably good hash function:

Type | Time | Description | Illustration |
---|---|---|---|

Access | O(1) | Using the key k, compute the associated bucket f(k) and access the corresponding value v. | |

Search | O(n) | For each key k: 1. Compute the associated hash value f(k). 2. Check if the value v is in the bucket associated with f(k). | |

Insertion | O(1) | Given the key k, compute the corresponding bucket f(k) and add the value v there. | |

Deletion | O(1) | Given the key k, compute the corresponding bucket f(k) and remove the value v from there. |

It is crucial to have a good hash function to ensure the $\mathcal{O}(1)$ runtimes.

## Collisions

### Definition

A hash collision happens when two different keys have the same hash value. When this is the case, we need a way to resolve it.

### Load factor

The load factor $\rho$ of a hash table is defined based on the total number of items stored $n=\#\{(k_i, v_i)\}$ and the number of buckets $b$ as follows:

If there are more items than there are buckets, then we have $\rho>1$. In this situation, the pigeonhole principle guarantees the presence of hash collisions.

### Collision resolution

The most common resolution methods can be put into two categories:

*Open addressing:*We try to find an open spot by looking at other buckets. This works best when $\rho$ is close to 0.Method Description Illustration Linear probing Tries to put the value in the next bucket until there is no more collision. Quadratic probing Tries to put the value in buckets that are further and further away in a quadratic fashion, until there is no more collision. Double hashing Uses a secondary hash function if the first one leads to a collision.

*Closed addressing:*We append the value to the existing bucket. This works best when when $\rho$ is close to 1.Method Description Illustration Chaining Uses linked lists to sequentially add values assigned to the same bucket.

The risk of collisions highlight the importance of choosing a hash function that leads to a uniform distribution of hash values.

**Subscribe here** to be notified of new Super Study Guide releases!