Hash tables

General concepts

Hash set

A hash set is a data structure that stores a collection of unique values $\{s_{1}, ..., s_{n}\}$ with fast search times.

A hash function $f$ is used to link an element $s$ to the address $f(s)$ of the bucket that contains it. $f(s)$ is called hash value of $s$ .

Hash table

A hash table, also called hash map, is a data structure used to index large amounts of data with fast key search times. It consists of an unordered collection of key-value pairs $\left\{(k_1, v_1), ..., (k_{n}, v_{n})\right\}$ .

A hash function $f$ is used to link a key $k$ to the address $f(k)$ of the bucket containing the associated value $v$ . $f(k)$ is called hash value of $k$ .

An ideal hash function fulfills the following properties:

Be easy to compute, to minimize runtimes
Have a resolution method if two distinct keys have the same hash value, i.e. form a hash collision
Have a uniform distribution of hash values to ensure a minimal amount of collisions

Remark

Basic hash functions include summing the ASCII representation of characters, whereas more sophisticated ones use advanced algebraic properties.

Operations

The following operations can be performed on a hash table with a reasonably good hash function:

Type	Time	Description
Access	O(1)	Using the key k, compute the associated bucket f(k) and access the corresponding value v.
Search	O(n)	For each key k: 1. Compute the associated hash value f(k). 2. Check if the value v is in the bucket associated with f(k).
Insertion	O(1)	Given the key k, compute the corresponding bucket f(k) and add the value v there.
Deletion	O(1)	Given the key k, compute the corresponding bucket f(k) and remove the value v from there.

Remark

It is crucial to have a good hash function to ensure the $\mathcal{O}(1)$ runtimes.

Collisions

Definition

A hash collision happens when two different keys have the same hash value. When this is the case, we need a way to resolve it.

Load factor

The load factor $\rho$ of a hash table is defined based on the total number of items stored $n=\#\{(k_i, v_i)\}$ and the number of buckets $b$ as follows:

\boxed{\rho\triangleq\frac{n}{b}}

Remark

If there are more items than there are buckets, then we have $\rho>1$ . In this situation, the pigeonhole principle guarantees the presence of hash collisions.

Collision resolution

The most common resolution methods can be put into two categories:

Open addressing: We try to find an open spot by looking at other buckets. This works best when $\rho$ is close to 0.

Method	Description	Illustration
Linear probing	Tries to put the value in the next bucket until there is no more collision.
Quadratic probing	Tries to put the value in buckets that are further and further away in a quadratic fashion, until there is no more collision.
Double hashing	Uses a secondary hash function if the first one leads to a collision.

Closed addressing: We append the value to the existing bucket. This works best when when $\rho$ is close to 1.
Method Description Illustration
Chaining Uses linked lists to sequentially add values assigned to the same bucket.

Method	Description	Illustration
Chaining	Uses linked lists to sequentially add values assigned to the same bucket.

The risk of collisions highlight the importance of choosing a hash function that leads to a uniform distribution of hash values.

Want more content like this?

Subscribe here to be notified of new Super Study Guide releases!

Hash tables

General concepts​

Hash set​

Hash table​

Operations​

Collisions​

Definition​

Load factor​

Collision resolution​

General concepts

Hash set

Hash table

Operations

Collisions

Definition

Load factor

Collision resolution