Skip to main content

Hash tables

By Afshine Amidi and Shervine Amidi

General concepts

Hash set

A hash set is a data structure that stores a collection of unique values {s1,...,sn}\{s_{1}, ..., s_{n}\} with fast search times.

A hash function ff is used to link an element ss to the address f(s)f(s) of the bucket that contains it. f(s)f(s) is called hash value of ss.

Hash table

A hash table, also called hash map, is a data structure used to index large amounts of data with fast key search times. It consists of an unordered collection of key-value pairs {(k1,v1),...,(kn,vn)}\left\{(k_1, v_1), ..., (k_{n}, v_{n})\right\}.

A hash function ff is used to link a key kk to the address f(k)f(k) of the bucket containing the associated value vv. f(k)f(k) is called hash value of kk.

An ideal hash function fulfills the following properties:

  • Be easy to compute, to minimize runtimes
  • Have a resolution method if two distinct keys have the same hash value, i.e. form a hash collision
  • Have a uniform distribution of hash values to ensure a minimal amount of collisions
Remark

Basic hash functions include summing the ASCII representation of characters, whereas more sophisticated ones use advanced algebraic properties.

Operations

The following operations can be performed on a hash table with a reasonably good hash function:

TypeTimeDescriptionIllustration
AccessO(1)Using the key k, compute the associated bucket f(k) and access the corresponding value v.
SearchO(n)For each key k:
1. Compute the associated hash value f(k).
2. Check if the value v is in the bucket associated with f(k).
InsertionO(1)Given the key k, compute the corresponding bucket f(k) and add the value v there.
DeletionO(1)Given the key k, compute the corresponding bucket f(k) and remove the value v from there.
Remark

It is crucial to have a good hash function to ensure the O(1)\mathcal{O}(1) runtimes.

Collisions

Definition

A hash collision happens when two different keys have the same hash value. When this is the case, we need a way to resolve it.

Load factor

The load factor ρ\rho of a hash table is defined based on the total number of items stored n=#{(ki,vi)}n=\#\{(k_i, v_i)\} and the number of buckets bb as follows:

ρnb\boxed{\rho\triangleq\frac{n}{b}}
Remark

If there are more items than there are buckets, then we have ρ>1\rho>1. In this situation, the pigeonhole principle guarantees the presence of hash collisions.

Collision resolution

The most common resolution methods can be put into two categories:

  • Open addressing: We try to find an open spot by looking at other buckets. This works best when ρ\rho is close to 0.

    MethodDescriptionIllustration
    Linear probingTries to put the value in the next bucket until there is no more collision.
    Quadratic probingTries to put the value in buckets that are further and further away in a quadratic fashion, until there is no more collision.
    Double hashingUses a secondary hash function if the first one leads to a collision.
  • Closed addressing: We append the value to the existing bucket. This works best when when ρ\rho is close to 1.

    MethodDescriptionIllustration
    ChainingUses linked lists to sequentially add values assigned to the same bucket.

The risk of collisions highlight the importance of choosing a hash function that leads to a uniform distribution of hash values.


Want more content like this?

Subscribe here to be notified of new Super Study Guide releases!