Hash tables
By Afshine Amidi and Shervine Amidi
General concepts
Hash set
A hash set is a data structure that stores a collection of unique values with fast search times.
A hash function is used to link an element to the address of the bucket that contains it. is called hash value of .
Hash table
A hash table, also called hash map, is a data structure used to index large amounts of data with fast key search times. It consists of an unordered collection of key-value pairs .
A hash function is used to link a key to the address of the bucket containing the associated value . is called hash value of .
An ideal hash function fulfills the following properties:
- Be easy to compute, to minimize runtimes
- Have a resolution method if two distinct keys have the same hash value, i.e. form a hash collision
- Have a uniform distribution of hash values to ensure a minimal amount of collisions
Basic hash functions include summing the ASCII representation of characters, whereas more sophisticated ones use advanced algebraic properties.
Operations
The following operations can be performed on a hash table with a reasonably good hash function:
Type | Time | Description | Illustration |
---|---|---|---|
Access | O(1) | Using the key k, compute the associated bucket f(k) and access the corresponding value v. | ![]() |
Search | O(n) | For each key k: 1. Compute the associated hash value f(k). 2. Check if the value v is in the bucket associated with f(k). | ![]() |
Insertion | O(1) | Given the key k, compute the corresponding bucket f(k) and add the value v there. | ![]() |
Deletion | O(1) | Given the key k, compute the corresponding bucket f(k) and remove the value v from there. | ![]() |
It is crucial to have a good hash function to ensure the runtimes.
Collisions
Definition
A hash collision happens when two different keys have the same hash value. When this is the case, we need a way to resolve it.
Load factor
The load factor of a hash table is defined based on the total number of items stored and the number of buckets as follows:
If there are more items than there are buckets, then we have . In this situation, the pigeonhole principle guarantees the presence of hash collisions.
Collision resolution
The most common resolution methods can be put into two categories:
Open addressing: We try to find an open spot by looking at other buckets. This works best when is close to 0.
Method Description Illustration Linear probing Tries to put the value in the next bucket until there is no more collision. Quadratic probing Tries to put the value in buckets that are further and further away in a quadratic fashion, until there is no more collision. Double hashing Uses a secondary hash function if the first one leads to a collision.
Closed addressing: We append the value to the existing bucket. This works best when when is close to 1.
Method Description Illustration Chaining Uses linked lists to sequentially add values assigned to the same bucket.
The risk of collisions highlight the importance of choosing a hash function that leads to a uniform distribution of hash values.
Subscribe here to be notified of new Super Study Guide releases!