Data Structures: Hash Tables
Data Structures: Hash Tables
Hash tables are one of the most practical data structures in everyday software. They power fast key-based lookups, caches, symbol tables in compilers, routing tables, session stores, and countless “find this by ID” operations in applications. Their appeal is straightforward: on average, a hash table can insert, find, and delete entries in constant time, often written as .
That speed is not magic. It comes from combining a good hash function with an array-like storage strategy and a plan for handling collisions. Understanding those ingredients helps you choose the right implementation and avoid performance surprises.
What a hash table is
A hash table stores key–value pairs and supports operations such as:
insert(key, value)lookup(key)delete(key)
Internally, it maintains an array of buckets or slots. A hash function converts a key into an integer hash code, and that hash code is mapped to an index in the array, typically using a modulo operation:
Inline mapping looks like , where is the array size.
If each key lands in its own slot, lookup is essentially one array access. The core challenge is that real inputs are messy, and different keys can map to the same slot.
Hash functions: the quality lever
A hash function should be:
- Deterministic: the same key always yields the same hash.
- Fast: hashing should not dominate runtime.
- Well-distributed: it should spread typical inputs uniformly across buckets.
For primitive types like integers, the hash may be the integer itself (possibly mixed). For strings and composite keys, practical hashes combine characters or fields in a way that reduces patterns. A poor hash can turn a hash table into a slow structure even if the collision strategy is correct, because too many keys concentrate into too few buckets.
It is also important to distinguish hashing for data structures from cryptographic hashing. Hash tables generally use non-cryptographic hashes that prioritize speed and distribution, not resistance to adversaries.
Collisions: why they happen and why they matter
A collision occurs when two different keys map to the same index. Collisions are inevitable because the key space is usually far larger than the number of buckets.
Collision resolution determines how the hash table behaves under load. There are two dominant families:
- Chaining (separate chaining)
- Open addressing (probing)
Both can achieve average-case operations when the table is sized appropriately and the hash function distributes keys well.
Separate chaining
With chaining, each array bucket holds a small collection of entries that share that bucket index, commonly a linked list, dynamic array, or another small structure.
How lookup works with chaining
- Compute the index from the hash.
- Go to that bucket.
- Search within the bucket’s list for the key.
If the hash distribution is good and the table is not overloaded, each bucket’s list stays short. The average work is proportional to the expected bucket length, which is tightly connected to the load factor.
Pros and cons of chaining
Pros
- Simple to implement and reason about.
- Deletion is straightforward (remove from the bucket’s list).
- Table can exceed its bucket count without immediate failure; performance degrades gradually.
Cons
- Extra memory per entry (pointers or list overhead).
- Worse cache locality than contiguous arrays.
- Performance depends on the efficiency of the per-bucket structure.
Chaining is a strong default when deletions are common or when you want predictable behavior as the table grows.
Open addressing
With open addressing, all entries live directly in the array. When a collision happens, the table probes other slots according to a probe sequence until it finds an empty slot (for insertion) or the target key (for lookup).
Common probing strategies include:
- Linear probing: try
i, i+1, i+2, ... - Quadratic probing: use a quadratic step pattern
- Double hashing: use a second hash to determine the step size
Deletion and tombstones
Deletion is more subtle in open addressing. If you simply clear a slot, you can break probe chains and cause future lookups to stop early. A common approach is to mark deleted slots with a special marker called a tombstone. Tombstones preserve probe continuity but can accumulate and slow down operations, which is one reason many implementations periodically rehash.
Pros and cons of open addressing
Pros
- Better cache locality; often fast in practice for reads.
- No per-entry pointer overhead.
- Memory layout is compact.
Cons
- Performance degrades sharply as the table fills.
- Deletion requires tombstones or equivalent handling.
- Sensitive to clustering effects (especially with linear probing).
Open addressing can be excellent for read-heavy workloads when you maintain a conservative load factor and have a solid resizing strategy.
Load factor and resizing
The load factor is a central concept:
where is the number of stored entries and is the number of buckets/slots.
- In chaining, approximates the average bucket length.
- In open addressing, represents how full the table is, which strongly influences the average probe length.
As rises, collisions become more frequent, and operations slow down. To keep average-case performance near , hash tables resize when the load factor crosses a threshold. Resizing typically means allocating a larger array and reinserting (rehashing) all existing keys, because the index mapping depends on .
This rehashing cost is expensive in the moment, but with a doubling strategy it is usually amortized: most inserts are cheap, and the occasional resize spreads out over many operations.
Complexity: average case vs worst case
Under common assumptions (good hashing, controlled load factor):
- Average insert, lookup, delete:
Worst-case behavior can be:
- for chaining if all keys land in one bucket.
- for open addressing if the table becomes extremely full or probing degenerates.
In practice, worst cases matter when:
- The hash function is weak for the input distribution.
- The table is allowed to reach high load factors.
- Inputs can be adversarial, such as public-facing systems where an attacker can choose keys.
If adversarial behavior is a concern, using robust hashing strategies and carefully chosen implementations becomes part of the security posture, not just a performance choice.
Real-world applications
Hash tables are used whenever you need fast membership tests or key-based retrieval:
- Dictionaries and maps: implementing
Map<K,V>orDictionary. - Sets: storing unique items and testing membership quickly.
- Caching: mapping request keys to cached responses.
- Databases and indexing: in-memory hash indexes and join algorithms.
- Compilers and interpreters: symbol tables mapping identifiers to metadata.
- Networking: fast lookup of routes, sessions, or connection state.
A common example is de-duplicating items: store each seen key in a hash set, and treat a lookup hit as “already processed.” This pattern stays fast even at large scale when the load factor remains under control.
Practical guidance for choosing and using hash tables
Choose the collision strategy that matches the workload
- Prefer chaining when deletions are frequent or when you want graceful degradation.
- Prefer open addressing when memory overhead matters and the workload is read-heavy, with careful load management.
Pay attention to load factor thresholds
Even if an API hides resizing details, performance often hinges on not letting the table get too full. If your environment lets you pre-size a hash table based on expected volume, do it to reduce rehashing and keep operations stable.
Ensure keys hash well
When you define custom keys, implement hashing and equality consistently. Two keys that are equal must produce the same hash. If hashing ignores important fields, you may create accidental clustering and slowdowns.
Closing perspective
Hash tables are a prime example of a data structure whose everyday performance depends on a few core ideas: hashing, collision resolution, and load factor control. Get those right, and you get near-constant-time behavior that makes many systems feel instant. Get them wrong, and the same structure can quietly become a bottleneck. Understanding how chaining and open addressing behave under real loads is what turns “it works” into “it scales.”