Collision Resolution: Open Addressing

When you implement a hash table, collisions are inevitable. Open addressing is a powerful strategy that handles these collisions not by chaining items into external lists, but by systematically searching—or probing—for the next available empty slot within the table's own array. This approach stores all key-value pairs directly in the table, leading to excellent memory locality and cache performance, which is critical in high-performance engineering systems. Mastering its variants and trade-offs is essential for designing efficient data storage and retrieval mechanisms.

Core Concepts of Probing

At its heart, open addressing resolves a collision by computing a sequence of alternative indices to probe. When the primary hash index is occupied, the algorithm applies a probe sequence to find the next candidate slot. This sequence is determined by a probing function. The table's load factor, denoted as $α$ (number of elements / table size), becomes a crucial metric; as $α$ approaches 1, the table becomes full, and probe sequences grow impractically long. A common rule is to resize the table (typically double its size and rehash all items) when $α$ exceeds a threshold like 0.7.

The primary challenge is designing a probe sequence that distributes keys uniformly to avoid clustering, where consecutive groups of occupied slots form, drastically increasing search times. There are three classical probing methods, each with distinct cluster behaviors.

Linear Probing

Linear probing is the simplest method. When a collision occurs at index $i$ , you check the next sequential slots: $i + 1$ , $i + 2$ , and so on, wrapping around to the start of the array. The probe function is $P (x) = c$ , where $c$ is a constant (often 1), so the sequence is: $ha s h (k ey) + c, ha s h (k ey) + 2 c, ... m o d t ab l e S i ze$ .

Its primary advantage is simplicity and excellent cache performance due to checking adjacent memory locations. However, it suffers severely from primary clustering. This occurs when keys hash to the same initial index, creating long, contiguous runs of occupied cells. New keys that hash to anywhere within a run will be placed at its end, making the run even longer. Search times degrade significantly as clusters grow.

For example, imagine a hash table of size 7. Keys A and B both hash to index 2. A occupies 2. When B is inserted, linear probing finds the next open slot, index 3. If a new key C hashes to 3, it must now probe to index 4. A cluster from indices 2-4 has formed. Any key hashing to 2, 3, or 4 will lengthen this cluster further.

Quadratic Probing

Quadratic probing attempts to mitigate primary clustering by using a quadratic function to determine the probe offset. The sequence is defined as: $(ha s h (k ey) + c_{1} * i + c_{2} * i^{2}) m o d t ab l e S i ze$ , where $i$ is the probe attempt number (0, 1, 2...), and $c_{1}$ and $c_{2}$ are constants.

This method spreads subsequent probes farther away from the original collision point. If the first probe is at index $h$ , subsequent probes check $h + 1$ , $h + 4$ , $h + 9$ , etc. (with wrap-around). This breaks up the long runs characteristic of linear probing. However, it introduces a subtler issue: secondary clustering. Two keys with the same initial hash value will follow the exact same probe sequence. While they don't create long linear blocks, they still contend for the same sequence of slots, which can increase search times.

A critical consideration with quadratic probing is that it is not guaranteed to find an empty slot even if one exists. It can cycle through the same set of indices without covering the entire table. Choosing a table size that is a prime number and ensuring the load factor stays below 0.5 often guarantees success.

Double Hashing

Double hashing is the most effective technique at minimizing both primary and secondary clustering. It uses two independent hash functions. The first determines the initial index. The second hash function calculates the step size for probing. The probe sequence is: $(ha s h_{1} (k ey) + i * ha s h_{2} (k ey)) m o d t ab l e S i ze$ .

The second hash must never evaluate to zero, and it's ideal if it is relatively prime to the table size. This ensures the probe sequence explores a unique permutation of table slots for each unique initial hash. Different keys that collide at the first hash will almost always have different step sizes, leading to completely different probe sequences. This results in a probe sequence distribution that closely mimics perfect random probing, offering the best theoretical performance among the classic open addressing methods.

Consider a table of size 10. For a key K1, let $ha s h_{1} (K 1) = 3$ and $ha s h_{2} (K 1) = 7$ . Its probe sequence is 3, 0, 7, 4, 1, 8, 5, 2, 9, 6. For a colliding key K2 with $ha s h_{1} (K 2) = 3$ but $ha s h_{2} (K 2) = 4$ , the sequence is 3, 7, 1, 5, 9, 3...—entirely different.

The Complication of Deletion

Deletion in an open-addressed hash table is not straightforward. You cannot simply nullify a slot. If you do, you risk breaking the probe sequence for subsequent searches. A search for a key probes until it finds either the key or an empty slot. If you create a "hole" by setting a slot to empty, a search for a key that probed through that slot before deletion will stop prematurely at the hole, incorrectly concluding the key is not present.

The standard solution is to mark a deleted slot with a special tombstone (or "deleted") marker. During insertion, a tombstone is treated as an occupied slot for the purpose of stopping a search (so you don't insert a duplicate key prematurely) but as an empty slot for the purpose of placing a new key. During a search, you must probe past tombstones. Over time, an accumulation of tombstones can degrade performance, as they increase the average probe length. Periodic rehashing of the entire table is required to clean them out.

Common Pitfalls

Ignoring Tombstones During Search: A frequent implementation error is treating a tombstone as an empty slot during a get() operation. This causes the search to terminate incorrectly. You must code your search loop to continue probing when it encounters a tombstone, stopping only on a truly empty slot.

Correction: The search logic should be: "If slot contains key, return value. If slot is truly null/empty, return 'not found'. If slot is a tombstone, continue probing."

Failing to Handle Wrap-Around: Forgetting to apply the modulo operation correctly on each probe calculation can lead to index-out-of-bounds errors. The probe sequence must cyclically explore the entire array.

Correction: Always compute the next index as (current_index + probe_offset) % table_size.

Allowing the Load Factor to Grow Too High: Open addressing performance collapses as the table nears capacity. An insert operation can degenerate into an $O (n)$ linear scan if the table is almost full.

Correction: Implement automatic resizing and rehashing when the load factor exceeds a predefined threshold (e.g., $α > 0.7$ ). All elements, including those behind tombstones, must be reinserted into the new, larger table.

Poor Choice of Second Hash in Double Hashing: Using a second hash function that returns zero or a value that shares factors with the table size can cause the probe sequence to repeat prematurely, failing to check all slots.

Correction: Ensure $ha s h_{2} (k ey) \neq = 0$ . A common strategy is to derive it as hash2(key) = R - (key % R), where $R$ is a prime smaller than the table size, and use a prime table size.

Summary

Open addressing resolves hash collisions by probing for empty slots within the table's own array, using methods like linear probing, quadratic probing, and double hashing.
Linear probing is cache-friendly but causes primary clustering, while quadratic probing reduces primary clustering but can lead to secondary clustering.
Double hashing, using two independent hash functions, provides the best theoretical performance by minimizing clustering and creating near-uniform probe sequence distributions.
Deletion requires special tombstone markers to avoid breaking probe sequences during subsequent searches, though tombstones themselves can degrade performance over time.
The table's load factor must be carefully monitored, with resizing and rehashing implemented to maintain efficient $O (1)$ expected time complexity for operations.

Collision Resolution: Open Addressing

Collision Resolution: Open Addressing

Core Concepts of Probing

Linear Probing

Quadratic Probing

Double Hashing

The Complication of Deletion

Common Pitfalls

Summary

Write better notes with AI