Skip to content
Feb 25

Cuckoo Hashing

MT
Mindli Team

AI-Generated Content

Cuckoo Hashing

In computer science, designing a hash table that guarantees fast lookups is a classic challenge. While standard chaining or open addressing can suffer from slow worst-case search times, Cuckoo hashing offers an elegant solution: worst-case constant-time lookups. This comes at the cost of a more complex insertion process that can evict existing items, much like a cuckoo bird displaces eggs from a nest. Understanding this trade-off is key to implementing efficient, predictable data structures for caching, routers, and databases where lookup speed is critical.

The Core Idea: Multiple Tables and Hash Functions

At its heart, Cuckoo hashing maintains two separate hash tables, often called Table 1 and Table 2. Each table has its own independent hash function. For any given key, there are exactly two possible locations where it can reside: one in Table 1 (determined by hash function ) and one in Table 2 (determined by hash function ).

This design is what enables the worst-case lookup guarantee. To find a key, you simply compute and check that slot in Table 1. If it's not there, you compute and check that slot in Table 2. That's it—just two memory accesses, regardless of how full the table is. This is a stark contrast to chaining, where a long linked list must be traversed, or open addressing, which may require a long probe sequence.

The magic—and complexity—lies in how insertion maintains this invariant. A key must always be found in one of its two designated slots. Insertion is not merely about placing an item in an empty slot; it's about ensuring the two-slot rule holds true for every key in the tables, even if it requires moving other keys around.

The Insertion Process and Displacement Chains

Inserting a new key x begins with a simple check. You look in its slot in Table 1, . If empty, you place x there and are done. If occupied, you do not give up. Instead, you displace the existing key, say y, and place x in that slot. Now, the evicted key y must find a new home. It has a second chance: its alternate slot in the other table.

So, you check y's other possible location. For example, if y was evicted from Table 1, its other home is in Table 2 at . If that slot is empty, you place y there and the insertion is complete. If it is also occupied by another key z, you repeat the process: evict z, place y in its spot, and then try to relocate z. This creates a displacement chain, a sequence of evictions and relocations.

This process can be visualized as a "cuckoo" moving from nest to nest, displacing occupants. The algorithm continues this chain until it either finds an empty slot or a condition is met that forces a different resolution.

Handling Cycles and The Need for Rehashing

A major challenge arises when the displacement chain forms a cycle. Imagine inserting key A which displaces B, which displaces C, which tries to displace A again, returning to the start. The algorithm would loop indefinitely. To prevent this, implementations typically set a maximum limit on the number of displacements (e.g., = 6 or a small multiple of ).

When this limit is exceeded, it indicates the current hash functions and table state cannot accommodate the new key without violating the two-slot rule for all items. The solution is rehashing. This involves:

  1. Choosing a new pair of hash functions and .
  2. Allocating new empty tables (sometimes of a larger size to reduce the chance of future cycles).
  3. Re-inserting all keys from the old tables, plus the new key causing the insertion failure, using the new hash functions.

Rehashing is an expensive operation, but it occurs with low probability when the tables are kept sufficiently sparse. The load factor (the number of items divided by the total number of slots) is crucial. For two-table cuckoo hashing, the maximum practical load factor is around 50%. Exceeding this significantly increases the probability of cycles and failed insertions, triggering frequent rehashing.

Analyzing Expected Performance

The performance profile of cuckoo hashing is its defining feature. Let's break it down:

  • Lookup Time: This is the strongest guarantee. A lookup requires exactly two probes, making it in the worst case. This predictability is invaluable for real-time systems.
  • Deletion Time: Deletion is equally simple and fast. To delete a key, you check its two possible locations and remove it from where it is found. This is also a constant-time operation.
  • Expected Insertion Time: Unlike lookup and delete, insertion is amortized constant time on average. A single insertion typically involves only a few displacements. The analysis, often done using random graph theory, shows that as long as the load factor is kept below a threshold (e.g., 50%), the probability of a long chain or cycle remains low. The expensive rehash operation happens infrequently enough that its cost, amortized over many insertions, remains constant.

The key takeaway from the analysis is the trade-off: you exchange the possibility of a costly insertion/rehash for the guarantee of a blisteringly fast lookup. This is an excellent trade-off for read-heavy workloads.

Practical Considerations and Variations

Basic cuckoo hashing has limitations, but engineers have developed powerful variations.

  • More Hash Functions/Tables: Using 3 or 4 hash functions (and corresponding tables) increases the number of possible slots for each key from 2 to 3 or 4. This allows for much higher load factors (over 90% in some cases) while maintaining a constant number of lookup probes, though insertion logic becomes more complex.
  • Stash: A small auxiliary "stash" (e.g., a small list) can be added to hold keys that cause cycles. Instead of rehashing immediately, a few problematic keys can be placed in the stash. Lookups must then also check the stash, but its small, fixed size preserves the lookup guarantee. This dramatically reduces the need for rehashing.
  • Choice of Hash Functions: The hash functions and must be independent and uniformly distributing. In practice, a single hash function is often used with a second derived from the first (e.g., ). Using a family of hash functions where a new seed generates a new function simplifies rehashing.

Common Pitfalls

  1. Ignoring the Load Factor: Attempting to maintain a load factor above 50% in a two-table setup will lead to catastrophic performance. The system will become stuck in a near-continuous cycle of failed insertions and rehashing. Always monitor the load and trigger a table resize (increase the number of slots) well before the critical threshold is reached.
  2. Inadequate Cycle Detection: Implementing displacement without a loop limit () will cause the program to hang on a cyclic insertion. Conversely, setting too low will trigger unnecessary rehashes. A good rule of thumb is to set it proportional to the logarithm of the table size.
  3. Poor Hash Function Design: If the two hash functions are not independent, they may map many keys to the same two slots, creating "hotspots" and drastically increasing collisions. This violates the probabilistic assumptions of the algorithm and leads to poor performance. Always use well-tested, randomized hash functions.
  4. Forgetting the Stash on Lookup: If you implement a stash to handle cycles, a common bug is to perform lookups only in the two main tables. This will cause items in the stash to be "lost." Every lookup operation must be a three-step check: Table 1, Table 2, then the stash.

Summary

  • Cuckoo hashing guarantees worst-case lookup time by restricting any key to one of exactly two possible locations, determined by two separate hash functions.
  • Insertion uses displacement chains, where a new key may evict existing keys, which are then recursively re-inserted into their alternate locations until an empty slot is found.
  • Cycles in displacement chains are detected by a loop limit and resolved by rehashing—selecting new hash functions and rebuilding the entire table, which is an expensive but infrequent operation.
  • Maintaining a low load factor (below 50% for two tables) is essential for keeping the probability of insertion failure and rehashing low, ensuring amortized insertion time.
  • Practical implementations often use enhancements like more hash functions or a small stash to improve the load factor and stability while preserving the constant-time lookup guarantee.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.