DS: Concurrent Data Structures

In an era of multi-core processors and highly concurrent applications, data structures must do more than just store data—they must orchestrate safe, efficient access for multiple threads simultaneously. Understanding concurrent data structures, which are designed for this exact purpose, is fundamental to writing correct, scalable, and performant software. This guide will walk you through the core techniques, from basic locks to advanced lock-free algorithms, and the critical trade-offs you must evaluate when designing your own.

The Foundation: Lock-Based Synchronization

The most intuitive way to make a data structure thread-safe is to use locks. A lock is a synchronization primitive that ensures only one thread can execute a critical section of code at a time. The simplest approach is coarse-grained locking, where a single lock protects the entire data structure. For a stack or queue, every operation (push, pop, enqueue, dequeue) acquires the same lock.

While coarse-grained locking guarantees correctness, it creates a major bottleneck. If Thread A is adding an item to the front of a linked list, Thread B is blocked from removing an item from the back, even though these operations might not logically interfere. This severely limits scalability, the ability to perform more work with more threads.

To improve scalability, fine-grained locking employs multiple locks to protect different parts of the data structure. In a linked-list-based queue, you might have one lock for the head pointer and a separate lock for the tail pointer. This allows one thread to enqueue (modifying tail) while another dequeues (modifying head), provided the queue isn't empty or near-empty. Implementing fine-grained locking is more complex and increases the risk of deadlock, a situation where two or more threads are permanently blocked, each waiting for a lock the other holds. You must establish and adhere to a strict global lock acquisition order to prevent this.

Moving Beyond Locks: Atomic Operations and Lock-Free Concepts

Locks have inherent drawbacks: they cause blocking (threads waiting idly), priority inversion, and convoying. Lock-free algorithms aim to solve these by guaranteeing system-wide progress: if some threads are executing operations, at least one will complete in a finite number of steps, even if others are delayed. They achieve this using atomic read-modify-write (RMW) operations provided directly by the hardware.

The most crucial RMW operation is compare-and-swap (CAS). Conceptually, CAS( $a dd r$ , $e x p ec t e d$ , $n e w$ ) performs this logic atomically:

if value at memory address *addr* equals *expected*:
    set value at *addr* to *new*
    return true
else:
    return false

A thread can read a shared pointer, calculate a new value, and use CAS to update it only if no other thread has changed it in the interim. If the CAS fails, the thread simply retries its operation. This "optimistic" retry loop is the heart of many lock-free algorithms.

However, lock-free programming introduces subtle hazards. The ABA problem is a classic issue. Imagine a thread reads a pointer from shared memory, $A$ . It prepares a new node to swing the pointer to, but before it can execute CAS, another thread performs two operations: it changes the pointer from $A$ to $B$ , then later changes it back from $B$ to $A$ . To the first thread, the pointer still appears to be $A$ , so its CAS succeeds, even though the underlying state has changed (e.g., the node at $A$ might have been recycled and now holds different data). This can corrupt the data structure. Solutions often involve using atomic "tagged pointers" that pair the pointer with an incrementing version number.

Implementing a Lock-Free Queue: The Michael-Scott Algorithm

A canonical example of a production-quality lock-free data structure is the Michael-Scott lock-free queue. It's a linked-list-based queue where the enqueue and dequeue operations can proceed concurrently without locks, using CAS.

The queue has a head pointer (pointing to a dummy node) and a tail pointer. The core enqueue logic for a new node new_node is:

Read the current tail and its next pointer.
If tail->next is not null, it means another thread is mid-update, so help it by advancing the tail pointer (this is the "helping" characteristic that ensures lock-freedom).
Otherwise, attempt to CAS the tail->next from null to new_node.
Whether your CAS succeeded or you helped, finally attempt to CAS the tail pointer itself to point to the last node.

The dequeue operation follows a similar pattern, carefully checking if the queue is empty, has one element, or has many elements, and using CAS to update the head pointer. The key to its correctness is that enqueue only modifies tail and the next pointer of the last node, while dequeue only modifies head. Their interference is managed through the careful sequence of reads and CAS operations. Implementing this algorithm correctly requires deep attention to memory ordering and safe memory reclamation (like hazard pointers) to avoid the ABA problem.

Performance Tradeoffs: Choosing the Right Approach

Selecting a concurrency strategy is a balancing act with no single best answer. You must evaluate the tradeoffs based on your application's contention profile, latency requirements, and complexity budget.

Coarse-Grained Locking offers simplicity and guaranteed correctness. Its performance is poor under high contention but can be acceptable for low-concurrency scenarios or as a correct first implementation. The primary cost is the lack of scalability.
Fine-Grained Locking increases scalability for certain access patterns by allowing more parallelism. However, it adds significant implementation complexity, increases memory overhead for multiple locks, and remains vulnerable to deadlock if not designed meticulously. Performance can degrade under high contention due to lock acquisition overhead.
Lock-Free Algorithms provide the best scalability under high contention, as delaying one thread does not block others. They are immune to deadlock, priority inversion, and convoying. The trade-offs are steep: implementation complexity is highest, they are notoriously difficult to prove correct, and they often incur higher overhead in low-contention scenarios due to the cost of CAS operations and retry loops. The performance tradeoffs here are clear: you trade predictable worst-case latency (no blocking) for potentially higher average-case overhead and immense development cost.

Common Pitfalls

Misjudging Lock Granularity: Using a single lock for a large, complex structure (coarse-grained) when fine-grained locking could yield significant parallelism. Conversely, implementing overly fine-grained locking with dozens of locks adds massive complexity for negligible performance gain. Profile your application to find the real contention bottlenecks.
Ignoring the ABA Problem: When designing with CAS, assuming a pointer's value hasn't changed because it's the same address is a fatal error. Always incorporate a defense, such as an associated version counter (tagged pointer) or using a safe memory reclamation scheme that guarantees a node is not reused while any thread might hold a reference to it.
Incorrect Memory Ordering: On modern relaxed-memory architectures, reads and writes can be reordered. Using a plain load to read a shared pointer and then dereferencing it can lead to accessing stale or incorrect data. You must use atomic operations with the appropriate memory ordering constraints (e.g., acquire semantics for loads, release semantics for stores) to establish correct "happens-before" relationships between threads.
Over-Optimizing Prematurely: Starting with a lock-free design is rarely the right choice. A well-implemented fine-grained lock can often deliver the required performance. Begin with the simplest correct solution (often coarse-grained), measure its performance under realistic loads, and only introduce the complexity of lock-free algorithms if profiling confirms a specific scalability bottleneck that locks cannot address.

Summary

Concurrent data structures enable safe, simultaneous access by multiple threads, which is essential for leveraging multi-core processors.
Lock-based approaches range from simple coarse-grained locks to more scalable fine-grained locks, with the latter introducing complexity and deadlock risk.
Lock-free algorithms use atomic operations like compare-and-swap (CAS) to avoid blocking, guaranteeing system-wide progress but introducing challenges like the ABA problem.
The Michael-Scott lock-free queue is a seminal design that demonstrates how careful use of CAS can enable non-blocking enqueue and dequeue operations.
Choosing a concurrency strategy involves evaluating performance tradeoffs between simplicity, scalability, and implementation complexity; lock-free is not always faster and is almost always more difficult to implement correctly.

DS: Concurrent Data Structures

DS: Concurrent Data Structures

The Foundation: Lock-Based Synchronization

Moving Beyond Locks: Atomic Operations and Lock-Free Concepts

Implementing a Lock-Free Queue: The Michael-Scott Algorithm

Performance Tradeoffs: Choosing the Right Approach

Common Pitfalls

Summary

Write better notes with AI