CA: Memory Interleaving and Banking

Memory interleaving and banking are foundational techniques in computer architecture that directly address the processor-memory performance gap. By organizing memory into multiple, independent banks and distributing addresses strategically, these methods enable overlapped data access, significantly boosting effective bandwidth. This is especially crucial for data-intensive applications like scientific computing, graphics rendering, and AI, where contiguous data streams are common.

Fundamentals of Memory Interleaving

At its core, memory interleaving is a design strategy that spreads consecutive memory addresses across multiple physical memory banks. Each bank can operate independently, with its own address decoder and data pathways. The primary goal is to increase aggregate bandwidth—the amount of data transferred per unit time—by allowing the memory system to service multiple access requests in parallel. Imagine a multi-lane highway where cars (data requests) can travel simultaneously in separate lanes (banks) instead of queuing in a single lane; interleaving organizes the memory "addresses" onto these lanes to minimize traffic jams.

In a non-interleaved (single-bank) memory, a second access cannot begin until the first one completes, leading to idle cycles. Interleaving exploits the fact that many programs access memory in sequential patterns. By mapping sequential addresses to different banks, the system can initiate an access to one bank while others are still busy with previous requests, thereby overlapping operations. The number of banks is typically a power of two (e.g., 2, 4, 8), and the specific mapping scheme determines which bits of the memory address select the bank.

Low-Order vs. High-Order Interleaving

The two primary schemes for distributing addresses are low-order interleaving and high-order interleaving. Your choice between them has profound implications for access patterns and system performance. In low-order interleaving, the least significant bits (LSBs) of the memory address are used to select the bank. For example, with 4 banks, address bits A1 and A0 (the two LSBs) might determine the bank, while the remaining higher-order bits specify the location within each bank. This means consecutive addresses (e.g., 0, 1, 2, 3) map to different banks (0, 1, 2, 3), which is ideal for streaming sequential data.

Conversely, high-order interleaving uses the most significant bits (MSBs) for bank selection. With the same 4-bank system, the two MSBs might choose the bank. This results in large, contiguous blocks of addresses residing in the same bank—addresses 0-1023 might be in bank 0, 1024-2047 in bank 1, and so on. High-order interleaving is beneficial when programs access large, contiguous data structures (like arrays) that are processed in parallel, as different processors or threads can work on separate blocks in different banks simultaneously without interference.

Bandwidth Calculation and Bank Conflicts

The theoretical peak bandwidth of an interleaved memory system is the number of banks multiplied by the bandwidth of a single bank. However, bank conflicts—situations where two or more simultaneous access requests target the same bank—degrade this ideal performance. You must calculate effective bandwidth by considering the access pattern and probability of conflicts.

Assume a system with $N$ banks, each with a cycle time of $t_{c}$ nanoseconds. If accesses are perfectly sequential and interleaved, one access can start every $t_{c} / N$ seconds, yielding a high bandwidth. However, with random accesses, the probability of a conflict rises. For example, if requests are independent and uniformly distributed, the chance that a new request hits a busy bank is $1/ N$ . The average access time can be modeled, and effective bandwidth becomes:

$Effective Bandwidth = \frac{Number of Successful Accesses}{Total Time}$

Consider a step-by-step scenario: A 4-bank system ( $N = 4$ ) has a bank cycle time of 10 ns. Under perfect sequential access, you can initiate a request every 2.5 ns (10/4). If you have a stream of 100 requests, ideal total time is $100 \times 2.5 = 250$ ns. But if 20% of requests cause a conflict adding a 10 ns delay, you must account for these stalls in the total time, reducing the effective bandwidth.

Designing Interleaved Memory Systems for Vector Processors

Vector processors excel at performing the same operation on large arrays of data, making memory bandwidth a critical bottleneck. Designing an interleaved memory for such processors requires careful alignment between the interleaving scheme and the vector access stride. The stride is the distance between consecutive elements accessed by the vector instruction. For optimal performance, the stride and the number of banks should be relatively prime—meaning they share no common factors other than 1.

This relative primality ensures that successive vector elements map to different banks, preventing conflicts and enabling sustained high bandwidth. For instance, if a vector processor accesses data with a stride of 2 (e.g., every other element) and you have 4 banks, the accesses will cycle through only 2 banks, causing conflicts and halving potential bandwidth. A better design might use a prime number of banks or implement skewed interleaving schemes to mitigate this. You must also consider the memory address mapping logic to support various strides efficiently, often integrated into the processor's load/store unit.

Reducing Average Access Latency with Banking

Beyond increasing bandwidth, memory banking effectively reduces average access latency for a sequence of requests. Latency is the time from issuing a request to receiving the data. In a single-bank system, average latency is simply the bank cycle time. With multiple banks, while the latency for an individual access remains unchanged, the system can pipeline requests. When one request is being serviced by a bank, other banks can be starting new accesses, so the average time between completions drops.

This is analogous to a supermarket with multiple checkout counters. Even if each counter takes the same time to process a customer, having more counters reduces the average wait time for customers in the queue. Technically, for $N$ banks with cycle time $t_{c}$ , the average latency for a long stream of sequential accesses approaches $t_{c} / N$ for the system, as requests are overlapped. However, this reduction is contingent on having enough independent accesses to keep the banks busy; under low load, latency benefits may not be fully realized.

Common Pitfalls

Confusing Interleaving Schemes: A frequent error is misapplying low-order and high-order interleaving. Using low-order interleaving for programs that access large contiguous blocks can lead to poor bank utilization, as sequential accesses within a block might span banks unnecessarily. Conversely, using high-order interleaving for fine-grained sequential streams can cause hot spots in one bank. Always analyze the dominant access patterns of the target workload before choosing a scheme.

Ignoring Stride-Based Bank Conflicts: When designing for vector or array processing, neglecting the relationship between access stride and number of banks is a critical mistake. As mentioned, a stride that shares a factor with the bank count leads to conflicts. For example, with 8 banks and a stride of 4, accesses will only utilize 2 banks. To correct this, consider using a prime number of banks or implementing dynamic bank reconfiguration to avoid these periodic conflicts.

Overestimating Bandwidth Without Considering Contention: Simply multiplying the number of banks by single-bank bandwidth gives peak theoretical performance. In practice, bus arbitration, queueing delays, and request scheduling can introduce contention. Effective design must include analysis of realistic traffic models and may require advanced controllers that prioritize requests to minimize conflict-induced stalls.

Incorrect Bank Conflict Calculation in Models: When calculating effective bandwidth, a common oversight is assuming conflicts are independent events. In reality, access patterns often have dependencies, leading to burst conflicts. Use simulation or more sophisticated probabilistic models that account for sequential dependencies or specific programming patterns to get accurate performance estimates.

Summary

Memory interleaving distributes consecutive addresses across multiple independent banks to enable overlapped access, directly increasing system bandwidth and reducing average latency for sequential operations.
Low-order interleaving uses least significant address bits for bank selection, ideal for fine-grained sequential streams, while high-order interleaving uses most significant bits, better suited for large contiguous blocks accessed in parallel.
Effective bandwidth must account for bank conflicts; calculations should model the probability of conflicts based on access patterns and strides, moving beyond simple peak formulas.
Designing for vector processors requires aligning the interleaving scheme with vector access strides, often aiming for stride and bank count to be relatively prime to minimize conflicts and maximize sustained bandwidth.
Banking reduces average access latency by pipelining requests across multiple banks, though the benefit is fully realized only with sufficient memory traffic to keep banks utilized.
Avoid common design pitfalls by carefully matching the interleaving scheme to workload patterns, analyzing stride conflicts, and modeling contention realistically rather than relying on theoretical peaks.

CA: Memory Interleaving and Banking

CA: Memory Interleaving and Banking

Fundamentals of Memory Interleaving

Low-Order vs. High-Order Interleaving

Bandwidth Calculation and Bank Conflicts

Designing Interleaved Memory Systems for Vector Processors

Reducing Average Access Latency with Banking

Common Pitfalls

Summary

Write better notes with AI