Computer Architecture

Computer architecture is the discipline that explains how computers are organized and why they perform the way they do. It connects software-visible concepts like instruction sets and memory addressing with hardware realities like pipelines, caches, and multicore execution. Understanding architecture is less about memorizing components and more about reasoning: what happens when a program runs, where time is spent, and which design tradeoffs lead to predictable performance.

At a practical level, architecture shapes everything from how fast a loop executes to how well an application scales across cores. It also determines the boundaries that compilers, operating systems, and developers must respect.

CPU organization: the core building blocks

A modern CPU is typically described in terms of a few major subsystems:

Control and execution: fetches instructions, decodes them, and executes operations using functional units (integer ALUs, floating-point units, vector units).
Registers: small, fast storage locations used for operands and intermediate results. Register availability influences how compilers schedule and optimize code.
Instruction pipeline: a staged assembly line that overlaps instruction work to increase throughput.
Memory interface: connects the core to caches and, eventually, main memory.

Even in simplified form, CPU behavior is shaped by two related performance measures:

Latency: time to complete one operation.
Throughput: number of operations completed per unit time.

Pipelining mainly improves throughput, while microarchitectural tricks (out-of-order execution, speculative execution, branch prediction) attempt to reduce stalls that would otherwise waste pipeline capacity.

Instruction sets and the RISC vs CISC distinction

The instruction set architecture (ISA) is the contract between hardware and software. It defines instructions, registers, addressing modes, privilege levels, and the memory model. Two classic ISA design philosophies are:

RISC (Reduced Instruction Set Computer)

RISC designs emphasize a smaller set of simple instructions, typically with:

Load/store architecture (arithmetic on registers; memory accessed via explicit loads and stores)
Fixed or regular instruction formats
Many general-purpose registers

This regularity can make decoding and pipelining easier, and it often pairs well with aggressive compiler optimization.

CISC (Complex Instruction Set Computer)

CISC designs include a richer set of instructions and addressing modes, often allowing single instructions to perform multi-step operations. Historically, this reduced code size and simplified assembly programming.

In modern processors, the practical line between RISC and CISC is blurrier than the labels suggest. Many CISC processors translate complex instructions into simpler internal operations, while many RISC processors implement sophisticated microarchitectural features. The key takeaway is that the ISA influences compiler strategy and software compatibility, while performance depends heavily on the underlying microarchitecture.

Pipelining: keeping the CPU busy

A basic instruction pipeline might include:

Fetch: read the next instruction from memory (often via the instruction cache).
Decode: interpret the instruction and read registers.
Execute: perform the operation or compute an address.
Memory: access data cache if needed.
Write-back: store results in registers.

Pipelines face hazards that reduce efficiency:

Data hazards: an instruction needs a result that is not ready yet.
Control hazards: a branch changes which instruction should be fetched next.
Structural hazards: hardware resources are contended.

Architectural solutions include forwarding paths, pipeline stalling, and branch prediction. Branch prediction is especially important because the cost of guessing wrong grows with pipeline depth. In many workloads, control flow behavior can dominate performance, so compilers and programmers often try to reduce unpredictable branching in hot paths.

Memory hierarchy: why caches matter

CPU cores run far faster than main memory. To bridge this gap, systems rely on a memory hierarchy:

Registers: fastest, smallest
L1 cache: very fast, very small, per core
L2/L3 cache: larger, slower, sometimes shared
Main memory (DRAM): much larger, much slower
Storage: orders of magnitude slower than DRAM, but persistent

Caches exploit locality:

Temporal locality: recently used data is likely to be used again soon.
Spatial locality: nearby data is likely to be used soon.

Cache design essentials

Key cache design choices include:

Cache lines: data moves between memory levels in fixed-size blocks.
Associativity: determines how many possible locations a line can occupy, influencing conflict misses.
Replacement policy: approximations of “least recently used” are common.
Write policy: write-through vs write-back affects bandwidth and complexity.

From a software perspective, the biggest wins often come from improving locality. For example, iterating through arrays in contiguous order tends to be cache-friendly, while pointer-chasing through scattered nodes often incurs frequent misses and stalls.

Virtual memory: abstraction and protection

Virtual memory provides each process with its own address space, which the operating system maps to physical memory. This delivers three major benefits:

Isolation and protection: processes cannot freely read or write each other’s memory.
Simplified programming model: programs can assume a large, contiguous address space.
Efficient sharing: shared libraries and memory-mapped files can be mapped into multiple processes.

Address translation is performed using page tables, with hardware acceleration through a Translation Lookaside Buffer (TLB), a cache of recent translations. When a translation is missing from the TLB, the processor may need to walk page tables, adding latency. If a page is not resident in physical memory, a page fault occurs and the OS must fetch data from storage, which is vastly slower than DRAM.

Practically, performance-sensitive applications pay attention to working set size and access patterns. Excessive page faults or high TLB miss rates can dwarf CPU-level optimizations.

Parallelism: multicore processors and shared resources

As clock speeds plateaued due to power and thermal limits, mainstream performance gains shifted toward multicore designs. Multiple cores allow true parallel execution, but speedups are constrained by:

Serial portions of the workload
Synchronization overhead (locks, barriers, atomic operations)
Contention for shared resources (last-level cache, memory bandwidth)

Even when a program is “parallel,” shared memory introduces challenges. Cache coherence protocols keep per-core caches consistent, but coherence traffic can become expensive when many cores frequently modify shared data. A common architectural lesson for software is to reduce false sharing and minimize shared writable state.

GPU architectures: throughput machines

GPUs are designed for massive throughput rather than low latency per task. While a CPU is optimized for fast execution of a single thread with complex control flow, a GPU is optimized to run many threads performing similar operations.

Typical GPU strengths include:

High arithmetic throughput
Massive parallelism for data-parallel workloads
High memory bandwidth (paired with careful memory access requirements)

GPU programming models generally benefit when work can be expressed as the same operation over many data elements, such as image processing, linear algebra, and many machine learning kernels. Irregular control flow and scattered memory access patterns can reduce efficiency because GPU hardware relies on keeping many execution lanes busy.

Putting it together: architectural thinking in practice

Computer architecture becomes most useful when it informs concrete decisions:

When optimizing performance, start by asking whether the workload is compute-bound, memory-bound, or limited by branching and synchronization.
Improve locality before chasing instruction-level tweaks. Better cache behavior often produces larger gains than micro-optimizations.
For parallel programs, measure scalability and watch for shared bottlenecks like memory bandwidth, lock contention, and coherence effects.
For heterogeneous systems, decide which parts belong on the CPU versus the GPU based on control complexity and data parallelism.

Architecture is ultimately about tradeoffs: simplicity versus flexibility, latency versus throughput, and compatibility versus innovation. The more clearly you understand those tradeoffs, the easier it becomes to predict performance, explain bottlenecks, and design software that aligns with the hardware it runs on.

Computer Architecture

Computer Architecture

CPU organization: the core building blocks

Instruction sets and the RISC vs CISC distinction

RISC (Reduced Instruction Set Computer)

CISC (Complex Instruction Set Computer)

Pipelining: keeping the CPU busy

Memory hierarchy: why caches matter

Cache design essentials

Virtual memory: abstraction and protection

Parallelism: multicore processors and shared resources

GPU architectures: throughput machines

Putting it together: architectural thinking in practice

Write better notes with AI