OS: Write-Ahead Logging and Recovery

In database and operating systems, ensuring that transactions survive system crashes is non-negotiable for data integrity. Write-ahead logging (WAL) is the cornerstone technique that guarantees durability by recording changes to a persistent log before they are applied to the database itself. Mastering WAL not only prevents data loss but also underpins efficient recovery mechanisms used in modern systems like PostgreSQL and MySQL.

The Principle of Write-Ahead Logging

Write-ahead logging (WAL) is a protocol that ensures transaction durability by mandating that all modifications to data are first recorded in a persistent log before any changes are written to the actual database files on disk. This log is an append-only sequence of records, each describing a single update, such as "change byte X at page Y from value A to B." The core rule, often called the Write-Ahead Log Protocol, states: for any data page modified in the buffer pool (the in-memory cache of database pages), the log records corresponding to those modifications must be forced to stable storage before the dirty page itself is written back to disk. This order is critical because if a crash occurs after a page write but before its log record is persisted, the system cannot determine if the change was part of a committed transaction, leading to potential corruption.

Think of WAL as keeping a detailed journal before making entries in a ledger. If the ledger is damaged, you can reconstruct it exactly from the journal. The log enables two key recovery objectives: redo, reapplying committed changes that may not have reached disk, and undo, rolling back uncommitted changes that may have partially persisted. This systematic approach transforms chaotic crash scenarios into a manageable replay process.

Implementing WAL Protocols

Implementing WAL involves designing log records and enforcing strict write sequences. A typical log record contains a unique Log Sequence Number (LSN), the transaction ID, the type of operation (e.g., update, commit, abort), the before-image and after-image of the data (for undo and redo), and pointers to previous records for the same transaction. The protocol requires that when a transaction commits, all log records for that transaction must be forced to the log device immediately—this is the commit rule. Only after this force-write is acknowledged does the system consider the transaction durable and notify the user.

From an engineering perspective, you must coordinate the buffer manager and the log manager. When a transaction modifies a page in the buffer pool, the system first creates a log record and appends it to a log buffer in memory. This log buffer is periodically flushed to disk, but the force-write is triggered at commit time or when a dirty page is about to be evicted from the buffer pool. Efficient implementation often uses steal/no-force buffer management: "steal" means uncommitted dirty pages can be written to disk (requiring undo capability), and "no-force" means committed pages need not be forced immediately to disk (requiring redo capability). WAL supports both, providing flexibility.

The ARIES Recovery Algorithm: Phases Explained

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is a widely studied WAL-based recovery algorithm that operates in three distinct phases: analysis, redo, and undo. It is designed to be highly flexible and efficient, handling complex scenarios like nested top actions.

The analysis phase begins after a crash. The system reads the log forward from the last checkpoint to identify the state at the time of the crash. It determines which transactions were active (i.e., not committed) and constructs two lists: the transaction table (active transactions) and the dirty page table (pages that might have dirty data on disk). This phase also identifies the starting point for the redo phase, known as the RedoLSN, which is the oldest LSN of a change that might not have been written to disk.

Next, the redo phase replays history to restore the database to its pre-crash state. Starting from the RedoLSN, the system scans the log forward. For each update log record, it re-applies the change if the affected page's pageLSN (the LSN of the last update applied to that page) is less than the log record's LSN. This pageLSN check ensures idempotence—changes are not applied twice even if recovery is interrupted. The redo phase repeats all actions, including those of transactions that later aborted, because undo will clean them up.

Finally, the undo phase rolls back incomplete transactions. Starting from the end of the log and moving backward, the system processes log records of transactions that were active at crash time. For each such record, it applies the before-image to restore the old value and writes a compensation log record (CLR) to the log. This CLR documents the undo action and includes a UndoNxtLSN pointer to the next log record to undo for that transaction, preventing endless loops. Undo continues until all active transactions are rolled back.

Checkpoints: Minimizing Recovery Time

Without checkpoints, recovery would require scanning the entire log, which could be impractically long. Checkpoints are periodic operations that reduce recovery time by creating a snapshot of the system state. There are two primary types: fuzzy checkpoints and sharp checkpoints.

A fuzzy checkpoint is non-blocking and allows normal transaction processing to continue. It involves writing a checkpoint log record that contains the transaction table and dirty page table at that moment. However, dirty pages are not forced to disk during the checkpoint. This means recovery must still perform redo from a point earlier than the checkpoint, but the analysis phase can start from the checkpoint record, limiting log scanning. Most production systems use fuzzy checkpoints for minimal disruption.

In contrast, a sharp checkpoint temporarily halts transactions to force all dirty pages to disk, ensuring that the log prior to the checkpoint is no longer needed for redo. This simplifies recovery but impacts performance due to the I/O burst and blocking. Understanding when to trigger checkpoints—based on log size or time intervals—is a key tuning parameter. Checkpoints interact closely with WAL: they rely on the log to be consistent and provide the RedoLSN for efficient recovery.

WAL and Buffer Pool Management: A Symbiotic Relationship

The buffer pool and WAL must interact seamlessly for both performance and correctness. The buffer pool caches database pages in memory to reduce disk I/O. When a page is modified, it becomes "dirty" and must be managed with respect to the log. The steal policy mentioned earlier is enabled by WAL: a dirty page from an uncommitted transaction can be written to disk (e.g., to free buffer space) because the log contains the before-image for potential undo.

Conversely, the no-force policy means that pages of committed transactions are not immediately written to disk; they remain in the buffer pool until evicted by a page replacement algorithm like LRU. This reduces write I/O but requires the redo capability from the log during recovery. The system tracks each page's pageLSN to coordinate with log records. When the buffer manager decides to write a dirty page to disk, it must ensure that all log records up to that page's pageLSN are already on stable storage—this is the WAL rule in action.

In practice, you analyze this interaction by considering scenarios. For example, if a crash occurs after a page write but before its transaction commits, the buffer pool may have inconsistent data, but WAL's undo phase will correct it using the log. This symbiosis allows databases to optimize memory usage while guaranteeing durability, a hallmark of robust system design.

Common Pitfalls

Ignoring the Force-Write Order: A frequent mistake is to assume that writing a dirty page to disk anytime is safe without first forcing its log records. If a crash happens after the page write but before the log force, the change might be irrecoverable or lead to corruption. Correction: Always enforce the WAL protocol strictly—log records must precede data page writes to stable storage.

Misunderstanding Checkpoint Contents: Engineers sometimes treat a fuzzy checkpoint as a complete backup, expecting recovery to start exactly from that point without redo. However, fuzzy checkpoints do not force dirty pages, so redo must still process earlier log records. Correction: Use the checkpoint's transaction and dirty page tables to initialize recovery state, but plan to redo from the RedoLSN, which may be before the checkpoint.

Skipping Compensation Log Records During Undo: When manually tracing ARIES undo, it's easy to forget to log the undo actions. If a second crash occurs during undo, without compensation log records (CLRs), the system might re-apply changes that were already rolled back. Correction: Always write CLRs during undo to make recovery idempotent and track progress.

Confusing PageLSN Comparisons in Redo: During the redo phase, applying a log record even when the pageLSN is equal to or greater than the log LSN can cause data corruption by overwriting newer updates. Correction: The redo condition is strict: only reapply if pageLSN < log LSN. This ensures that recovery does not repeat operations that are already reflected on disk.

Summary

Write-ahead logging (WAL) guarantees durability by recording all changes to a persistent log before applying them to the database, enabling reliable crash recovery.
The ARIES recovery algorithm processes crashes in three phases: analysis to determine system state, redo to replay committed changes, and undo to roll back incomplete transactions, using compensation log records for idempotence.
Checkpoints, especially fuzzy checkpoints, minimize recovery time by providing snapshots of transaction and dirty page states, but redo may still be needed from earlier points in the log.
WAL and buffer pool management are interdependent: WAL supports steal/no-force policies, allowing flexible memory management while ensuring data integrity through log synchronization.
Effective implementation requires strict adherence to the WAL protocol, correct handling of log sequence numbers (LSNs), and careful design of checkpoints to balance performance and recovery speed.

OS: Write-Ahead Logging and Recovery

OS: Write-Ahead Logging and Recovery

The Principle of Write-Ahead Logging

Implementing WAL Protocols

The ARIES Recovery Algorithm: Phases Explained

Checkpoints: Minimizing Recovery Time

WAL and Buffer Pool Management: A Symbiotic Relationship

Common Pitfalls

Summary

Write better notes with AI