Skip to content
Mar 1

Columnar Storage and Compression for Analytics

MT
Mindli Team

AI-Generated Content

Columnar Storage and Compression for Analytics

If you’ve ever waited minutes or hours for an analytical query to scan billions of rows, you’ve encountered the fundamental limitation of traditional row-based storage. Columnar storage and its companion compression techniques are the engineered solution to this problem, transforming analytical performance by rethinking how data is laid out on disk. Understanding these formats is essential for designing modern data warehouses, lakes, and efficient query engines.

The Core Idea: Storing by Column

Traditional row-based storage (like in a typical OLTP database) writes all the data for a single row contiguously to disk. This is ideal for operations that need the entire row, such as inserting a customer order or updating a record. However, for analytical queries that aggregate values from a few columns across millions of rows—like calculating the average sales per region—row storage is spectacularly inefficient. The query engine must read every row from disk, loading entire rows into memory just to extract and process the few relevant columns.

Columnar storage flips this model. Instead of storing all data for a row together, it stores all the values for each column together. In a table with columns CustomerID, Region, and SaleAmount, all SaleAmount values are stored contiguously in one block, all Region values in another, and so on. When a query asks for SUM(SaleAmount) GROUP BY Region, the database can read only the SaleAmount and Region column blocks, skipping over all other column data entirely. This dramatically reduces I/O, which is often the primary bottleneck in analytics.

Popular implementations of this concept include file formats like Apache Parquet and Apache ORC (Optimized Row Columnar), as well as columnar databases like ClickHouse, Amazon Redshift, and Google BigQuery. These systems are engineered from the ground up to exploit the performance characteristics of column-oriented layouts.

Compression: The Secret Weapon of Columnar Storage

The columnar layout unlocks highly effective compression schemes that are far less practical on row-oriented data. Because all values in a column are of the same data type and often exhibit low cardinality (a small number of distinct values), they compress extremely well.

  • Run-Length Encoding (RLE): This is exceptionally powerful in sorted columns. If the Region column is sorted, you might have a long sequence of the value "EMEA". Instead of storing "EMEA" thousands of times, RLE stores the value once with a count of its repetitions (e.g., ["EMEA", 5000]). This can reduce storage by several orders of magnitude.
  • Dictionary Encoding: This is arguably the most important compression method for columnar formats. The system builds a dictionary of all unique values in a column (e.g., 0="EMEA", 1="APAC", 2="Americas"). It then replaces the actual string values in the column data with compact integer IDs (e.g., 0,0,0,1,2...). Queries can operate directly on these compact integers, and the dictionary is only referenced when the final human-readable value is needed for output.
  • Delta Encoding: This is ideal for columns with sequentially increasing values, like timestamps or primary keys. Instead of storing the full values, the system stores the difference (delta) from the previous value. For example, the sequence 10001, 10002, 10003, 10005 becomes 10001, +1, +1, +2. These smaller deltas are much easier to compress further with generic algorithms like Snappy or Zstd.

The synergy is clear: columnar storage organizes data to make these powerful encodings possible, and the resulting compression further reduces I/O and increases the amount of data that can be cached in memory.

Query Optimization: Predicate Pushdown

Compression isn't just about saving disk space; it directly accelerates queries through a critical optimization called predicate pushdown. This is the ability of a query engine to apply filters (the WHERE clause) as early as possible in the data retrieval process, ideally before reading unnecessary data.

In a columnar file like Parquet, data is divided into row groups (horizontal partitions), and each column within a row group is stored in pages. Metadata for each page—like minimum and maximum values—is stored in the file footer. When a query executes WHERE SaleAmount > 1000, the engine can first read this lightweight metadata. If a page's maximum value is 500, the entire page can be skipped without being read or decompressed. Furthermore, with dictionary-encoded columns, the filter can be applied to the integer dictionary IDs directly, which is vastly faster than comparing strings.

This combination of columnar I/O reduction, efficient compression, and metadata-driven skipping is what makes scanning terabytes of data feasible in seconds.

Columnar vs. Row-Based: Choosing the Right Tool

Choosing a storage format is a trade-off based on your dominant query patterns.

Choose Columnar Storage (Parquet, ORC, Columnar DBs) when:

  • Your workload is analytical (OLAP): reads dominate, with complex queries involving aggregations, scans, and joins.
  • Queries typically access a subset of a table's many columns.
  • Data is often read in large batches.

Choose Row-Based Storage (CSV, Avro, traditional RDBMS) when:

  • Your workload is transactional (OLTP): you require frequent single-row inserts, updates, or deletes.
  • Queries almost always need to access all columns of a row (e.g., fetching a user's complete profile).
  • You are dealing with real-time event streaming where data is written row-by-row.

Data characteristics also matter. Columnar storage benefits most from:

  • Tables with many columns (wide tables).
  • Columns with low to moderate cardinality, enabling effective dictionary and RLE compression.
  • Data that can be sorted to maximize run-length encoding before being written.

Common Pitfalls

  1. Using Columnar Storage for OLTP Workloads: Attempting to use a format like Parquet as the primary store for a high-transaction application will result in terrible write performance. Writes require restructuring entire row groups and recomputing compression, which is a batch-oriented, not row-oriented, process.
  • Correction: Use the right tool for the job. Row-stores for OLTP, column-stores for OLAP. A common architecture is to ingest data into a row-store, then periodically ETL it into a columnar format in a data lake for analytics.
  1. Poor Data Sorting Before Ingestion: Writing data in a random order wastes the potential of run-length encoding. If a low-cardinality column like Region is scattered randomly, its values won't form long runs.
  • Correction: Sort your data by key low-cardinality columns (e.g., Region, Department) before writing it to Parquet or ORC. This one-time sorting cost pays massive dividends in compression ratio and query speed for all subsequent reads.
  1. Ignoring File and Block Sizing: Creating many tiny Parquet files or extremely large row groups can hurt performance. Many small files create overhead for the query engine ("small file problem"), while a single huge row group reduces opportunities for parallel I/O and predicate pushdown.
  • Correction: Aim for reasonably sized files (e.g., 256MB to 1GB) and row groups (e.g., 128MB). This balances I/O efficiency, parallelism, and the granularity of data skipping.
  1. Assuming All Queries Will Be Faster: While aggregate and scan queries accelerate dramatically, queries that retrieve most columns of many rows (a "select *" with a large result set) may see less benefit or could even be slower. The engine must now stitch together data from many separate column files, adding overhead.
  • Correction: Profile your query patterns. Understand that columnar storage is an optimization for specific access patterns, not a universal performance panacea.

Summary

  • Columnar storage (e.g., Parquet, ORC) stores data by column, not by row, drastically reducing I/O for analytical queries that read only a subset of columns.
  • The column-oriented layout enables powerful compression techniques like dictionary encoding and run-length encoding, which save space and further speed up queries by allowing operations on compact data representations.
  • Predicate pushdown uses file metadata (min/max values) to skip entire blocks of data during a scan, making filtered queries extremely fast.
  • Columnar formats are ideal for read-heavy, analytical workloads (OLAP), while row-based formats remain superior for write-heavy, transactional workloads (OLTP) where full rows are accessed.
  • To maximize benefits, data should be sorted before ingestion and organized into optimally sized files and row groups.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.