Skip to content
Mar 1

dbt Models and Materializations

MT
Mindli Team

AI-Generated Content

dbt Models and Materializations

dbt (data build tool) transforms how analytics engineers work by turning SQL SELECT statements into reliable, documented, and tested data assets. By adopting software engineering principles like modularity and dependency management, dbt enables you to build a data transformation pipeline that is transparent, maintainable, and efficient.

What is a dbt Model?

At its core, a dbt model is a single SQL file containing a SELECT statement. Think of it as a blueprint for a dataset you want to create. dbt executes this blueprint using a process called materialization, which determines the physical form the dataset takes in your data warehouse (e.g., a view or a table). Models are the fundamental building blocks in a dbt project; you organize them into directories that reflect their purpose within your data pipeline. The power of dbt lies not just in running SQL, but in orchestrating the execution order of these models based on their dependencies and applying the chosen materialization logic.

The Layered Architecture: Staging, Intermediate, and Mart

A best-practice dbt project structures models into logical layers, each with a distinct responsibility. This promotes clarity, reusability, and easier debugging.

  1. Staging Layer (staging/): This is where you interact with your raw data. Models in this layer perform light transformations like renaming columns, casting data types, and basic cleaning. Their primary goal is to create a consistent, typed interface to your source data. You will use the source() function here to reference raw tables. For example, a stg_customers model might select from a raw users table, standardizing the signup_date field.
  2. Intermediate Layer (intermediate/): These models sit between staging and final data products. They are often used for complex joins, aggregations, or business logic that is shared across multiple final datasets. An int_user_orders model that joins stg_users and stg_orders to calculate lifetime value for each user is a classic intermediate model. This layer prevents code duplication in your final marts.
  3. Mart Layer (marts/): This layer contains business-facing data products, organized by department or function (e.g., finance/, marketing/). Models here are fully transformed, ready for analysis, and often modeled as dimensions or facts. A mart_finance.customer_revenue table would be a final output built from intermediate and staging models.

Materialization Types: View, Table, Incremental, and Ephemeral

The materialization is a strategy dbt uses to build your model in the warehouse. You specify it in a model's configuration. Choosing the right one balances performance, cost, and data freshness.

  • View: dbt runs CREATE VIEW AS .... This is the default. Pros: No data duplication, always reflects the latest source data. Cons: Slow for downstream queries if the underlying transformation is complex.
  • Table: dbt runs CREATE TABLE AS ... or DROP TABLE ... CREATE TABLE AS ... on each run. Pros: Fast query performance for end-users. Cons: Full rebuild can be slow and costly for large datasets; data is only as fresh as the last run.
  • Incremental: This powerful materialization allows dbt to insert or update only new or changed rows since the last run. You must provide logic for dbt to identify "new" data, typically via a WHERE clause on an updated timestamp or incrementing ID column. It is ideal for large fact tables where only a small percentage of rows change daily.
  • Ephemeral: dbt does not create a database object. Instead, it compiles the model's logic as a Common Table Expression (CTE) into the parent models that reference it. It helps break down complex SQL without polluting the database schema, but overuse can make debugging difficult.

Dependency Management: ref() and source()

dbt automatically builds a Directed Acyclic Graph (DAG) of your models to determine the correct execution order. You create this graph using two critical functions.

  • ref(): This function is used to reference other dbt models. Instead of writing FROM analytics.schema.stg_customers, you write FROM {{ ref('stg_customers') }}. This creates a dependency, telling dbt that this model depends on stg_customers. dbt uses this to build models in the correct order and allows you to change a model's materialization without breaking downstream code.
  • source(): This function references your raw data tables that are not managed by dbt. You first define these sources in a sources.yml file, giving them a name and table. In a staging model, you then write FROM {{ source('raw_database', 'users_table') }}. This explicitly separates "raw" from "transformed" data, which is crucial for data lineage and documentation.

Strategies for Incremental Models

Incremental models are essential for performance but require careful design. The core strategy is defined in your model's SQL using an is_incremental() macro.

A typical pattern for a table appended with new rows looks like this:

{{
    config(
        materialized='incremental'
    )
}}

SELECT
    event_id,
    user_id,
    event_time,
    -- ... other columns
FROM {{ source('application', 'events') }}
WHERE 1=1
{% if is_incremental() %}
  -- Look for records newer than the max in the current target table.
  AND event_time > (select max(event_time) from {{ this }})
{% endif %}

For more complex upsert logic (insert new records, update changed ones), you would use warehouse-specific strategies. For example, in Snowflake or BigQuery, you might use a MERGE statement within the incremental model logic. The key is to correctly identify the unique key of your records and the condition for what constitutes a change. Misidentifying this logic is the most common source of errors in incremental models, leading to duplicate or missing data.

Common Pitfalls

  1. Defaulting to Table Materialization for Everything: While tables are fast to query, rebuilding multi-terabyte tables on every run is wasteful. Analyze your data volume and freshness needs. Use views for lightly transformed, frequently updated staging models and incrementals for large fact tables.
  2. Overcomplicating Staging Models: The staging layer should be simple. Avoid complex business logic or joins here. Its job is standardization, not calculation. Pushing logic down too early makes debugging harder and staging models less reusable.
  3. Incorrect Incremental Logic: The conditional WHERE clause in an incremental model must perfectly capture all new/changed data and no existing, unchanged data. Using a non-monotonic field (like status that can flip back and forth) or a field with late-arriving data can cause gaps. Always test incremental logic with historical data backfills.
  4. Ignoring the DAG with Raw SQL: Never use direct database table names (e.g., prod_db.schema.table). Always use {{ ref() }} or {{ source() }}. Bypassing these functions breaks dbt's ability to understand dependencies, lineage, and perform operations like dropping views that depend on a table you're trying to delete.

Summary

  • A dbt model is a modular SQL SELECT statement that is materialized into a database object like a view or table, forming the foundation of a transformed, reliable dataset.
  • Organizing models into staging, intermediate, and mart layers creates a clean, maintainable transformation pipeline that separates raw data access, business logic, and final presentation.
  • Choosing the correct materialization (view, table, incremental, ephemeral) is a critical performance decision, balancing compute cost, storage, and data freshness.
  • The ref() and source() functions are essential for declarative dependency management, enabling dbt to build models in the correct order and maintain clear data lineage.
  • Incremental models are key for efficiently transforming large datasets, but they require precise logic to correctly identify new and updated records, making them a powerful yet error-prone feature that demands careful testing.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.