Skip to content
Mar 1

dbt Tests and Documentation

MT
Mindli Team

AI-Generated Content

dbt Tests and Documentation

In modern data engineering, reliable data is non-negotiable. dbt (data build tool) empowers you to transform data in your warehouse, but without rigorous testing and clear documentation, your pipelines are brittle and opaque. Implementing robust data quality checks and generating comprehensive documentation turns your dbt project from a script collection into a trustworthy, maintainable data product.

Writing Schema Tests and Custom Data Tests

At its core, dbt testing is about asserting the correctness of your data models. You'll primarily work with two test types: schema tests and custom data tests. Schema tests are built-in, declarative checks you configure in YAML files to validate common constraints on your models. The four fundamental tests are:

  • unique: Ensures all values in a specified column are distinct. This is crucial for primary key columns.
  • not_null: Confirms that no null values exist in a column, often applied to critical business identifiers.
  • accepted_values: Validates that every entry in a column matches one from a predefined list. For example, a status column might only accept ('shipped', 'pending', 'cancelled').
  • relationships (or foreign_key): Verifies referential integrity by checking that every value in a column exists in another model's column, defining a parent-child relationship.

You define these tests in your schema.yml files under the model or source definition. Here is a practical configuration for an orders model:

version: 2
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'returned']

When the built-in tests are insufficient, you write custom data tests. These are plain SQL files in your tests directory that return failing rows. dbt executes the query; if any rows are returned, the test fails. For instance, to test that no order has a negative total amount, you would create a file like tests/positive_order_total.sql:

SELECT order_id, total_amount
FROM {{ ref('orders') }}
WHERE total_amount < 0

This modular approach allows you to encode any business logic or complex data rule as a test, making data quality an integral part of your development workflow.

Configuring Test Severity Levels

Not all test failures are equally urgent. dbt allows you to set severity levels to control how a failure impacts your project's execution. The two primary levels are error (the default) and warn. When a test with severity error fails, the dbt run command stops immediately. This is appropriate for critical data integrity rules, like a broken primary key. A test with severity warn will log the failure but allow the run to complete, useful for monitoring non-blocking issues like a gradual increase in null values in a non-critical field.

You configure severity in your YAML file or directly in a custom test using a config block. For example, to set a not_null test on a notes column as a warning:

columns:
  - name: notes
    tests:
      - not_null:
          severity: warn

In a custom SQL test, you set it at the top of the file:

{{ config(severity = 'warn') }}
SELECT user_id FROM {{ ref('users') }} WHERE last_login_date IS NULL

Understanding and applying severity strategically lets you balance rigorous quality checks with operational flexibility, ensuring critical pipelines aren't halted for minor anomalies while still tracking them.

Generating Documentation with Descriptions and Doc Blocks

Documentation in dbt serves a dual purpose: it describes your data assets for stakeholders and provides context that appears in the compiled SQL code via the {{ doc() }} function. You generate documentation by writing descriptions for your models, columns, sources, and macros in the same YAML files used for tests. This inline approach keeps documentation close to the code it describes.

For instance, augmenting the earlier orders model with descriptions:

models:
  - name: orders
    description: "Core fact table recording all customer transactions."
    columns:
      - name: order_id
        description: "Primary key for the orders table, generated sequentially."
        tests:
          - unique
          - not_null

For longer or reusable text, you use doc blocks. Defined in a dedicated docs.md file or within YAML, doc blocks are named snippets of markdown that you can reference elsewhere. This avoids duplication and centralizes common explanations. First, define a block in a docs section of your YAML:

version: 2
docs:
  - name: revenue_calculation
    content: |
      Revenue is calculated as `quantity * unit_price` after applying any applicable promotional discounts. This logic is consistent across all fact tables.

Then, reference it in a model or column description using the {{ doc() }} Jinja function:

- name: revenue
  description: "{{ doc('revenue_calculation') }}"

When you generate your project's documentation site, these descriptions and doc blocks render as readable text, turning your data dictionary into a living, queryable resource.

Serving the dbt Documentation Site

Once you've written tests and documentation, you can compile and serve a professional, web-based dbt documentation site. This site automatically generates a dependency graph (DAG) of your models and presents all the descriptions, column metadata, and test results in a searchable interface. It's the single source of truth for your data team's work.

You generate the site by running dbt docs generate. This command compiles your project's metadata, including the catalog of models and their documentation, into static JSON files. To view it locally, you then run dbt docs serve, which starts a local web server (typically on port 8080). In cloud-based dbt environments like dbt Cloud, the documentation site is automatically generated and hosted for you after each run. The site allows any analyst or engineer to explore data lineage, understand column definitions, and see when a model was last updated, dramatically reducing the time spent answering "what does this column mean?" or "where does this data come from?".

Integrating dbt Test Results into CI/CD Pipelines

For production-grade data pipelines, testing shouldn't be an afterthought; it must be an automated gate. Integrating dbt test results into CI/CD pipelines creates data quality gates that prevent flawed code from reaching your data warehouse. A typical workflow involves running dbt test (or dbt build) as a step in your continuous integration process whenever a pull request is opened or code is merged.

In a tool like GitHub Actions, you would create a workflow file that sets up your data warehouse credentials, checks out the code, installs dbt, and executes tests. The pipeline is configured to fail if any test with severity error does not pass. Here’s a conceptual outline:

name: dbt Test Suite
on: [pull_request]
jobs:
  run-dbt-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: Install dbt and dependencies
        run: |
          pip install dbt-<your-adapter>
      - name: Run dbt tests
        env:
          DBT_PROFILES_DIR: ./
        run: dbt test

This automation ensures that every change is validated against your data quality rules. For advanced monitoring, you can pipe test results to logging platforms or set up alerts for warn-level failures. By making tests a mandatory checkpoint, you institutionalize data reliability and foster a culture where data quality is everyone's responsibility.

Common Pitfalls

  1. Over-testing Everything: Applying not_null and unique tests to every column can cripple performance and create noise. Correction: Adopt a risk-based approach. Prioritize tests on key business metrics, identifiers, and columns used in critical joins. Use generic tests for broad coverage and custom tests for specific, high-value business logic.
  2. Ignoring Test Severity: Treating all tests as error-level can make pipelines fragile, while setting all to warn means critical failures are ignored. Correction: Classify tests intentionally. Use error for integrity-breaking conditions (e.g., broken foreign keys) and warn for informational or trending checks (e.g., a dip in row count that needs investigation).
  3. Sparse or Inconsistent Documentation: Writing descriptions for only some models or using inconsistent terminology renders the documentation site unreliable. Correction: Enforce a project standard. Use dbt test --select test_type:documentation to run the built-in documentation test that checks for missing descriptions, and integrate this into code reviews.
  4. Letting CI/CD Tests Run on Production Data: Running your full test suite in CI/CD against production can be slow and resource-intensive. Correction: Use a dedicated staging or CI database. Configure your dbt profile in CI to target a clone of production or a subset of data to validate logic efficiently without impacting live systems.

Summary

  • dbt schema tests like unique, not_null, accepted_values, and relationships provide declarative, built-in data quality checks configured in YAML files, while custom data tests written in SQL allow you to enforce any business rule.
  • Test severity levels (error and warn) give you control over pipeline behavior, letting you distinguish between blocking failures and advisory warnings.
  • Comprehensive documentation is created by adding descriptions to models and columns in YAML and using reusable doc blocks for consistent, centralized explanations.
  • The dbt documentation site, generated via dbt docs generate, offers an interactive, web-based portal to explore data lineage and metadata, serving as a vital resource for your entire team.
  • Integrating dbt tests into CI/CD pipelines automates data quality gating, ensuring every code change is validated against your quality standards before deployment, which is essential for maintaining trustworthy data products.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.