dbt Seeds and External Sources
AI-Generated Content
dbt Seeds and External Sources
In any data pipeline, the data you transform is only as reliable as the data you bring in. While dbt excels at transforming data already in your warehouse, you must also manage static reference files and integrate raw tables from other ingestion tools. dbt seeds and dbt sources are the two primary mechanisms for this, forming the critical foundation upon which all your models depend. Mastering when and how to use each ensures your project is robust, documented, and easy to maintain.
The Two Pillars: Seeds for Static Data, Sources for Raw Tables
Before building models, you need to establish the origin points of your data. dbt provides two distinct, complementary features for this purpose. dbt seeds are CSV files stored within your dbt project directory that are loaded directly into your data warehouse as tables. They are ideal for small, static datasets that change infrequently, such as country codes, product categories, or static mapping files. Conversely, dbt sources are used to define and document the raw tables already existing in your warehouse that were loaded by an external tool (like Fivetran, Stitch, or an in-house pipeline). Using sources creates a layer of abstraction, allowing you to reference, test, and monitor these upstream tables without hard-coding their names.
Working with dbt Seeds for Reference Data
A seed is fundamentally a CSV file placed in your project's seeds directory. When you run the dbt seed command, dbt reads the CSV and creates a corresponding table in your schema. The primary use case is for lookup tables and mapping files. For example, a seeds/country_codes.csv file containing columns for country_id and country_name can be seeded to create a reliable, version-controlled reference table.
Configuration for seeds is typically done in your dbt_project.yml file. You can specify the schema where seed tables are built, apply column data types, and define tests. Since seeds are part of your codebase, they benefit from version control, providing a clear audit trail for changes to critical reference data. However, seeds are not designed for large or frequently updated datasets; each dbt seed run performs a full refresh, which can be inefficient for thousands of rows.
Configuring dbt Sources for External Data
Sources declare the raw tables in your warehouse that are produced by external loaders. You define them in a YAML file (commonly models/sources.yml). This configuration does not create tables; it informs dbt about tables that already exist. A basic source definition includes the source name, database and schema, and the list of tables. The power of sources comes from their advanced configuration options.
Source quoting handles instances where your source table or schema name contains special characters, spaces, or is a reserved SQL keyword. By setting quoting in your source definition, you instruct dbt to wrap the identifier in the appropriate characters (e.g., double quotes for Snowflake, backticks for BigQuery). Loader-specific metadata can be added using the meta key. For instance, you might tag tables loaded by Fivetran to easily group them in documentation.
The most critical feature for pipeline reliability is the freshness check. You can configure a loaded_at_field (like a _loaded_at timestamp column added by your loader) and a freshness threshold (e.g., warn_after: {count: 12, period: hour}). Running dbt source freshness checks if the most recent record in that field is within the threshold, alerting you if an upstream data pipeline has stalled.
When to Use Seeds Versus Sources
Choosing the right tool is essential for a clean project architecture. Use dbt seeds when you have complete ownership over a small, static dataset that is integral to your project's logic. The data is stored in your dbt repository. Examples include:
- Static dimension mappings (e.g., internal status codes to human-readable labels).
- A fixed list of valid product SKUs for validation.
- Holiday calendars or fiscal period definitions.
Use dbt sources for any table that is created outside of dbt's control. This is the majority of your raw data. Examples include:
- Tables synced by SaaS ELT tools (Fivetran, Stitch).
- Data piped in via custom Apache Airflow DAGs or Spark jobs.
- Raw tables created by another team's process within the same data platform.
A key heuristic: if you need to edit the data values, it's likely a candidate for a seed (because you edit the CSV). If you only configure how dbt should react to the data (test it, check its freshness), it's a source.
Organizing Dependencies and Automating Freshness
A well-structured dbt project clearly organizes these external data dependencies. All source definitions should be consolidated in a dedicated sources.yml file or directory for easy discovery. Your models should reference sources using the {{ source('source_name', 'table_name') }} Jinja function, not raw table names. This abstraction is powerful; if a raw table changes location, you only update the sources.yml file, not every model that references it.
Automating source freshness monitoring is a cornerstone of pipeline reliability. The dbt source freshness command should be integrated into your orchestration tool (like Airflow, Prefect, or Dagster) to run on a schedule, independent of your model runs. The output can be logged and connected to alerting systems (Slack, PagerDuty). This creates a proactive check, ensuring that failures in upstream ingestion are caught before they cascade into your downstream dbt models, which might otherwise run on stale or missing data.
Common Pitfalls
- Using Seeds for Large or Dynamic Data: Attempting to seed a 500MB CSV file will cause slow runs and warehouse inefficiency. Seeds are for reference data, not operational data. For large static files, use your warehouse's native bulk load utility and then define the resulting table as a source.
- Neglecting Freshness Configuration on Critical Sources: Failing to define
loaded_at_fieldandfreshnesscriteria for core data pipelines leaves you blind to upstream failures. Always configure freshness for mission-critical sources. - Hard-Coding Raw Table Names in Models: Writing
FROM raw_paymentsinstead of{{ source('stripe', 'payments') }}creates brittle code. If theraw_paymentstable moves, you have a labor-intensive search-and-replace task. Using thesource()function future-proofs your SQL. - Misconfiguring Source Quoting: If a source table is named
user(a reserved SQL keyword) and you don't enable quoting, your dbt run will fail with a syntax error. Always review your external table names and set thequotingkey appropriately in your source definition.
Summary
- dbt seeds are for small, static CSV files (like lookup tables) stored within your dbt project and loaded directly into the warehouse.
- dbt sources define and document raw tables already in your warehouse that are loaded by external tools, enabling abstraction, testing, and monitoring.
- Configure source freshness checks using a
loaded_at_fieldto proactively monitor the health of your upstream data pipelines and ensure reliability. - Use source quoting and loader-specific metadata to correctly handle complex table identifiers and enrich your data documentation.
- Structure your project by referencing sources with the
{{ source() }}function and automate freshness checks in your orchestration to build a resilient data foundation.