Data Versioning with DVC

Machine learning projects are fundamentally about the interplay between code and data, yet traditional version control systems like Git fail miserably at handling large datasets. Data Version Control (DVC) solves this critical gap, transforming chaotic experiments into reproducible, collaborative workflows. By treating your data and machine learning models with the same rigor as your source code, DVC enables you to track every change, reproduce any past result, and confidently share your work with teams.

The Core Problem: Why Git Isn't Enough for Data

When you commit a 100MB dataset change to a Git repository, you bloat the repository history, slow down every clone operation, and quickly hit platform file size limits. Git was designed for source code—text files that are relatively small and change in discrete, line-based increments. Data files, model weights, and evaluation metrics are large binary files that change in their entirety. DVC acts as an extension to Git, creating a seamless bridge between your code repository and your data. Instead of committing the actual data file to Git, DVC stores a small, human-readable .dvc file that points to the real data, which is stored elsewhere. This .dvc file contains a unique identifier (hash) for the data, which DVC calculates. You version this lightweight pointer file in Git, while DVC manages the heavy data elsewhere.

For example, if you have a dataset file train.csv, running dvc add train.csv does two things: it moves train.csv to a special DVC cache directory (usually .dvc/cache) and creates a train.csv.dvc file. You then commit train.csv.dvc to Git. The actual CSV file is listed in your .gitignore automatically. This elegantly separates the versioning of code and data while keeping them perfectly synchronized.

Setting Up Remote Storage: S3, GCS, and Azure

Storing versioned data only on your local machine defeats the purpose of collaboration and backup. DVC integrates with all major cloud and on-premises storage solutions as remote storage backends. This means your data cache can be pushed to a centralized location that your entire team can access. Configuring a remote is a one-time setup that unlocks true collaboration.

The process is straightforward. First, choose your storage provider. For Amazon S3, the command would be:

dvc remote add -d myremote s3://mybucket/dvc-storage

The -d flag sets this as your default remote. You’ll need to configure your cloud credentials (e.g., via environment variables like AWS_ACCESS_KEY_ID). The commands for Google Cloud Storage (GCS) or Microsoft Azure Blob Storage are nearly identical, simply changing the protocol to gs:// or azure://. Once configured, you use dvc push to upload cached data to the remote and dvc pull to download it. This ensures that when a colleague clones your Git repository and runs dvc pull, they automatically retrieve the correct version of the data referenced by the .dvc files in the commit they checked out.

Building Reproducible Pipelines with `dvc.yaml`

Reproducibility is more than just tracking data; it's about capturing the entire process that transforms that data into a model or result. DVC pipelines allow you to define a sequence of pipeline stages—such as preprocessing, feature engineering, training, and evaluation—in a dvc.yaml file. Each stage specifies its dependencies (input data and code), command to run, and outputs (new data, models, or metrics). DVC then builds a dependency graph and only runs stages whose dependencies have changed, similar to a build system like Make.

Here is a simplified example of a dvc.yaml file:

stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - data/raw.csv
      - src/prepare.py
    outs:
      - data/prepared.csv
  train:
    cmd: python src/train.py
    deps:
      - data/prepared.csv
      - src/train.py
    params:
      - train.learning_rate
    outs:
      - model.pkl
    metrics:
      - metrics.json:
          cache: false

To run the pipeline, you execute dvc repro. DVC checks the hashes of all dependencies. If data/raw.csv changes, it will rerun the prepare stage and, because data/prepared.csv is now different, it will automatically rerun the train stage. This guarantees that your final model is always in sync with your data and code, providing a single command to reproduce the entire experiment.

Tracking Experiments with Metrics and Plots

Iterative development requires comparing experiments. DVC provides built-in, lightweight tooling for tracking metrics and generating plots. In the pipeline example above, the train stage outputs a metrics.json file. DVC can track this file's contents over different Git commits or branches. After running several experiments, you can compare all results with a simple command like dvc metrics diff or dvc params diff.

Plots take this further. You can configure DVC to track structured outputs like validation loss curves or confusion matrices. By defining plots in dvc.yaml, DVC can aggregate these files across commits and generate comparative visualizations. For instance, you can plot the AUC-ROC curves from five different experiments on a single chart with dvc plots diff. This native experiment tracking is purpose-built for the ML workflow, allowing you to quickly identify which model configuration performed best without needing a separate, heavy-weight platform during early research.

Integrating DVC with CI/CD for Automation

The ultimate test of a reproducible pipeline is its ability to run automatically. Integrating DVC with CI/CD systems like GitHub Actions or GitLab CI ensures that every code change triggers validation of the corresponding data pipeline. This automates testing and validation, catching errors before they reach production.

A typical CI/CD pipeline for a DVC project follows these steps:

Checkout Code & DVC Cache: The CI runner clones the Git repo and fetches the DVC remote configuration.
Pull Data: It runs dvc pull to download the data artifacts associated with the current commit.
Reproduce Pipeline: It executes dvc repro to run the pipeline from scratch, verifying that all stages complete successfully with the pulled data.
Run Tests: It executes any unit or integration tests on the newly generated outputs or models.
Push New Data (Optional): If the CI run was triggered by a process that generates new data (e.g., retraining on a schedule), it can then run dvc push to update the remote storage.

This automation enforces that the pipeline is self-contained and reproducible in a clean environment. It also allows for automated reporting, such as commenting a metrics comparison on a Pull Request whenever new model code is submitted.

Common Pitfalls

1. Forgetting to Configure and Use a Remote Storage Backend.

The Mistake: Using DVC only locally without setting up a remote like S3. Your .dvc files are in Git, but the actual data exists only in your local cache. If your hard drive fails or you need to collaborate, the data is lost or inaccessible.
The Correction: Always configure a remote storage backend immediately after initializing DVC. Make dvc push and dvc pull habitual parts of your workflow, just like git push and git pull.

2. Manually Modifying DVC-Tracked Files.

The Mistake: Directly editing a file that is under DVC control (e.g., opening data/prepared.csv in Excel and saving it). This breaks the link between the .dvc file and the data in the cache, causing dvc status to show the file as "modified" without a clear way to commit the change.
The Correction: Use DVC commands or your pipeline to update data. To properly update a tracked file, run dvc unlock on it first, make your changes, then run dvc commit to recalculate its hash and update the .dvc file. Better yet, design your workflow so that all data modifications happen as the output of a defined pipeline stage.

3. Confusing DVC's Role with Git's.

The Mistake: Thinking dvc commit is a substitute for git commit. DVC only updates the state of data files in the project and their corresponding .dvc pointer files. It does not version anything in the Git history.
The Correction: Follow a two-step commit process: First, run dvc commit to finalize changes to data (updating .dvc files). Second, run git commit to snapshot the updated .dvc files and your code changes into the Git repository. They are complementary, not interchangeable, commands.

Summary

DVC extends Git to handle large data and models by storing lightweight pointer files (.dvc) in Git and the actual content in a separate cache, which can be synced to cloud storage.
Remote storage backends (S3, GCS, Azure) are essential for collaboration and backup, allowing team members to dvc pull the correct data version for any Git commit.
Reproducible pipelines are defined in dvc.yaml, where DVC manages a dependency graph to automatically run only the stages whose inputs have changed, guaranteeing consistent results.
Built-in experiment tracking with dvc metrics and dvc plots provides a straightforward way to compare model performance across different Git commits or branches without external tools.
CI/CD integration automates pipeline execution and testing, validating that every code change can reproduce its associated data artifacts and models in a clean environment, a cornerstone of robust MLOps.

Data Versioning with DVC

Data Versioning with DVC

The Core Problem: Why Git Isn't Enough for Data

Setting Up Remote Storage: S3, GCS, and Azure

Building Reproducible Pipelines with dvc.yaml

Tracking Experiments with Metrics and Plots

Integrating DVC with CI/CD for Automation

Common Pitfalls

Summary

Write better notes with AI

Building Reproducible Pipelines with `dvc.yaml`