14
DVC - Data Versioning
This article also lives on GitHub
About DVC (Data Version Control)
- What's DVC?
- version control system for data science and machine learning
- compatible with git (it's based on git)
- What can DVC do?
- track
- data
- model
- pipeline
- metrics
- use storage directly
- no external services needed
- track
- Who are the targeted users of DVC?
- ML research / engineer
- DevOps & Engineers
- Why DVC?
- It links your data, model, and pipelines with your metrics.
- reproducibility
- trackable
- It links your data, model, and pipelines with your metrics.
Read DVC - Versioning Data and Models for more use cases
I suggest using pipx if you're to install DVC globally. However, an even better way is to install it inside the virtual environment within your project.
$ pip install pipx
$ pipx install dvc
$ dvc --version
2.3.0
DVC also provides Shell Completion and Syntax Highlighting Plugins for popular editors.
I'll use dvc_example to demonstrate how I applied DVC to an existing machine learning project. The example is based on Recognizing hand-written digits from scikit-learn documentation. All the DVC parts start from v1-base. You can git checkout
to the tag to follow along.
$ git clone https://github.com/Lee-W/dvc_example/ --branch v1-base
$ cd dvc_example
$ tree
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
To set up the development environment, you'll need pipenv and invoke. If you run into an error when running pipenv install
, you can run export SYSTEM_VERSION_COMPAT=1
before it. It's an open issue (Issue with NumPy, macOS 11 Big Sur, Python 3.9.1 Does pipenv not use the latest pip? #4564) of pipenv as of now. Or, you can just run the following commands.
# install needed tools
pipx install pipenv invoke
# set up environments
invoke init-dev
We'll use digit_recognizer/digit_recognizer.py for training a model that can recognize handwritten digits.
def main():
X, y = load_data()
X_train, X_test, y_train, y_test = process_data(X, y)
model = train_model(X_train, y_train)
predicted_y = model.predict(X_test)
output_results(y_test, predicted_y)
output_metrics(y_test, predicted_y)
pipenv install dvc
If you're to save data to remote storage, you might need to install extra dependencies.
(e.g., pipenv install dvc[s3]
)
- Supported types
[s3]
[azure]
[gdrive]
[gs]
[oss]
[ssh]
Or, use pipenv install dvc[all]
to install them all
Read dvc remote for more information
# initialize DVC configurations
$ pipenv run dvc init
# see what's created by DVC
$ tree .dvc
.dvc
├── config
└── plots
├── confusion.json
├── confusion_normalized.json
├── default.json
├── linear.json
├── scatter.json
└── smooth.json
# track DVC configuration through git
$ git add .dvc
# git commit
$ pipenv run cz commit
I'll use another local directory ../dvc_remote
as our remote storage. You can change it to s3 or other remote storage.
mkdir ../dvc_remote
dvc remote add --default local ../dvc_remote
Through --default
flag, we can push/pull from local
remote without specifying remote name.
Let see what's changed in .dvc/config
.
$ cat .dvc/config
[core]
remote = local
['remote "local"']
url = ../../dvc_remote
The url is ../../dvc_remote
instead of ../dvc_remote
because it's the relative path to .dvc
. As we've not yet push anything to our pseudo remote, ../dvc_remote
is still empty.
By this time, the data is loaded through sklearn.datasets.load_digits. We're going to change it to read from static file in data/
.
def load_data():
# Load data
digits = datasets.load_digits()
...
We can use the following script to output the digit data into data/
. Note that it's a one-time use script. We won't add it into git.
import os
import pandas as pd
from sklearn import datasets
os.mkdir("data")
digits = datasets.load_digits()
df = pd.DataFrame(digits.data)
df.to_csv("data/digit_data.csv", header=False, index=False)
df = pd.DataFrame(digits.target)
df.to_csv("data/digit_target.csv", header=False, index=False)
We'll need to make changes to load_data
and main
functions to read data from these files.
def load_data(X_path, y_path):
with open(X_path) as input_file:
csv_reader = csv.reader(input_file, quoting=csv.QUOTE_NONNUMERIC)
X = list(csv_reader)
with open(y_path) as input_file:
csv_reader = csv.reader(input_file, quoting=csv.QUOTE_NONNUMERIC)
y = [row[0] for row in csv_reader]
return X, y
......
def main():
X, y = load_data("data/digit_data.csv", "data/digit_target.csv")
......
Run pipenv run python digit_recognizer/digit_recognizer.py
to check whether everything works as we expected. If so, add these code changes into git.
Next, add data/
to DVC.
$ pipenv run dvc add data
100% Add|████████████████|1/1 [00:00, 2.14file/s]
To track the changes with git, run:
git add data.dvc .gitignore
dvc add
creates a data.dvc
file to track data/
and add it into .gitignore
so that data/
will only be tracked through DVC but not git.
# Add DVC files into git track
git add .gitignore data.dvc
# git commit
pipenv run cz commit
In data.dvc
, we can see 2 files (digit_data.csv
and digit_target.csv
) are tracked.
$ cat data.dvc
outs:
- md5: b8d81f4964ecb86739c79c833fb491f3.dir
size: 494728
nfiles: 2
path: data
Push these tracked data into DVC remote
dvc push
See what's changed in our repo storage ../dvc_remote
$ tree ../dvc_remote
../dvc_remote
├── 02
│ └── b861b6dc8e08da6d66547860f69277
├── 8c
│ └── ba569595920d230ade453b150f372b
└── b8
└── d81f4964ecb86739c79c833fb491f3.dir
3 directories, 3 files
The md5 value of our tracked data is b8d81f4964ecb86739c79c833fb491f3.dir
. There's also a corresponding file in ../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir
.
$ cat ../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir
[{"md5": "02b861b6dc8e08da6d66547860f69277", "relpath": "digit_data.csv"}, {"md5": "8cba569595920d230ade453b150f372b", "relpath": "digit_target.csv"}]%
This file indicates where the actual data sources are stored in ../dvc_remote
.
In conclusion, if we want to know how data is stored through DVC,
- find the md5 value in
*.dvc
in our project - find the path that matches this md5 value in our remote storage
- use the md5 value specified in the previous step to find the data sources in our remote storage
But most of the time, we don't need to do so. We can leave the tracking work to DVC.
# temporary delete our data locally
$ rm -rf data
# check whether DVC actually tracks our data
$ dvc status
data.dvc:
changed outs:
deleted: data
# bring our data back from remote storage
$ dvc checkout data
data
├── digit_data.csv
└── digit_target.csv
To demonstrate how DVC track data changes, let's remove the last 2 rows from data/digit_data.csv
and data/digit_target.csv
.
# check what's changed
$ dvc status
data.dvc:
changed outs:
modified: data
# Add these changes to DVC and git
$ dvc add
$ git add data.dvc
# git commit
$ pipenv run cz commit
# Push these changes to our remote storage
$ dvc push
The md5 value has been changed, and the size of our data is smaller than our previous record, 494728.
$ cat data.dvc
outs:
- md5: a333e114a49194e823ab9a4fa9e33ee9.dir
size: 494172
nfiles: 2
path: data
More files are added to ../dvc_remote
due to the data changes. You can follow the steps in the previous section to see what're actually store.
$ tree ../dvc_remote
../dvc_remote
├── 02
│ └── b861b6dc8e08da6d66547860f69277
├── 2a
│ └── 6cfa13365ac9b3af5146133aca6789
├── 8c
│ └── ba569595920d230ade453b150f372b
├── 94
│ └── 2481fce846fb9750b7b8023c80a5ef
├── a3
│ └── 33e114a49194e823ab9a4fa9e33ee9.dir
└── b8
└── d81f4964ecb86739c79c833fb491f3.dir
6 directories, 6 files
Let's git checkout
to the previous git commit to see what happens if we only revert the changes in data.dvc
.
# or "git checkout v2-track-data"
git checkout HEAD~1
After running wc -l data/digit_data.csv
, we'll still find 1795 rows instead of 1797 rows in the previous stage. That's because we need to run dvc checkout
as well.
We might easily forget this step. Thus, DVC implements a git-hook that can trigger dvc checkout
right after git checkout
. You can install these git-hooks through dvc install
. These hooks are added into .git/hooks
. If you want to know the detail of what's added, read dvc install.
Test these steps again. There should be an additional line after running git checkout
. This is the output message of dvc checkout
.
M data/
Push our code to a remote git repository
git remote add origin <REMOTE GIT REPO>
git push origin main
We've already pushed all the code and data changes to remote. Let's see how we could reproduce in another environment.
# check what's in our repo
$ dvc list <REMOTE GIT REPO>
.dvcignore
.github
.gitignore
LICENSE
Pipfile
Pipfile.lock
data
data.dvc
digit_recognizer
docs
mkdocs.yml
output
tasks.py
Although git does not track data/
, we can still list it through DVC.
Because we use relative path ../dvc_remote
as DVC remote storage, we need to create the new project in the same layer as dvc_example
. We'll clone the project into ../dvc_example_on_another_machine
.
# Clone repo git repo
$ git clone <YOUR REMOTE GIT REPO> ../dvc_example_on_another_machine
$ cd ../dvc_example_on_another_machine
$ tree .
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data.dvc
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
3 directories, 9 files
As you can see, data/
has not yet been added to the project. We can now pull data from our DVC remote storage.
# pull data from default DVC remote storage
$ dvc pull
A data/
1 file added and 2 files fetched
# `data` has now been added to the project
$ tree .
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data
│ ├── digit_data.csv
│ └── digit_target.csv
├── data.dvc
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
4 directories, 11 files
That's all for data versioning in DVC. In the next post, We'll continue on versioning a data pipeline, tracking parameters and metrics. We won't need dvc_example_on_another_machine
for the following steps. Feel free to remove it and change directory back to dvc_example
.
14