25
DVC - Data Versioning
This article also lives on GitHub
About DVC (Data Version Control)
- version control system for data science and machine learning
- compatible with git (it's based on git)
- track
- data
- model
- pipeline
- metrics
- use storage directly
- no external services needed
- ML research / engineer
- DevOps & Engineers
- It links your data, model, and pipelines with your metrics.
- reproducibility
- trackable
Read DVC - Versioning Data and Models for more use cases
I suggest using pipx if you're to install DVC globally. However, an even better way is to install it inside the virtual environment within your project.
$ pip install pipx
$ pipx install dvc
$ dvc --version
2.3.0
DVC also provides Shell Completion and Syntax Highlighting Plugins for popular editors.
I'll use dvc_example to demonstrate how I applied DVC to an existing machine learning project. The example is based on Recognizing hand-written digits from scikit-learn documentation. All the DVC parts start from v1-base. You can
git checkout
to the tag to follow along.$ git clone https://github.com/Lee-W/dvc_example/ --branch v1-base
$ cd dvc_example
$ tree
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
To set up the development environment, you'll need pipenv and invoke. If you run into an error when running
pipenv install
, you can run export SYSTEM_VERSION_COMPAT=1
before it. It's an open issue (Issue with NumPy, macOS 11 Big Sur, Python 3.9.1 Does pipenv not use the latest pip? #4564) of pipenv as of now. Or, you can just run the following commands.# install needed tools
pipx install pipenv invoke
# set up environments
invoke init-dev
We'll use digit_recognizer/digit_recognizer.py for training a model that can recognize handwritten digits.
def main():
X, y = load_data()
X_train, X_test, y_train, y_test = process_data(X, y)
model = train_model(X_train, y_train)
predicted_y = model.predict(X_test)
output_results(y_test, predicted_y)
output_metrics(y_test, predicted_y)
pipenv install dvc
If you're to save data to remote storage, you might need to install extra dependencies.
(e.g.,
(e.g.,
pipenv install dvc[s3]
)[s3]
[azure]
[gdrive]
[gs]
[oss]
[ssh]
Or, use
pipenv install dvc[all]
to install them allRead dvc remote for more information
# initialize DVC configurations
$ pipenv run dvc init
# see what's created by DVC
$ tree .dvc
.dvc
├── config
└── plots
├── confusion.json
├── confusion_normalized.json
├── default.json
├── linear.json
├── scatter.json
└── smooth.json
# track DVC configuration through git
$ git add .dvc
# git commit
$ pipenv run cz commit
I'll use another local directory
../dvc_remote
as our remote storage. You can change it to s3 or other remote storage.mkdir ../dvc_remote
dvc remote add --default local ../dvc_remote
Through
--default
flag, we can push/pull from local
remote without specifying remote name.Let see what's changed in
.dvc/config
.$ cat .dvc/config
[core]
remote = local
['remote "local"']
url = ../../dvc_remote
The url is
../../dvc_remote
instead of ../dvc_remote
because it's the relative path to .dvc
. As we've not yet push anything to our pseudo remote, ../dvc_remote
is still empty.By this time, the data is loaded through sklearn.datasets.load_digits. We're going to change it to read from static file in
data/
.def load_data():
# Load data
digits = datasets.load_digits()
...
We can use the following script to output the digit data into
data/
. Note that it's a one-time use script. We won't add it into git.import os
import pandas as pd
from sklearn import datasets
os.mkdir("data")
digits = datasets.load_digits()
df = pd.DataFrame(digits.data)
df.to_csv("data/digit_data.csv", header=False, index=False)
df = pd.DataFrame(digits.target)
df.to_csv("data/digit_target.csv", header=False, index=False)
We'll need to make changes to
load_data
and main
functions to read data from these files.def load_data(X_path, y_path):
with open(X_path) as input_file:
csv_reader = csv.reader(input_file, quoting=csv.QUOTE_NONNUMERIC)
X = list(csv_reader)
with open(y_path) as input_file:
csv_reader = csv.reader(input_file, quoting=csv.QUOTE_NONNUMERIC)
y = [row[0] for row in csv_reader]
return X, y
......
def main():
X, y = load_data("data/digit_data.csv", "data/digit_target.csv")
......
Run
pipenv run python digit_recognizer/digit_recognizer.py
to check whether everything works as we expected. If so, add these code changes into git.Next, add
data/
to DVC.$ pipenv run dvc add data
100% Add|████████████████|1/1 [00:00, 2.14file/s]
To track the changes with git, run:
git add data.dvc .gitignore
dvc add
creates a data.dvc
file to track data/
and add it into .gitignore
so that data/
will only be tracked through DVC but not git.# Add DVC files into git track
git add .gitignore data.dvc
# git commit
pipenv run cz commit
In
data.dvc
, we can see 2 files (digit_data.csv
and digit_target.csv
) are tracked.$ cat data.dvc
outs:
- md5: b8d81f4964ecb86739c79c833fb491f3.dir
size: 494728
nfiles: 2
path: data
Push these tracked data into DVC remote
dvc push
See what's changed in our repo storage
../dvc_remote
$ tree ../dvc_remote
../dvc_remote
├── 02
│ └── b861b6dc8e08da6d66547860f69277
├── 8c
│ └── ba569595920d230ade453b150f372b
└── b8
└── d81f4964ecb86739c79c833fb491f3.dir
3 directories, 3 files
The md5 value of our tracked data is
b8d81f4964ecb86739c79c833fb491f3.dir
. There's also a corresponding file in ../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir
.$ cat ../dvc_remote/b8/d81f4964ecb86739c79c833fb491f3.dir
[{"md5": "02b861b6dc8e08da6d66547860f69277", "relpath": "digit_data.csv"}, {"md5": "8cba569595920d230ade453b150f372b", "relpath": "digit_target.csv"}]%
This file indicates where the actual data sources are stored in
../dvc_remote
.In conclusion, if we want to know how data is stored through DVC,
*.dvc
in our projectBut most of the time, we don't need to do so. We can leave the tracking work to DVC.
# temporary delete our data locally
$ rm -rf data
# check whether DVC actually tracks our data
$ dvc status
data.dvc:
changed outs:
deleted: data
# bring our data back from remote storage
$ dvc checkout data
data
├── digit_data.csv
└── digit_target.csv
To demonstrate how DVC track data changes, let's remove the last 2 rows from
data/digit_data.csv
and data/digit_target.csv
.# check what's changed
$ dvc status
data.dvc:
changed outs:
modified: data
# Add these changes to DVC and git
$ dvc add
$ git add data.dvc
# git commit
$ pipenv run cz commit
# Push these changes to our remote storage
$ dvc push
The md5 value has been changed, and the size of our data is smaller than our previous record, 494728.
$ cat data.dvc
outs:
- md5: a333e114a49194e823ab9a4fa9e33ee9.dir
size: 494172
nfiles: 2
path: data
More files are added to
../dvc_remote
due to the data changes. You can follow the steps in the previous section to see what're actually store.$ tree ../dvc_remote
../dvc_remote
├── 02
│ └── b861b6dc8e08da6d66547860f69277
├── 2a
│ └── 6cfa13365ac9b3af5146133aca6789
├── 8c
│ └── ba569595920d230ade453b150f372b
├── 94
│ └── 2481fce846fb9750b7b8023c80a5ef
├── a3
│ └── 33e114a49194e823ab9a4fa9e33ee9.dir
└── b8
└── d81f4964ecb86739c79c833fb491f3.dir
6 directories, 6 files
Let's
git checkout
to the previous git commit to see what happens if we only revert the changes in data.dvc
.# or "git checkout v2-track-data"
git checkout HEAD~1
After running
wc -l data/digit_data.csv
, we'll still find 1795 rows instead of 1797 rows in the previous stage. That's because we need to run dvc checkout
as well.We might easily forget this step. Thus, DVC implements a git-hook that can trigger
dvc checkout
right after git checkout
. You can install these git-hooks through dvc install
. These hooks are added into .git/hooks
. If you want to know the detail of what's added, read dvc install.Test these steps again. There should be an additional line after running
git checkout
. This is the output message of dvc checkout
.M data/
Push our code to a remote git repository
git remote add origin <REMOTE GIT REPO>
git push origin main
We've already pushed all the code and data changes to remote. Let's see how we could reproduce in another environment.
# check what's in our repo
$ dvc list <REMOTE GIT REPO>
.dvcignore
.github
.gitignore
LICENSE
Pipfile
Pipfile.lock
data
data.dvc
digit_recognizer
docs
mkdocs.yml
output
tasks.py
Although git does not track
data/
, we can still list it through DVC.Because we use relative path
../dvc_remote
as DVC remote storage, we need to create the new project in the same layer as dvc_example
. We'll clone the project into ../dvc_example_on_another_machine
.# Clone repo git repo
$ git clone <YOUR REMOTE GIT REPO> ../dvc_example_on_another_machine
$ cd ../dvc_example_on_another_machine
$ tree .
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data.dvc
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
3 directories, 9 files
As you can see,
data/
has not yet been added to the project. We can now pull data from our DVC remote storage.# pull data from default DVC remote storage
$ dvc pull
A data/
1 file added and 2 files fetched
# `data` has now been added to the project
$ tree .
.
├── LICENSE
├── Pipfile
├── Pipfile.lock
├── data
│ ├── digit_data.csv
│ └── digit_target.csv
├── data.dvc
├── digit_recognizer
│ ├── __init__.py
│ └── digit_recognizer.py
├── docs
│ └── README.md
├── mkdocs.yml
├── output
└── tasks.py
4 directories, 11 files
That's all for data versioning in DVC. In the next post, We'll continue on versioning a data pipeline, tracking parameters and metrics. We won't need
dvc_example_on_another_machine
for the following steps. Feel free to remove it and change directory back to dvc_example
.25