DVC is a
free and open-source
Free and open-source software (FOSS) is a term used to refer to groups of software consisting of both free software and open-source software where anyone is freely licensed to use, copy, study, and change the software in any way, and the source ...
, platform-agnostic version system for
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
,
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
models, and experiments. It is designed to make ML models shareable, experiments reproducible,
and to track versions of models, data, and
pipelines. DVC works on top of
Git repositories
and
cloud storage
Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on "the cloud". The physical storage spans multiple servers (sometimes in multiple locations), and the physical environment is t ...
.
The first (beta) version of DVC 0.6 was launched in May 2017. In May 2020, DVC 1.0 was publicly released by Iterative.ai.
Overview
DVC is designed to incorporate the best practices of
software development
Software development is the process of conceiving, specifying, designing, programming, documenting, testing, and bug fixing involved in creating and maintaining applications, frameworks, or other software components. Software development invol ...
into
Machine Learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
workflow
A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of ...
s. It does this by extending the traditional
software
Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work.
At the lowest programming level, executable code consists ...
tool
Git by
cloud storage
Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on "the cloud". The physical storage spans multiple servers (sometimes in multiple locations), and the physical environment is t ...
s for
datasets and Machine Learning models.
Specifically, DVC makes Machine Learning operations:
* Codified: it codifies datasets and models by storing pointers to the data
files in
cloud storage
Cloud storage is a model of computer data storage in which the digital data is stored in logical pools, said to be on "the cloud". The physical storage spans multiple servers (sometimes in multiple locations), and the physical environment is t ...
s.
* Reproducible: it allows users to reproduce experiments, and rebuild
datasets from
raw data. These features also allow to automate the construction of datasets, the training, evaluation, and deployment of ML models.
DVC and Git
DVC
stores large files and datasets in separate storage, outside of
Git.
This storage can be on the user’s computer or hosted on any major cloud storage provider,
such as
AWS S3
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon (company), Amazon.com u ...
,
Google Cloud Storage
Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharin ...
, and
Microsoft Azure Blob Storage.
DVC users may also set up a remote repository on any
server
Server may refer to:
Computing
*Server (computing), a computer program or a device that provides functionality for other programs or devices, called clients
Role
* Waiting staff, those who work at a restaurant or a bar attending customers and su ...
and connect to it remotely.
When a user stores their data and models in the remote repository, text file is created in their
Git repository which points to the actual data in remote storage.
Features
DVC's features can be divided into three categories:
data management
Data management comprises all disciplines related to handling data as a valuable resource.
Concept
The concept of data management arose in the 1980s as technology moved from sequential processing (first punched cards, then magnetic tape) to ...
,
pipelines
Pipeline may refer to:
Electronics, computers and computing
* Pipeline (computing), a chain of data-processing stages or a CPU optimization found on
** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...
, and experiment tracking.
Data management
Data and model versioning is the base layer
of DVC for large files,
datasets, and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
models. It allows the use of a standard
Git workflow
A workflow consists of an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. It can be depicted as a sequence of ...
, but without the need to store those files in the repository. Large files, directories and ML models are replaced with small
metafiles, which in turn point to the original data. Data is stored separately, allowing
data scientists
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
to transfer large datasets or share a model with others.
DVC enables data versioning through codification. When a user creates metafiles, describing what datasets, ML artifacts and other features to track, DVC makes it possible to capture versions of data and models, create and restore from snapshots, record evolving metrics, switch between versions, etc.
Unique versions of data files and directories are
cached in a systematic way (also preventing file duplication). The working datastore is separated from the user’s workspace to keep the project light, but stays connected via file links handled automatically by DVC.
Pipelines
DVC provides a mechanism to define and execute
pipelines
Pipeline may refer to:
Electronics, computers and computing
* Pipeline (computing), a chain of data-processing stages or a CPU optimization found on
** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...
.
Pipelines represent the process of building ML datasets and models, from how data is preprocessed to how models are trained and evaluated.
Pipelines can also be used to deploy models into production environments.
DVC pipeline is focused on the experimentation phase of the ML process. Users can run multiple copies of a DVC pipeline by cloning a Git repository with the pipeline or running ML experiments. They can also record the workflow as a pipeline, and reproduce it in the future.
Pipelines are represented in
code as
yaml
configuration file
In computing, configuration files (commonly known simply as config files) are files used to configure the parameters and initial settings for some computer programs. They are used for user applications, server processes and operating system ...
s. These files define the stages of the pipeline and how data and information flows from one step to the next.
When a pipeline is run, the
artifacts Artifact, or artefact, may refer to:
Science and technology
*Artifact (error), misleading or confusing alteration in data or observation, commonly in experimental science, resulting from flaws in technique or equipment
** Compression artifact, a l ...
produced by that pipeline are registered in a
dvc.lock file
. The
lockfile
records the stages that were run, and stores a hash of the resulting output for each stage.
Not only is it a record of the execution of the pipeline, but is also useful when deciding which steps must be rerun on subsequent executions of the pipeline.
Experiment tracking
Experiment tracking allows developers to explore, iterate and compare different machine learning experiments.
Each experiment represents a variation of a data science project defined by changes in the workspace. Experiments maintain a link to the commit in the current branch (Git
HEAD
) as their parent or baseline. However, they do not form part of the regular
Git tree (unless they are made persistent). This stops temporary commits and branches from overflowing a user's repository.
Common use cases for experiments are:
# Comparison of model architectures
# Comparison of training or evaluation datasets
# Selection of model hyperparameters
DVC experiments can be managed and visualized either from the
VS Code
Visual Studio Code, also commonly referred to as VS Code, is a source-code editor made by Microsoft with the Electron Framework, for Windows, Linux and macOS. Features include support for debugging, syntax highlighting, intelligent code complet ...
IDE or online using Iterative Studio.
Visualization
Visualization or visualisation may refer to:
* Visualization (graphics), the physical or imagining creation of images, diagrams, or animations to communicate a message
* Data visualization, the graphic representation of data
* Information visualiz ...
allows each user to compare experiment results visually, track plots and generate them with library integrations.
DVC offers several options
for using visualization in a regular workflow:
* DVC can generate
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
files that include interactive plots from data series in
JSON
JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other s ...
,
YAML
YAML ( and ) (''see '') is a human-readable data-serialization language. It is commonly used for configuration files and in applications where data is being stored or transmitted. YAML targets many of the same communications applications as Ext ...
, CSV, or TSV format
* DVC can keep track of image files produced as plot outputs from the training/evaluation scripts
* DVCLive integrations can produce plots automatically during the training
The DVC VS Code extension
In 2022, Iterative released a free extension for
Visual Studio Code
Visual Studio Code, also commonly referred to as VS Code, is a source-code editor made by Microsoft with the Electron Framework, for Windows, Linux and macOS. Features include support for debugging, syntax highlighting, intelligent code compl ...
(VS Code), a
source-code editor
A source-code editor is a text editor program designed specifically for editing source code of computer programs. It may be a standalone application or it may be built into an integrated development environment (IDE) or web browser. Source-code ed ...
made by
Microsoft
Microsoft Corporation is an American multinational corporation, multinational technology company, technology corporation producing Software, computer software, consumer electronics, personal computers, and related services headquartered at th ...
, which provides VS Code users with the ability to use DVC in their editors with additional
user interface
In the industrial design field of human–computer interaction, a user interface (UI) is the space where interactions between humans and machines occur. The goal of this interaction is to allow effective operation and control of the machine f ...
functionality.
History
In 2017,
the first (beta) version of DVC 0.6 was publicly released (as a simple command line tool).
It allowed data scientists to keep track of their machine learning processes and file dependencies in the simple form of git-like commands. It also allowed them to transform existing machine learning processes into reproducible DVC pipelines. DVC 0.6 solved most of the common problems that machine learning engineers and data scientists were facing: the reproducibility of machine learning experiments, as well as data versioning and low levels of collaboration between teams.
Created by ex-Microsoft data scientist Dmitry Petrov, DVC aimed to integrate the best existing software development practices into machine learning operations.
In 2018, Dmitry Petrov together with Ivan Shcheklein, an engineer and entrepreneur, founded Iterative.ai,
an MLOps company that continued the development of DVC. Besides DVC, Iterative.ai is also behind open source tools like CML, MLEM, and Studio, the enterprise version of the open source tools.
In June 2020, the Iterative.ai team released DVC 1.0. New features like multi-stage DVC files, run cache, plots, data transfer optimizations, hyperparameter tracking, and stable release cycles were added as a result of discussions and contributions from the community.
In March 2021, DVC released DVC 2.0, which introduced ML experiments (experiment management), model checkpoints and metrics logging.
ML experiments: To solve the problem of Git overhead, when hundreds of experiments need to be run in a single day and each experiment run requires additional Git commands, DVC 2.0 ''introduced the lightweight experiments'' feature. It allows its users to auto-track ML experiments and capture code changes.
This eliminated the dependence upon additional services by saving data versions as metadata in Git, as opposed to relegating it to external databases or APIs.
ML model checkpoints versioning: The new release also enables versioning of all checkpoints with corresponding code and data.
Metrics logging: DVC 2.0 introduced a new open-source library ''DVC-Live'' that would provide functionality for tracking model metrics and organizing metrics in a way that DVC could visualize with navigation in Git history.
Alternative solutions to DVC
There are several open source projects that provide similar data version control capabilities to DVC,
such as
Git LFSDoltNessie an
lakeFS These projects vary in their fit to the different needs of data engineers and data scientists such as: scalability, supported file formats, support in tabular data and unstructured data, volume of data that are supported, and more.
References
External links
*
* {{GitHub, iterative/dvc
VS Code extension
Free software programmed in Python
Software using the Apache license
Python (programming language) libraries
Python (programming language) software
Data mining and machine learning software
2017 software