Manage, fetch and modify test datasets stored in DigitalOcean
Test data is stored in
data created in the repo root. There are three levels of data:
- Platform -- technology used or goldstandard
- Data type -- sub-types of data. applies to all but
- Sample ID -- data pertaining to a specific individual/trio
data/target/analyses/Diag-EHG122-NA12878 contains the
.analysis file for
data/target/samples/Diag-EHG122-NA12878/ contains all the sample-related
Data is versioned, but version info is not stored in the directory path. The
contains all data info for programmatic use (downloading, uploading, packaging). To view the version
of data currently downloaded, check the
DATA_READY file in the sample level of the directory tree.
You can also use
jq to quickly extract the version info from
The package can be installed locally or used via Docker to keep your environment clean. Both methods require you to have working DigitalOcean Spaces credentials. For instructions on creating these creds, see: https://www.digitalocean.com/community/tutorials/how-to-create-a-digitalocean-space-and-api-key#creating-an-access-key.
The best way to store these is in a file that can be sourced/exported or passed as a parameter to make. e.g.,
Credentials are stored in the environment variables
- Python >= 3.6
- clone the repo:
git clone email@example.com:data/vcpipe-testdata.git
- enter the repo:
python setup.py install
Before you do anything, you must load your credentials into your environment. If necessary, you can
specify them on the command line with the
--spaces-secret options, but this
reveals them on
ps and in your bash history and is not recommended.
Commands and shared options
Usage: vcpipe_testdata [OPTIONS] COMMAND [ARGS]... Options: --spaces-key TEXT DigitalOCean Spaces API key --spaces-secret TEXT DigitalOcean Spaces API secret --verbose Increase log level --debug Max logging -d, --datasets FILE JSON file containing datasets and versions [default: datasets.json] --threads INTEGER Maximum number of threads to use [default: (20)] --help Show this message and exit. Commands: download Download specified test data list-samples List samples available for each selected platform package Marks new test data as ready for upload upload Upload specified test data (must be packaged first)
The default number of threads to use is determined dynamically. It is either the number of
nproc) or 20, whichever is smaller. You can also specify spaces credentials via the
options, but this is not recommended as they will then be shown in
To modify the options shared by all commands use them before specifying the command. e.g., To download data using only 4 threads instead of the maximum available you would do:
Usage: vcpipe_testdata download [OPTIONS] Download specified test data Options: --data-type [all|analyses|samples|results] Type of data to download [default: all] --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq] Only download data from a specific platform, can be repeated for multiple platforms [default: all] --sample-id SAMPLE_ID Only download data from a specific sample ID, can be repeated for multiple samples -h, --help Show this message and exit.
Packaging and Uploading data
When adding a new dataset, or updating an existing one, you must:
- package the data before you upload it. This writes the
DATA_READY file to the directory with the timestamp of when it was packaged
and the version number. Data that has not been packaged will not be recognized as ready to upload.
- add new data instances description to
datasets.json in the standartized structure (samples/analyses)
vcpipe package will look for new data with an entry in
datasets.json but without
DATA_READY files and generate them appropriately. You can use the options to restrict which
directories it packages.
vcpipe upload will attempt to upload any directory with
DATA_READY file that does not
already exist in DigitalOcean. You can use additional options to restrict which directories are
Usage: vcpipe_testdata package [OPTIONS] Marks new test data as ready for upload Options: --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq] Only package data from a specific platform, can be repeated for multiple platforms [default: all] --sample-id SAMPLE_ID Only package data from a specific sample ID, can be repeated for multiple samples -h, --help Show this message and exit.
Usage: vcpipe_testdata upload [OPTIONS] Upload specified test data (must be packaged first) Options: --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq] Only upload data from a specific platform, can be repeated for multiple platforms [default: all] --sample-id SAMPLE_ID Only upload data from a specific sample ID, can be repeated for multiple samples -h, --help Show this message and exit.
All Sample IDs are stored in
datasets.json, but there is a help command for listing which samples
are available for each platform.
Usage: vcpipe_testdata list-samples [OPTIONS] List samples available for each selected platform Options: --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq] Only list samples from the specified platforms, can be repeated for multiple platforms [default: all] --show-data-types List data types available for each sample [default: False] -h, --help Show this message and exit.
Credentials are loaded into the Docker container by an environment file whose path is stored in the
DO_CREDS makefile variable.
- clone the repo:
git clone firstname.lastname@example.org:ousamg/data/vcpipe-testdata.git
- enter the repo:
- Build the local image:
Currently, makefile actions only work to download / upload all data. It does check for to see what
data has changed, if any, and only uploads new data / downloads data you do not have. For more
complicated options, install locally or use
make shell to enter the container and run
- Full function list and help
- Downloading data
make download-data DO_CREDS=~/.digitalocean/do_creds
- Uploading data
make package-data-- only changes local files, no creds needed
make upload-data DO_CREDS=~/.digitalocean/do_creds
- Get a docker shell to run
vcpipe_testdataas if in a local install
make shell DO_CREDS=~/.digitalocean/do_creds