[[TOC]]
Description
Manage, fetch and modify test datasets stored in DigitalOcean
Test data is stored in data
created in the repo root. There are three levels of data:
- Platform -- technology used or goldstandard
goldstandard
,target
,wes-hiseq
,wes-novaseq
,wgs-hiseq
- Data type -- sub-types of data. applies to all but
goldstandard
platforms analyses
,samples
,results
- Sample ID -- data pertaining to a specific individual/trio
NA12878_HG001
,Diag-EHG122-NA12878
,Diag-excap136-HG002C2-PM
,Diag-TestNovaSeq-NA12878N6
,Diag-wgs27-NA12878N6
, etc.
For example, data/target/analyses/Diag-EHG122-NA12878
contains the .analysis
file for
Diag-EHG122-NA12878
and data/target/samples/Diag-EHG122-NA12878/
contains all the sample-related
files.
Data is versioned, but version info is not stored in the directory path. The datasets.json
file
contains all data info for programmatic use (downloading, uploading, packaging). To view the version
of data currently downloaded, check the DATA_READY
file in the sample level of the directory tree.
e.g.,
$ cat data/target/samples/Diag-EHG122-NA12878/DATA_READY
timestamp: 2020-05-07 12:37:50.224142
version: v3.0-rel
$ cat data/goldstandard/AshkenazimTrio/HG002_NA24385_son/DATA_READY
timestamp: 2020-05-07 12:37:50.216316
version: NISTv3.3.2/GRCh37
You can also use jq
to quickly extract the version info from datasets.json
.
$ jq -r '.target.samples["Diag-EHG122-NA12878"].version' datasets.json
v3.0-rel
$ jq -r '.goldstandard.samples["AshkenazimTrio/HG002_NA24385_son"].version' datasets.json
NISTv3.3.2/GRCh37
Using vcpipe-testdata
The package can be installed locally or used via Docker to keep your environment clean. Both methods require you to have working DigitalOcean Spaces credentials. For instructions on creating these creds, see: https://www.digitalocean.com/community/tutorials/how-to-create-a-digitalocean-space-and-api-key#creating-an-access-key.
The best way to store these is in a file that can be sourced/exported or passed as a parameter to make. e.g.,
Local installation
Credentials are stored in the environment variables SPACES_KEY
and SPACES_SECRET
.
Setup
Requires:
- Python >= 3.6
- libcurl4
Installation:
- clone the repo:
git clone git@git.ousamg.io:data/vcpipe-testdata.git
- enter the repo:
cd vcpipe-testdata
- install:
python setup.py install
Usage
Using credentials
Before you do anything, you must load your credentials into your environment. If necessary, you can
specify them on the command line with the --spaces-key
and --spaces-secret
options, but this
reveals them on ps
and in your bash history and is not recommended.
Commands and shared options
Usage: vcpipe_testdata [OPTIONS] COMMAND [ARGS]...
Options:
--spaces-key TEXT DigitalOCean Spaces API key
--spaces-secret TEXT DigitalOcean Spaces API secret
--verbose Increase log level
--debug Max logging
-d, --datasets FILE JSON file containing datasets and versions [default:
datasets.json]
--threads INTEGER Maximum number of threads to use [default: (20)]
--help Show this message and exit.
Commands:
download Download specified test data
list-samples List samples available for each selected platform
package Marks new test data as ready for upload
upload Upload specified test data (must be packaged first)
The default number of threads to use is determined dynamically. It is either the number of
processors (nproc
) or 20, whichever is smaller. You can also specify spaces credentials via the
options, but this is not recommended as they will then be shown in ps
output.
To modify the options shared by all commands use them before specifying the command. e.g., To download data using only 4 threads instead of the maximum available you would do:
Downloading data
Usage: vcpipe_testdata download [OPTIONS]
Download specified test data
Options:
--data-type [all|analyses|samples|results]
Type of data to download [default: all]
--platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
Only download data from a specific platform,
can be repeated for multiple platforms
[default: all]
--sample-id SAMPLE_ID Only download data from a specific sample
ID, can be repeated for multiple samples
-h, --help Show this message and exit.
Packaging and Uploading data
When adding a new dataset, or updating an existing one, you must:
- package the data before you upload it. This writes the DATA_READY
file to the directory with the timestamp of when it was packaged
and the version number. Data that has not been packaged will not be recognized as ready to upload.
- add new data instances description to datasets.json
in the standartized structure (samples/analyses)
Running vcpipe package
will look for new data with an entry in datasets.json
but without
DATA_READY
files and generate them appropriately. You can use the options to restrict which
directories it packages.
Running vcpipe upload
will attempt to upload any directory with DATA_READY
file that does not
already exist in DigitalOcean. You can use additional options to restrict which directories are
uploaded.
Packaging data
Usage: vcpipe_testdata package [OPTIONS]
Marks new test data as ready for upload
Options:
--platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
Only package data from a specific platform,
can be repeated for multiple platforms
[default: all]
--sample-id SAMPLE_ID Only package data from a specific sample ID,
can be repeated for multiple samples
-h, --help Show this message and exit.
Uploading data
Usage: vcpipe_testdata upload [OPTIONS]
Upload specified test data (must be packaged first)
Options:
--platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
Only upload data from a specific platform,
can be repeated for multiple platforms
[default: all]
--sample-id SAMPLE_ID Only upload data from a specific sample ID,
can be repeated for multiple samples
-h, --help Show this message and exit.
Other functionality
list-samples
All Sample IDs are stored in datasets.json
, but there is a help command for listing which samples
are available for each platform.
Usage: vcpipe_testdata list-samples [OPTIONS]
List samples available for each selected platform
Options:
--platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
Only list samples from the specified
platforms, can be repeated for multiple
platforms [default: all]
--show-data-types List data types available for each sample
[default: False]
-h, --help Show this message and exit.
Docker installation
Credentials are loaded into the Docker container by an environment file whose path is stored in the
DO_CREDS
makefile variable.
Setup
Requires:
- Docker
- make
Installation:
- clone the repo:
git clone git@gitlab.com:ousamg/data/vcpipe-testdata.git
- enter the repo:
cd vcpipe-testdata
- Build the local image:
make build
Usage
Currently, makefile actions only work to download / upload all data. It does check for to see what
data has changed, if any, and only uploads new data / downloads data you do not have. For more
complicated options, install locally or use make shell
to enter the container and run
vcpipe_testdata
directly.
- Full function list and help
make help
- Downloading data
make download-data DO_CREDS=~/.digitalocean/do_creds
- Uploading data
make package-data
-- only changes local files, no creds neededmake upload-data DO_CREDS=~/.digitalocean/do_creds
- Get a docker shell to run
vcpipe_testdata
as if in a local install make shell DO_CREDS=~/.digitalocean/do_creds