Skip to content

[[TOC]]

Description

Manage, fetch and modify test datasets stored in DigitalOcean

Test data is stored in data created in the repo root. There are three levels of data:

  1. Platform -- technology used or goldstandard
  2. goldstandard, target, wes-hiseq, wes-novaseq, wgs-hiseq
  3. Data type -- sub-types of data. applies to all but goldstandard platforms
  4. analyses, samples, results
  5. Sample ID -- data pertaining to a specific individual/trio
  6. NA12878_HG001, Diag-EHG122-NA12878, Diag-excap136-HG002C2-PM, Diag-TestNovaSeq-NA12878N6, Diag-wgs27-NA12878N6, etc.

For example, data/target/analyses/Diag-EHG122-NA12878 contains the .analysis file for Diag-EHG122-NA12878 and data/target/samples/Diag-EHG122-NA12878/ contains all the sample-related files.

Data is versioned, but version info is not stored in the directory path. The datasets.json file contains all data info for programmatic use (downloading, uploading, packaging). To view the version of data currently downloaded, check the DATA_READY file in the sample level of the directory tree. e.g.,

$ cat data/target/samples/Diag-EHG122-NA12878/DATA_READY
timestamp: 2020-05-07 12:37:50.224142
version: v3.0-rel

$ cat data/goldstandard/AshkenazimTrio/HG002_NA24385_son/DATA_READY
timestamp: 2020-05-07 12:37:50.216316
version: NISTv3.3.2/GRCh37

You can also use jq to quickly extract the version info from datasets.json.

$ jq -r '.target.samples["Diag-EHG122-NA12878"].version' datasets.json
v3.0-rel

$ jq -r '.goldstandard.samples["AshkenazimTrio/HG002_NA24385_son"].version' datasets.json
NISTv3.3.2/GRCh37

Using vcpipe-testdata

The package can be installed locally or used via Docker to keep your environment clean. Both methods require you to have working DigitalOcean Spaces credentials. For instructions on creating these creds, see: https://www.digitalocean.com/community/tutorials/how-to-create-a-digitalocean-space-and-api-key#creating-an-access-key.

The best way to store these is in a file that can be sourced/exported or passed as a parameter to make. e.g.,

$ cat ~/.digitalocean/do_creds
SPACES_KEY=<api_key>
SPACES_SECRET=<api_secret>

Local installation

Credentials are stored in the environment variables SPACES_KEY and SPACES_SECRET.

Setup

Requires:

  • Python >= 3.6
  • libcurl4

Installation:

  1. clone the repo: git clone git@git.ousamg.io:data/vcpipe-testdata.git
  2. enter the repo: cd vcpipe-testdata
  3. install: python setup.py install

Usage

Using credentials

Before you do anything, you must load your credentials into your environment. If necessary, you can specify them on the command line with the --spaces-key and --spaces-secret options, but this reveals them on ps and in your bash history and is not recommended.

source ~/.digitalocean/do_creds; export SPACES_KEY SPACES_SECRET

Commands and shared options

Usage: vcpipe_testdata [OPTIONS] COMMAND [ARGS]...

Options:
  --spaces-key TEXT     DigitalOCean Spaces API key
  --spaces-secret TEXT  DigitalOcean Spaces API secret
  --verbose             Increase log level
  --debug               Max logging
  -d, --datasets FILE   JSON file containing datasets and versions  [default:
                        datasets.json]

  --threads INTEGER     Maximum number of threads to use  [default: (20)]
  --help                Show this message and exit.

Commands:
  download      Download specified test data
  list-samples  List samples available for each selected platform
  package       Marks new test data as ready for upload
  upload        Upload specified test data (must be packaged first)

The default number of threads to use is determined dynamically. It is either the number of processors (nproc) or 20, whichever is smaller. You can also specify spaces credentials via the options, but this is not recommended as they will then be shown in ps output.

To modify the options shared by all commands use them before specifying the command. e.g., To download data using only 4 threads instead of the maximum available you would do:

vcpipe_testdata --threads 4 download

Downloading data

Usage: vcpipe_testdata download [OPTIONS]

  Download specified test data

Options:
  --data-type [all|analyses|samples|results]
                                  Type of data to download  [default: all]
  --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
                                  Only download data from a specific platform,
                                  can be repeated for multiple platforms
                                  [default: all]

  --sample-id SAMPLE_ID           Only download data from a specific sample
                                  ID, can be repeated for multiple samples

  -h, --help                      Show this message and exit.

Packaging and Uploading data

When adding a new dataset, or updating an existing one, you must: - package the data before you upload it. This writes the DATA_READY file to the directory with the timestamp of when it was packaged and the version number. Data that has not been packaged will not be recognized as ready to upload. - add new data instances description to datasets.json in the standartized structure (samples/analyses)

Running vcpipe package will look for new data with an entry in datasets.json but without DATA_READY files and generate them appropriately. You can use the options to restrict which directories it packages.

Running vcpipe upload will attempt to upload any directory with DATA_READY file that does not already exist in DigitalOcean. You can use additional options to restrict which directories are uploaded.

Packaging data
Usage: vcpipe_testdata package [OPTIONS]

  Marks new test data as ready for upload

Options:
  --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
                                  Only package data from a specific platform,
                                  can be repeated for multiple platforms
                                  [default: all]

  --sample-id SAMPLE_ID           Only package data from a specific sample ID,
                                  can be repeated for multiple samples

  -h, --help                      Show this message and exit.
Uploading data
Usage: vcpipe_testdata upload [OPTIONS]

  Upload specified test data (must be packaged first)

Options:
  --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
                                  Only upload data from a specific platform,
                                  can be repeated for multiple platforms
                                  [default: all]

  --sample-id SAMPLE_ID           Only upload data from a specific sample ID,
                                  can be repeated for multiple samples

  -h, --help                      Show this message and exit.

Other functionality

list-samples

All Sample IDs are stored in datasets.json, but there is a help command for listing which samples are available for each platform.

Usage: vcpipe_testdata list-samples [OPTIONS]

  List samples available for each selected platform

Options:
  --platform [all|goldstandard|target|wes-hiseq|wes-novaseq|wgs-hiseq|wgs-novaseq]
                                  Only list samples from the specified
                                  platforms, can be repeated for multiple
                                  platforms  [default: all]

  --show-data-types               List data types available for each sample
                                  [default: False]

  -h, --help                      Show this message and exit.

Docker installation

Credentials are loaded into the Docker container by an environment file whose path is stored in the DO_CREDS makefile variable.

Setup

Requires:

  • Docker
  • make

Installation:

  1. clone the repo: git clone git@gitlab.com:ousamg/data/vcpipe-testdata.git
  2. enter the repo: cd vcpipe-testdata
  3. Build the local image: make build

Usage

Currently, makefile actions only work to download / upload all data. It does check for to see what data has changed, if any, and only uploads new data / downloads data you do not have. For more complicated options, install locally or use make shell to enter the container and run vcpipe_testdata directly.

  • Full function list and help
  • make help
  • Downloading data
  • make download-data DO_CREDS=~/.digitalocean/do_creds
  • Uploading data
  • make package-data -- only changes local files, no creds needed
  • make upload-data DO_CREDS=~/.digitalocean/do_creds
  • Get a docker shell to run vcpipe_testdata as if in a local install
  • make shell DO_CREDS=~/.digitalocean/do_creds