Skip to content

[[TOC]]

Description

Manage, fetch and modify reference datasets stored in DigitalOcean

Data is stored in data created in the repo root. There are three levels of data:

  1. nature -- the purpose or nature of the data
  2. capturekit, mutect2
  3. Data type -- sub-types of data (common is used if it doesn't make sense to define one)
  4. common,
  5. Kind -- a specific set of data in a nature, like 'agilent_sureselect_v05' in capturekits

On disk the data is layed out like capturekit/common/agilent_sureselect_v05 (generally {nature}/common/{kind ID}) with the corresponding key in datasets.json: "capturekit": { "kinds": { "agilent_sureselect_v05":

In Digital Ocean the layout is: {nature} / {data type} / {kind id} / {version} / {files}, e.g. capturekit / common / agilent_cre_v02 / ousamgv1.0 / GRCh37 / {files}

Data is versioned, but version info is not stored in the directory path. The datasets.json file contains all data info for programmatic use (downloading, uploading, packaging). To view the version of data currently downloaded, check the DATA_READY file in the sample level of the directory tree.

You can also use jq to quickly extract the version info from datasets.json.

$ jq -r '.capturekit.kinds["agilent_sureselect_v05"].version' datasets.json
ousamgv1.0/GRCh37

Update production

A tar of the dataset is uploaded to NSC and TSD. In CI there are jobs to automatically down from DO and create files for new/updated datasets. Once uploaded to TSD/NSC, run the script production/refdata/deploy-refdata.sh. Typically you'd also want to create a version file (part of the deploy script). You should tag this repo's master branch with the same version as created by the above script.

Using vcpipe-refdata

The package can be installed locally or used via Docker to keep your environment clean. Both methods require you to have working DigitalOcean Spaces credentials. For instructions on creating these creds, see: https://www.digitalocean.com/community/tutorials/how-to-create-a-digitalocean-space-and-api-key#creating-an-access-key.

The best way to store these is in a file that can be sourced/exported or passed as a parameter to make. e.g.,

$ cat ~/.digitalocean/do_creds
SPACES_KEY=<api_key>
SPACES_SECRET=<api_secret>

Local installation

Credentials are stored in the environment variables SPACES_KEY and SPACES_SECRET.

Setup

Requires:

  • Python >= 3.6
  • libcurl4

Installation:

  1. clone the repo: git clone git@gitlab.com:ousamg/data/vcpipe-refdata.git
  2. enter the repo: cd vcpipe-refdata
  3. install: python setup.py install

Usage

Using credentials

Before you do anything, you must load your credentials into your environment. If necessary, you can specify them on the command line with the --spaces-key and --spaces-secret options, but this reveals them on ps and in your bash history and is not recommended.

source ~/.digitalocean/do_creds; export SPACES_KEY SPACES_SECRET

Commands and shared options

Usage: vcpipe_refdata [OPTIONS] COMMAND [ARGS]...

Options:
  --spaces-key TEXT     DigitalOCean Spaces API key
  --spaces-secret TEXT  DigitalOcean Spaces API secret
  --verbose             Increase log level
  --debug               Max logging
  -d, --datasets FILE   JSON file containing datasets and versions  [default:
                        datasets.json]

  --threads INTEGER     Maximum number of threads to use  [default: (20)]
  --help                Show this message and exit.

Commands:
  download      Download specified test data
  list-kinds    List the various kinds of data of specified nature
  package       Marks new test data as ready for upload
  upload        Upload specified test data (must be packaged first)

The default number of threads to use is determined dynamically. It is either the number of processors (nproc) or 20, whichever is smaller. You can also specify spaces credentials via the options, but this is not recommended as they will then be shown in ps output.

To modify the options shared by all commands use them before specifying the command. e.g., To download data using only 4 threads instead of the maximum available you would do:

vcpipe_refdata --threads 4 download

Downloading data

Usage: vcpipe_refdata download [OPTIONS]

  Download specified test data

Options:
  --data-type [all|common]
                                  Type of data to download  [default: all]
  --nature [all|capturekit|mutect2]
                                  Only download data from a specific platform,
                                  can be repeated for multiple platforms
                                  [default: all]

  --kind-id KIND_ID           Only download data from of specific kind
                                  ID, can be repeated for multiple samples

  --version VERSION               Only download a specific version. Default:
                                  the version in datasets.json

  -h, --help                      Show this message and exit.

The local DATA_READY file can become "stale" when downloading multiple times/versions. If the DATA_READY file has the same size it won't be updated. The date has always the same number of characters and thus size; the version part will have the same size for versions with the same number of character, like v1.0 and v1.1.

So prevent this you can manually delete the DATA_READY file before downloading.

Packaging and Uploading data

When adding a new dataset, or updating an existing one, you must package the data before you upload it. The new or the updated dataset should be placed under data/NATURE/DATA-TYPE/KIND, e.g. data/capturekit/common/agilent_sureselect_v05. This writes the DATA_READY file to the directory with the timestamp of when it was packaged and the version number. Data that has not been packaged will not be recognized as ready to upload.

Running vcpipe-refdata package will look for new data with an entry in datasets.json but without DATA_READY files and generate them appropriately. You can use the options to restrict which directories it packages.

Running vcpipe_refdata upload will attempt to upload any directory with DATA_READY file that does not already exist in DigitalOcean. You can use additional options to restrict which directories are uploaded.

Packaging data
Usage: vcpipe_refdata package [OPTIONS]

  Marks new test data as ready for upload

Options:
  --nature [capturekit|mutect2]
                                  [default: all]

  --kind-id KIND_ID           Only package specific data of the given nature,
                                  can be repeated for multiple samples

  -h, --help                      Show this message and exit.
Uploading data
Usage: vcpipe_refdata upload [OPTIONS]

  Upload specified test data (must be packaged first)

Options:
  see the 'package' command

Other functionality

list-kinds

List the various data

Usage: vcpipe_refdata list-kinds [OPTIONS]

  List data for each selected nature

Options:
  --nature [all|capturekit|mutect2]
                                  [default: all]
  --show-data-types               List data types available for each sample
                                  [default: False]

  -h, --help                      Show this message and exit.

Docker installation

Credentials are loaded into the Docker container by an environment file whose path is stored in the DO_CREDS makefile variable.

Setup

Requires:

  • Docker
  • make

Installation:

  1. clone the repo: git clone git@git.ousamg.io:data/vcpipe-refdata.git
  2. enter the repo: cd vcpipe-refdata
  3. Build the local image: USER=ousamg make build

Do this on build server (tomato) so the CI pipeline of vcpipe can use the image to download data needed when running the CI pipeline.

Usage

Manually enter commands inside docker: DATA_DIR=/Users/severin/tmp/vcpipe-refdata DO_CREDS="~/.digitalocean/do_creds" make shell

Then: vcpipe_refdata package --nature capturekit --kind-id agilent_cre_v02

Currently, makefile actions only work to download / upload all data. It does check for to see what data has changed, if any, and only uploads new data / downloads data you do not have. For more complicated options, install locally or use make shell to enter the container and run vcpipe_refdata directly.

  • Full function list and help
  • make help
  • Downloading data
  • make download-data DO_CREDS=~/.digitalocean/do_creds
  • Uploading data
  • make package-data -- only changes local files, no creds needed
  • make upload-data DO_CREDS=~/.digitalocean/do_creds
  • Get a docker shell to run vcpipe_refdata as if in a local install
  • make shell DO_CREDS=~/.digitalocean/do_creds

Example on tomato: We want data to be accessible for gitlab-runner in CI pipeline.

  • USER=ousamg GROUP_ID=$(id -g gitlab-runner) make build # needed if change of datasets.json or py file`
  • DATA_DIR=/storage/pipeline-refdata DO_CREDS="~/.digitalocean/do_creds" make shell
  • vcpipe_refdata package --nature capturekit --kind-id agilent_cre_v02
  • vcpipe_refdata upload --nature capturekit --kind-id agilent_cre_v02
  • `vcpipe_refdata package --nature genomic --kind-id general
  • `vcpipe_refdata upload --nature genomic --kind-id general
  • vcpipe_refdata package --nature funcAnnot --kind-id master-genepanel
  • vcpipe_refdata upload --nature funcAnnot --kind-id master-genepanel