[[TOC]]
Description
Manage, fetch and modify reference datasets stored in DigitalOcean
Data is stored in data
created in the repo root. There are three levels of data:
- nature -- the purpose or nature of the data
capturekit
,mutect2
- Data type -- sub-types of data (common is used if it doesn't make sense to define one)
common
,- Kind -- a specific set of data in a nature, like 'agilent_sureselect_v05' in capturekits
On disk the data is layed out like capturekit/common/agilent_sureselect_v05 (generally {nature}/common/{kind ID})
with the corresponding key in datasets.json:
"capturekit": {
"kinds": {
"agilent_sureselect_v05":
In Digital Ocean the layout is: {nature} / {data type} / {kind id} / {version} / {files}, e.g. capturekit / common / agilent_cre_v02 / ousamgv1.0 / GRCh37 / {files}
Data is versioned, but version info is not stored in the directory path. The datasets.json
file
contains all data info for programmatic use (downloading, uploading, packaging). To view the version
of data currently downloaded, check the DATA_READY
file in the sample level of the directory tree.
You can also use jq
to quickly extract the version info from datasets.json
.
Update production
A tar of the dataset is uploaded to NSC and TSD. In CI there are jobs to automatically down from DO and create files for new/updated datasets. Once uploaded to TSD/NSC, run the script production/refdata/deploy-refdata.sh. Typically you'd also want to create a version file (part of the deploy script). You should tag this repo's master branch with the same version as created by the above script.
Using vcpipe-refdata
The package can be installed locally or used via Docker to keep your environment clean. Both methods require you to have working DigitalOcean Spaces credentials. For instructions on creating these creds, see: https://www.digitalocean.com/community/tutorials/how-to-create-a-digitalocean-space-and-api-key#creating-an-access-key.
The best way to store these is in a file that can be sourced/exported or passed as a parameter to make. e.g.,
Local installation
Credentials are stored in the environment variables SPACES_KEY
and SPACES_SECRET
.
Setup
Requires:
- Python >= 3.6
- libcurl4
Installation:
- clone the repo:
git clone git@gitlab.com:ousamg/data/vcpipe-refdata.git
- enter the repo:
cd vcpipe-refdata
- install:
python setup.py install
Usage
Using credentials
Before you do anything, you must load your credentials into your environment. If necessary, you can
specify them on the command line with the --spaces-key
and --spaces-secret
options, but this
reveals them on ps
and in your bash history and is not recommended.
Commands and shared options
Usage: vcpipe_refdata [OPTIONS] COMMAND [ARGS]...
Options:
--spaces-key TEXT DigitalOCean Spaces API key
--spaces-secret TEXT DigitalOcean Spaces API secret
--verbose Increase log level
--debug Max logging
-d, --datasets FILE JSON file containing datasets and versions [default:
datasets.json]
--threads INTEGER Maximum number of threads to use [default: (20)]
--help Show this message and exit.
Commands:
download Download specified test data
list-kinds List the various kinds of data of specified nature
package Marks new test data as ready for upload
upload Upload specified test data (must be packaged first)
The default number of threads to use is determined dynamically. It is either the number of
processors (nproc
) or 20, whichever is smaller. You can also specify spaces credentials via the
options, but this is not recommended as they will then be shown in ps
output.
To modify the options shared by all commands use them before specifying the command. e.g., To download data using only 4 threads instead of the maximum available you would do:
Downloading data
Usage: vcpipe_refdata download [OPTIONS]
Download specified test data
Options:
--data-type [all|common]
Type of data to download [default: all]
--nature [all|capturekit|mutect2]
Only download data from a specific platform,
can be repeated for multiple platforms
[default: all]
--kind-id KIND_ID Only download data from of specific kind
ID, can be repeated for multiple samples
--version VERSION Only download a specific version. Default:
the version in datasets.json
-h, --help Show this message and exit.
The local DATA_READY file can become "stale" when downloading multiple times/versions. If the DATA_READY file has the same size it won't be updated. The date has always the same number of characters and thus size; the version part will have the same size for versions with the same number of character, like v1.0 and v1.1.
So prevent this you can manually delete the DATA_READY file before downloading.
Packaging and Uploading data
When adding a new dataset, or updating an existing one, you must package the data before you upload
it. The new or the updated dataset should be placed under data/NATURE/DATA-TYPE/KIND, e.g. data/capturekit/common/agilent_sureselect_v05. This writes the DATA_READY
file to the directory with the timestamp of when it was packaged
and the version number. Data that has not been packaged will not be recognized as ready to upload.
Running vcpipe-refdata package
will look for new data with an entry in datasets.json
but without
DATA_READY
files and generate them appropriately. You can use the options to restrict which
directories it packages.
Running vcpipe_refdata upload
will attempt to upload any directory with DATA_READY
file that does not
already exist in DigitalOcean. You can use additional options to restrict which directories are
uploaded.
Packaging data
Usage: vcpipe_refdata package [OPTIONS]
Marks new test data as ready for upload
Options:
--nature [capturekit|mutect2]
[default: all]
--kind-id KIND_ID Only package specific data of the given nature,
can be repeated for multiple samples
-h, --help Show this message and exit.
Uploading data
Usage: vcpipe_refdata upload [OPTIONS]
Upload specified test data (must be packaged first)
Options:
see the 'package' command
Other functionality
list-kinds
List the various data
Usage: vcpipe_refdata list-kinds [OPTIONS]
List data for each selected nature
Options:
--nature [all|capturekit|mutect2]
[default: all]
--show-data-types List data types available for each sample
[default: False]
-h, --help Show this message and exit.
Docker installation
Credentials are loaded into the Docker container by an environment file whose path is stored in the
DO_CREDS
makefile variable.
Setup
Requires:
- Docker
- make
Installation:
- clone the repo:
git clone git@git.ousamg.io:data/vcpipe-refdata.git
- enter the repo:
cd vcpipe-refdata
- Build the local image:
USER=ousamg make build
Do this on build server (tomato) so the CI pipeline of vcpipe can use the image to download data needed when running the CI pipeline.
Usage
Manually enter commands inside docker:
DATA_DIR=/Users/severin/tmp/vcpipe-refdata DO_CREDS="~/.digitalocean/do_creds" make shell
Then:
vcpipe_refdata package --nature capturekit --kind-id agilent_cre_v02
Currently, makefile actions only work to download / upload all data. It does check for to see what
data has changed, if any, and only uploads new data / downloads data you do not have. For more
complicated options, install locally or use make shell
to enter the container and run
vcpipe_refdata
directly.
- Full function list and help
make help
- Downloading data
make download-data DO_CREDS=~/.digitalocean/do_creds
- Uploading data
make package-data
-- only changes local files, no creds neededmake upload-data DO_CREDS=~/.digitalocean/do_creds
- Get a docker shell to run
vcpipe_refdata
as if in a local install make shell DO_CREDS=~/.digitalocean/do_creds
Example on tomato: We want data to be accessible for gitlab-runner in CI pipeline.
USER=ousamg GROUP_ID=$(id -g gitlab-runner) make build
# needed if change of datasets.json or py file`DATA_DIR=/storage/pipeline-refdata DO_CREDS="~/.digitalocean/do_creds" make shell
vcpipe_refdata package --nature capturekit --kind-id agilent_cre_v02
vcpipe_refdata upload --nature capturekit --kind-id agilent_cre_v02
- `vcpipe_refdata package --nature genomic --kind-id general
- `vcpipe_refdata upload --nature genomic --kind-id general
vcpipe_refdata package --nature funcAnnot --kind-id master-genepanel
vcpipe_refdata upload --nature funcAnnot --kind-id master-genepanel