Skip to content

Variant calling pipeline

  • Single and trio sample pipelines (mapping, variant calling, annotation)
  • Automatic start of pipelines on local machine or on a compute cluster using slurm queue manager
  • Web interface showing live status, logs and ability to restart/delete analyses
  • Pipelines are defined using Nextflow. Since vcpipe release 6 DSL2 is used.

The execution platform is Linux and Python 3.6.x.

The env variable 'SYSTEM' must be defined and is used to pick the relevant settings and Nextflow profile. See the various pipeline startup scripts in exe/pipeline.

The main Nextflow scrips are in src/pipeline. The individual processes and some common workflows are located in src/pipeline/modules

Testdata are downloaded from DigitalOcean and managed using project 'https://gitlab.com/dpipe/variantcalling/vcpipe-testdata'.

More documentation can be found at:

Development

The various components are running in Docker. To adapt to your specific host system, some variables needs to be set properly. See more below.

Getting started

To start the execution environment (Docker) where you can start a pipeline:

  • make build
  • make dev

To start the execution environment:

  • make build
  • make executor-ci-setup

Make sure to define your environment by setting these variables before 'make executor-ci-setup':

  • MOUNTROOT
  • SINGULARITY_LOCATION
  • TESTDATA_LOCATION
  • REFDATA_LOCATION
  • SCRATCH_LOCATION

Like: TESTDATA_LOCATION="~/tmp/" REFDATA_LOCATION="~/tmp/" MOUNTROOT="~/tmp" SINGULARITY_LOCATION="~/tmp/singularity" SCRATCH_LOCATION="~/tmp" make dev

Then you can poke around the environment with 'make shell'. This is convenient to run build commands before putting them in Dockerfile.

Local executor tests

To start postgres, webui, executor and some executor tests:

  • make executor-ci-setup # with proper env var set, see below
  • make test-executor
  • make executor-ci-shell (enter container for debugging)

Like: GENEPANELS_LOCATION="~/work/genepanels" TESTDATA_LOCATION=~/work/vcpipe-testdata/data REFDATA_LOCATION="~/tmp/" MOUNTROOT="~/tmp" SINGULARITY_LOCATION="~/tmp/singularity" SCRATCH_LOCATION="~/tmp" make executor-ci-setup

This will start supervisor whose config files defines the postgres, webui and executor services. The webui and executor after started after postgres is ready.

Open the WebUI in a browser: http://localhost:8000/ (the port might be different depending on what is available on your host)

You can then run make executor-ci-shell to enter the container and then throw some jobs at the executor using /vcpipe/ops/test/executor.sh.

You can enter the environment using:

  • make executor-ci-shell

In case of errors, it's useful to enter the container and manually start the various services:

First start the container: GENEPANELS_LOCATION="~/work/genepanels" TESTDATA_LOCATION=~/work/vcpipe-testdata/data REFDATA_LOCATION="~/tmp/" MOUNTROOT="~/tmp" SINGULARITY_LOCATION="~/tmp/singularity" SCRATCH_LOCATION="~/tmp" make executor-ci-setup Then enter the container with make executor-ci-shell and start the various services and executor the tests:

  • supervisord -c /vcpipe/ops/dev/supervisor-rootprocess.cfg
  • supervisorctl -c /vcpipe/ops/dev/supervisor-rootprocess.cfg status
  • supervisorctl -c /vcpipe/ops/dev/supervisor-rootprocess.cfg start vcpipe:webui
  • supervisorctl -c /vcpipe/ops/dev/supervisor-rootprocess.cfg start vcpipe:executor
  • supervisorctl -c /vcpipe/ops/dev/supervisor-rootprocess.cfg status
  • echo "development versions" > /run-refdata/refdata-current-versions.txt
  • /vcpipe/ops/test/executor.sh
  • su postgres -c /usr/bin/psql
  • in psql: select enum_range(null::analysis_status);

To play around with executor/webui without using supervisor, make sure to set the proper python environment first:

  • cd /dist/.venv (might not be needed)
  • pipenv shell

Local testing

Testing the executor itself:

 GENEPANELS_LOCATION="~/work/genepanels/legacy" MOUNTROOT="~/tmp" SINGULARITY_LOCATION="~/tmp/singularity" make executor-ci-setup test-executor

This starts a container with user specific mounts and runs tests inside the container.

To test a bioinformatics pipeline (base, anno, trio):

   CI_RUN_LOCALLY=true MOUNTROOT="~/tmp" SINGULARITY_LOCATION="~/tmp/singularity" CI_PROJECT_DIR=~/tmp CAPTURE_KIT="CuCaV1"  make basepipe-integration

To run python unit tests:

  1. make dev: TESTDATA_LOCATION="~/tmp/" REFDATA_LOCATION="~/tmp/" MOUNTROOT="~/tmp" SINGULARITY_LOCATION="~/tmp/singularity" SCRATCH_LOCATION="~/tmp" make dev
  2. make shell
  3. /vcpipe/ops/test/unit-test

The pipelines depend on some Singularity images, so make sure to have them installed into SINGULARITY_LOCATION. To verify that they are there, run the various make targets:

make test-singularity{-XXX}

Also the testdata needs to be downloaded from Digital Ocean using project 'vcpipe-testdata'. The location of the downloaded testdata needs to be mounted into the container. See Makefile.

Variables

  • VCP_OPTS can be used to specify additional Docker flags, such as if you want to mount additional folders
  • WEBUI_PORT determines which port is available externally for the webui

Supervisor

The various processes running inside docker is managed by supervisor. There are two config files, one being used when supervisor is running as the container's foreground process.

More info

To control startup of the various processes:

  • start container: .. set vars .. make dev
  • enter container: make shell
  • set vars: EXECUTOR_SETTINGS="/vcpipe/config/settings-docker_executor_ci.json"
  • start processes: supervisord -c /vcpipe/ops/dev/supervisor-rootprocess.cfg

Unit testing

On each commit some quick tests are done, like python unit tests and linting. Building docker image and preparing the folders with vcpipe-bundle is also done on each commit.

Run unit tests locally

Create docker container: GENEPANELS_LOCATION="~/work/genepanels" TESTDATA_LOCATION=~/work/vcpipe-testdata/data REFDATA_LOCATION="~/tmp/" MOUNTROOT="~/tmp" SINGULARITY_LOCATION="~/tmp/singularity" SCRATCH_LOCATION="~/tmp" make executor-ci-setup Enter container: make executor-ci-shell

Run unit tests: pipenv run python -m pytest --color=yes /vcpipe/src/

Run vcfped test: pipenv shell python3 src/vcfped/tests/vcfped_specialtest.py

Change source code of test file to run different tests.

CI tests

The integration and regressions tests are usually run on the build server.

The CI pipeline is parameterized so one can easily test various parts.

These type of tests/runs can be triggered by the gitlab API:

  • base pipeline
  • anno pipeline
  • trio pipeline
  • executor (check that executor can be started and show correct status)
  • refdata (downloads refdata from DO release folder)

For each of the Nextflow pipeline there's an integration and regressison stage. The regression can be skipped, but it's run by default.

To initiate a test, use GITLAB_TOKEN_VCPIPE=<your trigger token> ops/trigger-ci-test.sh <type of test> [true/false] [true/false] where <type of test> is either of basepipe, triopipe, annopipe, refdata, executor or deploy. The boolean options determine whether to run regression test and to build a genepanel overview.

A token is required to access the Gitlab API. You must be made a maintainer of the vcpipe project to be allowed to create one. Once given the proper role, generate a token in the Settings page of the project (https://gitlab.com/ousamg/apps/vcpipe/settings/ci_cd).

The testing is done in Docker containers. The Docker image is rebuilt on start of the CI pipeline, which should only take a few seconds as Docker caches previous builds. The container has various python packages installed. Other important parts, such as genepanels, are mounted into the container

The testing of the Nextflow pipelines are each divided in two: integration and regression. Integration run the actual Nextflow pipeline script, which is both time, memory, CPU and IO demanding; expect > 10 hours. The regression part compares the output of the integration part to some predefined values. In Gitlab, the regression jobs has a declared dependency on the integration job, and gitlab makes integration artifact available to regression job in root of the project (CI_PROJECT_DIR). First part of the regression job moves this to 'ci-input' (which is mentioned in .gitignore to avoid making the repo dirty)

Because vcpipe has several dependencies, and produces so much data during a run several paths are mounted as volumes to have the output written outside of the running container. See the mount options in the 'docker run' commands in Makefile.

The testdata used are fetched from Digital Ocean. The data is downloaded to the build server in a separate stage in the CI pipeline. Data won't be downloaded unless a copy already exists on the build server. The data is made available to the tests by mounting/copying into Docker/Singularity container. Details are in Makefile and in the test/{basepipe,annopipe,triopipe}-{integration,regression} shell scripts.

Data is downloaded using a docker container created using the project vcpipe-testdata. The Docker image must be created using the appropiate build command of your local clone of that project. The CI tests described in this document assumes the image is available on the build server, and the download jobs will fail if it's not.

Singularity in Docker Singularity has various host requirements that we need to get right in Docker, like:

  • the various mounts (either user specified or system/default ones like /home, /tmp and /dev) must be owned by the user running singularity (which most often means root in Docker). Else singularity will give an owner error.
  • the identity of the user. So we mount passwd from the host.
  • the version of Singularity installed in Docker should be the same as the version that built the Singularity images used inside Docker (a segmentation fault occured mixing 2.5.0 and 2.6.1)
  • The Docker container must run with the '--privileged' flag

Run analyses in the staging environment

A prod-like environment has been set up on both NSC and TSD, in /boston/diag/staging and /ess/p22/data/durable/staging, respectively (hereafter called STAGING_ROOT). The environement is made up of $STAGING_ROOT/sw/vcpipe/vcpipe (executor, webui, Nextflow pipelines), genepanels, refdata, testdata, various python modules, folders for analysis and samples metadata, and result files from the pipeline runs (preprocessed).

The files generated by the pipeline are located in a date-stamped folder inside the analysis folder or in the preprocessed folder. Data for ELLA is copied to /boston/diag/staging/transfer/ (NSC) or to /ess/p22/data/durable/production/ella/ella-staging/data/analyses/incoming (TSD). These paths are defined in config/settings-{tsd,nsc}-staging.json.

There are scripts in $STAGING_ROOT/sw to start and stop executor and webui. Do 'source $STAGING_ROOT/sw/staging-screen-commands.txt' to make aliases for start/stop commands available in you shell. The port of the webui is found in config/settings-*.json.

To update the various parts of the environment, use $STAGING_ROOT/sw/deploy.sh (for vcpipe, genepanels, sensitive-db and the python modules) or $STAGING_ROOT/refdata/deploy-refdata.sh. To show the various versions installed, use $STAGING_ROOT/sw/show-versions.sh

When the executor is running, an analysis is automatically started once its metadata, input dependencies and READY file are present in the $STAGING_ROOT/analyses folder. The corresponding sample data must also be present in the $STAGING_ROOT/samples folder (annopipe requires the sample to be present in the database, even though it does not use it).

The pipeline result is generated below the analysis folder itself, e.g., $STAGING_ROOT/analyses/Diag-excap24-12445/result/2022-05-17-12-33-51. Once an analysis has been run, it is recorded in the database. To re-run an analysis use the delete button in the webui. If you need the old result data copy/move it before clicking delete in the webui. Also copy/move result data in preprocessed or in ella-staging's incoming or imported folder. If an analysis has already been imported into ELLA, you also need to delete it from ELLA's database using the CLI (ella(/ops/staging-cli.sh)).

The files generated by the pipeline are located in a date-stamped folder inside the analysis folder or in the preprocessed folder. Data for ELLA is copied to /boston/diag/staging/transfer/ (NSC) or to /ess/p22/data/durable/production/ella/ella-staging/data/analyses/incoming (TSD). These paths are defined in settings-{tsd,nsc}-staging.json.

There are scripts in sw to start and stop executor and webui. Do 'source staging-screen-commands.txt' to make alias for start/stop commands available in you shell. The port of the webui is found in settings-*.json.

To update the various part of the environment, use sw/deploy.sh (for vcpipe, genepanels, sensitive-db and the python moduels) or refdata/deploy-refdata.sh. To show the various versions installed, use sw/show-versions.sh

When the executor is running, an analysis is automatically started once it's metadata, input dependencies and READY file are present in the analyses folder. The appropritate sample data must also be present in the samples folder (the annopipe requires the sample to be present in database, even though it's not used).

The pipeline result is generated below the analysis folder, like analyses/Diag-excap24-12445/result/2022-05-17-12-33-51. To rerun an analysis use the delete button in the webui. If you need the old result data copy/move it before clicking delete in webui. Also copy/move result data in preprocessed or in ella-staging's incoming or imported folder. If an analysis has been imported into ELLA, you need to delete it from ELLA's database using the CLI (ella(/ops/staging-cli.sh))

Release

Process

This is described in detail in ehåndbok.

Create an issue

An issue is created to keep track of the release process.

Tagging

  • Tag in git the various modules that should be released.
  • Update release-history-*.txt with name of git tags.

Trigger CI pipeline

  • Trigger one or more several CI pipeline executions as described above, like ops/trigger-ci-test.sh annopipe true
  • Once the execution is finished without error, you can create release artifacts by triggering the CI again using the 'deploy' as argument.
  • Click the play button for 'create-change-overview' to create a file with a change description. The file will be available in the web interface of the Gitlab CI and should be attached to the Jira issue accompanying the release process.

Transfer artifacts

Get the artefacts from Gitlab.

  • Transfer the release artifact (.tgz) and the corresponding checksum file (.sha1) to the archive folder of the production servers.

Deploy on TSD/NSC

  • Stop the running services and deploy the artifacts. The release versions will be registered in the file 'sw/release-history.txt'
  • sw/deploy.sh archive/{artifact.tgz} {vcpipe,genepanels, sensitive-db,...}

Creating release artifacts using Gitlab's CI pipeline

Release artifact are created by triggering the CI giving 'deploy' as the pipeline type. This will package the vcpipe source files and the python modules into separate tar archives. The tar file is put in the build server's /storage/releases folder.

A checksum file (*.sha1) is created in the same location. These files must then be transferred to the production servers. Another job will tar the various python modules (managed by Pipenv in Pipfile/Pipfile.lock). The tar file will be available in /storage/releases. The filename contains both the vcpipe git sha and Pipfile.lock sha. If the Pipefile.lock sha is unchanged, there is no need to update the Python modules in production.

The python modules are managed by Pipenv in Pipfile/Pipfile.lock. The python modules archive filename contains both the vcpipe git sha and Pipfile.lock sha. If the Pipefile.lock sha is unchanged, there is no need to update the Python modules in production.

There is a manual CI job to create a report with details about the changes between two consecutive versions of the pipeline. When creating a release, the two last lines in release-history*.csv contains two release tags. When developing, the last line should reference a branch, like origin/dev, a hotfix branch, or a release branch. The job is typically played after having tagged the repos for a release, and you want a change summary put in a text file along side the release tar itself.

Download and deploy refdata

There are CI jobs to download refdata from DigitalOcean and package them as tar files; one tar for each versioned dataset, like for the capture kit agilent_sureselect_v05.

The python code (in vcpipe-refdata, usually run as in Docker) downloads the various files in the dataset from DO, and then the files are tarred. If a tar file is present the call to vcpipe-refdata's download is skipped. To save space on the CI server the real tar file can be replaced with an empty tar file with the proper name (*-.tar).

To create empty files to skip re-download:

cd /storage/releases/refdata
for t in `ls *-*.tar`; do tacl --upload $t p22; sudo rm $t; sudo touch $t; done

The CI jobs are defined using the Gitlab CI matrix feature, with values for matching for the keywords in 'datasets.json'. Currently, the versions are found in 'datasets.json' in the root of vcpipe. Having version info in vcpipe itself is similar to how versions of images are defined in config/anaylsis/singularity.json.

On TSD the refdata are found in /ess/p22/cluster/{production,staging}/refdata/data. The 'refdata' folder has scripts to untar the .tar when deploying new datasets, and the gather version of datasets currently in the data folder.

When deploying new refdata:

  • create/update datasets on disk and upload to Digital Ocean using scripts in vcpipe-refdata
  • update vcpipe's datasets.json
  • run CI pipeline to download datasets (only dataset tar files not existing in /storage/releases/refdata) are download/created.
  • when triggering the CI pipeline have the variable RUN_TEST='refdata'
  • Transfer new datasets tar file to TSD
  • Use script in {production,staging}/refdata to deploy the dataset
  • Optionally have the above script also archive the datasets.json found in the deployed copy of the vcpipe module

Python modules

Requires Python 3.6 (NSC's RHEL servers don't have 3.7 as of Feb 2022).

Modules are handled with 'pipenv' (https://realpython.com/pipenv-guide/)

See Pipfile and Pipfile.lock.

To install (prefereabley inside Docker): pipenv install --ignore-pipfile

To generate the Pipfile.lock (based on versions in Pipfile), run inside docker (with the correct version of python installed): /src/vcpipe$ pipenv lock

The Pipefile.lock is used to install the specific version of the python modules in the Docker image. There is a Gitlab CI job to gather all python modules that are needed when running executor/webui in production.

Used when running executor/webui

pipenv install <module>

Used in CI tests

pipenv install --dev <module>

  • numpy
  • pandas
  • pybedtools
  • openpyxl
  • supervisor

Don't install supervisor through yum; you'll get an old version

Pipenv issues

Sometimes strange things happens...

Pipenv and version of python: https://pipenv-fork.readthedocs.io/en/latest/diagnose.html#pipenv-does-not-respect-pyenvs-global-and-local-python-versions

Pipenv by default uses the Python it is installed against to create the virtualenv. You can set the --python option,

Local python environment

To juggle between various python versions, use tool 'pyenv' (https://chamikakasun.medium.com/how-to-manage-multiple-python-versions-in-macos-2021-guide-f86ef81095a6)

List available python versions with pyenv install --list. Install: pyenv install 3.7.10 Choose: pyenv local 3.7.10 pipenv --python pyenv which python install

Div

Playground for developing the Dockerfile

To avoid lengthy builds during development it's useful to try out Dockerfile instructions before putting them in a Dockerfile. By using multiple 'From' in Dockerfile we can build an image up until a specific 'From' and try out the later instructions in a container based on that image:

  • build until a 'From': FROM=core-tools make build
  • then enter a container based on the image: make docker-dev
  • inside you can try out say 'pip install foo' and if it works you put it in a Dockerfile like 'RUN pip install foo'
  • run make kill to stop the playground container

Testing Nextflow scripts

With DSL2 it's become easier to run parts of the Nextflow scripts:

  • assemble the processes and workflows you need to test by creating a workflow to wrap the process and workflows under test, see src/pipeline/test.nf
  • run a specific wrapper workflow (ops/test/run-nxf-test)
  • any binaries mentioned in the various processes's script section can be "simulated" by creating a bash function or script that's made. There will be an error if the declared output files aren't created. This can be handled by having the bash function touch the relevant files. available for the Nextflow execution environment. Note that a bash alias can't be exported to other sub shells; functions can

Tools

Postgres (used in staging, production and CI): https://www.postgresql.org/docs/

API

Misc. tools for working with the api and it's request/response.

pbpaste | jq 'del(.data[0].attributes.log)'