Skip to content

HTS Bioinf - Anno with anno-targets

Scope

The procedure describes what we call anno system which includes ella-anno and anno-targets repositories. This is the annotation part of the pipeline as well as the annotation service used by ELLA.

Definitions

ELLA: software for interpretataions of genetic variants.

Supervisor: software module to monitor processes.

Supervisor UI: the webpage to access Supervisor functionality

TSD: Tjenester for Sensitive Data (infrastructur at USIT, University of Oslo)


What is anno and anno-targets?

anno is an annotation service, with an API and an internal queue system. It can accept requests for annotation of specific samples (or vcf files), and then run a specified target.

anno-targets are the available targets for anno. A target is bascially a postprocess job, that is run after the annotation is performed. It can, for example, consist of creating reports, statistics or IGV tracks of the incoming data.

We currently run anno with anno-targets as a service on TSD, but it is also integrated in the pipeline and the backup pipeline.

anno-target: ella

There is currently just one target in our anno-targets, named ella. It performs a lot of different tasks, basically preparing the data for import into ELLA. Among these tasks are creating a pipeline report (with CNV information, coverage information, potential low coverage warning, sample metadata info and version info), excel file for backup use, coverage report pdf and IGV tracks for ELLA. It also formats all files and places them in the directory structure required by ELLA.

Sample repo

anno utilises a sample repo definition file located in sample-repo/samples.json.

A process controlled by Supervisor updates the samples.json at regular intervals. The code is located in sample-repo/ and is a git repo inside TSD. Any changes should be commited to git immediately.

The sample repo script scans vcpipe database along with /tsd/p22/data/durable{,2,3}/production/preprocessed to find all samples that should be available for annotation via anno (and by extension, within ELLA).

Background

anno system consists of two linked code repositiories - ella-anno and anno-targets and is dedicated to the annotation of variant calls files produced by the variant calling pipeline.

The repository ella-anno administers generic and openly available annotation sources. The purpose of the repository anno-targets is two-fold:

  1. It is for our internal (and/or restricted) annotation sources, such as our internal databases and HGMD Pro.
  2. It contains the ella-target that postprocesses the data that is specific to our users. This includes for example markdown reports displayed in ELLA (report.txt and warning.txt), coverage reports, IGV tracks, attachments, and backup excel sheets.

In order to ensure smooth operations, the core functionalities of ella-anno and anno-targets are implemented in software containers (docker >> singularity). The repositories' Dockerfiles provide the instructions to build such containers. The build process from anno-targets is to start from ella-anno, and extend its Dockerfile to integrate our ella-target into a new, extended image. The resulting image can be used to both spin up a service (which is what ELLA uses), and to execute annotation+target from command line (which is what the pipeline does).

In summary:

ella-anno contains:

  • Code for the API
  • Administration of generic and open annotation sources (VEP, FASTA, gnomAD++), stored on DigitalOcean
  • Possibility of extension with additional targets

anno-targets contains:

  • Code for pre- and post-processing data in the ella-target
  • Administration of restricted and internal annotation sources (inDB, HGMD++), stored on DigitalOcean
  • With anno-targets available in the image under /anno-targets, and an environment variable pointing anno to this directory, one can trigger the API or CLI of anno with target ella

Two modes of running anno

The pipeline (and the repository tests) trigger annotation through annotate_with_target, while ELLA's anno service does so through the task.py Python module, both using the master script annotate.sh.

Development

During development, it is desirable to run specific pieces of software within the containers. This is simplified by a repositories' Makefile which takes care of setting and propagating all necessary variables and of building the needed targets.

  • dev/run: run functional containers
  • build: build complete images
  • build-*: build partial images; in particular, the build-ella-annobuilder target builds the [ella]annobuilder image on top of which the anno-targets builder is itself constructed.
  • generate-*: generate all data or parts thereof (*package targets)
  • download-*: download all data or parts thereof (download-package target)
  • upload-*: upload all data or parts thereof (upload-package target)
  • test-*: targets simulate production pipeline calls on pre-packaged test cases.
  • singularity-*: --

The root directory /anno is implicitly created (by WORKDIR instructions) in anno's base Docker image. Some other directories, e.g., the ops directory, are implicitly created by ADD or COPY commands issued for parts of their content in anno's base Docker image.

anno's Makefile data management targets are wrappers for specific calls to anno's Makefile own annobuilder-template. They do not build any Docker images themselves but rely on anno-targets's Makefile "super" targets to do so.

anno's Makefile data management targets default to syncing datasets defined in anno's own ops/datasets.json file. When triggered by anno-targets's Makefile's "super" targets, the datasets are looked up in anno-targets's datasets.json file. The exception to this rule is anno-targets's Makefile's download-data target (see DATASETS_OPT variable).

Note! Targets like download-data currently result in a RuntimeError in sync_data.py whenever a version other than expected (from datasets.json) of some piece of data exists in the data directory; set variable RUN_CMD_ARGS to --force before the make command to force removal and subsequent update of any offenders.

The environment variable REGIONS is exported by ${SCRATCH}/workdir/target.source, a dynamically generated set of instructions sourced by ${SCRATCH}/workdir/cmd.sh, itself a dynamically generated script and the one responsible for spawning children shell processes via ella preprocess, annotate and annotate.sh, and main ella scripts.

Note! REGIONS is initialized to point to a copy of the gene panel's transcripts regions file but is subsequently processed by the ella preprocess script.

Change management

Change requests or errors are registered in Gitlab https://www.gitlab.com/alleles/ella-anno for general issues. The issue is discussed with the system responsible before work is started.

The source code is stored in git repositories https://www.gitlab.com/alleles/ella-anno and https://gitlab.com/alleles/anno-targets.

Work is done in a separate branch and the app is tested in a dedicated test environment. When approved by users and system responsible the changes are added to the main branch.

Before deployment, the source code is tagged and an application is created and transferred to TSD. The application is started in the staging environment and if approved by superuser and system responsible the application can be deployed to production. The superuser verifies the changes.

Changes are documented by filling out "Endringskontroll for bioinformatikk". Hotfixes do not need an "Endringskontroll". Hotfixes are defined as updates fixing bugs in functionality (but not changing the functionality itself). Hotfix versions are denoted by the PATCH number increasing (version is given by MAJOR.MINOR.PATCH).

See HTS Bioinf - Release and deployment of anno system for release and deployment details.

References

HTS Bioinf - Running annoservice