Skip to content

HTS Bioinf - Anno with anno-targets

Scope

The procedure describes what we call the Anno system, the software contained in the ella-anno and anno-targets repositories. The Anno system provides the annotation part of the pipeline as well as the annotation service used by ELLA.

Definitions

ELLA: software for interpretation of genetic variants.

Supervisor: software module to monitor processes.

Supervisor UI: the web page to access Supervisor functionality

TSD: Tjenester for Sensitive Data (infrastructure at USIT, University of Oslo)

What are anno and anno-targets

anno or, as it is also called (after its code repository), ella-anno is an annotation service, with an API and an internal queue system. It can accept requests for annotation of specific samples (or VCF files), and then run a specified target.

anno-targets are the available targets for anno. A target is basically a post-processing job that is run after the annotation is performed. It can, for example, consist of creating reports, statistics or IGV tracks of the incoming data.

We currently run anno with anno-targets as a service on TSD and as part of the pipeline (annopipe).

The anno-target: ella

At the time of writing anno-targets contains just one target called ella, which performs lots of different tasks, essentially preparing the data for import into ELLA. Anno is chiefly in charge of creating:

  • a pipeline report containing:
    • CNV information
    • coverage information
    • potential low coverage warnings
    • sample metadata info
    • version info
  • a coverage report PDF
  • IGV tracks for ELLA

The ella target also formats all files and organizes them in the directory structure required by ELLA.

ELLA classifications database

A database of in-house deep-intronic variants deemed to belong in ACMG-class 4 or 5, and therefore to be spared by Anno's filters, is maintained on TSD and regularly updated. The database is saved in a VCF file the path to which is provided to Anno through the environment variable ELLA_CLASSIFICATIONS_DB.

Sample repo

Anno utilises a sample repo definition JSON file (samples.json) located in an instance-specific sample repo directory, e.g., for TSD production, /ess/p22/data/durable/production/anno/sample-repo-prod.

An automatic process updates the samples.json file at regular intervals. The code is located at /ess/p22/data/durable/production/dev-ops/src in TSD and is under Git version control. Any changes should be committed immediately.

The sample repo script scans the vcpipe database alongside the /ess/p22/data/durable/<instance>/preprocessed directory to find all samples that should be available for annotation via Anno (and by extension, within ELLA).

Background

The Anno system consists of two linked code repositories -- ella-anno and anno-targets -- and is dedicated to the annotation of variant calls files produced by the variant calling pipeline.

The repository ella-anno administers generic and openly available annotation sources. The purpose of the repository anno-targets is two-fold:

  1. It provides access to our internal (and/or restricted) annotation sources, such as our internal databases and HGMD Pro.
  2. It contains the ella target that post-processes the data in a manner that is tailored to our specific needs. This includes for example markdown reports displayed in ELLA (report.txt and warning.txt), coverage reports, IGV tracks, attachments, and backup excel sheets.

In order to ensure smooth operations, the core functionalities of ella-anno and anno-targets are implemented in software containers (docker >> singularity). The repositories' Dockerfiles provide the instructions to build such containers. The build process from anno-targets is to start from ella-anno, and extend its Dockerfile to integrate our ella target into a new, extended image. The resulting image can be used to both spin up a service (which is what ELLA uses), and to execute annotation+target from the command line (which is what the pipeline does).

In summary:

ella-anno contains:

  • Code for the API
  • Administration of generic and open annotation sources (VEP, FASTA, gnomAD++), stored on DigitalOcean
  • Possibility of extension with additional targets

anno-targets contains:

  • Code for pre- and post-processing data (currently limited to the ella target)
  • Administration of restricted and internal annotation sources (inDB, HGMD++), stored on DigitalOcean
  • With anno-targets available in the image under /anno-targets, and an environment variable pointing to this directory, the possibility to trigger API or CLI calls of anno with target ella

Two modes of running anno

The pipeline (and the repository tests) trigger annotation through annotate_with_target, while ELLA's anno service does so through the task.py Python module, both using the master script annotate.sh.

Development

During development, it is desirable to run specific pieces of software within the containers. The Docker containers are orchestrated by the repositories' Makefiles which take care of setting and propagating all necessary variables and of building the needed targets.

  • dev/run: run functional containers
  • build: build complete images
  • build-*: build partial images; in particular, the build-ella-annobuilder target builds the [ella]annobuilder image on top of which the anno-targets builder is itself constructed.
  • generate-*: generate all data or parts thereof (*package targets)
  • download-*: download all data or parts thereof (download-package target)
  • upload-*: upload all data or parts thereof (upload-package target)
  • test-*: simulate production pipeline calls on pre-packaged test cases
  • singularity-*: various operations on Singularity containers.

Good to know

The root directory /anno is implicitly created (by WORKDIR instructions) in anno's base Docker image. Some other directories, e.g., the ops directory, are implicitly created by ADD or COPY commands issued for parts of their content in anno's base Docker image.

Anno's Makefile data management targets are wrappers for specific calls to Anno's Makefile own annobuilder-template. They do not build any Docker images themselves but rely on anno-targets's Makefile "super" targets to do so.

Anno's Makefile data management targets default to syncing data sets defined in ella-anno's own ops/datasets.json file. When triggered by anno-targets's Makefile's "super" targets, the data sets are looked up in anno-targets's datasets.json file. The exception to this rule is anno-targets's Makefile's download-data target (see DATASETS_OPT variable).

Note! Targets like download-data currently result in a RuntimeError in sync_data.py whenever a version other than expected (from datasets.json) of some piece of data exists in the data directory; set variable RUN_CMD_ARGS to --force before the make command to force removal and subsequent update of any offenders.

Note! The environment variable REGIONS is exported by ${SCRATCH}/workdir/target.source, a dynamically generated set of instructions sourced by ${SCRATCH}/workdir/cmd.sh, itself a dynamically generated script and the one responsible for spawning children shell processes via ella preprocess, annotate and annotate.sh, and main ella scripts. REGIONS is initialized to point to a copy of the gene panel's transcripts regions file but is subsequently processed by the ella preprocess script.

Change management

Change requests or errors are registered in Gitlab https://www.gitlab.com/alleles/ella-anno for general issues. The issues are discussed with the system responsible before work is started.

The source code is stored in the git repositories https://www.gitlab.com/alleles/ella-anno and https://gitlab.com/alleles/anno-targets.

Work is done in a separate branch and the app is tested in a dedicated test environment. When approved by users and system responsible the changes are added to the main branch.

Before deployment, the source code is tagged and an application is created and transferred to TSD. The application is started in the staging environment and, if approved by superuser and system responsible, can be deployed to production. The superuser verifies the changes.

Changes are documented by filling out "Endringskontroll for bioinformatikk". Data-only releases and hotfixes do not need an "Endringskontroll". Hotfixes are defined as updates fixing bugs in functionality (but not changing the functionality itself). Hotfix versions are denoted by the PATCH number increasing (version is given by MAJOR.MINOR.PATCH).

See HTS Bioinf - Release and deployment of anno system for release and deployment details.

References

HTS Bioinf - Running annoservice