HTS Bioinf - Anno with anno-targets
Scope
The procedure describes what we call the Anno system, the software contained in the ella-anno
and anno-targets
repositories. The Anno system provides the annotation part of the pipeline as well as the annotation service used by ELLA.
Definitions
ELLA: software for interpretation of genetic variants.
Supervisor: software module to monitor processes.
Supervisor UI: the web page to access Supervisor functionality
TSD: Tjenester for Sensitive Data (infrastructure at USIT, University of Oslo)
What are anno and anno-targets
anno
or, as it is also called (after its code repository), ella-anno
is an annotation service, with an API and an internal queuing system. It can accept requests for annotation of specific samples (or VCF files), and then run a specified target.
anno-targets
are the available targets for anno. A target is basically a post-processing job that is run after the annotation is performed. It can, for example, consist of creating reports, statistics or IGV tracks of the incoming data.
We currently run anno
with anno-targets
as a service on TSD and as part of the pipeline (annopipe
).
The anno-target: ella
At the time of writing anno-targets
contains just one target called ella
, which performs lots of different tasks, essentially preparing the data for import into ELLA. It is chiefly in charge of creating:
- a pipeline report containing:
- CNV information
- coverage information
- potential low coverage warnings
- sample metadata info
- version info
- a coverage report PDF
- IGV tracks for ELLA
The ella
target also formats all files and organizes them in the directory structure required by ELLA.
ELLA classifications database
A database of in-house deep-intronic variants deemed to belong in ACMG-class 4 or 5, and therefore to be spared by Anno's filters, is maintained on TSD and regularly updated. The database is saved in a VCF file the path to which is provided to Anno through the environment variable ELLA_CLASSIFICATIONS_DB
.
Sample repo
Anno utilises a sample repo definition JSON file (samples.json
) located in an instance-specific sample repo directory, e.g., for TSD production, /ess/p22/data/durable/production/anno/sample-repo-prod
.
An automatic process updates the samples.json
file at regular intervals. The code is located at /ess/p22/data/durable/production/dev-ops/src
in TSD and is under Git version control. Any changes should be committed immediately.
The sample repo script scans the vcpipe
database alongside the durable/<instance>/data/analyses-results/{singles,trios}
directory to find all samples that should be available for annotation via Anno (and by extension, within ELLA).
Background
The Anno system consists of two linked code repositories -- ella-anno
and anno-targets
-- and is dedicated to the annotation of variant calls files produced by the variant calling pipeline.
The repository ella-anno
administers generic and openly available annotation sources. The purpose of the repository anno-targets
is two-fold:
- It provides access to our internal (and/or restricted) annotation sources, such as our internal databases and HGMD Pro.
- It contains the
ella
target that post-processes the data in a manner that is tailored to our specific needs. This includes for example markdown reports displayed in ELLA (report.txt
andwarning.txt
), coverage reports, IGV tracks, attachments, and backup excel sheets.
In order to ensure smooth operations, the core functionalities of ella-anno
and anno-targets
are implemented in software containers (docker
>> singularity
). The repositories' Dockerfile
s provide the instructions to build such containers. The build process from anno-targets
is to start from ella-anno
, and extend its Dockerfile
to integrate our ella
target into a new, extended image. The resulting image can be used to both spin up a service (which is what ELLA uses), and to execute annotation+target from the command line (which is what the pipeline does).
In summary:
ella-anno
contains:
- Code for the API
- Administration of generic and open annotation sources (VEP, FASTA, gnomAD++), stored on DigitalOcean
- Possibility of extension with additional targets
anno-targets
contains:
- Code for pre- and post-processing data (currently limited to the
ella
target) - Administration of restricted and internal annotation sources (inDB, HGMD++), stored on DigitalOcean
- With
anno-targets
available in the image under/anno-targets
, and an environment variable pointing to this directory, the possibility to trigger API or CLI calls of anno with target ella
Two modes of running anno
The pipeline (and the repository tests) trigger annotation through annotate_with_target
, while ELLA's anno service does so through the task.py
Python module, both using the master script annotate.sh
.
Development
During development, it is desirable to run specific pieces of software within the containers. The Docker containers are orchestrated by the repositories' Makefile
s which take care of setting and propagating all necessary variables and of building the needed targets.
dev/run
: run functional containersbuild
: build complete imagesbuild-*
: build partial images; in particular, thebuild-ella-annobuilder
target builds the[ella]annobuilder
image on top of which theanno-targets
builder is itself constructed.generate-*
: generate all data or parts thereof (*package targets)download-*
: download all data or parts thereof (download-package target)upload-*
: upload all data or parts thereof (upload-package target)test-*
: simulate production pipeline calls on pre-packaged test casessingularity-*
: various operations on Singularity containers.
Good to know
The root directory /anno
is implicitly created (by WORKDIR
instructions) in anno's base Docker image. Some other directories, e.g., the ops
directory, are implicitly created by ADD
or COPY
commands issued for parts of their content in anno's base Docker image.
Anno's Makefile
data management targets are wrappers for specific calls to Anno's Makefile
own annobuilder
-template. They do not build any Docker images themselves but rely on anno-targets
's Makefile
"super" targets to do so.
Anno's Makefile
data management targets default to syncing data sets defined in ella-anno
's own ops/datasets.json
file. When triggered by anno-targets
's Makefile's "super" targets, the data sets are looked up in anno-targets
's datasets.json
file. The exception to this rule is anno-targets
's Makefile
's download-data
target (see DATASETS_OPT
variable).
Note! Targets like
download-data
currently result in aRuntimeError
insync_data.py
whenever a version other than expected (fromdatasets.json
) of some piece of data exists in the data directory; set variableRUN_CMD_ARGS
to--force
before themake
command to force removal and subsequent update of any offenders.Note! The environment variable
REGIONS
is exported by${SCRATCH}/workdir/target.source
, a dynamically generated set of instructions sourced by${SCRATCH}/workdir/cmd.sh
, itself a dynamically generated script and the one responsible for spawning children shell processes via ella preprocess,annotate
andannotate.sh
, and main ella scripts.REGIONS
is initialized to point to a copy of the gene panel's transcripts regions file but is subsequently processed by the ella preprocess script.
Change management
Change requests or errors are registered in Gitlab https://www.gitlab.com/alleles/ella-anno for general issues. The issues are discussed with the system responsible before work is started.
The source code is stored in the git repositories https://www.gitlab.com/alleles/ella-anno and https://gitlab.com/alleles/anno-targets.
Work is done in a separate branch and the app is tested in a dedicated test environment. When approved by users and system responsible the changes are added to the main branch.
Before deployment, the source code is tagged and an application is created and transferred to TSD. The application is started in the staging environment and, if approved by superuser and system responsible, can be deployed to production. The superuser verifies the changes.
Changes are documented by filling out "Endringskontroll for bioinformatikk". Data-only releases and hotfixes do not need an "Endringskontroll". Hotfixes are defined as updates fixing bugs in functionality (but not changing the functionality itself). Hotfix versions are denoted by the PATCH
number increasing (version is given by MAJOR.MINOR.PATCH
).
See HTS Bioinf - Release and deployment of anno system for release and deployment details.