HTS Bioinf - Anno with anno-targets
Scope
The procedure describes what we call anno system which includes ella-anno
and anno-targets
repositories. This is the annotation part of the pipeline as well as the annotation service used by ELLA.
Definitions
ELLA: software for interpretataions of genetic variants.
Supervisor: software module to monitor processes.
Supervisor UI: the webpage to access Supervisor functionality
TSD: Tjenester for Sensitive Data (infrastructur at USIT, University of Oslo)
What is anno and anno-targets?
anno is an annotation service, with an API and an internal queue system. It can accept requests for annotation of specific samples (or vcf
files), and then run a specified target.
anno-targets
are the available targets for anno. A target is bascially a postprocess job, that is run after the annotation is performed. It can, for example, consist of creating reports, statistics or IGV tracks of the incoming data.
We currently run anno with anno-targets as a service on TSD, but it is also integrated in the pipeline and the backup pipeline.
anno-target: ella
There is currently just one target in our anno-targets
, named ella
. It performs a lot of different tasks, basically preparing the data for import into ELLA. Among these tasks are creating a pipeline report (with CNV information, coverage information, potential low coverage warning, sample metadata info and version info), excel file for backup use, coverage report pdf and IGV tracks for ELLA. It also formats all files and places them in the directory structure required by ELLA.
Sample repo
anno utilises a sample repo definition file located in sample-repo/samples.json
.
A process controlled by Supervisor updates the samples.json
at regular intervals. The code is located in sample-repo/
and is a git
repo inside TSD. Any changes should be commited to git
immediately.
The sample repo script scans vcpipe
database along with /tsd/p22/data/durable{,2,3}/production/preprocessed
to find all samples that should be available for annotation via anno (and by extension, within ELLA).
Background
anno system consists of two linked code repositiories - ella-anno
and anno-targets
and is dedicated to the annotation of variant calls files produced by the variant calling pipeline.
The repository ella-anno
administers generic and openly available annotation sources. The purpose of the repository anno-targets
is two-fold:
- It is for our internal (and/or restricted) annotation sources, such as our internal databases and HGMD Pro.
- It contains the
ella-target
that postprocesses the data that is specific to our users. This includes for example markdown reports displayed in ELLA (report.txt
andwarning.txt
), coverage reports, IGV tracks, attachments, and backup excel sheets.
In order to ensure smooth operations, the core functionalities of ella-anno
and anno-targets
are implemented in software containers (docker
>> singularity
). The repositories' Dockerfile
s provide the instructions to build such containers. The build process from anno-targets
is to start from ella-anno
, and extend its Dockerfile
to integrate our ella-target into a new, extended image. The resulting image can be used to both spin up a service (which is what ELLA uses), and to execute annotation+target from command line (which is what the pipeline does).
In summary:
ella-anno
contains:
- Code for the API
- Administration of generic and open annotation sources (VEP, FASTA, gnomAD++), stored on DigitalOcean
- Possibility of extension with additional targets
anno-targets
contains:
- Code for pre- and post-processing data in the ella-target
- Administration of restricted and internal annotation sources (inDB, HGMD++), stored on DigitalOcean
- With
anno-targets
available in the image under/anno-targets
, and an environment variable pointing anno to this directory, one can trigger the API or CLI of anno with target ella
Two modes of running anno
The pipeline (and the repository tests) trigger annotation through annotate_with_target
, while ELLA's anno service does so through the task.py
Python module, both using the master script annotate.sh
.
Development
During development, it is desirable to run specific pieces of software within the containers. This is simplified by a repositories' Makefile
which takes care of setting and propagating all necessary variables and of building the needed targets.
dev/run
: run functional containersbuild
: build complete imagesbuild-*
: build partial images; in particular, thebuild-ella-annobuilder
target builds the[ella]annobuilder
image on top of which theanno-targets
builder is itself constructed.generate-*
: generate all data or parts thereof (*package targets)download-*
: download all data or parts thereof (download-package target)upload-*
: upload all data or parts thereof (upload-package target)test-*
: targets simulate production pipeline calls on pre-packaged test cases.singularity-*
: --
The root directory /anno
is implicitly created (by WORKDIR
instructions) in anno's base Docker image. Some other directories, e.g., the ops
directory, are implicitly created by ADD
or COPY
commands issued for parts of their content in anno's base Docker image.
anno's Makefile
data management targets are wrappers for specific calls to anno's Makefile
own annobuilder
-template. They do not build any Docker images themselves but rely on anno-targets's Makefile
"super" targets to do so.
anno's Makefile
data management targets default to syncing datasets defined in anno's own ops/datasets.json
file. When triggered by anno-targets
's Makefile's "super" targets, the datasets are looked up in anno-targets
's datasets.json
file. The exception to this rule is anno-targets
's Makefile
's download-data
target (see DATASETS_OPT
variable).
Note!
Targets like download-data
currently result in a RuntimeError
in sync_data.py
whenever a version other than expected (from datasets.json
) of some piece of data exists in the data directory; set variable RUN_CMD_ARGS
to --force
before the make
command to force removal and subsequent update of any offenders.
The environment variable REGIONS
is exported by ${SCRATCH}/workdir/target.source
, a dynamically generated set of instructions sourced by ${SCRATCH}/workdir/cmd.sh
, itself a dynamically generated script and the one responsible for spawning children shell processes via ella preprocess, annotate
and annotate.sh
, and main ella scripts.
Note!
REGIONS
is initialized to point to a copy of the gene panel's transcripts regions file but is subsequently processed by the ella preprocess script.
Change management
Change requests or errors are registered in Gitlab https://www.gitlab.com/alleles/ella-anno for general issues. The issue is discussed with the system responsible before work is started.
The source code is stored in git repositories https://www.gitlab.com/alleles/ella-anno and https://gitlab.com/alleles/anno-targets.
Work is done in a separate branch and the app is tested in a dedicated test environment. When approved by users and system responsible the changes are added to the main branch.
Before deployment, the source code is tagged and an application is created and transferred to TSD. The application is started in the staging environment and if approved by superuser and system responsible the application can be deployed to production. The superuser verifies the changes.
Changes are documented by filling out "Endringskontroll for bioinformatikk". Hotfixes do not need an "Endringskontroll". Hotfixes are defined as updates fixing bugs in functionality (but not changing the functionality itself). Hotfix versions are denoted by the PATCH
number increasing (version is given by MAJOR.MINOR.PATCH
).
See HTS Bioinf - Release and deployment of anno system for release and deployment details.