HTS Bioinf - Process for updates to production pipelines
Scope
This document contains a description of the procedure for introducing major changes to production pipelines.
The production pipeline referred to here consists of the in-house bioinformatic analysis scripts, the necessary third-party software, and the reference data used for variant calling on diagnostic samples. Except for hotfixes, updates to production should be done in planned releases and follow the procedure below.
Responsibility
Release coordinator (automation, variant calling, annotation, or interpretation).
Release and development process
Updates to the system are usually done in a planned and controlled manner where work is defined and documented in Gitlab. Changes are tested on the developers' personal machines, on development servers, on NSC or on TSD. An "endringskontroll" (EK) is filled and approved when needed (See the attachment "Saksgang for endringskontroll med eksempler" under procedure Endringskontroll - AMG), and the software is assigned a version number according to the conventions described below. The deployment of the software on the various platforms (NSC and TSD) is coordinated with the person currently in production duty.
- The release coordinator for the relevant system creates a Gitlab issue by using the respective template issue.
- The release coordinator gathers the changes that should go into the release and makes sure the issues are appropriately tagged in Gitlab.
- A release branch is created after most or all development has been done. The branch is typically named like the planned version (e.g.
4.3
). Testing is done on tagged versions of this release branch (use tag4.3-rcN
,N
is an integer). When testing is completed, the branch is merged intomaster
andmaster
is tagged (v4.3-rel
in this example). - A EK is filled if necessary (with exception of hotfixes and patches)
- When EK has been approved, the software can be deployed to the relevant platforms (NSC and TSD). See HTS Bioinf - Deployment of vcpipe for production, HTS Bioinf - Release and deployment of tsd-import, HTS Bioinf - Release and deployment of anno system.
- The release coordinator coordinates with the bioinformatician in production when the software can be deployed.
- New gene panels are imported into ELLA.
- Information is sent to all units when the changes have been applied to production via GDx operational Teams channel.
Endringskontroll (EK) and documentation
EK shortly describes the changes in a manner understandable by non-technical personnel. In addition, we need to document how the changes were tested, by whom and, if relevant, who approved them. EK must refer to relevant issues, merge requests (MRs) or milestones in Gitlab. If sensitive information is involved, create a document on TSD in a meaningfully named directory in /ess/p22/data/durable/production/investigations
.
Generally, the success criteria for a change/feature should be agreed on early in the development phase and documented clearly in Gitlab. This should be discussed with an end-user representative.
When preparing EK, the template "Endringskontroll_for_bioinformatikk_ELLA.docx" (attached to this procedure in eHåndbok) is to be used.
Some changes do not require an Endringskontroll, for example:
- changes related to testing (CI)
- minor refactorings of code not affecting any functionality
- urgent bug fixes that are done "in-place" in production (and are verified to work)
Code review
Generally, all changes are done via an MR in Gitlab. They need another developer's approval to be merged into the codebase. The items to consider in an MR are listed in the template selected when the MR was created. See Software development and review for more details.
Some changes do not require approval, and can be merged without prior approval. Such changes should be fairly trivial:
- changes already reviewed in another MR
- documentation
Types of testing
CI testing: Some testing is done in an automated fashion using Gitlab's CI (continuous integration) system. These tests are defined in the file .gitlab-ci.yml
at the root of the respective repository and are triggered automatically when changes are "pushed" or manually using Gitlab's CI's API. See the README
file in the respective repository for more details.
Manually running parts or the whole of the bioinformatics pipeline. This can be done either locally on the developers' personal machines or using any of the development servers.
Pipeline runs on NSC and TSD. Complete analyses including mapping, variant calling and annotation can be run using either patient samples or non-sensitive test data (commercial control samples). This is most easily done using the automation system (the executor program in the vcpipe
repo) set up on the staging environment on either NSC or TSD. The staging environment mimics the actual production environment. The relevant fastq/bam
files and metadata files (.analysis
and .sample
) are copied to the directories monitored by the automation system, and the result files can be inspected and verified. When the analyses are run on the TSD staging environment, the results will be sent to ELLA staging, and can be verified there. Whenever changes in the pipelines will change files used by ELLA, the test results must be checked in ELLA staging. Typically, results are compared to previous results available in the production environment (including ELLA production). When relevant, lab engineers from GDx or superusers from user units in the department should be involved in this process.
CI Testing
The various pipelines are tested using Gitlab's CI and specifying the version of the modules.
vcpipe
Run the script in the vcpipe
repo (preferably the one from the branch with the relevant changes):
where the token is the personal token you created in vcpipe
's Gitlab project settings page, <pipeline>
is one of annopipe
, basepipe
or triopipe
, and <version>
can be either a branch name or a tag.
A regression test suited for exome, trio and target (cancer) pipelines is triggered as part of a release. Further details are found in the README
file of the vcpipe
repo.
- Gitlab's CI runs the pipeline on our CI/development servers. Errors are registered as issues in Gitlab.
- The pilot genome Reference Material NA12878 is used at each integration test to call variants. They are compared against the "Gold Standard" high confidence variants by the Genome in a Bottle (GIAB) Consortium. Deviations from the benchmark are reported and cause the run to be marked as failed.
- The exome pipeline is run with a large gene panel. The number of filtered variants reported in the interpretation sheet is compared against previous runs. Any deviations are flagged and cause the run to be marked as failed.
Testing on TSD
The staging environment in TSD (/ess/p22/data/durable/staging
) can be used to test any available NA (public reference) or diagnostic sample.
The environment is similar to the production one to allow for realistic testing using diagnostic samples. The database, the directory structure and the software are independent of the ones in production. The computation is done on the regular compute cluster, including Dragen.
As part of a release, one must put the new software modules in the staging environment and run tests. At least, one should run an exome sample and a target (cancer) sample and make sure that they go through the pipeline successfully and that the result files are present on disk. If running the analysis type 'annopipe', the result files should be available in the ELLA staging environment. When the release introduces fundamental changes, the result files must be checked more thoroughly and approved by lab engineers.
The testing must use either finalized and tagged modules or modules that contain most of the changes. Due to long computation times, it is often not possible to run tests again once the modules have been tagged.
The relevant analyses files must be created in their respective analyses directories under staging/analyses-work
. Previous production analyses, found under /ess/p22/data/durable/production/analyses-work
, can be copied to the staging environment for this purpose.
The pipelines are managed by executor
and webui
in the p22-hpc-03
VM. Simply start those services using the scripts in staging/sw
.
Discuss with a bioinformatician in production if that VM is actually available.
Versioning and hotfixes
Important updates to the pipeline software are tagged with versions of the form v4.3.2-rel
.
- The first number is incremented when there are major changes like introducing a new structure or a new set of tools.
- The second number (minor) is incremented for normal updates that might add new or improve current functionality.
- The last number (micro) is incremented for smaller, often trivial changes, or for an unplanned change that was initially done directly in the production environment (a hotfix). The hotfix change is also made in the code repository (i.e. Gitlab) and the code repository is tagged. Often the hotfix version is released through standard procedures soon after to make sure that production is in sync with the tagged code repository. The practice of code review and approval in a (Gitlab) Merge Request also applies to hotfix changes that are later done in the code repository.
- A hotfix is done in-place in production without "Endringskontroll". Such hotfixes are only done for bugs or updates that need be done quickly.
Creating a deployable package
A deployable package is created by triggering a job in Gitlab's CI or locally.
Refer to respective documentation and Makefile help in each repository.
Transfer both the tgz
and the sha1
files to TSD/NSC. If you haven't changed the Python modules, there is no point in transferring the Python module package.
Updates to proprietary software (Dragen)
For updates to the third-party licensed software (Dragen) we currently use the following process.
First, an EK is initiated and filled in with new software version and brief summary of changes. After the testing is done, the results are added to the EK document as a summary and as an additional report document with tables and details.
Test plan
- Create new reference genome index as described here
- Run latest NA12878 and HG002 samples with new software version. Check with Bioinformatics Operations Coordinator for the latest wgs projects that included those reference samples
- Compare command line options of the software (make issues for necessary adjustments in production pipeline if needed)
-
NA21878 sample (we restrict the comparison to the high confident regions):
- Compare genome-wide sensitivity and positive predictive value (PPV) for SNV and indels between the current and previous version runs, report in a table for each version:
Dragen vN.N.N Filter Sensitivity PPV TP FN FP SNP ALL value value value value value INDEL ALL value value value value value - Compare values of quality control parameters between the current and previous version runs
Quality control parameters PASS/FAIL Dragen new Dragen previous Acceptable range Coverage >10x value value value [0.89,1.01] Coverage >20x value value value [0.83,1.01] Median coverage value value value [29,1000000000] Transitions/transversions value value value [1.5,2.5] Ratio skewed allele depth value value value [0,0.05] Base quality >30 value value value 80 -
HG002 sample (we restrict the comparison to the high confident regions):
- Compare genome-wide TP, FN and FP for CNV/SV with Truvari tool, using current run and run by the previous version of the software (see scripts in the previous run folder)
TP FN FP Dragen new value value value Dragen previous value value value - Examine and compare SNV and indels on the X chromosome
-
Diagnostic single sample for the GDx-lab engineer to examine in ELLA staging using Mitokon panel (additionally, report count of variants reported by both versions and exclusive to each version)
- Diagnostic trio for the GDx-lab engineer to examine in ELLA staging using UtAv panel (additionally, report count of variants reported by both versions and exclusive to each version)
No. variants in both | Only in previous | Only in new | |
---|---|---|---|
Mitokon vX | value | value | value |
UtAv vY | value | value | value |
Acceptance criteria
- SNVs: both sensitivity and PPV should be >99%
- indels: both sensitivity and PPV should be >95%
- CNV/SV: no large differences
- SNVs and indels on chromosome X: no large differences
- Quality parameter values: no large differences
- Production pipeline is compatible with the new version of software command options
See also
The technical steps of vcpipe
release, deployment, and archiving in the procedure Release and deployment of vcpipe.
- Development guide for general overview of development process
- The general Endringskontroll - AMG procedure
Attachments
See template "Endringskontroll for bioinformatiske endringer" in EHB.