HTS Bioinf - Process for updates to production pipelines
Scope
This is a procedure for the formal process of introducing major changes to production pipelines.
The variant calling pipeline referred to here consists of the in-house bioinformatic analysis scripts, the necessary third-party software, and the reference data used for variant calling on diagnostic samples. Except for hotfixes, updates to production should be done in planned releases and follow the procedure below.
Responsibility
The pipeline release bioinformatician.
Release and development process
Updates to the system are usually done in a planned and controlled manner where work is defined and documented in Gitlab. Changes are tested on the developers' personal machines, on development servers, on NSC or on TSD. An "endringskontroll" (EK) is filled and approved, and the software is assigned a version number according to the conventions described above. The deployment of the software on the various platforms (NSC and TSD) is coordinated with the person having production duty.
- The release manager will create a Gitlab issue by using the respective template issue.
- The release manager gathers the changes that should go into the release and makes sure the issues are appropriately tagged in Gitlab.
- A release branch is created after most or all development has been done. The branch is typically named like the planned version (e.g.
4.3
). Testing is done on tagged versions of this release branch (use tag4.3-rcN
,N
is an integer). When testing is completed, the branch is merged intomaster
andmaster
is tagged (v4.3-rel
in this example). - An "Endringskontroll" is filled.
- When "Endringskontroll" has been approved, the software can be deployed to the relevant platforms (NSC and TSD). See HTS Bioinf - Deployment of vcpipe for production
- The release manager coordinates with the bioinformatician in production when the pipeline can be applied.
- New gene panels are imported into ELLA.
- Information is sent to all units when the changes have been applied to production.
Endringskontroll (EK) and documentation
EK shortly describes the changes in a manner understandable by non-technical personnel. In addition, we need to document how the changes were tested, by whom and, if relevant, who approved them. EK must refer to relevant issues, merge requests (MRs) or milestones in Gitlab. If sensitive information is involved, create a document on TSD in a meaningfully named directory in /tsd/p22/data/durable2/investigations
.
Generally, the success criteria for a change/feature should be agreed on early in the development phase and documented clearly in Gitlab. This should be discussed with an end-user representative.
EK should have a list or a table containing:
- Short title
- References to Gitlab
- Description of success criteria/requirements for approval
- Names of analyses/samples
- Coverage/precision/specificity for the pipeline (using the NA12878N6 sample)
- Type of test (see below)
- Name of developer/tester/approving user
- Date of testing/approval
Some changes do not require an Endringskontroll, for example:
- changes related to testing (CI)
- minor refactorings of code not affecting any functionality
- urgent bug fixes that are done "in-place" in production (and are verified to work)
Code review
Generally, all changes are done via an MR in Gitlab. They need another developer's approval to be merged into the codebase. The items to consider in an MR are listed in the template selected when the MR was created. See Development and review for more details.
Some changes do not require approval, and can be merged without prior approval. Such changes should be fairly trivial:
- changes already reviewed in another MR
- documentation
Types of testing
CI testing : Some testing is done in an automated fashion using the Gitlab's CI (continuous integration) system. These tests are defined in the file .gitlab-ci.yml
at the root of the vcpipe
repository and are triggered automatically when changes are committed or triggered by using Gitlab's CI API. See the README
file of the vcpipe
repository for more details.
Manully running parts or the whole of the bioinformatics pipeline. This can be done either locally on the developers' personal machines or using any of the test servers (like tomato
or focus
).
Pipeline runs on NSC and TSD. Complete analyses including mapping, variant calling and annotation can be run using either patient samples or non-sensitive test data (NA). This is most easily done using the automation system (the executor program in the vcpipe
repo) on either NSC or TSD. The relevant fastq/bam
files and metadata files (.analysis
and .sample
) are copied to the directories monitored by the automation system and the result files can be inspected and verified. On TSD a staging environment is set up to mimic the production one. The analyses run there will be sent to ELLA staging and can be verified there. Typically, results are compared to previous results available in the production environment (including ELLA production).
CI Testing
The various pipelines are tested using Gitlab's CI and specifying the version of the modules. Run the script in the vcpipe
repo (preferably the one from the branch with the relevant changes):
where the token is the personal token you created in vcpipe
's Gitlab project settings page, <pipeline>
is one of annopipe
, basepipe
or triopipe
, and <version>
can be either a branch name or a tag.
A regression test suited for exome, trio and target (cancer) pipelines is triggered as part of a release. Further details are found in the README
file of the vcpipe
repo.
- Gitlab's CI runs the pipeline on
tomato.uio.no
, our CI server. Errors are registered as issues in Gitlab. - The pilot genome Reference Material NA12878 is used at each integration test to call variants. They are compared against the "Gold Standard" high confidence variants by the Genome in a Bottle (GIAB) Consortium. Deviations from the benchmark are reported and cause the run to be marked as failed.
- The exome pipeline is run with a large gene panel. The number of filtered variants reported in the interpretation sheet is compared against previous runs. Any deviations are flagged and cause the run to be marked as failed.
Testing on TSD
The staging environment in TSD (/cluster/projects/p22/staging
) can be used to test any available NA or diagnostic sample.
The environment is similar to the production one to allow for realistic testing using diagnostic samples. The database, the directory structure and the software are independent of the ones in production. The computation is done on the regular compute cluster, including Dragen.
As part of a release, one must put the new software modules in the staging environment and run tests. At least, one should run an exome sample and a target (cancer) sample and make sure that the pipeline completes successfully and that the result files are present on disk. If running the analysis type 'annopipe', the result files should be available in the ELLA staging environment. When the release introduces fundamental changes, the result files must be checked more thoroughly and approved by lab engineers.
The testing must use either finalized and tagged modules or modules that contain most of the changes. Due to long computation times, it is often not possible to run tests again once the modules have been tagged.
Relevant analyses and samples directories are copied to staging/{analyses,samples}
. Existing samples/analyses
are found in /tsd/p22/data/durable{,2,3,4}
.
The vcpipe
module is deployed by placing it in staging/sw/archive
and then running staging/sw/deploy.sh
.
The pipeline is started in the VM p22-submit-dev
by starting executor
and webui
using the scripts in staging/sw
.
Discuss with a bioinformatician in production if that VM is available.
Versioning and hotfixes
Important updates to the pipeline software are tagged with versions of the form v4.3.2-rel
.
- The first number is incremented when there are major changes like introducing a new structure or a new set of tools.
- The second number (minor) is incremented for normal updates that might add new or improve current functionality.
- The last number (micro) is incremented for smaller, often trivial changes, or for an unplanned change that was initially done directly in the production environment (a hotfix). The hotfix change is also made in the code repository (i.e. Gitlab) and the code repository is tagged. Often the hotfix version is released through standard procedures soon after to make sure that production is in sync with the tagged code repository. The practice of code review and approval in a (Gitlab) Merge Request also applies to hotfix changes that are later done in the code repository.
- A hotfix is done in-place in production without "Endringskontroll". Such hotfixes are only done for bugs or updates that need be done quickly.
Creating a deployable package
A deployable package is created by triggering a job in Gitlab's CI.
Trigger a run of CI giving deploy
as argument to the ops/trigger-ci-test.sh
script.
The packages for vcpipe
and its Python modules will be available on the build server (tomato
) in directory /storage/releases
.
Transfer both the tgz
and the sha1
file to TSD/NSC. If you haven't changed the Python modules, there is no point in transferring the Python module package.
See also
The technical steps of vcpipe
release, deployment, and archiving in the procedure "Release and deployment".
- The guide "Development guide" for general overview of development process
- The general "Endringskontroll - AMG" procedure
Attachments
See template "Endringskontroll for bioinformatiske endringer" in EHB.