HTS Bioinf - Update of in-house annotation data
Scope
The aim of this procedure is to provide instructions for updating the four HTS-Bioinf sensitive databases:
1. Structural variant database for whole genome sequencing
This section provides instructions for creating databases of copy number variation (CNV) and general structural variation (SV) from whole genome sequencing data produced by the Dragen analysis platform. The CNV database is founded on measurements of changes in read depth. The SV database is founded mainly on measurements of aberrant insert sizes for read mapping. In addition, this section describes how to make visualization tracks of the combined CNV and SV database for the analysis software ELLA.
2. CNV exome database
This section provides instructions for creating two databases of copy number variation (CNV) from whole exome sequencing data. The first database is a serialized R object (RDS) that can be read by the R statistical package. It contains the median read depth per exon for a particular capture kit. It is used as a background reference for the CNV caller. The second database is a BED file. It contains the actual CNVs, that is, a collection of genomic intervals that were either deleted or duplicated in a sample. It is used to annotate CNVs in order to facilitate interpretation.
3. CoNVaDING background data
This section provides instructions for creating/updating CoNVaDING background data. CoNVaDING uses a set of background controls for calling CNVs on a new sample. The basic principle is that the coverage of a region is compared against the cross-section of the coverage in samples with similar coverage characteristics (from the background controls). This section details how to create/update new data sets. This is necessary whenever there are substantial changes to the lab or bioinformatic processes.
4. Tumor pipeline background data
In order to efficiently distinguish between noise and mosaic variations, a panel of normals is used in the tumor pipeline variant calling as recommended by GATK best practice. At least 40 samples, prepared and sequenced in the same way as the proband, will be included.
Responsibility
The bioinformatician responsible for annotation updates is responsible for keeping sensitive databases updated. EKG shares the responsibility to keep CoNVaDING background data updated.
Release of sensitive databases
Make a Gitlab issue from the template for sensitive-db
releases containing which databases to update:
- Structural variant WGS database
- CNV exome database
- CoNVaDING background data
- Tumor pipeline background data
The sensitive databases are under version control in two separate Git repositories currently located in /ess/p22/data/durable/development/sw/sensitive-db-factory
:
sensitive-db
for:- CNV exome database
- CoNVaDING background data
- Tumor pipeline background data
sv-indb
for Structural variant WGS database
Releases should follow the same procedures designated for software repositories.
NOTE: the following steps are automated for the structural variant WGS database, and integrated into the procedure for updating it.
-
Check out a new Git feature branch. The branch name [gitlab-issue-feature-branch] should contain the Gitlab issue number.
-
Add all relevant files (note that these are database-specific) and commit the changes to the Git feature branch.
-
It is advisable to test any changes to the database in staging before merging the feature branch into
master
. -
Merge the feature branch into
master
and tag the latter with a semanticVERSION
, e.g.vX.X.X
: -
Export the release
.tgz
archive and its.sha1
file (note that the Git archive's name contains both the release version tag and the commit-id the tag points to):
Structural variant WGS database
When to update
The database should be updated whenever there are significant changes to the database creation tools or to the Dragen SV and CNV callers. The frequency database's purpose is to identify common variants. Common variants can be false positives or true positives but in either case they are considered to be likely non-pathogenic. If the database is not updated after a Dragen update, its filtering efficacy may be impaired.
A change to the Dragen SV and CNV callers is regarded as significant if its effect on the database's frequency filtering efficacy is significant.
It is important to note that the database's frequency filtering efficacy depends on the number of samples used to compute the frequency, and that this number in many cases is more important than the Dragen version it has been run on. Therefore, it is often better to have a large database based on a older Dragen version than a small database based on the latest Dragen version.
Location of the database
After deployment, the database is stored in a sub-directory of sensitive-db
on TSD. Its location at the time of writing is /ess/p22/data/durable/production/reference/sensitive/sensitive-db/wgs-sv
.
Prerequisites
The database generation scripts and a Singularity container with the required tools should be placed in TSD as described here.
Whenever the sample selection criterion for the database changes due to an upgrade of Dragen, a new version of make-sv-indb
containing the new sample selection criterion should be released.
Generating new data
- Log in to
p22-hpc-03
. - Go to
/ess/p22/data/durable/development/sw/sensitive-db-factory/sv-indb/make-sv-indb
. This is wheremake-sv-indb
should be. - Check which samples will be considered for inclusion into the database by running:
SETUP_FILE=config/manta_dr.json make transfer-dryrun
SETUP_FILE=config/canvas_dr.json make transfer-dryrun
- Create the databases. This must be done on the compute cluster and may take several days. In order to take this into account, it is advised to manually add today's date to the
version
field in all JSON files used in the following steps:SETUP_FILE=config/manta_dr.json make full-process
SETUP_FILE=config/canvas_dr.json make full-process
- Post-process the databases to minimize file sizes:
SETUP_FILE=config/manta_dr.json make postprocess
SETUP_FILE=config/canvas_dr.json make postprocess
- Move databases to version control and release a tar archive:
- Add the first database and choose not to release when prompted:
SETUP_FILE=config/manta_dr.json make store-database
- Add the second database and choose to release when prompted:
SETUP_FILE=config/canvas_dr.json make store-database
- Add the first database and choose not to release when prompted:
- Deploy by moving the tar archive that was created by the previous step and extract it inside the
sensitive-db/
directory as shown in the standard deployment procedure. - Make tracks for ELLA. Make sure that the
version
field in the JSON file corresponds to the one from the previous steps.- Combine the two databases into one database:
SETUP_FILE=config/merged_dr.json make merge
- Make tracks on BED format and BigWig format:
SETUP_FILE=config/merged_dr.json make bigwig
- Combine the two databases into one database:
- Notify the ELLA development team that the database tracks
indb_dr_merged*.vcf.gz*
andindb_dr_merged*.bigWig*
are available and point them to their location.
CNV exome database
When to update
The database should be updated when there is a specific reason for updating, e.g. the capture protocol is changed.
Tools
The tool for calling CNVs is exCopyDepth
and is written in R
. It is accompanied by modules symlinkRefBam.R
and computeBackground.R
for generating the median read depth per exon
database. The tool for annotating the CNVs is called cnvScan
and is written in Python. There isn't a specific tool for collecting its output; instead instructions are given to create the CNV database.
Input
The samples BAM files are the input for generating the median read depth per exon database. The *.cnv.bed
result files produced by cnvScan
are used to create the CNV database.
Procedure
Default instructions are given for trio pipelines. For other pipelines (exome or target), the reader is invited to adapt the commands accordingly (as explained in the remarks).
Setup
Prerequisites:
- refdata repository, available at
/ess/p22/data/durable/production/reference/public/refdata
- vcpipe repository, available at
/ess/p22/data/durable/production/sw/variantcalling/vcpipe
- vcpipe-essentials repository, available at
/ess/p22/data/durable/production/sw/variantcalling/vcpipe-essentials
Open a terminal and set up your environment:
export SETTINGS=/ess/p22/data/durable/production/sw/variantcalling/vcpipe/config/settings-tsd.json
source /ess/p22/data/durable/production/sw/variantcalling/vcpipe/exe/setup.source
source /ess/p22/data/durable/production/sw/sourceme
Make a new directory and cd
into it, for example:
DATE=$(date +"%Y%m%d")
cd /ess/p22/data/durable/development/personal/${USER}
mkdir ${DATE}-inDB-CNV && cd ${DATE}-inDB-CNV
Compute the reference read depth per exon
-
Collect the BAM files in a directory
Rscript /ess/p22/data/durable/production/sw/variantcalling/vcpipe/src/annotation/cnv/exCopyDepth/symlinkRefBam.R Trio1
The defaults are:
- Repeat for Trio2, Trio3, etc.
- Make sure NA or HG samples were not added twice (in two projects). If so, remove duplicates.
- For exomes, the first parameter becomes something like excap.
-
Compute the background read depth per exon
Rscript /ess/p22/data/durable/production/sw/variantcalling/vcpipe/src/annotation/cnv/exCopyDepth/computeBackground.R
The defaults are (
REFPATH=/ess/p22/data/durable/production/reference/public/refdata/data
):--reference refBam --probes ${REFPATH}/captureKit/common/illumina_trusightone_v01/illumina_trusightone_v01.probes.bed --fasta ${REFPATH}/genomic/common/general/human_g1k_v37_decoy.fasta --output reference.rds
Rename the output file:
NOTE: For exome, the probes parameter should be:
and the output file should be renamed to
cnv-background-exome-${DATE}.rds
-
Update the CNV calls
Run
indbCNV_creator.py
which generates two files:cnv-calls-excap-{yyyymmdd}.sorted.bed.gz
andcnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi
(indbCNV_creator.py
in amg repo'ssrc/indb
directory). -
Move the produced background read depth and CNV calls files into the
exCopyDepth
subdirectory:mv cnv-background-trio-${DATE}.rds /ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/exCopyDepth/ mv cnv-calls-trio-${DATE}.sorted.bed.gz* /ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/exCopyDepth/
-
Update
/ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/sensitive-db.json
with the new values (relative paths) for the background and database keys. At the time of writing, the file contains:{ "cnv": { "exCopyDepth": { "trio": { "background": "exCopyDepth/cnv-background-trio-20161004.rds", "database": "exCopyDepth/cnv-calls-trio-20161021.sorted.bed.gz" }, "excap": { "background": "exCopyDepth/cnv-background-excap-20170209.rds", "database": "exCopyDepth/cnv-calls-excap-20170209.sorted.bed.gz" } } } }
NOTE: the index file does not get its own entry.
-
Update the release notes
/ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/release-notes.md
:Tag Date File Who Notes v1.0.0-rel 20161021 cnv-background-trio-20161021.rds p22-huguesfo (pilot) Trio1-5 P,M,F cnv-calls-trio-20161021.sorted.bed.gz.tbi p22-huguesfo (pilot) Trio1-5 P only cnv-calls-trio-20161021.sorted.bed.gz p22-huguesfo (pilot) Trio1-5 P only v1.0.1-rel 20161111 cnv-background-trio-20161111.rds p22-huguesfo Trio1-5 P,M,F cnv-calls-trio-20161111.sorted.bed.gz.tbi p22-huguesfo Trio1-5 P,M,F cnv-calls-trio-20161111.sorted.bed.gz p22-huguesfo Trio1-5 P,M,F
CoNVaDING background data
CoNVaDING background data are generated and added to sensitive-db
according to the procedure Update of CoNVaDING background data.
NOTE: the custom capture kit currently used for CoNVaDING background data is
CuCaV3
.
- Update the release notes
/ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/release-notes.md
- Deploy as part of
sensitive-db
.
Tumor pipeline background data
Update data
The Tumor pipeline background data need only be re-generated when sample preparation or sequencing changes. For each sample included in the panel of normals, run the following commands:
gatk4 \
Mutect2 \
-R ${bundle.reference.fasta} \
--germline-resource ${bundle.mutect2.germline_resource} \
--genotype-germline-sites true \
-I ${bam_file} \
-tumor ${analysis_name} \
-L ${calling_region} \
-O ${analysis_name}${output_suffix}.raw.vcf.gz
gatk4 \
CreateSomaticPanelOfNormals \
--vcfs ${analysis_name}${output_suffix}.raw.vcf.gz \
-O panelOfNormals.${DATE}.vcf.gz
Release data
- Move the produced VCF file and index file, e.g.
panelOfNormals.20190308.vcf.gz
andpanelOfNormals.20190308.vcf.gz.tbi
to/ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/tumor
- Update the
/ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/sensitive-db.json
file'stumor.panel_of_normal
entry with the new version - Update the release notes
/ess/p22/data/durable/development/sw/sensitive-db-factory/sensitive-db/release-notes.md
- Deploy as part of
sensitive-db
.
Deployment of sensitive databases
Obtain the latest
sensitive-db
Git archive (andsha1
)sv-indb
Git archive
Upload them to the respective production archive directories {production}/sw/archive
on TSD and NSC (see main production routines.)
Deploy to the production pipeline
Deploy by executing the following two commands on TSD and NSC.
-
Sensitive DB:
-
Structural variation DB:
Deploy to ELLA anno
Deployment for ELLA anno is part of the deployment for the production pipeline. It is only required when there are changes to either of:
- CNV exome database
- Structural variant WGS database
Deploy to ELLA anno on durable
This deployment will be shared by all ELLA anno services on TSD.
-
Check that no production jobs are running, e.g. using
./ops/num-active-jobs.sh
-
Stop production using
./ops/supervisor-ctl.sh -e master
-
Update the CNV exome database if necessary
- copy
cnv-calls-excap-{yyyymmdd}.sorted.bed.gz
andcnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi
to/ess/p22/data/durable/production/anno/sensitive-db
- update
/ess/p22/data/durable/production/anno/ops/start_instance.sh
(changeEXOME_CNV_INHOUSE_DB
to the new version)
- copy
-
Update the structural variant WGS databases if necessary
-
Start production using
./ops/supervisor-ctl.sh -e master