Skip to content

HTS Bioinf - Update of in-house annotation data

Scope

The aim of this procedure is to provide instructions for updating the four HTS-Bioinf sensitive databases:

1. Structural variant database for whole genome sequencing

This section provides instructions for creating databases of copy number variation (CNV) and general structural variation (SV) from whole genome sequencing data produced by the Dragen analysis platform. The CNV database is founded on measurements of changes in read depth. The SV database is founded mainly on measurements of aberrant insert sizes for read mapping. In addition, this section describes how to make visualization tracks of the combined CNV and SV database for the analysis software ELLA.

2. CNV exome database

This section provides instructions for creating two databases of copy number variation (CNV) from whole exome sequencing data. The first database is a serialized R object (RDS) that can be read by the R statistical package. It contains the median read depth per exon for a particular capture kit. It is used as a background reference for the CNV caller. The second database is a BED file. It contains the actual CNVs, that is, a collection of genomic intervals that were either deleted or duplicated in a sample. It is used to annotate CNVs in order to facilitate interpretation.

3. CoNVaDING background data

This section provides instructions for creating/updating CoNVaDING background data. CoNVaDING uses a set of background controls for calling CNVs on a new sample. The basic principle is that the coverage of a region is compared against the cross-section of the coverage in samples with similar coverage characteristics (from the background controls). This section details how to create/update new data sets. This is necessary whenever there are substantial changes to the lab or bioinformatic processes.

4. Tumor pipeline background data

In order to efficiently distinguish between noise and mosaic variations, a panel of normals is used in the tumor pipeline variant calling as recommended by GATK best practice. At least 40 samples, prepared and sequenced in the same way as the proband, will be included.

Responsibility

The bioinformatician responsible for annotation updates is responsible for keeping sensitive databases updated. EKG shares the responsibility to keep CoNVaDING background data updated.

 


 

Release of sensitive databases

Make a Gitlab issue from the template for sensitive-db releases containing which databases to update:

The sensitive databases are under version control in two separate Git repositories currently located in /ess/p22/cluster/dev/sensitive:

  • sensitive-db  for:
    • CNV exome database
    • CoNVaDING background data
    • Tumor pipeline background data
  • sv-indb  for Structural variant WGS database

Releases should follow the same procedures designated for software repositories.

NOTE: the following steps are automated for the structural variant WGS database, and integrated into the procedure for updating it.

  1. Check out a new Git feature branch. The branch name [gitlab-issue-feature-branch] should contain the Gitlab issue number.

    git checkout -b [gitlab-issue-feature-branch]
    
  2. Add all relevant files (note that these are database-specific) and commit the changes to the Git feature branch.

  3. It is advisable to test any changes to the database in staging before merging the feature branch into master.

  4. Merge the feature branch into master and tag the latter with a semantic VERSION, e.g. vX.X.X:

    git switch master
    git merge [gitlab-issue-feature-branch]
    git tag -a ${VERSION}-rel -m "Tag release ${VERSION}"
    
  5. Export the release .tgz archive and its .sha1 file (note that the Git archive's name contains both the release version tag and the commit-id the tag points to):

    ./export.sh ${VERSION}-rel
    

 

Structural variant WGS database

When to update

The database should be updated whenever there are significant changes to the database creation tools or to the Dragen SV and CNV callers. The frequency database's purpose is to identify common variants. Common variants can be false positives or true positives but in either case they are considered to be likely non-pathogenic. If the database is not updated after a Dragen update, its filtering efficacy may be impaired.

A change to the Dragen SV and CNV callers is regarded as significant if its effect on the database's frequency filtering efficacy is significant.

It is important to note that the database's frequency filtering efficacy depends on the number of samples used to compute the frequency, and that this number in many cases is more important than the Dragen version it has been run on. Therefore, it is often better to have a large database based on a older Dragen version than a small database based on the latest Dragen version.

Location of the database

After deployment, the database is stored in a sub-directory of sensitive-db on TSD. Its location at the time of writing is /ess/p22/data/durable/production/reference/sensitive/sensitive-db/wgs-sv.

Prerequisites

The database generation scripts and a Singularity container with the required tools should be placed in TSD as described here.

Whenever the sample selection criterion for the database changes due to an upgrade of Dragen, a new version of make-sv-indb containing the new sample selection criterion should be released.

Generating new data

  1. Log in to p22-submit-dev.
  2. Go to /ess/p22/cluster/dev/sensitive/sv-indb/make-sv-indb. This is where make-sv-indb should be.
  3. Check which samples will be considered for inclusion into the database by running:
    • SETUP_FILE=config/manta_dr.json make transfer-dryrun
    • SETUP_FILE=config/canvas_dr.json make transfer-dryrun
  4. Create the databases. This must be done on the compute cluster and may take several days. In order to take this into account, it is advised to manually add today's date to the version field in all JSON files used in the following steps:
    • SETUP_FILE=config/manta_dr.json make full-process
    • SETUP_FILE=config/canvas_dr.json make full-process
  5. Post-process the databases to minimize file sizes:
    • SETUP_FILE=config/manta_dr.json make postprocess
    • SETUP_FILE=config/canvas_dr.json make postprocess
  6. Move databases to version control and release a tar archive:
    • Add the first database and choose not to release when prompted: SETUP_FILE=config/manta_dr.json make store-database
    • Add the second database and choose to release when prompted: SETUP_FILE=config/canvas_dr.json make store-database
  7. Deploy by moving the tar archive that was created by the previous step and extract it inside the sensitive-db/ directory as shown in the standard deployment procedure.
  8. Make tracks for ELLA. Make sure that the version field in the JSON file corresponds to the one from the previous steps.
    • Combine the two databases into one database: SETUP_FILE=config/merged_dr.json make merge
    • Make tracks on BED format and BigWig format: SETUP_FILE=config/merged_dr.json make bigwig
  9. Notify the ELLA development team that the following new database tracks have been made:
    • /ess/p22/cluster/dev/shared/sv-indb/svdb/indb_dr_merged*.vcf.gz*
    • /ess/p22/cluster/dev/shared/sv-indb/svdb/indb_dr_merged*.bigWig*

 

CNV exome database

When to update

The database should be updated when there is a specific reason for updating, e.g. the capture protocol is changed.

Tools

The tool for calling CNVs is exCopyDepth and is written in R. It is accompanied by modules symlinkRefBam.R and computeBackground.R for generating the median read depth per exon database. The tool for annotating the CNVs is called cnvScan and is written in Python. There isn't a specific tool for collecting its output; instead instructions are given to create the CNV database.

Input

The samples BAM files are the input for generating the median read depth per exon database. The *.cnv.bed result files produced by cnvScan are used to create the CNV database.

Procedure

Default instructions are given for trio pipelines. For other pipelines (exome or target), the reader is invited to adapt the commands accordingly (as explained in the remarks).

Setup

Prerequisites:

  • refdata repository, available at /ess/p22/data/durable/production/reference/public/refdata
  • vcpipe repository, available at /ess/p22/data/durable/production/sw/variantcalling/vcpipe
  • vcpipe-essentials repository, available at /ess/p22/data/durable/production/sw/variantcalling/vcpipe-essentials

Open a terminal and set up your environment:

export SETTINGS=/ess/p22/data/durable/production/sw/variantcalling/vcpipe/config/settings-tsd.json
source /ess/p22/data/durable/production/sw/variantcalling/vcpipe/exe/setup.source
source /ess/p22/cluster/sw/sourceme

Make a new directory and cd into it, for example:

DATE=$(date +"%Y%m%d")
cd /ess/p22/cluster/dev/${USER}
mkdir ${DATE}-inDB-CNV && cd ${DATE}-inDB-CNV

Compute the reference read depth per exon

  1. Collect the BAM files in a directory

    Rscript /ess/p22/data/durable/production/sw/variantcalling/vcpipe/src/annotation/cnv/exCopyDepth/symlinkRefBam.R Trio1
    

    The defaults are:

    --output_path refBam
    --interpretation /ess/p22/data/durable/production/interpretations
    
    • Repeat for Trio2, Trio3, etc.
    • Make sure NA or HG samples were not added twice (in two projects). If so, remove duplicates.
    • For exomes, the first parameter becomes something like excap.

     

  2. Compute the background read depth per exon

    Rscript /ess/p22/data/durable/production/sw/variantcalling/vcpipe/src/annotation/cnv/exCopyDepth/computeBackground.R
    

    The defaults are (REFPATH=/ess/p22/data/durable/production/reference/public/refdata/data):

    --reference refBam
    --probes ${REFPATH}/captureKit/common/illumina_trusightone_v01/illumina_trusightone_v01.probes.bed
    --fasta ${REFPATH}/genomic/common/general/human_g1k_v37_decoy.fasta
    --output reference.rds
    

    Rename the output file:

    mv reference.rds cnv-background-trio-${DATE}.rds
    

    NOTE: For exome, the probes parameter should be:

    --probes ${REFPATH}/captureKit/agilent_sureselect_v05/wex_Agilent_SureSelect_v05_b37.baits.bed
    

    and the output file should be renamed to cnv-background-exome-${DATE}.rds

     

  3. Update the CNV calls

    Run indbCNV_creator.py which generates two files: cnv-calls-excap-{yyyymmdd}.sorted.bed.gz and cnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi (indbCNV_creator.py in amg repo's src/indb directory).

     

  4. Move the produced background read depth and CNV calls files into the exCopyDepth subdirectory:

    mv cnv-background-trio-${DATE}.rds /ess/p22/cluster/dev/sensitive/sensitive-db/exCopyDepth/
    mv cnv-calls-trio-${DATE}.sorted.bed.gz* /ess/p22/cluster/dev/sensitive/sensitive-db/exCopyDepth/
    

     

  5. Update /ess/p22/cluster/dev/sensitive/sensitive-db/sensitive-db.json with the new values (relative paths) for the background and database keys. At the time of writing, the file contains:

    {
          "cnv": {
             "exCopyDepth": {
                "trio": {
                   "background": "exCopyDepth/cnv-background-trio-20161004.rds",
                   "database": "exCopyDepth/cnv-calls-trio-20161021.sorted.bed.gz"
                },
             "excap": {
                   "background": "exCopyDepth/cnv-background-excap-20170209.rds",
                   "database": "exCopyDepth/cnv-calls-excap-20170209.sorted.bed.gz"
                }
             }
          }
    }
    

    NOTE: the index file does not get its own entry.

     

  6. Update the release notes /ess/p22/cluster/dev/sensitive/sensitive-db/release-notes.md:

    Tag Date File Who Notes
    v1.0.0-rel 20161021 cnv-background-trio-20161021.rds p22-huguesfo (pilot) Trio1-5 P,M,F
    cnv-calls-trio-20161021.sorted.bed.gz.tbi p22-huguesfo (pilot) Trio1-5 P only
    cnv-calls-trio-20161021.sorted.bed.gz p22-huguesfo (pilot) Trio1-5 P only
    v1.0.1-rel 20161111 cnv-background-trio-20161111.rds p22-huguesfo Trio1-5 P,M,F
    cnv-calls-trio-20161111.sorted.bed.gz.tbi p22-huguesfo Trio1-5 P,M,F
    cnv-calls-trio-20161111.sorted.bed.gz p22-huguesfo Trio1-5 P,M,F

     

CoNVaDING background data

CoNVaDING background data are generated and added to sensitive-db according to the procedure Update of CoNVaDING background data.

NOTE: the custom capture kit currently used for CoNVaDING background data is CuCaV3.

  1. Update the release notes /ess/p22/cluster/dev/sensitive/sensitive-db/release-notes.md
  2. Deploy as part of sensitive-db.

 

Tumor pipeline background data

Update data

The Tumor pipeline background data need only be re-generated when sample preparation or sequencing changes. For each sample included in the panel of normals, run the following commands:

gatk4 \
    Mutect2 \
    -R ${bundle.reference.fasta} \
    --germline-resource ${bundle.mutect2.germline_resource} \
    --genotype-germline-sites true \
    -I ${bam_file} \
    -tumor ${analysis_name} \
    -L ${calling_region} \
    -O ${analysis_name}${output_suffix}.raw.vcf.gz

gatk4 \
    CreateSomaticPanelOfNormals \
    --vcfs ${analysis_name}${output_suffix}.raw.vcf.gz \
    -O panelOfNormals.${DATE}.vcf.gz

Release data

  1. Move the produced VCF file and index file, e.g. panelOfNormals.20190308.vcf.gz and panelOfNormals.20190308.vcf.gz.tbi to /ess/p22/cluster/dev/sensitive-db/tumor
  2. Update the /ess/p22/cluster/dev/sensitive-db/sensitive-db.json file's tumor.panel_of_normal entry with the new version
  3. Update the release notes /ess/p22/cluster/dev/sensitive-db/release-notes.md
  4. Deploy as part of sensitive-db.

 

Deployment of sensitive databases

Obtain the latest

  • sensitive-db Git archive (and sha1)
  • sv-indb Git archive

Upload them to the respective production archive directories {production}/sw/archive on TSD and NSC (see main production routines.)

Deploy to the production pipeline

Deploy by executing the following two commands on TSD and NSC.

  • Sensitive DB:

    {production}/sw/deploy.sh ${ARCHIVE}/[sensitive-db-git-archive] prod sensitive-db
    
  • Structural variation DB:

    tar xvf ${ARCHIVE}/[sv-indb-git-archive] -C {production}/reference/sensitive/sensitive-db
    

Deploy to ELLA anno

Deployment for ELLA anno is part of the deployment for the production pipeline. It is only required when there are changes to either of:

  1. CNV exome database
  2. Structural variant WGS database

Deploy to ELLA anno on durable

This deployment will be shared by all ELLA anno services on TSD.

  1. Check that no production jobs are running, e.g. using ./ops/num-active-jobs.sh

  2. Stop production using ./ops/supervisor-ctl.sh -e master

  3. Update the CNV exome database if necessary

    • copy cnv-calls-excap-{yyyymmdd}.sorted.bed.gz and cnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi to /ess/p22/data/durable/production/anno/sensitive-db
    • update /ess/p22/data/durable/production/anno/ops/start_instance.sh (change EXOME_CNV_INHOUSE_DB to the new version)
  4. Update the structural variant WGS databases if necessary

    tar xvf [path-to-wgs-sv-git-archive] -C /ess/p22/data/durable/production/anno/sensitive-db
    
  5. Start production using ./ops/supervisor-ctl.sh -e master