Skip to content

HTS Bioinf - Update sensitive databases

Scope

The aim of this procedure is to provide instructions for updating the four HTS-bioinf sensitive databases:

1. CNV exome database

This section provide instructions for creating two databases of copy number variation (CNV) from whole exome sequencing data. The first database is a serialized R object (RDS), that can be read by the R statistical package. It contains the median read depth per exon for a particular capture kit. It is used as a background reference for the CNV caller. The second database is a BED file. It contains the actual CNV's, that is, a collection of genomic intervals that were either deleted or duplicated in a sample. It is used to annotate the CNV in order to facilitate interpretation

2. Structural variant database for whole genome sequencing

This section provides instructions for creating databases for structural variants (SV) and copy number variation (CNV) from whole genome sequencing data from the Dragen analysis platform. The CNV database is based on measurements of changes in read depth. The SV database is mainly based on measurements of aberrant insert sizes for read mapping. In addition, this section describes how to make visualization tracks for the analysis software Ella of the combined SV and CNV database.

3. Convading background data

This section provide instructions for creating/updating CoNVaDING backgroud data. CoNVaDING uses a set of background controls for calling CNVs on a new sample. The basic principle is that the coverage of a region is compared against the cross section of the coverage in samples with similar coverage characteristics (from the background controls).

This section details how to create / update new data sets. This is necessary whenever there are substantial changes to the lab or bioinformatic processes.

4. Tumor pipeline background data

In order to efficiently distinguish between noise and mosaic variant, a panel of normals is used in the tumor pipeline variant calling as recommended by GATK best practice. At least 40 samples, which were prepared and sequenced in the same way as the sample in analysis, will be included.

5. Inhouse frequency databases

All the publicly available datasets used in annotation can be downloaded directly using the make commands in ella-anno or anno-targets. The sensitive datasets however must be generated on TSD.

Responsibility

The bioinformatician responsible for annotation updates is responsible for keepting sensitive databases updated. EKG shares the responsibility to keep Convading background data updated.


Release and deployment of sensitive databases

Make a JIRA issue from the template of sensitive-db release containing which databases to update:

Release of the sensitive-db git archive

The sensitive-db git archive only contains the three databases

  • CNV exome database
  • Convading background data
  • Tumor pipeline background data

Release procedure sensitive-db

  1. Check out a new git feature branch in /cluster/projects/p22/dev/sensitive-db. The branch name [jira-issue-feature-branch] should contain the JIRA issue number:
git checkout -b [jira-issue-feature-branch]
  1. Add all relevant files to the git feature branch. Relevant files are database specific for CNV exome, Convading background or Tumor background

  2. Commit the changes to the git feature branch

git commit -m "[sensitive-db version] JIRA issue: Update data for [exome,convading,tumor]"
  1. If updating CNV exome or Tumor background do testing
  2. Tag the feature branch by the convension git tag -a vX.X.X-rc -m "Tag release candidate vX.X.X"
  3. Export the release candiate .tgz archive ./export.sh vX.X.X-rc
  4. Deploy to TSD staging
  5. Run a relevant sample to check that nothing fails (Choose a recent exome sample or a recent tumor sample)

  6. Check out the master branch and merge the feature branch

git merge [jira-issue-feature-branch]
  1. Tag the release by convension
git tag -a vX.X.X-rel -m "Tag release vX.X.X"
  1. Export the release .tgz archive and the accompanying .sha1 -file by ./export.sh vX.X.X-rel

  2. The name of git archive contains both the release version tag and the commit-id that the git tag points to.

Deployment of the sensitive-db and wgs-sv git archives

Obtain the latest

  1. sensitive-db git archive (and sha1)
  2. latest structural variant WGS database git archive

Make sure that they are present at the four archive paths:

  1. TSD cluster
  2. Staging: /cluster/projects/p22/staging/sw/archive
  3. Production: /cluster/projects/p22/production/sw/archive
  4. NSC cluster
  5. Staging: /boston/diag/staging/sw/archive
  6. Production: /boston/diag/production/sw/archive

Transfer the git archive to NSC via the TSD directory /tsd/p22/data/durable/file-export). Transfer command on sleipnir is:

tacl --download [git-archive-and-sha1] --env alt p

Deploy to production pipeline

Deploy by executing the two commands:

1 - Sensitive DB:

./deploy.sh archive/[sensitive-db-git-archive] sensitive-db

2 - Structural variant DB:

tar xvf archive/[wgs-sv-git-archive] -C vcpipe/sensitive-db

in each of the four sw paths

  1. TSD cluster
  2. Staging: /cluster/projects/p22/staging/sw
  3. Production: /cluster/projects/p22/production/sw
  4. NSC cluster
  5. Staging: /boston/diag/staging/sw
  6. Production: /boston/diag/production/sw

Deploy to ELLA anno

Deployment for ELLA anno is part of the deployment for the production pipeline. It is only required when there are changes to either of:

  1. CNV exome database
  2. Structural variant WGS database

Deploy to ELLA anno on durable

This deployment will be shared by all the ELLA anno services on TSD

  1. Check that no jobs are running on prod, by running ./ops/num-active-jobs.sh
  2. Stop prod using ./ops/supervisor-ctl.sh -e master
  3. Update the CNV exome database if necessary
  4. copy cnv-calls-excap-{yyyymmdd}.sorted.bed.gz and cnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi to /tsd/p22/data/durable/production/anno/sensitive-db
  5. update /tsd/p22/data/durable/production/anno/ops/start_instance.sh , change EXOME_CNV_INHOUSE_DB to the new version.
  6. Update the structural variant WGS databases if necessary
    tar xvf [path-to-wgs-sv-git-archive] -C /tsd/p22/data/durable/production/anno/sensitive-db\
    
  7. Start prod using ./ops/supervisor-ctl.sh -e master

CNV exome database

Tools

The tool for calling the CNV is called exCopyDepth and is written in R. It is accompanied by scripts for generating the median read depth per exon database that are called symlinkRefBam.R and computeBackground.R. The tool for annotation the CNV is called cnvScan and is written in Python. There isn't a specific tool for gathering its output; instead instructions are given to create the CNV database.

Input

The BAM files of the samples are the input for the median read depth per exon database. The *.cnv.bed result files produced by cnvScan are used to create the CNV database.

Procedure

Default instructions are given for trios. For other pipelines (exome or target), the reader is invited to adapt the commands (as explained in the remarks).

Setup

Prerequisites:

  • vcpipe repository is available at /cluster/projects/p22/production/sw/vcpipe/vcpipe
  • vcpipe-bundle repository is available at /cluster/projects/p22/production/sw/vcpipe/vcpipe-bin
  • vcpipe-bin repository is available at /cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle

Open a Terminal and set up your environment:

export SETTINGS=/cluster/projects/p22/production/sw/vcpipe/vcpipe/config/settings-tsd.json
source /cluster/projects/p22/production/sw/vcpipe/vcpipe/exe/setup.source
source /cluster/projects/p22/sw/sourceme

Make a new directory and cd into it, for example:

cd /cluster/projects/p22/dev/p22-${USER}today=$(date +"%Y%m%d")
mkdir ${today}-inDB-CNV && cd ${today}-inDB-CNV

Compute the reference read-depth per exon

Collect the BAM files in a directory

Rscript /cluster/projects/p22/production/sw/vcpipe/vcpipe/src/annotation/cnv/exCopyDepth/symlinkRefBam.R Trio

The defaults are:

--output_path refBam \
--interpretation /tsd/p22/data/durable/production/interpretations
  • Repeat for Trio2, etc.
  • Check that NA or HG samples did not get added twice (in two projects). If so, remove the duplicates.
  • For exomes, the first parameter becomes something like excap

Compute the background read-depth per exon

Rscript /cluster/projects/p22/production/sw/vcpipe/vcpipe/src/annotation/cnv/exCopyDepth/computeBackground.R

The defaults are:

--reference refBam \
--probes /cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle/captureKit/illumina_trusightone_v01/illumina_trusightone_v01.probes.bed \
--fasta /cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.fasta \
--output reference.rds
  • Rename the output file to something like cnv-background-trio-20161004.rds:
mv reference.rds cnv-background-trio-${today}.rds
  • For exome, the probes parameter becomes:
--probes /cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle/captureKit/agilent_sureselect_v05/wex_Agilent_SureSelect_v05_b37.baits.bed

and the output file should be renamed to cnv-background-exome-${today}.rds - Follow the procedure to add the file to version control.

Update the CNV calls

  • Run indbCNV_creator.py which generate two files: cnv-calls-excap-{yyyymmdd}.sorted.bed.gz and cnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi (indbCNV_creator.py in amg repo src/indb directory)
  • Follow the following steps to add the two result files to version control.

Check-in files in version control

  • The /cluster/projects/p22/dev/sensitive-db is a git directory. Move the produced background read-depth and cnv calls files into the exCopyDepth subdirectory:
mv cnv-background-trio-${today}.rds /cluster/projects/p22/dev/sensitive-db/exCopyDepth/
mv cnv-calls-trio-${today}.sorted.bed.gz* /cluster/projects/p22/dev/sensitive-db/exCopyDepth/
  • Update /cluster/projects/p22/dev/sensitive-db/sensitive-db.json file with the new values (relative paths) for the background and database keys. For reference, that json file looks like this:
{
    "cnv": {
        "exCopyDepth": {
            "trio": {
                "background": "exCopyDepth/cnv-background-trio-20161004.rds",
                "database": "exCopyDepth/cnv-calls-trio-20161021.sorted.bed.gz"
                },
        "excap": {
                "background": "exCopyDepth/cnv-background-excap-20170209.rds",
                "database": "exCopyDepth/cnv-calls-excap-20170209.sorted.bed.gz"
                }
        }
    }
}
  • Note: the index file does not get its own entry.
  • Update the release notes /cluster/projects/p22/dev/sensitive-db/release-notes.md
Tag Date File Who Notes
v1.0.0-rel 20161021 cnv-background-trio-
20161021.rds
p22-huguesfo (pilot) Trio1-5 P,M,F
cnv-calls-trio-
20161021.sorted.bed.gz.tbi
p22-huguesfo (pilot) Trio1-5 P only
cnv-calls-trio-
20161021.sorted.bed.gz
p22-huguesfo (pilot) Trio1-5 P only
v1.0.1-rel 20161111 cnv-background-
trio-20161111.rds
p22-huguesfo Trio1-5 P,M,F
cnv-calls-trio-
20161111.sorted.bed.gz.tbi
p22-huguesfo Trio1-5 P,M,F
cnv-calls-trio-
20161111.sorted.bed.gz
p22-huguesfo Trio1-5 P,M,F
  • Add the background file, the database and its index, the json to git and commit:
cd /cluster/projects/p22/dev/sensitive-db
git add exCopyDepth/cnv-background-trio-${today}.rds
git add exCopyDepth/cnv-calls-trio-${today}.sorted.bed.gz
git add exCopyDepth/cnv-calls-trio-${today}.sorted.bed.gz.tbi
git add sensitive-db.json
git commit -m "CNV trio release ${today}
  • Continue with the standard sensitive-db git archive release

Update

The database will be updated when there is a specific reason for updating, e.g. the capture protocol is changed.

Remarks

For the data in sensitive-db-v1.0.2, the scripts in vcpipe-v1.4.4 were used: All Proband , Mother and Father were included in the reference read-depth of trios, and chrX and chrY were excluded for both trios and exomes. From vcpipe-v1.5 however, only Mother and Father are included in the reference read-depth of trios, and chrX and chrY are always included.

Structural variant WGS database

When to create new databases

The databases need to be updated whenever there are significant changes to the database creation tools or to the Dragen SV and CNV callers. The frequency databases are meant for filtering out common variants. Common variants can be true positive or false positive, but in either case they are considered to be likely non-pathogenic. If the database is not updated after a Dragen update it may lose parts of its ability to filter false positive.

A change to the Dragen SV and CNV callers can be regarded as significant if the frequency filtering capabilities changes significantly.

It is important to note that the frequency filtering capabilities of the database depends on the number of samples in the databases, and that this number in many cases is more important than the Dragen version it has been run on. Therefore it is often better to have a large database based on a previous Dragen version than a small database based on the latest version.

Location of databases

After deployment the databases are stored in a sub-directory in the sensitive-db on TSD. Current location is /cluster/projects/p22/production/sw/vcpipe/sensitive-db/wgs-sv.

Prerequisites

Database generation scripts and a Singularity container with the required tools should be placed in TSD as described in https://gitlab.com/ousamg/apps/make-sv-indb.

If changing the sample selection criterion for the database due to an upgrade of Dragen, a new release of https://gitlab.com/ousamg/apps/make-sv-indb should be made containing the new sample selection criterion.

Creating a new data set

  1. Log in to p22-submit-dev
  2. Go to /cluster/projects/p22/dev/shared/sv-indb/make-sv-indb. This is where https://gitlab.com/ousamg/apps/make-sv-indb should have been extracted.
  3. Check which samples will be considered for inclusion into the database by running
  4. Run: SETUP_FILE=config/manta_dr.json make transfer-dryrun
  5. Run: SETUP_FILE=config/canvas_dr.json make transfer-dryrun
  6. Make the databases. Database generation is performed on the compute cluster. This may take several days. In order to take this into account, it is adviced to manually add today's date to the "version" field in all the JSON files that are used in the following steps
  7. Run: SETUP_FILE=config/manta_dr.json make full-process
  8. Run: SETUP_FILE=config/canvas_dr.json make full-process
  9. Postprocess databases to minimize file sizes
  10. Run: SETUP_FILE=config/manta_dr.json make postprocess
  11. Run: SETUP_FILE=config/canvas_dr.json make postprocess
  12. Move databases to version control and release a tar archive
  13. Add the first database and choose not to release when prompted: SETUP_FILE=config/manta_dr.json make store-database
  14. Add the second database and choose to release when propted: SETUP_FILE=config/canvas_dr.json make store-database
  15. Deploy by moving the tar archive that was created by the previous step and extract it inside the sensitive-db/ directory as shown in the standard deployment procedure
  16. Make tracks for ELLA. Make sure that the "version" field in the JSON file corresponds to the previous steps.
  17. Combine the two databases into one database. Run: SETUP_FILE=config/merged_dr.json make merge
  18. Make tracks on BED format and BigWig format. Run: SETUP_FILE=config/merged_dr.json make bigwig
  19. Notify the ELLA development team that the following new database tracks have been made:
  20. /cluster/projects/p22/dev/shared/sv-indb/svdb/indb_dr_merged*.vcf.gz*
  21. /cluster/projects/p22/dev/shared/sv-indb/svdb/indb_dr_merged*.bigWig*

Convading background data

Convading background data are generated and added to sensitive-db according to the procedure Update of Convading background data

  1. Update the release notes /cluster/projects/p22/dev/sensitive-db/release-notes.md
  2. Deploy as part of sensitive-db.

Tumor pipeline background data

Update data

For each sample included in panel of normal, run the following command:

gatk4 \
    Mutect2 \
    -R ${bundle.reference.fasta} \
    --germline-resource ${bundle.mutect2.germline_resource} \
    --genotype-germline-sites true \
    -I ${bam_file} \
    -tumor ${analysis_name} \
    -L ${calling_region} \
    -O ${analysis_name}${output_suffix}.raw.vcf.gz

For all samples included in panel of normal, run the following command:

gatk4 \
    CreateSomaticPanelOfNormals \
    --vcfs ${analysis_name}${output_suffix}.raw.vcf.gz \
    -O panelOfNormals.DATE_OF_GENERATED.vcf.gz

for example, panelOfNormals.20190308.vcf.gz.

It only needs to be re-generated when sample preparation or sequencing is changed. The version control is followed the common rules for sensitive databases.

Release and deploy

  1. Move the produced vcf file and index file, e.g. panelOfNormals.20190308.vcf.gz and panelOfNormals.20190308.vcf.gz.tbi to /cluster/projects/p22/dev/sensitive-db/tumor
  2. Update the /cluster/projects/p22/dev/sensitive-db/sensitive-db.json file with the new version for tumor.panel_of_normal entry.
  3. Update the release notes /cluster/projects/p22/dev/sensitive-db/release-notes.md
  4. Release and deploy as part of sensitive-db

Inhouse frequency databases

On TSD, ssh into a VM with access to /cluster and follow:

source /cluster/projects/p22/sw/sourceme
export SETTINGS=/cluster/projects/p22/production/sw/vcpipe/vcpipe/config/settings-tsd.json
cd /cluster/projects/p22/production/sw/vcpipe/vcpipe/exe
source setup.source
python indbcreator.py # for inDB
python indbWGS_creator.py # for wgsDB

Export resulting vcf-file.

References

CoNVaDING paper: https://www.ncbi.nlm.nih.gov/pubmed/