HTS Bioinf - Update sensitive databases
Scope
The aim of this procedure is to provide instructions for updating the four HTS-bioinf sensitive databases:
1. CNV exome database
This section provide instructions for creating two databases of copy number variation (CNV) from whole exome sequencing data. The first database is a serialized R object (RDS), that can be read by the R statistical package. It contains the median read depth per exon for a particular capture kit. It is used as a background reference for the CNV caller. The second database is a BED file. It contains the actual CNV's, that is, a collection of genomic intervals that were either deleted or duplicated in a sample. It is used to annotate the CNV in order to facilitate interpretation
2. Structural variant database for whole genome sequencing
This section provides instructions for creating databases for structural variants (SV) and copy number variation (CNV) from whole genome sequencing data from the Dragen analysis platform. The CNV database is based on measurements of changes in read depth. The SV database is mainly based on measurements of aberrant insert sizes for read mapping. In addition, this section describes how to make visualization tracks for the analysis software Ella of the combined SV and CNV database.
3. Convading background data
This section provide instructions for creating/updating CoNVaDING backgroud data. CoNVaDING uses a set of background controls for calling CNVs on a new sample. The basic principle is that the coverage of a region is compared against the cross section of the coverage in samples with similar coverage characteristics (from the background controls).
This section details how to create / update new data sets. This is necessary whenever there are substantial changes to the lab or bioinformatic processes.
4. Tumor pipeline background data
In order to efficiently distinguish between noise and mosaic variant, a panel of normals is used in the tumor pipeline variant calling as recommended by GATK best practice. At least 40 samples, which were prepared and sequenced in the same way as the sample in analysis, will be included.
5. Inhouse frequency databases
All the publicly available datasets used in annotation can be downloaded directly using the make
commands in ella-anno
or anno-targets
. The sensitive datasets however must be generated on TSD.
Responsibility
The bioinformatician responsible for annotation updates is responsible for keepting sensitive databases updated. EKG shares the responsibility to keep Convading background data updated.
Release and deployment of sensitive databases
Make a JIRA issue from the template of sensitive-db release containing which databases to update:
- CNV exome database
- Structural variant WGS database
- Convading background data
- Tumor pipeline background data
- Inhouse frequency databases
Release of the sensitive-db git archive
The sensitive-db
git archive only contains the three databases
- CNV exome database
- Convading background data
- Tumor pipeline background data
Release procedure sensitive-db
- Check out a new git feature branch in
/cluster/projects/p22/dev/sensitive-db
. The branch name [jira-issue-feature-branch] should contain the JIRA issue number:
-
Add all relevant files to the git feature branch. Relevant files are database specific for CNV exome, Convading background or Tumor background
-
Commit the changes to the git feature branch
- If updating CNV exome or Tumor background do testing
- Tag the feature branch by the convension
git tag -a vX.X.X-rc -m "Tag release candidate vX.X.X"
- Export the release candiate
.tgz
archive./export.sh vX.X.X-rc
- Deploy to TSD staging
-
Run a relevant sample to check that nothing fails (Choose a recent exome sample or a recent tumor sample)
-
Check out the
master branch
and merge the feature branch
- Tag the release by convension
-
Export the release
.tgz
archive and the accompanying.sha1
-file by./export.sh vX.X.X-rel
-
The name of git archive contains both the release version tag and the commit-id that the git tag points to.
Deployment of the sensitive-db and wgs-sv git archives
Obtain the latest
sensitive-db
git archive (andsha1
)- latest
structural variant WGS database
git archive
Make sure that they are present at the four archive paths:
- TSD cluster
- Staging:
/cluster/projects/p22/staging/sw/archive
- Production:
/cluster/projects/p22/production/sw/archive
- NSC cluster
- Staging:
/boston/diag/staging/sw/archive
- Production:
/boston/diag/production/sw/archive
Transfer the git archive to NSC via the TSD directory /tsd/p22/data/durable/file-export
). Transfer command on sleipnir
is:
Deploy to production pipeline
Deploy by executing the two commands:
1 - Sensitive DB:
2 - Structural variant DB:
in each of the four sw paths
- TSD cluster
- Staging:
/cluster/projects/p22/staging/sw
- Production:
/cluster/projects/p22/production/sw
- NSC cluster
- Staging:
/boston/diag/staging/sw
- Production:
/boston/diag/production/sw
Deploy to ELLA anno
Deployment for ELLA anno is part of the deployment for the production pipeline. It is only required when there are changes to either of:
- CNV exome database
- Structural variant WGS database
Deploy to ELLA anno on durable
This deployment will be shared by all the ELLA anno services on TSD
- Check that no jobs are running on prod, by running
./ops/num-active-jobs.sh
- Stop prod using
./ops/supervisor-ctl.sh -e master
- Update the CNV exome database if necessary
- copy
cnv-calls-excap-{yyyymmdd}.sorted.bed.gz
andcnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi
to/tsd/p22/data/durable/production/anno/sensitive-db
- update
/tsd/p22/data/durable/production/anno/ops/start_instance.sh
, changeEXOME_CNV_INHOUSE_DB
to the new version. - Update the structural variant WGS databases if necessary
- Start prod using
./ops/supervisor-ctl.sh -e master
CNV exome database
Tools
The tool for calling the CNV is called exCopyDepth
and is written in R
. It is accompanied by scripts for generating the median read depth per exon
database that are called symlinkRefBam.R
and computeBackground.R
. The tool for annotation the CNV is called cnvScan
and is written in Python. There isn't a specific tool for gathering its output; instead instructions are given to create the CNV database.
Input
The BAM files of the samples are the input for the median read depth per exon database. The *.cnv.bed
result files produced by cnvScan are used to create the CNV database.
Procedure
Default instructions are given for trios. For other pipelines (exome or target), the reader is invited to adapt the commands (as explained in the remarks).
Setup
Prerequisites:
- vcpipe repository is available at
/cluster/projects/p22/production/sw/vcpipe/vcpipe
- vcpipe-bundle repository is available at
/cluster/projects/p22/production/sw/vcpipe/vcpipe-bin
- vcpipe-bin repository is available at
/cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle
Open a Terminal and set up your environment:
export SETTINGS=/cluster/projects/p22/production/sw/vcpipe/vcpipe/config/settings-tsd.json
source /cluster/projects/p22/production/sw/vcpipe/vcpipe/exe/setup.source
source /cluster/projects/p22/sw/sourceme
Make a new directory and cd into it, for example:
cd /cluster/projects/p22/dev/p22-${USER}today=$(date +"%Y%m%d")
mkdir ${today}-inDB-CNV && cd ${today}-inDB-CNV
Compute the reference read-depth per exon
Collect the BAM files in a directory
Rscript /cluster/projects/p22/production/sw/vcpipe/vcpipe/src/annotation/cnv/exCopyDepth/symlinkRefBam.R Trio
The defaults are:
- Repeat for Trio2, etc.
- Check that NA or HG samples did not get added twice (in two projects). If so, remove the duplicates.
- For exomes, the first parameter becomes something like excap
Compute the background read-depth per exon
Rscript /cluster/projects/p22/production/sw/vcpipe/vcpipe/src/annotation/cnv/exCopyDepth/computeBackground.R
The defaults are:
--reference refBam \
--probes /cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle/captureKit/illumina_trusightone_v01/illumina_trusightone_v01.probes.bed \
--fasta /cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.fasta \
--output reference.rds
- Rename the output file to something like
cnv-background-trio-20161004.rds
:
- For exome, the probes parameter becomes:
--probes /cluster/projects/p22/production/sw/vcpipe/vcpipe-bundle/captureKit/agilent_sureselect_v05/wex_Agilent_SureSelect_v05_b37.baits.bed
and the output file should be renamed to cnv-background-exome-${today}.rds
- Follow the procedure to add the file to version control.
Update the CNV calls
- Run
indbCNV_creator.py
which generate two files:cnv-calls-excap-{yyyymmdd}.sorted.bed.gz
andcnv-calls-excap-{yyyymmdd}.sorted.bed.gz.tbi
(indbCNV_creator.py
in amg reposrc/indb
directory) - Follow the following steps to add the two result files to version control.
Check-in files in version control
- The
/cluster/projects/p22/dev/sensitive-db
is agit
directory. Move the produced background read-depth and cnv calls files into the exCopyDepth subdirectory:
mv cnv-background-trio-${today}.rds /cluster/projects/p22/dev/sensitive-db/exCopyDepth/
mv cnv-calls-trio-${today}.sorted.bed.gz* /cluster/projects/p22/dev/sensitive-db/exCopyDepth/
- Update
/cluster/projects/p22/dev/sensitive-db/sensitive-db.json
file with the new values (relative paths) for the background and database keys. For reference, that json file looks like this:
{
"cnv": {
"exCopyDepth": {
"trio": {
"background": "exCopyDepth/cnv-background-trio-20161004.rds",
"database": "exCopyDepth/cnv-calls-trio-20161021.sorted.bed.gz"
},
"excap": {
"background": "exCopyDepth/cnv-background-excap-20170209.rds",
"database": "exCopyDepth/cnv-calls-excap-20170209.sorted.bed.gz"
}
}
}
}
- Note: the index file does not get its own entry.
- Update the release notes
/cluster/projects/p22/dev/sensitive-db/release-notes.md
Tag | Date | File | Who | Notes |
---|---|---|---|---|
v1.0.0-rel | 20161021 | cnv-background-trio- 20161021.rds |
p22-huguesfo | (pilot) Trio1-5 P,M,F |
cnv-calls-trio- 20161021.sorted.bed.gz.tbi |
p22-huguesfo | (pilot) Trio1-5 P only | ||
cnv-calls-trio- 20161021.sorted.bed.gz |
p22-huguesfo | (pilot) Trio1-5 P only | ||
v1.0.1-rel | 20161111 | cnv-background- trio-20161111.rds |
p22-huguesfo | Trio1-5 P,M,F |
cnv-calls-trio- 20161111.sorted.bed.gz.tbi |
p22-huguesfo | Trio1-5 P,M,F | ||
cnv-calls-trio- 20161111.sorted.bed.gz |
p22-huguesfo | Trio1-5 P,M,F |
- Add the background file, the database and its index, the json to git and commit:
cd /cluster/projects/p22/dev/sensitive-db
git add exCopyDepth/cnv-background-trio-${today}.rds
git add exCopyDepth/cnv-calls-trio-${today}.sorted.bed.gz
git add exCopyDepth/cnv-calls-trio-${today}.sorted.bed.gz.tbi
git add sensitive-db.json
git commit -m "CNV trio release ${today}
- Continue with the standard sensitive-db git archive release
Update
The database will be updated when there is a specific reason for updating, e.g. the capture protocol is changed.
Remarks
For the data in sensitive-db-v1.0.2
, the scripts in vcpipe-v1.4.4
were used: All Proband , Mother and Father were included in the reference read-depth of trios, and chrX and chrY were excluded for both trios and exomes. From vcpipe-v1.5
however, only Mother and Father are included in the reference read-depth of trios, and chrX and chrY are always included.
Structural variant WGS database
When to create new databases
The databases need to be updated whenever there are significant changes to the database creation tools or to the Dragen SV and CNV callers. The frequency databases are meant for filtering out common variants. Common variants can be true positive or false positive, but in either case they are considered to be likely non-pathogenic. If the database is not updated after a Dragen update it may lose parts of its ability to filter false positive.
A change to the Dragen SV and CNV callers can be regarded as significant if the frequency filtering capabilities changes significantly.
It is important to note that the frequency filtering capabilities of the database depends on the number of samples in the databases, and that this number in many cases is more important than the Dragen version it has been run on. Therefore it is often better to have a large database based on a previous Dragen version than a small database based on the latest version.
Location of databases
After deployment the databases are stored in a sub-directory in the sensitive-db
on TSD. Current location is /cluster/projects/p22/production/sw/vcpipe/sensitive-db/wgs-sv
.
Prerequisites
Database generation scripts and a Singularity container with the required tools should be placed in TSD as described in https://gitlab.com/ousamg/apps/make-sv-indb.
If changing the sample selection criterion for the database due to an upgrade of Dragen, a new release of https://gitlab.com/ousamg/apps/make-sv-indb should be made containing the new sample selection criterion.
Creating a new data set
- Log in to
p22-submit-dev
- Go to
/cluster/projects/p22/dev/shared/sv-indb/make-sv-indb
. This is where https://gitlab.com/ousamg/apps/make-sv-indb should have been extracted. - Check which samples will be considered for inclusion into the database by running
- Run:
SETUP_FILE=config/manta_dr.json make transfer-dryrun
- Run:
SETUP_FILE=config/canvas_dr.json make transfer-dryrun
- Make the databases. Database generation is performed on the compute cluster. This may take several days. In order to take this into account, it is adviced to manually add today's date to the "version" field in all the
JSON
files that are used in the following steps - Run:
SETUP_FILE=config/manta_dr.json make full-process
- Run:
SETUP_FILE=config/canvas_dr.json make full-process
- Postprocess databases to minimize file sizes
- Run:
SETUP_FILE=config/manta_dr.json make postprocess
- Run:
SETUP_FILE=config/canvas_dr.json make postprocess
- Move databases to version control and release a tar archive
- Add the first database and choose not to release when prompted:
SETUP_FILE=config/manta_dr.json make store-database
- Add the second database and choose to release when propted:
SETUP_FILE=config/canvas_dr.json make store-database
- Deploy by moving the tar archive that was created by the previous step and extract it inside the
sensitive-db/
directory as shown in the standard deployment procedure - Make tracks for ELLA. Make sure that the "version" field in the
JSON
file corresponds to the previous steps. - Combine the two databases into one database. Run:
SETUP_FILE=config/merged_dr.json make merge
- Make tracks on BED format and BigWig format. Run:
SETUP_FILE=config/merged_dr.json make bigwig
- Notify the ELLA development team that the following new database tracks have been made:
/cluster/projects/p22/dev/shared/sv-indb/svdb/indb_dr_merged*.vcf.gz*
/cluster/projects/p22/dev/shared/sv-indb/svdb/indb_dr_merged*.bigWig*
Convading background data
Convading background data are generated and added to sensitive-db
according to the procedure Update of Convading background data
- Update the release notes
/cluster/projects/p22/dev/sensitive-db/release-notes.md
- Deploy as part of
sensitive-db
.
Tumor pipeline background data
Update data
For each sample included in panel of normal, run the following command:
gatk4 \
Mutect2 \
-R ${bundle.reference.fasta} \
--germline-resource ${bundle.mutect2.germline_resource} \
--genotype-germline-sites true \
-I ${bam_file} \
-tumor ${analysis_name} \
-L ${calling_region} \
-O ${analysis_name}${output_suffix}.raw.vcf.gz
For all samples included in panel of normal, run the following command:
gatk4 \
CreateSomaticPanelOfNormals \
--vcfs ${analysis_name}${output_suffix}.raw.vcf.gz \
-O panelOfNormals.DATE_OF_GENERATED.vcf.gz
for example, panelOfNormals.20190308.vcf.gz
.
It only needs to be re-generated when sample preparation or sequencing is changed. The version control is followed the common rules for sensitive databases.
Release and deploy
- Move the produced vcf file and index file, e.g.
panelOfNormals.20190308.vcf.gz
andpanelOfNormals.20190308.vcf.gz.tbi
to/cluster/projects/p22/dev/sensitive-db/tumor
- Update the
/cluster/projects/p22/dev/sensitive-db/sensitive-db.json
file with the new version fortumor.panel_of_normal
entry. - Update the release notes
/cluster/projects/p22/dev/sensitive-db/release-notes.md
- Release and deploy as part of
sensitive-db
Inhouse frequency databases
On TSD, ssh
into a VM with access to /cluster
and follow:
source /cluster/projects/p22/sw/sourceme
export SETTINGS=/cluster/projects/p22/production/sw/vcpipe/vcpipe/config/settings-tsd.json
cd /cluster/projects/p22/production/sw/vcpipe/vcpipe/exe
source setup.source
python indbcreator.py # for inDB
python indbWGS_creator.py # for wgsDB
Export resulting vcf-file.
References
CoNVaDING paper: https://www.ncbi.nlm.nih.gov/pubmed/