HTS Bioinf - Basepipe pipeline
Summary
This document describes the data processing after the sequencing data have been imported into the pipeline system.
The basepipe pipeline is a pipeline implemented in Nextflow.io for processing high throughput sequencing data. In short, it takes in paired-end reads, creates quality-annotated variant files and performs automatic quality control. The pipeline consists of different stages: mapping, SNP and small indel calling, CNV calling, quality control and SNP fingerprint check. For control samples, it also compares the SNP and small indel results to a high confidence variant set from the Genome In A Bottle (GIAB) consortium to output sensitivity and positive predictive value (PPV) for trend analysis. The pipeline adheres as much as possible to the best practices recommended by the chosen software providers, as given at the time of implementation.
Responsibility
Execution of the pipeline is performed by a bioinformatician from "Enhet for Genomdiagnostikk" (GDx). The bioinformatician must have been trained for production according to procedure HTS Bioinf - Training for running pipeline.
Table: Sample types and typical runtime
Sample type | Capture kit size | Typical runtime (without waiting time) |
---|---|---|
Cancer samples | Genes and regions from 21 genes | 1 hour |
Tumor samples | Genes and regions from 21 genes, for detection of somatic variants | 1 hour |
Cardio samples | 174 genes | 1 hour |
Exome samples | Whole exome | 8-12 hours |
Genome samples analyzed with GATK | Whole genome | 50 hours |
Genome samples analyzed with Dragen | Whole genome | 1 hour |
Input
Paired-end sequencing data in FASTQ format, the sample metadata file in JSON format and the analysis configuration file in JSON format as described in HTS Bioinf – Importing samples in TSD.
Output
The important output files of the pipeline are listed below:
- SNP and small indel calling results in VCF format after quality annotation for the whole capture kit region and chrM regions
.bam
and.bai
files, containing the mapped data, used for variant calling and visualization- CNV calling results in
.vcf
format or.bed
format - Quality control metrics file
- SNP fingerprinting report.
Usage and requirements
The pipeline is implemented in the vcpipe
repository. It is primarily meant to be run as part of the automation system included. For more information, see scripts in the vcpipe
repository. For full pipeline functionality vcpipe-essentials
, vcpipe-refdata
and sensitive-db
are are also required.
Pipeline stages
The following table summarizes the stages performed for different types of samples:
Cancer samples | Tumor samples | Cardio samples | Exome samples | Genome samples | |
---|---|---|---|---|---|
basepipe_mapping | yes | yes | yes | yes | yes |
basepipe_variantcalling_parallel | yes | - | yes | yes | yes |
basepipe_mutect2 | - | yes | - | - | - |
basepipe_variantcalling_join | yes | - | yes | yes | yes |
basepipe_dragen | - | - | - | - | yes |
basepipe_read_depth_stats | yes | yes | yes | yes | yes |
basepipe_read_depth_stats_bigwig | yes | yes | yes | yes | yes |
basepipe_as_metrics | yes | yes | yes | yes | yes |
basepipe_is_metrics | yes | yes | yes | yes | yes |
basepipe_hs_metrics | yes | yes | yes | yes | - |
basepipe_qc_prepare_vcf | yes | yes | yes | yes | yes |
basepipe_qc | yes | yes | yes | yes | yes |
basepipe_qc_report | yes | yes | yes | yes | yes |
basepipe_mt_variantcalling | - | - | - | - | yes |
basepipe_tiddit | - | - | - | - | yes |
basepipe_svaba | - | - | - | - | yes |
basepipe_manta | - | - | - | - | yes |
basepipe_cnvnator | - | - | - | - | yes |
basepipe_delly | - | - | - | - | yes |
basepipe_svdb_merge | - | - | - | - | yes |
basepipe_svdb_dragen_merge | - | - | - | - | yes |
basepipe_cnv | - | - | - | yes | - |
basepipe_convading | yes | yes | - | - | - |
basepipe_convading_to_vcf | yes | yes | - | - | - |
basepipe_fingerprinting_vcf | yes | - | - | yes | yes |
basepipe_fingerprinting | yes | - | - | yes | yes |
basepipe_happy_target_regions | yes | yes | yes | yes | yes |
basepipe_happy | yes | yes | yes | yes | yes |
basepipe_multiqc | yes | yes | yes | yes | yes |
Mapping stage
The mapping stage performs normal mapping and refinement processes following GATK best practices.
Input: Paired-end sequencing data in FASTQ format
Output: .bam
and .bai
files, containing the mapped data after refinement, used for variant calling
Processes included in the Nextflow script: basepipe_mapping
It has the following steps:
- Mapping using
bwa-mem
withsamtools
for sorting and indexing (Mapping
) - Marking duplicates using the Picard tool
MarkDuplicates
- Base quality score recalibration (
BQSR
) withBaseRecalibrator
andPrintReads
usinggatk
. Note: in case a sample has too few reads (which do not provide enough data) or too many reads (which take too long to process), this step will be dropped.
Mapping
and MarkDuplicates
apply to all samples, BQSR
only apply to exome samples.
SNP and small indel calling stage
The SNP and small indel calling stage performs SNP and small indel calling and quality filtering following standard GATK best practices.
Input: .bam
and .bai
files, containing the mapped data after refinement
Output: Quality annotated VCF files for the whole capture kit region
Processes included in the Nextflow script: basepipe_variantcalling_parallel
, basepipe_variantcalling_join
and basepipe_mutect
It has the following steps:
- Variant calling by tools in GATK
- Variant quality filtration by tools in GATK
The following table summarizes the tools used for different types of samples in different steps:
Variant caller | Variant filtration for SNPs | Variant filtration for Indels | |
---|---|---|---|
Cancer samples | HaplotypeCaller | VariantFiltration | VariantFiltration |
Tumor samples (GATK version 4+) | Mutect2 | FilterMutectCalls | GetPileupSummaries, CalculateContamination, CollectSequencingArtifactMetrics, FilterByOrientationBias and FilterMutectCalls |
Cardio samples | HaplotypeCaller | VariantFiltration | VariantFiltration |
Exome samples | HaplotypeCaller | VariantFiltration | VariantFiltration |
Genome samples | HaplotypeCaller | VariantRecalibrator and ApplyRecalibration | VariantRecalibrator and ApplyRecalibration |
Variant filtration is described in detail later in this document.
Dragen pipeline stage
The Dragen pipeline is a pipeline implemented for fast processing of genome sequencing data by using specialized software and hardware offered by Illumina DRAGEN Bio-IT Platform. The pipeline takes the sequencing data as the input. The steps include mapping, SNP and small indel calling, CNV calling, structural variant (SV) calling and repeat expansion calling on the whole genome. These are implemented in one process in a Nextflow script. The commands and options are based on DRAGEN recommendations. This means that the default setting are always used unless explicitly specified. You will need to refer to the respective DRAGEN version documentations if you need to find out the values for specific default settings.
Input: Paired-end sequencing data in FASTQ format
Output: .bam
and .bai
files, quality annotated VCF files, quality control reports and warnings, CNVs (.cnv.vcf
), SVs (.sv.vcf
), merged output of CNV and SV callers (merged_
) and repeat expansion (.repeats.vcf
) calls in VCF format
Processes included in the Nextflow script: basepipe_dragen
, basepipe_svdb_dragen_merge
Genome reference: It is recommended to always regenerate the Dragen reference genome hash table whenever Dragen software is updated.
Login to p22-hpc-01
, p22-hpc-02
or p22-hpc-03
and run
sbatch /ess/p22/data/durable/production/sw/automation/tsd-import/script/tsd/build-dragen-refgenome.sh
Check the log file build-dragen-refgenome.log
created in the working directory and make sure the job finished without errors.
SNP and small indel calling on mitochondrial genome stage
The SNP and small indel calling on mitochondrial genome stage performs SNP and small indel calling and quality filtering on the mitochondrial genome following GATK best Practices for Mitochondrial Analysis.
Input: .bam
and .bai
files, containing the mapped data for the whole genome region after refinement
Output: Quality annotated VCF files for the mitochondrial genome, haplogroup prediction and mitochondrial coverage statistics
Processes included in the Nextflow script: basepipe_mt_variantcalling
Quality control stage
In the quality control stage, the quality control metrics are generated. These will be described in detail later in this document.
The multiQC
process summarizes quality metrics from the quality control run.
Input: .bam
and .bai
files, containing the mapped data after refinement, quality annotated VCF files for the whole capture kit region
Output: quality control metrics
Processes included in the Nextflow script: basepipe_read_depth_stats
, basepipe_as_metrics
(generates alignment summary metrics), basepipe_is_metrics
(generates insert size summary metrics), basepipe_hs_metrics
(generates coverage summary metrics), basepipe_qc_prepare_vcf
, basepipe_qc
(generates .qc_result.json
), basepipe_qc_report
(generates .qc_report.md
and .qc_warnings.md
) and basepipe_multiqc
CNV detection stage for genome sequencing samples
For genome sequencing samples, when the sample is analyzed with the GATK pipeline, several copy number variant (CNV) callers run in parallel with the SNP and small indel calling. The results from the different callers are then merged by svdb
.
Input: .bam
and .bai
files, containing the mapped data after refinement.
Output: Quality annotated VCF files from individual callers and merged VCF file for the whole genome.
Processes included in the Nextflow script: basepipe_tiddit
, basepipe_svaba
, basepipe_manta
, basepipe_cnvnator
, basepipe_delly
and basepipe_svdb_merge
.
CNV detection stage for exome sequencing samples
For exome sequencing samples, a copy number variant (CNV) detection (exCopyDepth
) runs in parallel with the SNP and small indel calling. The background data are provided by sensitive-db
.
Input: .bam
and .bai
files, containing the mapped data after refinement
Output: Quality annotated CNVs written in .bed
format files for the whole capture kit region
Processes included in the Nextflow script: basepipe_cnv
CNV detection stage for target sequencing samples
For cancer samples and tumor samples*, a copy number variant (CNV) detection tool (CoNVaDING
) runs in parallel with the SNP and small indel calling. The background data are provided by sensitive-db
.
* The CNV results from tumor samples are not used by the lab for interpretation.
Input: .bam
and .bai
files, containing the mapped data after refinement
Output: Quality annotated VCF files for the whole capture kit region
Processes included in the Nextflow script: basepipe_convading
and basepipe_convading_to_vcf
SNP fingerprinting check stage
The purpose of the fingerprinting process is to identify potential sample swaps happening as part of the sample preparation in the lab. It is implemented by an in-house script, that checks genotype data called by HaplotypeCaller
on 16 selected SNP sites, against the genotypes of the same sites as called by a TaqMan machine.
This stage is only applied to cancer samples, single exome samples and single genome samples. The criteria for failing are described in the procedure: HTS Lab – TaqMan SNP-ID.
If the fingerprinting check fails, the pipeline will abort. Data will still be available for error tracking. The steps that need to be taken when the check fails are described in the procedure: HTS - Mismatch between TaqMan SNP-ID and sequencing data.
Input: .bam
and .bai
files, containing the mapped data after refinement
Output: Genotypes of the 16 checked sites detected by HaplotypeCaller
and comparison of HaplotypeCaller
and TaqMan results
Processes included in the Nextflow script: basepipe_fingerprinting_vcf
and basepipe_fingerprinting
hap.py stage
In the hap.py
stage, the predicted SNP and small indel results will be compared with the high confidence SNP and small indel calls provided by GIAB. This stage is only applied on control samples (NA12878, HG002, HG003 and HG004). The results will be used exclusively for samples that need trend analysis (see procedure: HTS - Use of reference materials for internal quality control).
Input: Quality annotated VCF files for the whole capture kit region
Output: Sensitivity and PPV in the defined region
Processes included in the Nextflow script: basepipe_happy_target_regions
and basepipe_happy
Implementation specifics
This section is dedicated to the most important implementation specifics.
SNP and small indel filtration
Variant filtration steps follow GATK best practices. When there are too few variants and therefore not enough data to perform soft-filtering, hard-filtering is applied.
For tumor samples, which are different from other samples, SNPs and small indels are called with mutect2
in addition. Variant filtration for these samples also differs from that for other samples, as described in detail in the "Somatic variant filtration" section below.
The following table summarizes which filtering strategy is applied on SNPs/indels for different types of samples:
SNP filtering | Indel filtering | |
---|---|---|
Cancer samples | hard | hard |
Cardio samples | hard | hard |
Exome samples | hard | hard |
Genome samples | soft | soft |
Parameters
QD: QualByDepth, Variant call confidence normalized by depth of sample reads supporting a variant.
FS: Strand bias estimated using Fisher's Exact Test.
MQ: Root Mean Square of the mapping quality of reads across all samples. Provides an estimation of the overall mapping quality of reads supporting a variant call.
MQRankSum: Rank Sum Test for mapping qualities of REF versus ALT reads.
ReadPosRankSum: Rank Sum Test for relative positioning of REF versus ALT alleles within reads.
SOR: Strand bias estimated by the Symmetric Odds Ratio test.
DP: Depth of coverage.
Parameters and thresholds in soft-filtering and hard-filtering
SNP soft-filtering | Indel soft-filtering | SNP hard-filtering | Indel hard-filtering | |
---|---|---|---|---|
QD | yes | yes | < 2.0 | < 2.0 |
FS | yes | yes | > 60.0 | > 200.0 |
MQ | - | - | < 40.0 | - |
MQRankSum | yes | yes | < -12.5 | - |
ReadPosRankSum | yes | yes | < -8.0 | < -20.0 |
SQR | yes | yes | - | - |
DP | yes (genome samples only) | yes | - | - |
Somatic variant filtration
Variant quality annotation using the following tools from GATK version 4+:
GetPileupSummaries
andCalculateContamination
: Estimate cross-sample contamination, the output is required byFilterMutectCall
.FilterMutectCalls
: Determines whether a call is a confident somatic callCollectSequencingArtifactMetrics
andFilterByOrientationBias
: Filtering the variant call set based on sequence context artifacts, e.g. OxoG and FFPE
Quality control
Quality control checks and thresholds
For samples imported to ELLA, the values for quality control parameters are shown on the INFO page in ELLA under the header "Pipeline QC". For quick reference, the following table lists quality checks, thresholds and the corresponding names in ELLA for different samples*:
Parameter | ELLA | Cancer samples | Cardio samples | Tumor samples*** | Exome samples | Genome samples |
---|---|---|---|---|---|---|
PCT_TARGET_BASES_10X | "Coverage > 10x" | (0.97, 1.01) | (0.97, 1.01) | (0.97, 1.01) | (0.97, 1.01) | - |
PCT_10X | "Coverage > 10x" | - | - | - | - | (0.89, 1.01) |
PCT_TARGET_BASES_20X | "Coverage > 20x" | (0.8, 1.01) | (0.8, 1.01) | (0.8, 1.01) | (0.80, 1.01) | - |
PCT_20X | "Coverage > 10x" | - | - | - | - | [0.83, 1.01] |
MEAN_TARGET_COVERAGE | "Mean coverage" | (75, 1000000000) | (75, 1000000000) | (350, 1000000000) | (75, 1000000000) | - |
MEDIAN COVERAGE | "Median coverage" | - | - | - | - | (29, 1000000000) |
Q30** | "Base quality > 30" | 75 | 80 | 75 | 80 | 80 |
contamination | "Number of skewed AD variants" | (-1,6) | (-1,6) | - | - | - |
skewedRatio | "Variants skewed allele depth" | - | - | - | (0, 0.05) | (0, 0.05) |
TiTvRatio | "Transitions/transversions" | - | - | - | (2, 3) | (1.5, 2.5) |
*: '()' are open intervals, i.e. down/up to but not including extrema, e.g. (2,3) means >2 and <3.
**: The thresholds shown here are percentages, i.e. 80 means 80%. The sample meets the corresponding quality criterion when its value is larger than the given threshold, e.g. a threshold of 80 means the sample's value needs to be >80 to pass.
***: The Department of Pathology will make sure the sample contains at least 30% of tumor tissues.
Quality control implementation
Quality control is implemented using an in-house script that collects metrics on the .bam
file with third party tools as well as different metrics on the VCF with in-house modules.
The following metrics are collected for the .bam
file but not checked against a threshold:
- Various data from
CollectAlignmentSummaryMetrics
using Picard tools - Various data from
CollectInsertSizeMetrics
using Picard tools - Various data from
mapping_metrics.csv
using Dragen - Various data from
insert-stats.tab
using Dragen
The following data are collected for the .bam
file and are checked against a threshold:
- PCT_TARGET_BASES_10X from
CalculateHsMetrics
using Picard tools: The fraction of all target bases achieving 10X or greater coverage. - PCT_10X from
results
generated bymosdepth
: The fraction of bases that attained at least 10X sequence coverage in post-filtering bases. - PCT_TARGET_BASES_20X from
CalculateHsMetrics
using Picard tools: The fraction of all target bases achieving 20X or greater coverage. - PCT_20X from
results
generated bymosdepth
: The fraction of bases that attained at least 20X sequence coverage in post-filtering bases. - MEAN_TARGET_COVERAGE from
CalculateHsMetrics
using Picard tools: The mean coverage of a target region. - MEDIAN_COVERAGE from
results
generated bymosdepth
: The median coverage in bases of the genome territory, after all filters are applied.
The following metrics are collected from the VCF file and checked against thresholds:
- skewedRatio: contamination level of exome samples and genome samples. First, the allelic read depth ratio is calculated for each heterozygous variant. Then, the variants with ratio above 3 (skewed variants) are counted. Finally, skewedRatio is calculated as (num of skewed variants) / (total num of heterozygous variants). A typical value of skewedRatio for an exome sample is around 0.01.
- tiTvRatio: ratio of transversion vs transitions.
- contamination: contamination level of cancer samples and cardio samples. First, the allelic read depth ratio is calculated for each variant present in (at least) half of the European population. Then, the heterozygous variants with ratio above 3 or below 0.25 and the homozygous variants with ratio above 0.1 are counted. Their number should be less than the given threshold.
The following metrics are collected from sequencing statistics (Demultiplex_Stats.htm
) and checked against thresholds:
- Q30 : Percentage number of bases having phred score quality >30.
The quality control data and summary are written to a JSON file. The pipeline will proceed regardless of whether quality control succeeds or not, so if running manually, the users must check the logs themselves. The rationale for not halting the pipeline is that in some cases one may still want to use the result files. In day-to-day use, it is included in an automation system that checks the result automatically after running the pipeline.
CNV and SV filtration and quality control
The CNV and SV variants are considered to have sufficient quality when they are given a PASS filter by the callers. In addition, the quality of variants is verified by visual inspection in the ELLA variant interpretation software.