Skip to content

HTS Bioinf - Basepipe pipeline

Summary

This document describes the data processing after the sequencing data have been imported into the pipeline system.

The basepipe pipeline is a pipeline implemented in Nextflow.io for processing high throughput sequencing data. In short, it takes in paired-end reads, creates quality-annotated variant files and performs automatic quality control. The pipeline consists of different stages: mapping, SNP and small indel calling, CNV calling, quality control and SNP fingerprint check. For control samples, it also compares the SNP and small indel results to a high confidence variant set from the Genome In A Bottle (GIAB) consortium to output sensitivity and positive predictive value (PPV) for trend analysis. The pipeline adheres as much as possible to the best practices recommended by the chosen software providers, as given at the time of implementation.

Responsibility

Execution of the pipeline is performed by a bioinformatician from "Enhet for Genomdiagnostikk" (GDx). The bioinformatician must have been trained for production according to procedure HTS Bioinf - Training for running pipeline.


Table: Sample types and typical runtime

Sample type Capture kit size Typical runtime (without waiting time)
Cancer samples Genes and regions from 21 genes 1 hour
Tumor samples Genes and regions from 21 genes, for detection of somatic variants 1 hour
Cardio samples 174 genes 1 hour
Exome samples Whole exome 8-12 hours
Genome samples analyzed with GATK Whole genome 50 hours
Genome samples analyzed with Dragen Whole genome 1 hour

Input

Paired-end sequencing data in FASTQ format, the sample metadata file in JSON format and the analysis configuration file in JSON format as described in HTS Bioinf – Importing samples in TSD.

Output

The important output files of the pipeline are listed below:

  • SNP and small indel calling results in VCF format after quality annotation for the whole capture kit region and chrM regions
  • .bam and .bai files, containing the mapped data, used for variant calling and visualization
  • CNV calling results in .vcf format or .bed format
  • Quality control metrics file
  • SNP fingerprinting report.

Usage and requirements

The pipeline is implemented in the vcpipe repository. It is primarily meant to be run as part of the automation system included. For more information, see scripts in the vcpipe repository. For full pipeline functionality vcpipe-essentials, vcpipe-refdata and sensitive-db are are also required.

Pipeline stages

The following table summarizes the stages performed for different types of samples:

Cancer samples Tumor samples Cardio samples Exome samples Genome samples
basepipe_mapping yes yes yes yes yes
basepipe_variantcalling_parallel yes - yes yes yes
basepipe_mutect2 - yes - - -
basepipe_variantcalling_join yes - yes yes yes
basepipe_dragen - - - - yes
basepipe_read_depth_stats yes yes yes yes yes
basepipe_read_depth_stats_bigwig yes yes yes yes yes
basepipe_as_metrics yes yes yes yes yes
basepipe_is_metrics yes yes yes yes yes
basepipe_hs_metrics yes yes yes yes -
basepipe_qc_prepare_vcf yes yes yes yes yes
basepipe_qc yes yes yes yes yes
basepipe_qc_report yes yes yes yes yes
basepipe_mt_variantcalling - - - - yes
basepipe_tiddit - - - - yes
basepipe_svaba - - - - yes
basepipe_manta - - - - yes
basepipe_cnvnator - - - - yes
basepipe_delly - - - - yes
basepipe_svdb_merge - - - - yes
basepipe_svdb_dragen_merge - - - - yes
basepipe_cnv - - - yes -
basepipe_convading yes yes - - -
basepipe_convading_to_vcf yes yes - - -
basepipe_fingerprinting_vcf yes - - yes yes
basepipe_fingerprinting yes - - yes yes
basepipe_happy_target_regions yes yes yes yes yes
basepipe_happy yes yes yes yes yes
basepipe_multiqc yes yes yes yes yes

Mapping stage

The mapping stage performs normal mapping and refinement processes following GATK best practices.

Input: Paired-end sequencing data in FASTQ format

Output: .bam and .bai files, containing the mapped data after refinement, used for variant calling

Processes included in the Nextflow script: basepipe_mapping

It has the following steps:

  • Mapping using bwa-mem with samtools for sorting and indexing (Mapping)
  • Marking duplicates using the Picard tool MarkDuplicates
  • Base quality score recalibration (BQSR) with BaseRecalibrator and PrintReads using gatk. Note: in case a sample has too few reads (which do not provide enough data) or too many reads (which take too long to process), this step will be dropped.

Mapping and MarkDuplicates apply to all samples, BQSR only apply to exome samples.

SNP and small indel calling stage

The SNP and small indel calling stage performs SNP and small indel calling and quality filtering following standard GATK best practices.

Input: .bam and .bai files, containing the mapped data after refinement

Output: Quality annotated VCF files for the whole capture kit region

Processes included in the Nextflow script: basepipe_variantcalling_parallel, basepipe_variantcalling_join and basepipe_mutect

It has the following steps:

  • Variant calling by tools in GATK
  • Variant quality filtration by tools in GATK

The following table summarizes the tools used for different types of samples in different steps:

Variant caller Variant filtration for SNPs Variant filtration for Indels
Cancer samples HaplotypeCaller VariantFiltration VariantFiltration
Tumor samples (GATK version 4+) Mutect2 FilterMutectCalls GetPileupSummaries, CalculateContamination, CollectSequencingArtifactMetrics, FilterByOrientationBias and FilterMutectCalls
Cardio samples HaplotypeCaller VariantFiltration VariantFiltration
Exome samples HaplotypeCaller VariantFiltration VariantFiltration
Genome samples HaplotypeCaller VariantRecalibrator and ApplyRecalibration VariantRecalibrator and ApplyRecalibration

Variant filtration is described in detail later in this document.

Dragen pipeline stage

The Dragen pipeline is a pipeline implemented for fast processing of genome sequencing data by using specialized software and hardware offered by Illumina DRAGEN Bio-IT Platform. The pipeline takes the sequencing data as the input. The steps include mapping, SNP and small indel calling, CNV calling, structural variant (SV) calling and repeat expansion calling on the whole genome. These are implemented in one process in a Nextflow script. The commands and options are based on DRAGEN recommendations. This means that the default setting are always used unless explicitly specified. You will need to refer to the respective DRAGEN version documentations if you need to find out the values for specific default settings.

Input: Paired-end sequencing data in FASTQ format

Output: .bam and .bai files, quality annotated VCF files, quality control reports and warnings, CNVs (.cnv.vcf), SVs (.sv.vcf), merged output of CNV and SV callers (merged_) and repeat expansion (.repeats.vcf) calls in VCF format

Processes included in the Nextflow script: basepipe_dragen, basepipe_svdb_dragen_merge

Genome reference: It is recommended to always regenerate the Dragen reference genome hash table whenever Dragen software is updated. Login to p22-hpc-01, p22-hpc-02 or p22-hpc-03 and run

sbatch /ess/p22/data/durable/production/sw/automation/tsd-import/script/tsd/build-dragen-refgenome.sh

Check the log file build-dragen-refgenome.log created in the working directory and make sure the job finished without errors.

SNP and small indel calling on mitochondrial genome stage

The SNP and small indel calling on mitochondrial genome stage performs SNP and small indel calling and quality filtering on the mitochondrial genome following GATK best Practices for Mitochondrial Analysis.

Input: .bam and .bai files, containing the mapped data for the whole genome region after refinement

Output: Quality annotated VCF files for the mitochondrial genome, haplogroup prediction and mitochondrial coverage statistics

Processes included in the Nextflow script: basepipe_mt_variantcalling

Quality control stage

In the quality control stage, the quality control metrics are generated. These will be described in detail later in this document. The multiQC process summarizes quality metrics from the quality control run.

Input: .bam and .bai files, containing the mapped data after refinement, quality annotated VCF files for the whole capture kit region

Output: quality control metrics

Processes included in the Nextflow script: basepipe_read_depth_stats, basepipe_as_metrics (generates alignment summary metrics), basepipe_is_metrics (generates insert size summary metrics), basepipe_hs_metrics (generates coverage summary metrics), basepipe_qc_prepare_vcf, basepipe_qc (generates .qc_result.json), basepipe_qc_report (generates .qc_report.md and .qc_warnings.md) and basepipe_multiqc

CNV detection stage for genome sequencing samples

For genome sequencing samples, when the sample is analyzed with the GATK pipeline, several copy number variant (CNV) callers run in parallel with the SNP and small indel calling. The results from the different callers are then merged by svdb.

Input: .bam and .bai files, containing the mapped data after refinement.

Output: Quality annotated VCF files from individual callers and merged VCF file for the whole genome.

Processes included in the Nextflow script: basepipe_tiddit, basepipe_svaba, basepipe_manta, basepipe_cnvnator, basepipe_delly and basepipe_svdb_merge.

CNV detection stage for exome sequencing samples

For exome sequencing samples, a copy number variant (CNV) detection (exCopyDepth) runs in parallel with the SNP and small indel calling. The background data are provided by sensitive-db.

Input: .bam and .bai files, containing the mapped data after refinement

Output: Quality annotated CNVs written in .bed format files for the whole capture kit region

Processes included in the Nextflow script: basepipe_cnv

CNV detection stage for target sequencing samples

For cancer samples and tumor samples*, a copy number variant (CNV) detection tool (CoNVaDING) runs in parallel with the SNP and small indel calling. The background data are provided by sensitive-db.

* The CNV results from tumor samples are not used by the lab for interpretation.

Input: .bam and .bai files, containing the mapped data after refinement

Output: Quality annotated VCF files for the whole capture kit region

Processes included in the Nextflow script: basepipe_convading and basepipe_convading_to_vcf

SNP fingerprinting check stage

The purpose of the fingerprinting process is to identify potential sample swaps happening as part of the sample preparation in the lab. It is implemented by an in-house script, that checks genotype data called by HaplotypeCaller on 16 selected SNP sites, against the genotypes of the same sites as called by a TaqMan machine.

This stage is only applied to cancer samples, single exome samples and single genome samples. The criteria for failing are described in the procedure: HTS Lab – TaqMan SNP-ID.

If the fingerprinting check fails, the pipeline will abort. Data will still be available for error tracking. The steps that need to be taken when the check fails are described in the procedure: HTS - Mismatch between TaqMan SNP-ID and sequencing data.

Input: .bam and .bai files, containing the mapped data after refinement

Output: Genotypes of the 16 checked sites detected by HaplotypeCaller and comparison of HaplotypeCaller and TaqMan results

Processes included in the Nextflow script: basepipe_fingerprinting_vcf and basepipe_fingerprinting

hap.py stage

In the hap.py stage, the predicted SNP and small indel results will be compared with the high confidence SNP and small indel calls provided by GIAB. This stage is only applied on control samples (NA12878, HG002, HG003 and HG004). The results will be used exclusively for samples that need trend analysis (see procedure: HTS - Use of reference materials for internal quality control).

Input: Quality annotated VCF files for the whole capture kit region

Output: Sensitivity and PPV in the defined region

Processes included in the Nextflow script: basepipe_happy_target_regions and basepipe_happy

Implementation specifics

This section is dedicated to the most important implementation specifics.

SNP and small indel filtration

Variant filtration steps follow GATK best practices. When there are too few variants and therefore not enough data to perform soft-filtering, hard-filtering is applied.

For tumor samples, which are different from other samples, SNPs and small indels are called with mutect2 in addition. Variant filtration for these samples also differs from that for other samples, as described in detail in the "Somatic variant filtration" section below.

The following table summarizes which filtering strategy is applied on SNPs/indels for different types of samples:

SNP filtering Indel filtering
Cancer samples hard hard
Cardio samples hard hard
Exome samples hard hard
Genome samples soft soft

Parameters

QD: QualByDepth, Variant call confidence normalized by depth of sample reads supporting a variant.

FS: Strand bias estimated using Fisher's Exact Test.

MQ: Root Mean Square of the mapping quality of reads across all samples. Provides an estimation of the overall mapping quality of reads supporting a variant call.

MQRankSum: Rank Sum Test for mapping qualities of REF versus ALT reads.

ReadPosRankSum: Rank Sum Test for relative positioning of REF versus ALT alleles within reads.

SOR: Strand bias estimated by the Symmetric Odds Ratio test.

DP: Depth of coverage.

Parameters and thresholds in soft-filtering and hard-filtering

SNP soft-filtering Indel soft-filtering SNP hard-filtering Indel hard-filtering
QD yes yes < 2.0 < 2.0
FS yes yes > 60.0 > 200.0
MQ - - < 40.0 -
MQRankSum yes yes < -12.5 -
ReadPosRankSum yes yes < -8.0 < -20.0
SQR yes yes - -
DP yes (genome samples only) yes - -

Somatic variant filtration

Variant quality annotation using the following tools from GATK version 4+:

  • GetPileupSummaries and CalculateContamination: Estimate cross-sample contamination, the output is required by FilterMutectCall.
  • FilterMutectCalls: Determines whether a call is a confident somatic call
  • CollectSequencingArtifactMetrics and FilterByOrientationBias: Filtering the variant call set based on sequence context artifacts, e.g. OxoG and FFPE

Quality control

Quality control checks and thresholds

For samples imported to ELLA, the values for quality control parameters are shown on the INFO page in ELLA under the header "Pipeline QC". For quick reference, the following table lists quality checks, thresholds and the corresponding names in ELLA for different samples*:

Parameter ELLA Cancer samples Cardio samples Tumor samples*** Exome samples Genome samples
PCT_TARGET_BASES_10X "Coverage > 10x" (0.97, 1.01) (0.97, 1.01) (0.97, 1.01) (0.97, 1.01) -
PCT_10X "Coverage > 10x" - - - - (0.89, 1.01)
PCT_TARGET_BASES_20X "Coverage > 20x" (0.8, 1.01) (0.8, 1.01) (0.8, 1.01) (0.80, 1.01) -
PCT_20X "Coverage > 10x" - - - - [0.83, 1.01]
MEAN_TARGET_COVERAGE "Mean coverage" (75, 1000000000) (75, 1000000000) (350, 1000000000) (75, 1000000000) -
MEDIAN COVERAGE "Median coverage" - - - - (29, 1000000000)
Q30** "Base quality > 30" 75 80 75 80 80
contamination "Number of skewed AD variants" (-1,6) (-1,6) - - -
skewedRatio "Variants skewed allele depth" - - - (0, 0.05) (0, 0.05)
TiTvRatio "Transitions/transversions" - - - (2, 3) (1.5, 2.5)

*: '()' are open intervals, i.e. down/up to but not including extrema, e.g. (2,3) means >2 and <3.

**: The thresholds shown here are percentages, i.e. 80 means 80%. The sample meets the corresponding quality criterion when its value is larger than the given threshold, e.g. a threshold of 80 means the sample's value needs to be >80 to pass.

***: The Department of Pathology will make sure the sample contains at least 30% of tumor tissues.

Quality control implementation

Quality control is implemented using an in-house script that collects metrics on the .bam file with third party tools as well as different metrics on the VCF with in-house modules.

The following metrics are collected for the .bam file but not checked against a threshold:

  • Various data from CollectAlignmentSummaryMetrics using Picard tools
  • Various data from CollectInsertSizeMetrics using Picard tools
  • Various data from mapping_metrics.csv using Dragen
  • Various data from insert-stats.tab using Dragen

The following data are collected for the .bam file and are checked against a threshold:

  • PCT_TARGET_BASES_10X from CalculateHsMetrics using Picard tools: The fraction of all target bases achieving 10X or greater coverage.
  • PCT_10X from results generated by mosdepth: The fraction of bases that attained at least 10X sequence coverage in post-filtering bases.
  • PCT_TARGET_BASES_20X from CalculateHsMetrics using Picard tools: The fraction of all target bases achieving 20X or greater coverage.
  • PCT_20X from results generated by mosdepth: The fraction of bases that attained at least 20X sequence coverage in post-filtering bases.
  • MEAN_TARGET_COVERAGE from CalculateHsMetrics using Picard tools: The mean coverage of a target region.
  • MEDIAN_COVERAGE from results generated by mosdepth: The median coverage in bases of the genome territory, after all filters are applied.

The following metrics are collected from the VCF file and checked against thresholds:

  • skewedRatio: contamination level of exome samples and genome samples. First, the allelic read depth ratio is calculated for each heterozygous variant. Then, the variants with ratio above 3 (skewed variants) are counted. Finally, skewedRatio is calculated as (num of skewed variants) / (total num of heterozygous variants). A typical value of skewedRatio for an exome sample is around 0.01.
  • tiTvRatio: ratio of transversion vs transitions.
  • contamination: contamination level of cancer samples and cardio samples. First, the allelic read depth ratio is calculated for each variant present in (at least) half of the European population. Then, the heterozygous variants with ratio above 3 or below 0.25 and the homozygous variants with ratio above 0.1 are counted. Their number should be less than the given threshold.

The following metrics are collected from sequencing statistics (Demultiplex_Stats.htm) and checked against thresholds:

  • Q30 : Percentage number of bases having phred score quality >30.

The quality control data and summary are written to a JSON file. The pipeline will proceed regardless of whether quality control succeeds or not, so if running manually, the users must check the logs themselves. The rationale for not halting the pipeline is that in some cases one may still want to use the result files. In day-to-day use, it is included in an automation system that checks the result automatically after running the pipeline.

CNV and SV filtration and quality control

The CNV and SV variants are considered to have sufficient quality when they are given a PASS filter by the callers. In addition, the quality of variants is verified by visual inspection in the ELLA variant interpretation software.