HTS Bioinf - ELLA CNV module: CNV calling and filtering
Scope
This document details CNV-calling in the WGS DRAGEN pipeline, covering CNV and SV integration, variant quality criteria, annotation, filtering, and pipeline limitations.
CNV calling by DRAGEN in the whole genome data
CNV calling in DRAGEN utilizes two different callers: Canvas and Manta. These callers use distinct algorithms to detect structural variants, explained in Table 2. The criteria for determining when the callers have identified the same variant simultaneously are as follows:
- Overlap: At least 70% overlap between the variants.
- Breakpoint proximity: Breakpoints must be less than 2.5 kb apart.
- When variants partially overlap but do not meet these criteria, both variants are reported separately in ELLA.
Table 1: Representation of variants in ELLA
Variants called by the two callers are represented differently in ELLA, as summarized below:
Caller(s) | Representation in ELLA | Annotation in "CNV VCF" track |
---|---|---|
Manta (SV) | Manta variant | MANTAID |
Canvas (CNV) | Canvas variant | CANVASID, canvasCN |
Manta and Canvas | Manta variant | MANTAID and CANVASID, canvasCN |
Note on copy number
The sample CN
field provides copy number information for Canvas calls and merged variants, ensuring consistent representation. Pure Manta calls, however, do not include a copy number CN
value.
Table 2: Details of DRAGEN CNV and SV callers
Caller | Scope | Variant type | Data evidence and breakpoints | Weaknesses/Artifacts | Technical indication in CNV VCF |
---|---|---|---|---|---|
Canvas | CNV caller (unbalanced variants) | DEL and DUP*, >5 kb | Read depth changes, improperly paired end reads near CNV boundaries | Small read depth changes can indicate false positives; uncertain breakpoints; large variants may be split into smaller ones; skews in improperly paired reads at borders (i.e. the PE format field in the VCF) may also be indicative of technical variants | Few supporting sources (low/none read depth changes); |
Manta | SV caller (balanced and unbalanced variants) | DEL and DUP:TANDEM*, >=50 bp | Split reads, discordant reads (insert size and relative orientation of read pairs), and local re-assembly | False positives for small variants with only split reads and no paired ends; imprecise breakpoints if only paired ends and no split reads; challenges with large balanced false positive variants | Breakpoints in repeats: "Repeat_type_left" and "Repeat_type_right" |
*Difference between DUP and DUP:TANDEM: DUP indicates the source of the duplication in the genome but not its destination, hence no genotype is available. DUP:TANDEM specifies where the duplication occurred (genotype available) but does not provide the copy number, unless it was merged with the Canvas call.
Variants identified by both Canvas and Manta simultaneously show broad technical support and are highly likely to be true positives. The technical foundations of the calling principles - 1) read depth change, 2) split reads, and 3) discordant reads — are further detailed in the attached "CNV quality parameters."
CNV breakpoints
Since CNV calls from Manta and Canvas are based on different detection principles, their CNV calls may differ in size and breakpoint coordinates. When the differences in breakpoints between the two callers are too large for the variants to merge, this results in the "same finding" being listed twice in the variant list.
It is also essential to be aware that CNV callers may sometimes fragment an actual CNV, especially when the CNV spans genomic regions without a valid reference genome. It is always necessary to review raw data in IGV to define the CNV's breakpoints and location, as described in the attached "CNV quality parameters".
Note that DRAGEN and ELLA use different conventions for reporting the start and end positions of called CNVs:
- DRAGEN reports variants using left normalization, starting at the last base upstream the variant (the last normal base) and including the last base of the variant at the end.
- ELLA excludes the last base, showing variant length as 1 base shorter. The start base in ELLA is reported the same as in DRAGEN.
Occasionally, Manta reports INDEL variants (deletions with inserted sequences), causing discrepancies between SVLEN and END-POS.
Reported SV types
The current pipeline only reports deletions and duplications, not other structural variant types (such as insertions).
Filtering of CNVs
The minimum solution currently lacks in-ELLA filtering capabilities, such as trio filtering or combining SNV and CNV findings for recessive genes. Thus, the minimum solution is primarily suited for smaller gene panels where a large number of CNVs is not expected. CNVs are annotated and filtered upstream of ELLA, as shown in Figure 1 below.
Figure 1: Data preprocessing, including filtering, prior to import into ELLA. Yellow rectangles indicate steps that remove variants. Due to technical limitations, VEP only annotates variants in the size category <10 Mb; variants >10 Mb lack a CSQ (consequence) value.
Only CNVs within 1 kb of transcripts in the relevant gene panel are retained. Frequency filtering is conducted using caller-specific databases to account for caller-specific artifacts. These frequency databases do not differentiate between heterozygous and homozygous variants or account for sex for X-chromosome variants. The databases are generated as described in the procedure HTS Bioinf - Update of in-house annotation data.
Calling small CNVs is known to be technically challenging (see Table 2), leading to a large number of calls in this size range. Starting with anno-targets version 2.17.0, CNVs <1 kb that do not overlap coding regions are hard-filtered during preprocessing (see Figure 1).
Those regions are defined as regions with the following consequences (VEP CSQ field) in all transcripts:
- intron_variant
- intron_variant&non_coding_transcript_variant
- intron_variant&downstream_gene_variant
- downstream_gene_variant
- feature_truncation&intron_variant
- a set of (intron_variant, downstream_gene_variant).
It is acknowledged and accepted that some true-positive intronic CNVs may be filtered out using this method.