CNV filtering
CNVs are provided by the VCF file defined in the environment variable $CNV_VCF
. The content of $CNV_VCF
is iteratively updated by the operations in target/modules/wgs_cnv_annotation.sh
.
Note: In the script the intermediate processing steps to $CNV_VCF
are stored in separate files, but in the documentation, we will use $CNV_VCF
as an alias also for each intermediate steps.
The CNVs in $CNV_VCF
are subject to annotation and filtering operations:
- FILTER annotation. In the
$CNV_VCF
FILTER column, one or more filter tags are given in a;
-separated list. Based on rules defined inconfig/analysistypeconfig.json
, filter tags can be added to the FILTER column - Filtering. Variants with certain filter tags in the FILTER list or other properties are removed or included.
All custom FILTER annotation tags and filtering criteria are defined in
Requirements on the filter annotation and filtering definitions are defined in the JSON schema
Description of filters
Input filters and the canonical interpretation group
Filters provided by the callers are present in the FILTER column of $CNV_VCF
. Variants with such filters are silently removed because we only consider variants in the canonical interpretation group. The canonical interpretation group is defined in
Transcript filter
Variants that are smaller than a certain threshold will only be included if they overlap with a genepanel transcript. Small variants are defined in
Overlap is defined as overlapping by at least one base within a region around the genepanel transcipt defined by the interval length
The FILTER
value of such small variants outside of the viscinity of the genepanel transcipt are subject to the filter
This filter is quite ad hoc, and assumes that the canonical interpretation group only allow variants with FILTER='PASS'
.
Quality filter annotation
New quality filters tags are added by interpreting the filter definitions in
Filters are defined as JSON strings on the format
{
"$FILTER_NAME": {
"common": {
"$DESCRIPTIVE_NAME_FOR_FILTER": {
"$VCF_COLUMN_NAME_LOWER_CASE[:INFO_OR_FORMAT_FIELD_AS_IS]": {
"$IF_NUMERICAL_ONE_OF_KEYWORDS(lt|le|gt|ge|eq|ne)": "$POSITIVE_NUMBER",
"$IF_EXACT_MATCH_USE(in)": [
"$LIST_OF_EXACT_STRING_MATCHES"
],
"$IF_SEMICOLON_SEPARATED_KEYWORD_USE(contains)": [
"$LIST_OF_STRINGS_CONTAINED_IN_FORMAT"
],
"$IF_COLON_SEPARATED_USE(search)": "$REGEX_TO_MATCH_ALL_TRANSCRIPTS"
}
}
},
"exception": {
"... USE SAME FORMAT TO DEFINE EXCEPTIONS TO IGNORE THE COMMON FILTERS ..."
}
}
}
Note that one can use the search
or contains
keywords for any string, not only those containing ,
or ;
.
Note that only non-negative numbers are understood by the numerical filter. Internally in the filter parser we use absolute values.
There is currently one quality FILTER name that will be used for removing variants
There are currently one quality FILTER name that will be used for rescuing variants from other filters
In detail these filters are:
{
"MinSizeIntronVariant": {
"common": {
"small_intron_variants": {
"info:SVLEN": {
"lt": 300
},
"info:CSQ": {
"search": "\\|intron_variant\\||\\|intron_variant&non_coding_transcript_variant\\|"
}
}
}
},
"HighACMGClass": {
"common": {
"rescue_hi_ACMG_class": {
"info:ACMG_class": {
"in": ["4", "5"]
}
}
}
}
}
Frequency filter annotation, and quality and frequency filtering
Frequency FILTER annotation, quality filtering and frequency filtering is performed in one operation, and the conceptual difference between the three operations is therefore blurred.
Frequency FILTER annotation is performed using the filter definitions in
The filter definitions are similar to the quality filters, except that the FILTER names are hardcoded and not part of the filter defintions. The FILTER frequency names are
Here 'HiFreqInHdb' annotates high frequency for any of the inhouse databases, 'HiFreqSwegen' for any of the Swegen databases and 'HiFreqGnomad' for the Gnomad database.
Definition of frequency filters are written on form
{
"(common|manta|canvas)": {
"(gnomad|swegen|indb)": {
"$VCF_COLUMN_NAME_LOWER_CASE[:INFO_OR_FORMAT_FIELD_AS_IS]": {
"... SAME FILTER DEFINITION AS FOR QUALITY FILTERS ..."
}
}
}
}
Which in our case gives frequency annotation
{
"common": {
"gnomad": {
"info:FRQ_GNOMAD": {
"gt": 0.01
},
"info:OCC_GNOMAD": {
"gt": 50
}
}
},
"manta": {
"swegen": {
"info:OCC_SWEGEN_MANTA": {
"gt": 10
},
"info:FRQ_SWEGEN_MANTA": {
"gt": 0.01
}
},
"indb": {
"info:OCC_INDB_MANTA": {
"gt": 10
},
"info:FRQ_INDB_MANTA": {
"gt": 0.01
}
}
},
"canvas": {
"swegen": {
"info:OCC_SWEGEN_CNVNATOR": {
"gt": 10
},
"info:FRQ_SWEGEN_CNVNATOR": {
"gt": 0.01
}
},
"indb": {
"info:OCC_INDB_CANVAS": {
"gt": 10
},
"info:FRQ_INDB_CANVAS": {
"gt": 0.01
}
}
}
}
Note: Filters are defined so that the both the frequency and the number of occurrences of the variant needs to be above the defined thresholds for the filter annotation to apply.
Note: 'common' filters are applied to all variants, 'manta' is valid for all variants that were called by Manta, and 'canvas' is only valid if the variant was only called by Canvas.
Note: In ${ANALYSIS_TYPE_CONFIG}
we also have definitions for the callers Delly, Tiddit and CNVnator, which are not currently in use.
Quality filtering is based on the definitions in
In our case this is
Note: The 'search' regular expression means that variants outside of transcripts are removed, and variants with 'PASS' or one of the quality inclusion filters are kept
Note: The interpretation group is selected before frequency annotation. Therefore the frequency FILTER names are not in the 'search' string.
Frequency filtering is performed for all variants with a high frequency FILTER tag. Exceptions to this rule are defined in
In our case this is
{
"rescue_homozygote_DEL_on_X": {
"info:SVTYPE": {
"in": ["DEL"]
},
"chrom": {
"in": ["X"]
},
"format:GT": {
"in": ["1", "1/1", "1|1"]
}
},
"rescue_ACMG_class": {
"filter": {
"contains": ["HighACMGClass"]
}
}
}
For the resulting variants, FILTER is either 'PASS', or FILTER contains "HighACMGClass", or FILTER contains a high frequency tag for homozygote DEL on X.