Skip to content

CNV filtering

CNVs are provided by the VCF file defined in the environment variable $CNV_VCF. The content of $CNV_VCF is iteratively updated by the operations in target/modules/wgs_cnv_annotation.sh.

Note: In the script the intermediate processing steps to $CNV_VCF are stored in separate files, but in the documentation, we will use $CNV_VCF as an alias also for each intermediate steps.

The CNVs in $CNV_VCF are subject to annotation and filtering operations:

  1. FILTER annotation. In the $CNV_VCF FILTER column, one or more filter tags are given in a ;-separated list. Based on rules defined in config/analysistypeconfig.json, filter tags can be added to the FILTER column
  2. Filtering. Variants with certain filter tags in the FILTER list or other properties are removed or included.

All custom FILTER annotation tags and filtering criteria are defined in

ANALYSIS_TYPE_CONFIG=config/analysistypeconfig.json

Requirements on the filter annotation and filtering definitions are defined in the JSON schema

ANALYSIS_TYPE_CONFIG_SCHEMA=config/analysistypeconfig.schema.json

Description of filters

Input filters and the canonical interpretation group

Filters provided by the callers are present in the FILTER column of $CNV_VCF. Variants with such filters are silently removed because we only consider variants in the canonical interpretation group. The canonical interpretation group is defined in

jq '.WGS.svparams.interpretation_groups.canonical' "${ANALYSIS_TYPE_CONFIG}"

Transcript filter

Variants that are smaller than a certain threshold will only be included if they overlap with a genepanel transcript. Small variants are defined in

jq '.WGS.svparams.interpretation_groups.small' "${ANALYSIS_TYPE_CONFIG}"

Overlap is defined as overlapping by at least one base within a region around the genepanel transcipt defined by the interval length

jq '.WGS.svparams.gene_panel_slop' "${ANALYSIS_TYPE_CONFIG}"

The FILTER value of such small variants outside of the viscinity of the genepanel transcipt are subject to the filter

'PASS' -> '.'

This filter is quite ad hoc, and assumes that the canonical interpretation group only allow variants with FILTER='PASS'.

Quality filter annotation

New quality filters tags are added by interpreting the filter definitions in

jq '.WGS.svparams.filters' "${ANALYSIS_TYPE_CONFIG}"

Filters are defined as JSON strings on the format

{
    "$FILTER_NAME": {
        "common": {
            "$DESCRIPTIVE_NAME_FOR_FILTER": {
                "$VCF_COLUMN_NAME_LOWER_CASE[:INFO_OR_FORMAT_FIELD_AS_IS]": {
                    "$IF_NUMERICAL_ONE_OF_KEYWORDS(lt|le|gt|ge|eq|ne)": "$POSITIVE_NUMBER",
                    "$IF_EXACT_MATCH_USE(in)": [
                        "$LIST_OF_EXACT_STRING_MATCHES"
                        ],
                    "$IF_SEMICOLON_SEPARATED_KEYWORD_USE(contains)": [
                        "$LIST_OF_STRINGS_CONTAINED_IN_FORMAT"
                        ],
                    "$IF_COLON_SEPARATED_USE(search)": "$REGEX_TO_MATCH_ALL_TRANSCRIPTS" 
                }
            }

        },
        "exception": {
            "... USE SAME FORMAT TO DEFINE EXCEPTIONS TO IGNORE THE COMMON FILTERS ..."
        }
    }
}

Note that one can use the search or contains keywords for any string, not only those containing , or ;.

Note that only non-negative numbers are understood by the numerical filter. Internally in the filter parser we use absolute values.

There is currently one quality FILTER name that will be used for removing variants

MinSizeIntronVariant

There are currently one quality FILTER name that will be used for rescuing variants from other filters

HighACMGClass

In detail these filters are:

{
    "MinSizeIntronVariant": {
        "common": {
            "small_intron_variants": {
                "info:SVLEN": {
                    "lt": 300
                },
                "info:CSQ": {
                    "search": "\\|intron_variant\\||\\|intron_variant&non_coding_transcript_variant\\|"
                }
            }
        }
    },
    "HighACMGClass": {
        "common": {
            "rescue_hi_ACMG_class": {
                "info:ACMG_class": {
                    "in": ["4", "5"]
                }
            }
        }
    }
}

Frequency filter annotation, and quality and frequency filtering

Frequency FILTER annotation, quality filtering and frequency filtering is performed in one operation, and the conceptual difference between the three operations is therefore blurred.

Frequency FILTER annotation is performed using the filter definitions in

jq '.WGS.svdb.criteria' "${ANALYSIS_TYPE_CONFIG}"

The filter definitions are similar to the quality filters, except that the FILTER names are hardcoded and not part of the filter defintions. The FILTER frequency names are

HiFreqInHdb
HiFreqSwegen
HiFreqGnomad

Here 'HiFreqInHdb' annotates high frequency for any of the inhouse databases, 'HiFreqSwegen' for any of the Swegen databases and 'HiFreqGnomad' for the Gnomad database.

Definition of frequency filters are written on form

{
    "(common|manta|canvas)": {
        "(gnomad|swegen|indb)": {
            "$VCF_COLUMN_NAME_LOWER_CASE[:INFO_OR_FORMAT_FIELD_AS_IS]": {
                "... SAME FILTER DEFINITION AS FOR QUALITY FILTERS ..."
            }
        }
    }
}

Which in our case gives frequency annotation

{
    "common": {
       "gnomad": {
           "info:FRQ_GNOMAD": {
               "gt": 0.01
           },
           "info:OCC_GNOMAD": {
               "gt": 50
           }
       }
    },
    "manta": {
       "swegen": {
           "info:OCC_SWEGEN_MANTA": {
               "gt": 10
           },
           "info:FRQ_SWEGEN_MANTA": {
               "gt": 0.01
           }
       },
       "indb": {
           "info:OCC_INDB_MANTA": {
               "gt": 10
           },
           "info:FRQ_INDB_MANTA": {
               "gt": 0.01
           }
       }
    },
    "canvas": {
        "swegen": {
            "info:OCC_SWEGEN_CNVNATOR": {
                "gt": 10
            },
            "info:FRQ_SWEGEN_CNVNATOR": {
                "gt": 0.01
            }
        },
        "indb": {
            "info:OCC_INDB_CANVAS": {
                "gt": 10
            },
            "info:FRQ_INDB_CANVAS": {
                "gt": 0.01
            }
        }
    }
}

Note: Filters are defined so that the both the frequency and the number of occurrences of the variant needs to be above the defined thresholds for the filter annotation to apply.

Note: 'common' filters are applied to all variants, 'manta' is valid for all variants that were called by Manta, and 'canvas' is only valid if the variant was only called by Canvas.

Note: In ${ANALYSIS_TYPE_CONFIG} we also have definitions for the callers Delly, Tiddit and CNVnator, which are not currently in use.

Quality filtering is based on the definitions in

jq '.WGS.svparams.interpretation_groups.ella' "${ANALYSIS_TYPE_CONFIG}"

In our case this is

{
    "info:SVTYPE": {
        "in": ["DEL", "DUP"]
    },
    "filter": {
        "search": "^[^.]*(PASS|HighACMGClass)"
    }
}

Note: The 'search' regular expression means that variants outside of transcripts are removed, and variants with 'PASS' or one of the quality inclusion filters are kept

Note: The interpretation group is selected before frequency annotation. Therefore the frequency FILTER names are not in the 'search' string.

Frequency filtering is performed for all variants with a high frequency FILTER tag. Exceptions to this rule are defined in

jq '.WGS.svdb.exceptions' "${ANALYSIS_TYPE_CONFIG}"

In our case this is

{
    "rescue_homozygote_DEL_on_X": {
        "info:SVTYPE": {
            "in": ["DEL"]
        },
        "chrom": {
            "in": ["X"]
        },
        "format:GT": {
            "in": ["1", "1/1", "1|1"]
        }
    },
    "rescue_ACMG_class": {
        "filter": {
            "contains": ["HighACMGClass"]
        }
    }
}

For the resulting variants, FILTER is either 'PASS', or FILTER contains "HighACMGClass", or FILTER contains a high frequency tag for homozygote DEL on X.