Skip to content

Tools for postprocessing of structural variants

Standardizing a VCF file from our CNV-callers

The runnable script sv_standardizer will print a standardized VCF to stdout when executed as

sv_standardizer VCF_FILENAME --caller CALLER_NAME --sample SAMPLE_NAMES

where

  • VCF_FILENAME is the VCF from one of the callers ["manta", "delly", "tiddit", "cnvnator", "svaba", "canvas", "cnv_sv"] or one of the end states ["merged", "header", "ella"]
  • If no filename is given, the function will take input from stdin
  • CALLER_NAME explicitly sets the source of the VCF_FILENAME
  • CALLER_NAME can be omitted if the VCF_FILENAME starts with {CALLER_NAME}_
  • SAMPLE_NAMES is a comma-separated list of SAMPLE column names to ensure equal namin for all callers
  • It is expected that the order of SAMPLE columns is the same for all callers

Note: A hack is used for the standardization of a file in the "merged" state for filenames that contain Diag- in order to remove unwanted INFO fields from v2.8.2 of SVDB merge

Postprocessing of the FILTER column

The FILTER column can be updated when a variant record fails a quality test. Postprocessing is needed for callers such as CNVnator, Delly, Manta and the Dragen SV and CNV callers.

The runnable script sv_postprocessing will print a VCF with updated FILTER column to stdout when executed as

sv_postprocessing STANDARDIZED_MERGED_VCF --set-filters WGS_SVPARAMS_FILTER --caller-priority CALLERS --set-filter-descriptions SET_FILTER_DESCRIPTIONS

where

  • STANDARDIZED_MERGED_VCF has been standardized by sv_standardizer and merged by SVDB
  • If no filename is given, the function will take input from stdin
  • WGS_SVPARAMS_FILTER is a JSON-formatted string defined in analysistypeconfig.schema.json under WGS.svparams.filter. There is no additional schema in sv_postprocessing to check the filter definitions.
  • CALLERS is a prioritized list of callers. Only the filter of the most prioritized caller will be used.
  • Optional: SET_FILTER_DESCRIPTIONS is a dictionary {FILTER_NAME: "Description of filter"} for the VCF header. FILTER_NAMEs are the keys of WGS_SVPARAMS_FILTER. If not set, the description will be the generic: "Custom filter FILTER_NAME".

Filtering and selecting a interpretation group

Filtering means to remove records from the VCF based on certain conditions

  1. remove variants with high frequencies or other properties (set filters). Exceptions are allowed (set rescue filters)
  2. return a subset of variants (set interpretation group)
sv_wgs_filtering STANDARDIZED_MERGED_VCF
  --output-format OUTPUT_FORMAT
  --set-interpretation-group SET_INTERPRETATION_GROUP
  --set-filters SET_FILTERS
  --set-rescue-filters SET_RESCUE_FILTERS
  --caller-priority CALLER_PRIORITY

where

  • STANDARDIZED_MERGED_VCF has been standardized by sv_standardizer
  • If no filename is given, the function will take input from stdin
  • Optional: STANDARDIZED_MERGED_VCF may also have been postprocessed by sv_postprocessing
  • OUTPUT-FORMAT can take values such as vcf, tsv, bed or md
  • SET_INTERPRETATION_GROUP is a JSON-formatted string defined in analysistypeconfig.schema.json under svparams.interpretation_groups
  • The interpretation group is applied prior to filtering
  • SET_FILTERS is a JSON-formatted string defined in analysistypeconfig.schema.json under svdb.criteria. There is no additional schema in sv_wgs_filtering to check the filter definitions.
  • Note that frequency filtering by the Gnomad database is applied to variants from all callers, whereas filtering by INDB and SweGen databases is caller specific
  • SET_RESCUE_FILTERS is a JSON-formatted string defined in analysistypeconfig.schema.json under svdb.exceptions
  • CALLER_PRIORITY is a list of callers in order of priority. The freqency database for the caller of highest priority will be used for filtering when multiple callers have been merged
  • Optional: --debug will show debug information

Note: There are additional options for this function. For the latest options, check sv_wgs_filtering --help

Validating a VCF file

A VCF file can be validated by the script sv_validator which will print an error message if the format of the VCF does not comply with a pydantic model for the VCF records

sv_validator VCF_FILENAME
--datamodel DATAMODEL

where

  • VCF_FILENAME is the VCF file to be validated. If no filename is given, the function will take input from stdin.
  • DATAMODEL is the pydantic model to be used for validation. For datamodel options and the default model, check sv_validator --help