HTS bioinf - Genomic coordinates liftover
Scope
This document outlines the implementation summary and outputs of the liftover from hg19 to hg38 genome build.
Reference genome coordinates mapping
The mapping between reference genome assemblies (e.g., GRCh37/hg19 to GRCh38/hg38) involves remapping genomic coordinates to account for multiple types of reference changes:
- Sequence additions and refinements
- Representation of complex and repetitive regions
- Correction of assembly errors
- Chromosomal rearrangements and alternate loci
Our implementation utilizes CrossMap tool with custom post-processing to ensure high-quality coordinate conversion.
Implementation
The coordinate liftover process follows these key steps:
- Variant positions are extracted from the source VCF file (either SNVs or structural variants)
- Coordinates are converted using CrossMap and the UCSC chain file
- Multi-mapped cases are resolved through a region-based secondary analysis
- The original VCF file is annotated with target genome coordinates and mapping quality indicator
Liftover annotations
Liftover annotations (explained in Table 1) are shown in the region section in ELLA, see below.
Figure 1: Display of hg38 Liftover annotations in ELLA Region section. When hg38_map
is not FAILED
, the links will be active and the target genome coordinates non-empty.
Figure 2: Display of hg38 Liftover annotations in ELLA Region section. When hg38_map
not FAILED
, there will be no content in the gnomAD v4 link or coordinates.
Notes
- The vertical header HG38 LIVTOVER contains the link to UCSC Liftover tool
- The vertical header GNOMAD v4 contains the link to the respective region in GnomAD v4 browser in GRCh38 build
- gnomAD v4 link will be inactive in case of failed liftover
Table 1: Hg38 liftover annotations
Field | Description |
---|---|
hg38_chr | Target chromosome in hg38 reference |
hg38_start | Starting position in hg38 (1-based) |
hg38_end | End position in hg38 |
hg38_coord | Complete coordinate representation |
hg38_map | Mapping quality indicator |
If mapping status is FAILED (hg38_map=FAILED) only hg38_map tag is added.
Mapping quality indicator
The hg38_map
field classifies coordinate conversion confidence as following:
UNIQUE: The source coordinates mapped unambiguously to a single location in the target reference genome. This represents the highest confidence level.
REGION: The source coordinates initially mapped to multiple potential locations, but region-based analysis determined a most likely mapping. These coordinates are generally reliable but may warrant additional confirmation.
FAILED: The source coordinates could not be reliably mapped to the target reference genome. This may occur due to significant reference differences, complex regions, or deleted sequences. These variants require manual review before clinical interpretation for the target genome.
Structural variant considerations
For structural variants, the liftover process includes specialized handling:
- END positions are extracted from INFO fields
- For variants with SVLEN but no END, the appropriate end position is calculated