Skip to content

Setup coverage and segmental duplications

Information about coverage and segmental duplications (SegDup) is displayed for each selected transcript in the Gene panel builder (GPB) frontend as an indication of the variant calling quality that can be expected for a given method.

Creating files

Coverage and SegDup files must be pre-generated for the corresponding information to be displayed in the GPB frontend. The necessary scripts are located in scripts/genomic-data/.

Coverage file

Input:

  • Alignment files (.bam or .cram) with matching indexes (.bai / .crai ).
  • Text file listing the alignment files (.bam or .cram), one per line.
  • Reference genome file (.fasta) with index (.fai) to use with CRAM files (needed for mosdepth to convert CRAM to per-base BED).
  • Genome sizes file (.genome) that lists the length of each chromosome, one per line (needed for BED inputs and d4tools create).

Output:

  • Coverage file coverage.d4 in the desired directory

From scripts/genomic-data, run run_coverage.py in a coverage-tools Docker container (the recipe for building the coverage-tools Docker image is in scripts/genomic-data/Dockerfile) including all software necessary to compute the coverage (rust, d4tools, samtools and mosdepth):

docker run \
    --rm \
    --workdir /mounted/$(pwd) \
    -v $(pwd):/mounted/$(pwd) \
    coverage-tools \
    python3 src/run_coverage.py \
        -i <bams_and_crams.txt> \
        -o <output_directory> \
        -f <reference_genome.fasta> \
        -g <genome_sizes.genome> \
        -mq <mapq>

Replace the placeholder values (<...>) above with paths to real files or directories:

  • <bams_and_crams.txt>: List of alignment files.
  • <output_directory>: Output directory.
  • <reference_genome.fasta>: Reference genome file (e.g. GRCh37.fasta).
  • <genome_sizes.genome>: Genome sizes file.
  • <mapq>: Minimum mapping quality value of alignment files to include in coverage calculations (optional; default=60).

SegDup file

Input:

  • Genome sizes file (.genome) that lists the length of each chromosome, one per line (needed for BED inputs and d4tools create).

Output:

  • segdup.d4 file in the data directory.

Run script make_segdup.py with the make command:

make make-segdup

Running coverage/SegDup scripts in another environment

To run the scripts in another environment (e.g. TSD; see also Containers in TSD), the following steps are needed:

  1. Clone the repository

  2. Build Docker image from the genepanel-builder/scripts/genomic-data directory

    • The Docker image can be built using make (holds also in environments without internet access):
    make build-image
    
    • If make isn't available, the Docker image can be built directly with the following command:
    DOCKER_BUILDKIT=1 docker build -t coverage-tools .
    
  3. Save the Docker image in a tar file:

    make save-image
    

    This will run the command:

    docker save coverage-tools:latest | gzip > coverage-tools.tar.gz
    
  4. Copy the tar file (for example to the TSD server) and load the image:

    docker load -i coverage-tools.tar.gz
    
  5. Run coverage script as above with docker

Handy commands

  • Convert CRAM to BAM:
samtools view -b \
    -T ref/GRCh38DH.fa \
    -o data/s1/HG00140.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam \
    data/s1/HG00140.alt_bwamem_GRCh38DH.20150826.GBR.exome.cram
  • Index alignment file (.bam or .cram):
samtools index file.bam

Troubleshooting coverage/SegDup scripts

  • If per-base files are already in the output directory, it is assumed they are the correct files and the generation of a new per-base file is skipped. The same goes for the merged.bed file and the decompressed per-base.bed files. If you are unsure about the correctness of the files, it is recommended to delete them before running the script.
  • Aligmnent files should have a matching index file (.bai or .crai).
  • Reference genome FASTA file (e.g. GRCh37.fasta) should have a matching index file (.fai).
  • If samtools outputs samtools index: failed to create or write index, make sure the directory where the BAM or CRAM file is located has write permissions. Samtools also may require that the BAM or CRAM file has read permissions. The index is created by default in the same directory as the BAM or CRAM file.

Configure display of coverage and SegDup in the frontend

Use of the coverage and SegDup files in the GPB frontend is configured in gpbuilder/resources/settings.json; see example_settings.json (same folder) for an example. The parameters you can configure are:

  • build_summary_script: Path to script that creates a summary of a gene panel build.

  • coverage: Regions to include in calculation of average coverage for a gene.

    Subkey Explanation Values
    slop.upstream / slop.downstream Number of bp to include upstream/downstream of exons [integer]
    use_cds_if_available If set to true, only coding regions (not UTRs) for protein coding genes are considered when calculating coverage. Note: This has no effect on non-protein coding genes. true/false
  • d4: Paths and thresholds for d4 files for coverage/SegDups.

    Subkey Explanation Values
    [file] Short name of d4 file [string]
    [file].long_name Long name of d4 file [string]
    [file].path Path to d4 file [string]
    [file].threshold Threshold for including a bp in calculation of average sequencing coverage or SegDup overlap for a gene * [integer 0-100]

    * Notes on thresholds:

    • For coverage, this is the minimum number of reads (e.g. 10) to count a bp as covered; this should usually be set in the lower range, e.g. 10-20.
    • For SegDups, this is the minimum degree of homology at a bp to count as covered by a SegDup. Reasonable values here would be in the range 90-100 (meaning 90%-100% homology; homology below 90% does not result in a region being designated as a segmental duplication by UCSC).