Skip to content

Setup coverage and segmental duplications

Information about coverage and segmental duplications (SegDup) is displayed for all transcripts and regions in the Gene panel builder (GPB) frontend as an indication of the variant calling quality that can be expected for a given method.

Creating files

Coverage and SegDup files must be pre-generated for the corresponding information to be displayed in the GPB frontend. The necessary scripts are located in scripts/genomic-data. They are meant to run in a Docker container built according to the recipe in scripts/genomic-data/Dockerfile, which ensures that all the necessary software (Rust, SAMtools, d4tools and mosdepth) is installed.

The Docker image is built as follows:

docker build -t coverage-tools scripts/genomic-data

Note

If no truthful information is required, mock d4 files (with zero values everywhere) can be generated by issuing the following command:

docker compose run \
    --no-deps \
    --entrypoint python \
    gpbuilder \
    gpbuilder/coverage/mock_d4_file.py \
        --output-file wgs.d4 \
        --high 0 \
        --low 0

This will create a d4 file in the gpbuilder directory, with zero coverage everywhere.

Coverage file

For example:

docker run \
    --rm \
    --volume /home/gpb:/host/gpb \
    --volume /home/gpb/bams_or_crams.txt:/host/bams_or_crams.txt \
    --volume /home/gpb/genome_reference.fasta:/host/genome_reference.fasta \
    --volume /home/gpb/genome_sizes.txt:/host/genome_sizes.txt \
    coverage-tools \
    python /app/src/run_coverage.py \
        -i /host/bams_or_crams.txt \
        -o /host/gpb \
        --genome-reference /host/genome_reference.fasta \
        --genome-sizes /host/genome_sizes.txt \
        --mapq 50

where:

  • bams_or_crams.txt: List of alignment files (.bam or .cram) with matching indexes (.bai / .crai ).
  • genome_reference.fasta: Genome reference file in FASTA format with index (.fai) to use with CRAM files (needed for mosdepth to convert CRAM to per-base BED).
  • genome_sizes.txt: Genome sizes file (.genome) listing the length of each chromosome, one per line (needed for BED inputs and d4tools's create command).
  • 50: Minimum mapping quality value of alignment files to include in coverage calculations (optional; default=60)

This will write a coverage file coverage.d4 in the /home/gpb directory.

SegDup file

For example:

docker run \
    --rm \
    --volume /home/gpb/genome_sizes.txt:/host/genome_sizes.txt \
    --volume /home/gpb/:/host/gpb \
    --user 1000:1000 \
    coverage-tools \
    python /app/src/make_segdup.py \
        --genome /host/genome_sizes.txt \
        --output /host/gpb

where:

  • genome_sizes.txt: Genome sizes file (.genome) listing the length of each chromosome, one per line (needed for BED inputs and d4tools's create command).

This will write a segmental duplications file segdup.d4 in the /home/gpb directory.

Running scripts in an environment without internet access

To run the scripts in closed environment (e.g. UiO's service for sensitive data, TSD; see also Containers in TSD), the following steps are needed:

  1. Clone the repository.

  2. Build a Docker image according to the recipe in genepanel-builder/scripts/genomic-data/Dockerfile. This can be done in two ways:

    DOCKER_BUILDKIT=1 docker build -t coverage-tools scripts/genomic-data
    
  3. Save the Docker image in a tar file. Again, this can be done in two ways:

    docker save coverage-tools:latest | gzip > coverage-tools.tar.gz
    
  4. Upload the newly created archive to its destination, for example TSD, and from there load the Docker image:

    docker load -i coverage-tools.tar.gz
    
  5. Run coverage script as shown above (using the Docker command)

Handy commands

  • Convert CRAM to BAM:

    samtools view -b \
        -T ref/GRCh38DH.fa \
        -o data/s1/HG00140.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam \
        data/s1/HG00140.alt_bwamem_GRCh38DH.20150826.GBR.exome.cram
    
  • Index alignment file (.bam or .cram):

    samtools index file.bam
    

Troubleshooting

  • If per-base files are already in the output directory, it is assumed they are the correct files and the generation of a new per-base file is skipped. The same goes for the merged.bed file and the decompressed per-base.bed files. If you are unsure about the correctness of the files, it is recommended to delete them before running the script.
  • Aligmnent files should have a matching index file (.bai or .crai).
  • Reference genome FASTA file (e.g. GRCh37.fasta) should have a matching index file (.fai).
  • If SAMtools outputs samtools index: failed to create or write index, make sure the directory where the BAM or CRAM file is located has write permissions. SAMtools also may require that the BAM or CRAM file has read permissions. The index is created by default in the same directory as the BAM or CRAM file.

Configure display of coverage and segmental duplications in the frontend

The way coverage and segmental duplications are displayed in the GPB frontend is configured in a JSON file the path to which is stored in the GENERAL_SETTINGS environment variable; see gpbuilder/resources/example_settings.json for an example. The configurable parameters are:

  • coverage: Regions to include in the calculation of a gene's average coverage.

    Subkey Explanation Values
    slop.upstream / slop.downstream Number of bps to include upstream/downstream of exons [integer]
    use_cds_if_available If set to true, only coding regions (i.e. not UTRs) of protein coding genes are taken into account. [Note: this has no effect on non-coding genes.] true/false
  • d4: D4 files paths and thresholds for computing coverage and segmental duplications statistics.

    Subkey Explanation Values
    [file] Short name of d4 file [string]
    [file].long_name Long name of d4 file [string]
    [file].path Path to d4 file [string]
    [file].threshold Threshold for inclusion of a bp in the calculation of a gene's average sequencing coverage or segmental duplication overlap * [integer 0-100]

    * Notes on thresholds:

    • For coverage, this is the minimum number of reads to count a bp as covered; this should usually be set in the lower range, e.g. 10-20.
    • For segmental duplications, this is the minimum degree of homology at a bp to count it as covered by a segmental duplication. Reasonable values here would be in the range 90-100 (meaning 90%-100% homology; homology below 90% does not result in a region being designated as a segmental duplication by UCSC).