Setup coverage and segmental duplications
Information about coverage and segmental duplications (SegDup) is displayed for each selected transcript in the Gene panel builder (GPB) frontend as an indication of the variant calling quality that can be expected for a given method.
Creating files
Coverage and SegDup files must be pre-generated for the corresponding information to be displayed in the GPB frontend. The necessary scripts are located in scripts/genomic-data/
.
Coverage file
Input:
- Alignment files (
.bam
or.cram
) with matching indexes (.bai
/.crai
). - Text file listing the alignment files (
.bam
or.cram
), one per line. - Reference genome file (
.fasta
) with index (.fai
) to use with CRAM files (needed for mosdepth to convert CRAM to per-base BED). - Genome sizes file (
.genome
) that lists the length of each chromosome, one per line (needed for BED inputs and d4tools create).
Output:
- Coverage file
coverage.d4
in the desired directory
From scripts/genomic-data
, run run_coverage.py
in a coverage-tools
Docker container (the recipe for building the coverage-tools
Docker image is in scripts/genomic-data/Dockerfile
) including all software necessary to compute the coverage (rust, d4tools, samtools and mosdepth):
docker run \
--rm \
--workdir /mounted/$(pwd) \
-v $(pwd):/mounted/$(pwd) \
coverage-tools \
python3 src/run_coverage.py \
-i <bams_and_crams.txt> \
-o <output_directory> \
-f <reference_genome.fasta> \
-g <genome_sizes.genome> \
-mq <mapq>
Replace the placeholder values (<...>
) above with paths to real files or directories:
<bams_and_crams.txt>
: List of alignment files.<output_directory>
: Output directory.<reference_genome.fasta>
: Reference genome file (e.g.GRCh37.fasta
).<genome_sizes.genome>
: Genome sizes file.<mapq>
: Minimum mapping quality value of alignment files to include in coverage calculations (optional; default=60).
SegDup file
Input:
- Genome sizes file (
.genome
) that lists the length of each chromosome, one per line (needed for BED inputs and d4tools create).
Output:
segdup.d4
file in thedata
directory.
Run script make_segdup.py
with the make
command:
Running coverage/SegDup scripts in another environment
To run the scripts in another environment (e.g. TSD; see also Containers in TSD), the following steps are needed:
-
Clone the repository
-
Build Docker image from the
genepanel-builder/scripts/genomic-data
directory- The Docker image can be built using
make
(holds also in environments without internet access):
- If make isn't available, the Docker image can be built directly with the following command:
- The Docker image can be built using
-
Save the Docker image in a tar file:
This will run the command:
-
Copy the tar file (for example to the TSD server) and load the image:
-
Run coverage script as above with docker
Handy commands
- Convert CRAM to BAM:
samtools view -b \
-T ref/GRCh38DH.fa \
-o data/s1/HG00140.alt_bwamem_GRCh38DH.20150826.GBR.exome.bam \
data/s1/HG00140.alt_bwamem_GRCh38DH.20150826.GBR.exome.cram
- Index alignment file (
.bam
or.cram
):
Troubleshooting coverage/SegDup scripts
- If per-base files are already in the output directory, it is assumed they are the correct files and the generation of a new per-base file is skipped. The same goes for the
merged.bed
file and the decompressedper-base.bed
files. If you are unsure about the correctness of the files, it is recommended to delete them before running the script. - Aligmnent files should have a matching index file (
.bai
or.crai
). - Reference genome FASTA file (e.g.
GRCh37.fasta
) should have a matching index file (.fai
). - If samtools outputs
samtools index: failed to create or write index
, make sure the directory where the BAM or CRAM file is located has write permissions. Samtools also may require that the BAM or CRAM file has read permissions. The index is created by default in the same directory as the BAM or CRAM file.
Configure display of coverage and SegDup in the frontend
Use of the coverage and SegDup files in the GPB frontend is configured in gpbuilder/resources/settings.json
; see example_settings.json
(same folder) for an example. The parameters you can configure are:
-
build_summary_script
: Path to script that creates a summary of a gene panel build. -
coverage
: Regions to include in calculation of average coverage for a gene.Subkey Explanation Values slop.upstream
/slop.downstream
Number of bp to include upstream/downstream of exons [integer] use_cds_if_available
If set to true
, only coding regions (not UTRs) for protein coding genes are considered when calculating coverage. Note: This has no effect on non-protein coding genes.true
/false
-
d4
: Paths and thresholds for d4 files for coverage/SegDups.Subkey Explanation Values [file] Short name of d4 file [string] [file]. long_name
Long name of d4 file [string] [file]. path
Path to d4 file [string] [file]. threshold
Threshold for including a bp in calculation of average sequencing coverage or SegDup overlap for a gene * [integer 0-100] * Notes on thresholds:
- For coverage, this is the minimum number of reads (e.g. 10) to count a bp as covered; this should usually be set in the lower range, e.g. 10-20.
- For SegDups, this is the minimum degree of homology at a bp to count as covered by a SegDup. Reasonable values here would be in the range 90-100 (meaning 90%-100% homology; homology below 90% does not result in a region being designated as a segmental duplication by UCSC).