Setup coverage and segmental duplications
Information about coverage and segmental duplications (SegDup) is displayed for all transcripts and regions in the Gene panel builder (GPB) frontend as an indication of the variant calling quality that can be expected for a given method.
Creating files
Coverage and SegDup files must be pre-generated for the corresponding information to be displayed in the GPB frontend. The necessary scripts are located in scripts/genomic-data
. They are meant to run in a Docker container built according to the recipe in scripts/genomic-data/Dockerfile
, which ensures that all the necessary software (Rust, SAMtools, d4tools and mosdepth) is installed.
The Docker image is built as follows:
Note
If no truthful information is required, mock d4 files (with zero values everywhere) can be generated by issuing the following command:
docker compose run \
--no-deps \
--entrypoint python \
gpbuilder \
gpbuilder/coverage/mock_d4_file.py \
--output-file wgs.d4 \
--high 0 \
--low 0
This will create a d4 file in the gpbuilder
directory, with zero coverage everywhere.
Coverage file
For example:
docker run \
--rm \
--volume /home/gpb:/host/gpb \
--volume /home/gpb/bams_or_crams.txt:/host/bams_or_crams.txt \
--volume /home/gpb/genome_reference.fasta:/host/genome_reference.fasta \
--volume /home/gpb/genome_sizes.txt:/host/genome_sizes.txt \
coverage-tools \
python /app/src/run_coverage.py \
-i /host/bams_or_crams.txt \
-o /host/gpb \
--genome-reference /host/genome_reference.fasta \
--genome-sizes /host/genome_sizes.txt \
--mapq 50
where:
bams_or_crams.txt
: List of alignment files (.bam
or.cram
) with matching indexes (.bai
/.crai
).genome_reference.fasta
: Genome reference file in FASTA format with index (.fai
) to use with CRAM files (needed for mosdepth to convert CRAM to per-base BED).genome_sizes.txt
: Genome sizes file (.genome
) listing the length of each chromosome, one per line (needed for BED inputs and d4tools'screate
command).50
: Minimum mapping quality value of alignment files to include in coverage calculations (optional; default=60)
This will write a coverage file coverage.d4
in the /home/gpb
directory.
SegDup file
For example:
docker run \
--rm \
--volume /home/gpb/genome_sizes.txt:/host/genome_sizes.txt \
--volume /home/gpb/:/host/gpb \
--user 1000:1000 \
coverage-tools \
python /app/src/make_segdup.py \
--genome /host/genome_sizes.txt \
--output /host/gpb
where:
genome_sizes.txt
: Genome sizes file (.genome
) listing the length of each chromosome, one per line (needed for BED inputs and d4tools'screate
command).
This will write a segmental duplications file segdup.d4
in the /home/gpb
directory.
Running scripts in an environment without internet access
To run the scripts in closed environment (e.g. UiO's service for sensitive data, TSD; see also Containers in TSD), the following steps are needed:
-
Clone the repository.
-
Build a Docker image according to the recipe in
genepanel-builder/scripts/genomic-data/Dockerfile
. This can be done in two ways: -
Save the Docker image in a tar file. Again, this can be done in two ways:
-
Upload the newly created archive to its destination, for example TSD, and from there load the Docker image:
-
Run coverage script as shown above (using the Docker command)
Handy commands
-
Convert CRAM to BAM:
-
Index alignment file (
.bam
or.cram
):
Troubleshooting
- If per-base files are already in the output directory, it is assumed they are the correct files and the generation of a new per-base file is skipped. The same goes for the
merged.bed
file and the decompressedper-base.bed
files. If you are unsure about the correctness of the files, it is recommended to delete them before running the script. - Aligmnent files should have a matching index file (
.bai
or.crai
). - Reference genome FASTA file (e.g.
GRCh37.fasta
) should have a matching index file (.fai
). - If SAMtools outputs
samtools index: failed to create or write index
, make sure the directory where the BAM or CRAM file is located has write permissions. SAMtools also may require that the BAM or CRAM file has read permissions. The index is created by default in the same directory as the BAM or CRAM file.
Configure display of coverage and segmental duplications in the frontend
The way coverage and segmental duplications are displayed in the GPB frontend is configured in a JSON file the path to which is stored in the GENERAL_SETTINGS
environment variable; see gpbuilder/resources/example_settings.json
for an example. The configurable parameters are:
-
coverage
: Regions to include in the calculation of a gene's average coverage.Subkey Explanation Values slop.upstream
/slop.downstream
Number of bps to include upstream/downstream of exons [integer] use_cds_if_available
If set to true
, only coding regions (i.e. not UTRs) of protein coding genes are taken into account. [Note: this has no effect on non-coding genes.]true
/false
-
d4
: D4 files paths and thresholds for computing coverage and segmental duplications statistics.Subkey Explanation Values [file] Short name of d4 file [string] [file]. long_name
Long name of d4 file [string] [file]. path
Path to d4 file [string] [file]. threshold
Threshold for inclusion of a bp in the calculation of a gene's average sequencing coverage or segmental duplication overlap * [integer 0-100] * Notes on thresholds:
- For coverage, this is the minimum number of reads to count a bp as covered; this should usually be set in the lower range, e.g. 10-20.
- For segmental duplications, this is the minimum degree of homology at a bp to count it as covered by a segmental duplication. Reasonable values here would be in the range 90-100 (meaning 90%-100% homology; homology below 90% does not result in a region being designated as a segmental duplication by UCSC).