Skip to content

HTS Bioinf - Create region files for new capture kits

Scope

Described herein is the procedure for generating region files for a new capture kit to be used in the variant calling pipeline.

Responsibility

A bioinformatician is responsible for running the commands that act on the input data to generate region files for the new capture kit. The input files can be downloaded from the capture kit provider.

Procedure

Tools used: Linux commands, BED tools and in-house scripts.

  1. Obtain probe/bait regions from the capture kit provider (e.g. the probe/bait regions for the Agilent SureSelect Human All Exon capture kit is in *_Covered.bed).

  2. Create noslop_bed file in the vcpipe-bundle, this will create a chromosome position-sorted, non-overlapped, BED format region file:

    e.g., for the Agilent SureSelect Human All Exon capture kit:

    cat S30409818_Covered.bed \
        | grep "^chr" \
        | sort -k1,1V -k2,2n -k3,3n \
        | sed 's/^chr//g' \
        | bedtools merge -c 4 -o distinct -i - \
        | awk 'OFS = "\t" {print $0, "0", "+"}' \
        > agilent_cre_v02.baits.bed
    
  3. Create noslop_list file in the vcpipe-bundle, this will create a chromosome position sorted, non-overlapped, LIST format region file:

    e.g., for the Agilent SureSelect Human All Exon capture kit:

    grep "^@SQ" /bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict > agilent_cre_v02.baits.list
    
    cat S30409818_Covered.bed \
        | grep "^chr" \
        | sort -k1,1V -k2,2n -k3,3n \
        | sed 's/^chr//g' \
        | bedtools merge -c 4 -o distinct -i - \
        | awk 'FS = "\t", OFS = "\t" {print $1, $2, $3, "+", $4}' \
        >> agilent_cre_v02.baits.list
    
  4. Create slop50_list file in the vcpipe-bundle (the directory holding the bedtools executable should be in PATH):

    e.g., for the Agilent SureSelect Human All Exon capture kit:

    grep "^@SQ" /bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict \
        | cut -f2,3 \
        | awk -F"[:\t]" 'OFS="\t" {print $2, $4}' \
        > hg19.genome
    
    bedtools slop -b 50 -i agilent_cre_v02.baits.bed -g hg19.genome \
        > agilent_cre_v02.baits.slop50.bed
    
    grep "^@SQ" /bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict \
        > agilent_cre_v02.baits.slop50.merged.list
    
    bedtools merge -c 4 -o distinct -i agilent_cre_v02.baits.slop50.bed \
        | awk 'FS = "\t", OFS = "\t" {print $1, $2, $3, "+", $4}' \
        >> agilent_cre_v02.baits.slop50.merged.list
    
  5. Create a directory for the new capture kit in the captureKit directory of the vcpipe-bundle repository and store all the newly generated files there.