Skip to content

HTS Bioinf - Create region files for the new capture kits

Scope

This procedure is to explain how to generate region files for a new capture kit. The region files will be used in the variant calling pipeline.

Responsibility

A bioinformatician is responsible for running commands that act on the input data to provide region files. The input files can be downloaded from the capture kit provider.


Procedure

Tools used: Linux commands, bedtools and in-house scripts.

  1. Obtain probe/bait regions from the capture kit provider (e.g. the probe/bait regions for Agilent SureSelect Human All Exon capture kit is in *_Covered.bed).
  2. Create noslop_bed file in the vcpipe-bundle, this will create a chromosome position sorted, non-overlapped, BED format region file:

    e.g. for Agilent SureSelect Human All Exon capture kit:

    cat S30409818_Covered.bed |\ 
        grep "^chr" |\
        sort –k1,1V –k2,2n –k3,3n |\
        sed 's/^chr//g' |\
        bedtools merge –c 4 –o distinct –i - |\
        awk 'OFS="\t" {print $0, "0", "+"}' > agilent_cre_v02.baits.bed
    
  3. Create noslop_list file in the vcpipe-bundle, this will create a chromosome position sorted, non-overlapped, LIST format region file

    e.g. for Agilent SureSelect Human All Exon capture kit:

    grep “^@SQ" /bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict > agilent_cre_v02.baits.list
    
    cat S30409818_Covered.bed |\ 
        grep "^chr" |\
        sort –k1,1V –k2,2n –k3,3n |\
        sed 's/^chr//g' |\
        bedtools merge –c 4 –o distinct –i - |\
        awk 'FS="\t", OFS="\t" {print $1, $2, $3, "+", $4}' >> agilent_cre_v02.baits.list
    
  4. Create slop50_list file in the vcpipe-bundle (bedtools should be in the PATH and this is only applied for Agilent SureSelect Human All Exon capture kit):

    grep "^@SQ" /bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict |\
        cut -f2,3 |\
        awk -F"[:\t]" 'OFS="\t" {print $2,$4}' > hg19.genome
    
    bedtools slop -b 50 -i agilent_cre_v02.baits.bed -g hg19.genome > agilent_cre_v02.baits.slop50.bed
    
    grep “^@SQ" /bundle/genomic/gatkBundle_2.5/human_g1k_v37_decoy.dict > agilent_cre_v02.baits.slop50.merged.list
    
    bedtools merge –c 4 –o distinct -i agilent_cre_v02.baits.slop50.bed |\ 
        awk 'FS="\t", OFS="\t" {print $1, $2, $3, "+", $4}' >> agilent_cre_v02.baits.slop50.merged.list
    
  5. Create a directory in the captureKit directory under vcpipe-bundle repository for the capture kit, all the newly generated files need to be stored there.