HTS Bioinf - Trio pipeline
Summary
The trio pipeline is a pipeline implemented in Nextflow.io for processing trio data. It is a joint analysis of sequencing data from a proband (child) and her/his mother and father to allow detection of de novo variants or proband-specific recessive variants.
This document describes the data processing in the pipeline environment under the assumption that basepipe
pipelines were run successfully for all members of the trio.
Responsibility
Execution of the production pipeline is performed by a bioinformatician from "Enhet for Genomdiagnostikk" (GDx). The bioinformatician must have been trained for production according to procedure HTS Bioinf - Training for running pipeline.
Trio pipeline implementation
Input
The important inputs are .g.vcf.gz
files generated by the basepipe
pipeline for all three members of the trio.
Output
The important output files of the pipeline are explained below:
- The joint variant calling VCF file after quality annotation
- Gender and pedigree check results
Usage and requirements
The pipeline is implemented in the vcpipe
repository. It is primarily meant to be run as part of the automation system included. For more information see scripts in vcpipe
repository. Its dependencies include vcpipe-essentials
, vcpipe-refdata
and
sensitive-db
.
Pipeline stages
Pre-processing
In this stage, the utility code verifies that:
- the
.analysis
and.sample
configuration files are sane; - the pedigree described in those configuration files is consistent;
- gene panel, capture kit, and bundle files are present;
basepipe
results are available.
A pedigree .ped
file is created at this stage.
Joint variant calling stage
For samples processed with the GATK pipeline, the .g.vcf.gz
files generated by the basepipe
pipeline for proband, mother and father are merged in the GenotypeGVCFs
step, following of the best practice workflow for variant discovery.
After that, variant filtering is performed in the same way as for single samples. See procedure HTS Bioinf – Basepipe pipeline for details on tools and specifications.
For samples processed with the Dragen
pipeline, the .g.vcf.gz
files generated by the basepipe
pipeline for proband, mother and father are merged by Dragen
software based on DRAGEN recommendations (default settings unless explicitly set).
Input: .g.vcf.gz
files generated for all three members of the trio by the basepipe
pipeline
Output: the joint variant calling VCF file after quality annotation.
Processes included in the Nextflow script: triopipe_variantcalling_join
and dragen_trio
Merge mitochondrial variants stage
The mitochondrial SNP and small indels VCF files generated by the basepipe
pipeline for proband and mother are merged using BCFtools.
Input: mitochondrial variants in .vcf.gz
files generated by the basepipe
pipeline for proband and mother.
Output: the joint variant calling VCF file after quality annotation
Processes included in the Nextflow script: triopipe_mt_variantcalling_join
Gender and pedigree test stage
The open source software VCFped enables detection of trios and close pairwise relationships alongside sex estimation in a multi-sample VCF file. A .ped
format pedigree file can be provided to VCFped for comparison. The test fails if its results do not match the provided pedigree. Sex is estimated by calculating the heterozygosity rate in chromosome X. When the rate is lower than 10%, the estimate is male. When the rate exceeds 25%, the estimate is female.
Input: the joint variant calling VCF file after quality annotation and a .ped
format pedigree file
Output: outputs from VCFped
Processes included in the Nextflow script: pedigree_gender_check
Integrated quality control report stage
The sex and pedigree test results will be incorporated into different quality control reports with detailed quality control metrics for all three members of the trio.
Input: results from VCFped and quality control results for all members of the trio
Output: .qc_result.json
, .qc_report.md
and .qc_warnings.md
hap.py stage
In the hap.py
stage, the predicted SNP and small indel results from the joint variant calling will be compared with the high confidence SNP and small indel calls provided by GIAB. This stage is only applied on control samples (HG002, HG003 and HG004). The results will be used for the samples that need trend analysis (see procedure: HTS - Use of reference materials for internal quality control).
Input: Quality annotated VCF files for the whole capture kit region
Output: Sensitivity and PPV in the defined region
POST command stage
In this stage, the quality control parameters will be checked and the files listed in the configuration files will be copied to the durable/production/data/analyses-results/trios
directory.
Input: No input (this stage will start only when all other processes are finished)
Output: No output
Processes included in the Nextflow script: postcmd_triopipe