Skip to main content

WES/WGS Cohort Joint-Genotyping Pipeline

A user-oriented guide for multi-sample joint genotyping using GenomicsDB, GenotypeGVCFs, and VQSR.

Sources: Bash, Snakemake, Nextflow, Cromwell


Diagram: Cohort Joint-Genotyping Workflow

WES/WGS cohort joint-genotyping workflow


Purpose

This pipeline combines many per-sample gVCFs and performs cohort-level joint genotyping, producing a single VCF for all samples with consistent genotype calls and variant filtering.

Use this when:

  • You want consistent genotypes across a family or population.
  • You plan to build VQSR models on cohort-level variant distributions.
  • You are preparing a joint VCF for association or segregation analyses.

Inputs

  • Sample map file (sample_map in the CBIcall YAML): TSV used by GenomicsDBImport (sample name → gVCF path).
  • Per-sample gVCFs from the single-sample pipeline.
  • Reference genome (REF from env.sh).
  • VQSR resources (SNP and INDEL training sets).
  • Optional interval list for WES mode.
WES interval source

The interval resource depends on the software stack. Legacy gatk-3.5 Bash WES uses Agilent SureSelect hg19 BED files, while current gatk-4.6 native WES uses the GATK bundle / Broad b37 exome interval list. See the FAQ for details.


Execution Modes

Standard cohort mode runs as one job from GenomicsDB import through final filtering.

1. GenomicsDBImport

  • Imports all gVCFs into a GenomicsDB workspace.
  • Handles both WES (interval-limited) and WGS (whole genome) modes.
  • Output: on-disk database under 01_genomicsdb/, accessed as gendb://<workspace>.

2. Joint Genotyping (GenotypeGVCFs)

  • Runs GenotypeGVCFs on gendb://<workspace> and the reference.

  • Produces the cohort-level VCF:

    • cohort.gv.raw.vcf.gz

3. Count Variants and Decide on VQSR

  • Counts SNPs and INDELs in the raw cohort VCF.
  • Compares counts to configurable thresholds:
    • MIN_SNP_FOR_VQSR (default 1000)
    • MIN_INDEL_FOR_VQSR (default 8000)
  • Determines whether to build SNP and/or INDEL VQSR models.

4. Build VQSR SNP Model

  • If enough SNPs:
    • Run VariantRecalibrator in SNP mode.
    • Uses training resources and multiple annotations:
      • QD, FS, MQ, MQRankSum, ReadPosRankSum.
    • Outputs:
      • cohort.snp.recal.vcf.gz
      • cohort.snp.tranches.txt.

5. Build VQSR INDEL Model

  • If enough INDELs:
    • Run VariantRecalibrator in INDEL mode.
    • Uses annotations:
      • QD, FS, ReadPosRankSum.
    • Outputs:
      • cohort.indel.recal.vcf.gz
      • cohort.indel.tranches.txt.

6. Apply VQSR

  • If SNP model exists:
    • Apply SNP VQSR → cohort.post_snp.vcf.gz.
  • If INDEL model exists:
    • Apply INDEL VQSR → cohort.vqsr.vcf.gz.

The best available VCF (VQSR-filtered or raw) is used as input to the next step.

7. Hard Filtering and QC VCF

  • Run VariantFiltration with the GATK 4.6 hard filters below.
  • Output: cohort.gv.QC.vcf.gz.
Filter nameExpression
LowQUALQUAL < 30.0
QD2QD < 2.0, when QD is present
FS60FS > 60.0
MQ40MQ < 40.0
MQRS-12.5MQRankSum < -12.5, when MQRankSum is present
RPRS-8ReadPosRankSum < -8.0, when ReadPosRankSum is present
QD2_indelQD < 2.0, when QD is present
FS200FS > 200.0
RPRS-20ReadPosRankSum < -20.0, when ReadPosRankSum is present

This QC VCF is the primary cohort workflow output for downstream tools and project-level review.


Output Files

In these filenames, gv means GenotypeGVCFs: the raw VCF is the direct joint-genotyped output from that GATK step, and the QC VCF is the filtered version used downstream.

The default basename is cohort. If output_basename is set, replace cohort below with that value.

FileDescription
cohort.gv.raw.vcf.gzRaw cohort joint-genotyped VCF
cohort.snp.recal.vcf.gzSNP VQSR model VCF
cohort.snp.tranches.txtSNP VQSR tranches and diagnostics
cohort.indel.recal.vcf.gzINDEL VQSR model VCF
cohort.indel.tranches.txtINDEL VQSR tranches and diagnostics
cohort.post_snp.vcf.gzVCF after applying SNP VQSR
cohort.vqsr.vcf.gzVCF after applying SNP and INDEL VQSR
cohort.gv.QC.vcf.gzFinal hard-filtered cohort QC VCF (recommended)
logs/cohort_joint_genotyping.logMain log file for the cohort pipeline

When to Use This Pipeline

  • After you have gVCFs from the single-sample pipeline.
  • When you need a single VCF for all samples for:
    • Family-based segregation analysis.
    • Case/control or population cohorts.
    • Downstream tools that expect joint genotypes.