Skip to content

WES/WGS Cohort Joint-Genotyping Pipeline

A user-oriented guide for multi-sample joint genotyping using GenomicsDB, GenotypeGVCFs, and VQSR.

See Bash pipeline:
#!/usr/bin/env bash
#
# w[eg]s_cohort_genomicsdb_with_vqsr.sh
# Cohort joint-genotyping wrapper using GenomicsDBImport -> GenotypeGVCFs -> VQSR/Hard-filters (GATK 4.6)
# Last Modified: 2025-10-13
set -eu
set -o pipefail

function usage {
  cat <<EOF
Usage: $0 -m <sample_map.tsv> [-p wes|wgs] [-w <workspace>] [-t <threads>]

  -s  --sample-map   Sample map file for --sample-name-map (required)
  -p  --pipeline    'wes' (default) or 'wgs'
  -w  --workspace   GenomicsDB workspace path (default: ./cohort.genomicsdb.<job_id>)
  -h  --help        Show this help
EOF
  exit 1
}

# Parse args
PIPELINE="wes"
WORKSPACE=""
SAMPLE_MAP=""

while [[ $# -gt 0 ]]; do
  case "$1" in
    -m|--sample-map) SAMPLE_MAP="$2"; shift 2;;
    -p|--pipeline) PIPELINE="$2"; shift 2;;
    -w|--workspace) WORKSPACE="$2"; shift 2;;
    -t) THREADS="$2"; shift 2;;
    -h|--help) usage ;;
    *) echo "Unknown arg: $1" >&2; usage ;;
  esac
done

if [ -z "$SAMPLE_MAP" ]; then
  echo "Error: sample_map is required." >&2
  usage
fi

# Load parameters
BINDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$BINDIR/parameters.sh"

# Prepare output directories and logging
dir=$(pwd)
VARCALLDIR=$dir/02_varcall
LOGDIR=$dir/logs
mkdir -p "$VARCALLDIR"
mkdir -p "$LOGDIR"

cd $VARCALLDIR

LOG="$LOGDIR/cohort_joint_genotyping.log"

# pipeline mode
# Convert to Uppercase
PIPELINE=${PIPELINE^^}
if [[ "$PIPELINE" != "WES" && "$PIPELINE" != "WGS" ]]; then
  echo "Error: pipeline must be 'wes' or 'wgs'." >&2; exit 1
fi

# sample count & workspace default naming
if [ ! -s "$SAMPLE_MAP" ]; then
  echo "Error: sample_map '$SAMPLE_MAP' not found or empty." >&2; exit 1
fi
SAMPLE_COUNT=$(wc -l < "$SAMPLE_MAP" | tr -d ' ')
if [ -z "$WORKSPACE" ]; then
  WORKSPACE="${WORKSPACE}_${SAMPLE_COUNT}"
fi

# interval argument for WES vs WGS
# Set interval argument for WES vs WGS
if [ "$PIPELINE" = "WES" ]; then
  INTERVAL_ARG="-L $INTERVAL_LIST"
  echo "WES mode: restricting to $INTERVAL_LIST"
else
  INTERVAL_ARG=""
  echo "WGS mode: processing whole genome"
fi

# Derived output names
COHORT_RAW_VCF="cohort.gv.raw.vcf.gz"
COHORT_VQSR_SNP="cohort.snp.recal.vcf.gz"
COHORT_SNP_TRANCHES="cohort.snp.tranches.txt"
COHORT_VQSR_INDEL="cohort.indel.recal.vcf.gz"
COHORT_INDEL_TRANCHES="cohort.indel.tranches.txt"
COHORT_POST_SNP="cohort.post_snp.vcf.gz"
COHORT_POST_VQSR="cohort.vqsr.vcf.gz"
COHORT_QC_VCF="cohort.gv.QC.vcf.gz"

echo "## Cohort GenomicsDBImport -> Genotype -> VQSR/Hard-filter"
echo "sample_map: $SAMPLE_MAP"
echo "pipeline: $PIPELINE"
echo "sample_count: $SAMPLE_COUNT"
echo "workspace: $WORKSPACE"
echo "out_vcf: $COHORT_RAW_VCF"
echo "tmpdir: $TMPDIR"
echo "log: $LOG"
echo "" | tee -a "$LOG"

# -----------------------------------------------------------------------------
# Step 1: GenomicsDBImport
# -----------------------------------------------------------------------------
echo ">>> Step 1: GenomicsDBImport" | tee -a "$LOG"
mkdir -p "$(dirname "$WORKSPACE")"

set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS_64G GenomicsDBImport \
  --sample-name-map "$SAMPLE_MAP" \
  --genomicsdb-workspace-path "$WORKSPACE" \
  --merge-input-intervals true \
  $INTERVAL_ARG \
  --tmp-dir "$TMPDIR" \
  2>> "$LOG"
set +x

# -----------------------------------------------------------------------------
# Step 2: GenotypeGVCFs
# -----------------------------------------------------------------------------
echo ">>> Step 2: GenotypeGVCFs" | tee -a "$LOG"
if [ -z "${REF:-}" ]; then
  echo "Error: REF not set (expected to be defined in parameters.sh)." >&2
  exit 1
fi

set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS_64G GenotypeGVCFs \
  -R "$REF" \
  -V "gendb://$WORKSPACE" \
  -O "$COHORT_RAW_VCF" \
  --stand-call-conf 10 \
  --tmp-dir "$TMPDIR" \
  $INTERVAL_ARG \
  2>> "$LOG"
set +x

if [ $? -ne 0 ]; then
  echo "ERROR: GenotypeGVCFs failed. See log: $LOG" >&2
  exit 1
fi
echo "Genotyping completed. Raw cohort VCF: $COHORT_RAW_VCF" | tee -a "$LOG"

# -----------------------------------------------------------------------------
# Step 3: Count SNPs/INDELs and decide on VQSR
# -----------------------------------------------------------------------------
echo ">>> Step 3: Count variants and decide on VQSR" | tee -a "$LOG"
nSNP=$(zgrep -v '^#' "$COHORT_RAW_VCF" | awk 'length($5)==1' | wc -l)
nINDEL=$(zgrep -v '^#' "$COHORT_RAW_VCF" | awk 'length($5)!=1' | wc -l)
nSNP=$(echo "$nSNP" | tr -d ' ')
nINDEL=$(echo "$nINDEL" | tr -d ' ')
echo "Found SNPs: $nSNP ; INDELs: $nINDEL" | tee -a "$LOG"

apply_snp=false
apply_indel=false
minSNP=${MIN_SNP_FOR_VQSR:-1000}
minINDEL=${MIN_INDEL_FOR_VQSR:-8000}

# -----------------------------------------------------------------------------
# Step 4: VariantRecalibrator (SNP)
# -----------------------------------------------------------------------------
if (( nSNP >= minSNP )); then
  echo ">>> Building SNP VQSR model (VariantRecalibrator)" | tee -a "$LOG"
  set -x
  "$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
    -R "$REF" \
    -V "$COHORT_RAW_VCF" \
    $SNP_RES \
    -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
    --mode SNP \
    -O "$COHORT_VQSR_SNP" \
    --tranches-file "$COHORT_SNP_TRANCHES" \
    --max-gaussians 6 \
    --tmp-dir "$TMPDIR" \
    2>> "$LOG"
  set +x
  apply_snp=true
else
  echo "Skipping SNP VQSR (found $nSNP < min $minSNP)" | tee -a "$LOG"
fi

# -----------------------------------------------------------------------------
# Step 5: VariantRecalibrator (INDEL)
# -----------------------------------------------------------------------------
if (( nINDEL >= minINDEL )); then
  echo ">>> Building INDEL VQSR model (VariantRecalibrator)" | tee -a "$LOG"
  set -x
  "$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
    -R "$REF" \
    -V "$COHORT_RAW_VCF" \
    $INDEL_RES \
    -an QD -an FS -an ReadPosRankSum \
    --mode INDEL \
    -O "$COHORT_VQSR_INDEL" \
    --tranches-file "$COHORT_INDEL_TRANCHES" \
    --max-gaussians 4 \
    --tmp-dir "$TMPDIR" \
    2>> "$LOG"
  set +x
  apply_indel=true
else
  echo "Skipping INDEL VQSR (found $nINDEL < min $minINDEL)" | tee -a "$LOG"
fi

# -----------------------------------------------------------------------------
# Step 6: Apply VQSR (if models exist)
# -----------------------------------------------------------------------------
echo ">>> Step 6: Apply VQSR (if available)" | tee -a "$LOG"
tmp_vcf="$COHORT_RAW_VCF"
if [ "$apply_snp" = true ]; then
  echo "Applying SNP recalibration..." | tee -a "$LOG"
  set -x
  "$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
    -R "$REF" -V "$tmp_vcf" \
    --recal-file "$COHORT_VQSR_SNP" \
    --tranches-file "$COHORT_SNP_TRANCHES" \
    --mode SNP --truth-sensitivity-filter-level 99.0 \
    -O "$COHORT_POST_SNP" \
    --tmp-dir "$TMPDIR" \
    2>> "$LOG"
  set +x
  tmp_vcf="$COHORT_POST_SNP"
fi

if [ "$apply_indel" = true ]; then
  echo "Applying INDEL recalibration..." | tee -a "$LOG"
  set -x
  "$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
    -R "$REF" -V "$tmp_vcf" \
    --recal-file "$COHORT_VQSR_INDEL" \
    --tranches-file "$COHORT_INDEL_TRANCHES" \
    --mode INDEL --truth-sensitivity-filter-level 95.0 \
    -O "$COHORT_POST_VQSR" \
    --tmp-dir "$TMPDIR" \
    2>> "$LOG"
  set +x
  tmp_vcf="$COHORT_POST_VQSR"
fi

# -----------------------------------------------------------------------------
# Step 7: Hard filters & write cohort QC VCF (always run for QC)
# (filters copied from your single-sample script)
# -----------------------------------------------------------------------------
echo ">>> Step 7: Hard-filter & write cohort QC VCF" | tee -a "$LOG"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantFiltration \
  -R "$REF" \
  -V "$tmp_vcf" \
  --filter-name "LowQUAL" --filter-expression "QUAL < 30.0" \
  --filter-name "QD2"        --filter-expression "QD < 2.0" \
  --filter-name "FS60"       --filter-expression "FS > 60.0" \
  --filter-name "MQ40"       --filter-expression "MQ < 40.0" \
  --filter-name "MQRS-12.5"  --filter-expression "MQRankSum < -12.5" \
  --filter-name "RPRS-8"     --filter-expression "ReadPosRankSum < -8.0" \
  --filter-name "QD2_indel"  --filter-expression "QD < 2.0" \
  --filter-name "FS200"      --filter-expression "FS > 200.0" \
  --filter-name "RPRS-20"    --filter-expression "ReadPosRankSum < -20.0" \
  -O "$COHORT_QC_VCF" \
  --tmp-dir "$TMPDIR" \
  2>> "$LOG"
set +x

if [ $? -ne 0 ]; then
  echo "ERROR: VariantFiltration (cohort QC) failed. See log: $LOG" >&2
  exit 1
fi

echo "Cohort QC VCF written: $COHORT_QC_VCF" | tee -a "$LOG"

# Final message
echo "All done. Cohort raw: $COHORT_RAW_VCF ; Cohort QC: $COHORT_QC_VCF" | tee -a "$LOG"
exit 0

Diagram: Cohort Joint-Genotyping Workflow

flowchart TD A["Input per sample gVCF files"] B["Import gVCFs into GenomicsDB workspace"] C["Run GenotypeGVCFs to create cohort raw VCF"] D["Count SNP and INDEL variants"] E["Check SNP count threshold"] F["Check INDEL count threshold"] G["Build SNP VQSR model"] H["Build INDEL VQSR model"] I["Apply SNP VQSR"] J["Apply INDEL VQSR"] K["Choose best available VCF"] L["Apply hard filters"] M["Write cohort QC VCF"] A --> B --> C --> D D --> E -->|Enough SNPs| G --> I D --> F -->|Enough INDELs| H --> J C --> K I --> K J --> K K --> L --> M

Purpose

This pipeline combines many per-sample gVCFs and performs cohort-level joint genotyping, producing a single VCF for all samples with consistent genotype calls and variant filtering.

Use this when:

  • You want consistent genotypes across a family or population.
  • You plan to build VQSR models on cohort-level variant distributions.
  • You are preparing a joint VCF for association or segregation analyses.

Inputs

  • Sample map file (--sample-map):
    TSV used by GenomicsDBImport (sample name → gVCF path).
  • Per-sample gVCFs from the single-sample pipeline.
  • Reference genome (REF from parameters.sh).
  • VQSR resources (SNP and INDEL training sets).
  • Optional interval list for WES mode.

Workflow

1. GenomicsDBImport

  • Imports all gVCFs into a GenomicsDB workspace.
  • Handles both WES (interval-limited) and WGS (whole genome) modes.
  • Output: on-disk database accessed as gendb://<workspace>.

2. Joint Genotyping (GenotypeGVCFs)

  • Runs GenotypeGVCFs on gendb://<workspace> and the reference.
  • Produces the cohort-level VCF:

  • cohort.gv.raw.vcf.gz

3. Count Variants and Decide on VQSR

  • Counts SNPs and INDELs in the raw cohort VCF.
  • Compares counts to configurable thresholds:
  • MIN_SNP_FOR_VQSR (default 1000)
  • MIN_INDEL_FOR_VQSR (default 8000)
  • Determines whether to build SNP and/or INDEL VQSR models.

4. Build VQSR SNP Model

  • If enough SNPs:
  • Run VariantRecalibrator in SNP mode.
  • Uses training resources and multiple annotations:
    • QD, FS, MQ, MQRankSum, ReadPosRankSum.
  • Outputs:
    • cohort.snp.recal.vcf.gz
    • cohort.snp.tranches.txt.

5. Build VQSR INDEL Model

  • If enough INDELs:
  • Run VariantRecalibrator in INDEL mode.
  • Uses annotations:
    • QD, FS, ReadPosRankSum.
  • Outputs:
    • cohort.indel.recal.vcf.gz
    • cohort.indel.tranches.txt.

6. Apply VQSR

  • If SNP model exists:
  • Apply SNP VQSR → cohort.post_snp.vcf.gz.
  • If INDEL model exists:
  • Apply INDEL VQSR → cohort.vqsr.vcf.gz.

The best available VCF (VQSR-filtered or raw) is used as input to the next step.

7. Hard Filtering and QC VCF

  • Run VariantFiltration with a set of hard filters on:
  • QUAL, QD, FS, MQ, MQRankSum, ReadPosRankSum.
  • Output: cohort.gv.QC.vcf.gz.

This QC VCF is the recommended cohort VCF for downstream analysis.


Output Files

File Description
cohort.gv.raw.vcf.gz Raw cohort joint-genotyped VCF
cohort.snp.recal.vcf.gz SNP VQSR model VCF
cohort.snp.tranches.txt SNP VQSR tranches and diagnostics
cohort.indel.recal.vcf.gz INDEL VQSR model VCF
cohort.indel.tranches.txt INDEL VQSR tranches and diagnostics
cohort.post_snp.vcf.gz VCF after applying SNP VQSR
cohort.vqsr.vcf.gz VCF after applying SNP and INDEL VQSR
cohort.gv.QC.vcf.gz Final hard-filtered cohort QC VCF (recommended)
logs/cohort_joint_genotyping.log Main log file for the cohort pipeline

When to Use This Pipeline

  • After you have gVCFs from the single-sample pipeline.
  • When you need a single VCF for all samples for:
  • Family-based segregation analysis.
  • Case/control or population cohorts.
  • Downstream tools that expect joint genotypes.