WES/WGS Cohort Joint-Genotyping Pipeline¶
A user-oriented guide for multi-sample joint genotyping using GenomicsDB, GenotypeGVCFs, and VQSR.
See Bash pipeline:
#!/usr/bin/env bash
#
# w[eg]s_cohort_genomicsdb_with_vqsr.sh
# Cohort joint-genotyping wrapper using GenomicsDBImport -> GenotypeGVCFs -> VQSR/Hard-filters (GATK 4.6)
# Last Modified: 2025-10-13
set -eu
set -o pipefail
function usage {
cat <<EOF
Usage: $0 -m <sample_map.tsv> [-p wes|wgs] [-w <workspace>] [-t <threads>]
-s --sample-map Sample map file for --sample-name-map (required)
-p --pipeline 'wes' (default) or 'wgs'
-w --workspace GenomicsDB workspace path (default: ./cohort.genomicsdb.<job_id>)
-h --help Show this help
EOF
exit 1
}
# Parse args
PIPELINE="wes"
WORKSPACE=""
SAMPLE_MAP=""
while [[ $# -gt 0 ]]; do
case "$1" in
-m|--sample-map) SAMPLE_MAP="$2"; shift 2;;
-p|--pipeline) PIPELINE="$2"; shift 2;;
-w|--workspace) WORKSPACE="$2"; shift 2;;
-t) THREADS="$2"; shift 2;;
-h|--help) usage ;;
*) echo "Unknown arg: $1" >&2; usage ;;
esac
done
if [ -z "$SAMPLE_MAP" ]; then
echo "Error: sample_map is required." >&2
usage
fi
# Load parameters
BINDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$BINDIR/parameters.sh"
# Prepare output directories and logging
dir=$(pwd)
VARCALLDIR=$dir/02_varcall
LOGDIR=$dir/logs
mkdir -p "$VARCALLDIR"
mkdir -p "$LOGDIR"
cd $VARCALLDIR
LOG="$LOGDIR/cohort_joint_genotyping.log"
# pipeline mode
# Convert to Uppercase
PIPELINE=${PIPELINE^^}
if [[ "$PIPELINE" != "WES" && "$PIPELINE" != "WGS" ]]; then
echo "Error: pipeline must be 'wes' or 'wgs'." >&2; exit 1
fi
# sample count & workspace default naming
if [ ! -s "$SAMPLE_MAP" ]; then
echo "Error: sample_map '$SAMPLE_MAP' not found or empty." >&2; exit 1
fi
SAMPLE_COUNT=$(wc -l < "$SAMPLE_MAP" | tr -d ' ')
if [ -z "$WORKSPACE" ]; then
WORKSPACE="${WORKSPACE}_${SAMPLE_COUNT}"
fi
# interval argument for WES vs WGS
# Set interval argument for WES vs WGS
if [ "$PIPELINE" = "WES" ]; then
INTERVAL_ARG="-L $INTERVAL_LIST"
echo "WES mode: restricting to $INTERVAL_LIST"
else
INTERVAL_ARG=""
echo "WGS mode: processing whole genome"
fi
# Derived output names
COHORT_RAW_VCF="cohort.gv.raw.vcf.gz"
COHORT_VQSR_SNP="cohort.snp.recal.vcf.gz"
COHORT_SNP_TRANCHES="cohort.snp.tranches.txt"
COHORT_VQSR_INDEL="cohort.indel.recal.vcf.gz"
COHORT_INDEL_TRANCHES="cohort.indel.tranches.txt"
COHORT_POST_SNP="cohort.post_snp.vcf.gz"
COHORT_POST_VQSR="cohort.vqsr.vcf.gz"
COHORT_QC_VCF="cohort.gv.QC.vcf.gz"
echo "## Cohort GenomicsDBImport -> Genotype -> VQSR/Hard-filter"
echo "sample_map: $SAMPLE_MAP"
echo "pipeline: $PIPELINE"
echo "sample_count: $SAMPLE_COUNT"
echo "workspace: $WORKSPACE"
echo "out_vcf: $COHORT_RAW_VCF"
echo "tmpdir: $TMPDIR"
echo "log: $LOG"
echo "" | tee -a "$LOG"
# -----------------------------------------------------------------------------
# Step 1: GenomicsDBImport
# -----------------------------------------------------------------------------
echo ">>> Step 1: GenomicsDBImport" | tee -a "$LOG"
mkdir -p "$(dirname "$WORKSPACE")"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS_64G GenomicsDBImport \
--sample-name-map "$SAMPLE_MAP" \
--genomicsdb-workspace-path "$WORKSPACE" \
--merge-input-intervals true \
$INTERVAL_ARG \
--tmp-dir "$TMPDIR" \
2>> "$LOG"
set +x
# -----------------------------------------------------------------------------
# Step 2: GenotypeGVCFs
# -----------------------------------------------------------------------------
echo ">>> Step 2: GenotypeGVCFs" | tee -a "$LOG"
if [ -z "${REF:-}" ]; then
echo "Error: REF not set (expected to be defined in parameters.sh)." >&2
exit 1
fi
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS_64G GenotypeGVCFs \
-R "$REF" \
-V "gendb://$WORKSPACE" \
-O "$COHORT_RAW_VCF" \
--stand-call-conf 10 \
--tmp-dir "$TMPDIR" \
$INTERVAL_ARG \
2>> "$LOG"
set +x
if [ $? -ne 0 ]; then
echo "ERROR: GenotypeGVCFs failed. See log: $LOG" >&2
exit 1
fi
echo "Genotyping completed. Raw cohort VCF: $COHORT_RAW_VCF" | tee -a "$LOG"
# -----------------------------------------------------------------------------
# Step 3: Count SNPs/INDELs and decide on VQSR
# -----------------------------------------------------------------------------
echo ">>> Step 3: Count variants and decide on VQSR" | tee -a "$LOG"
nSNP=$(zgrep -v '^#' "$COHORT_RAW_VCF" | awk 'length($5)==1' | wc -l)
nINDEL=$(zgrep -v '^#' "$COHORT_RAW_VCF" | awk 'length($5)!=1' | wc -l)
nSNP=$(echo "$nSNP" | tr -d ' ')
nINDEL=$(echo "$nINDEL" | tr -d ' ')
echo "Found SNPs: $nSNP ; INDELs: $nINDEL" | tee -a "$LOG"
apply_snp=false
apply_indel=false
minSNP=${MIN_SNP_FOR_VQSR:-1000}
minINDEL=${MIN_INDEL_FOR_VQSR:-8000}
# -----------------------------------------------------------------------------
# Step 4: VariantRecalibrator (SNP)
# -----------------------------------------------------------------------------
if (( nSNP >= minSNP )); then
echo ">>> Building SNP VQSR model (VariantRecalibrator)" | tee -a "$LOG"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
-R "$REF" \
-V "$COHORT_RAW_VCF" \
$SNP_RES \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
--mode SNP \
-O "$COHORT_VQSR_SNP" \
--tranches-file "$COHORT_SNP_TRANCHES" \
--max-gaussians 6 \
--tmp-dir "$TMPDIR" \
2>> "$LOG"
set +x
apply_snp=true
else
echo "Skipping SNP VQSR (found $nSNP < min $minSNP)" | tee -a "$LOG"
fi
# -----------------------------------------------------------------------------
# Step 5: VariantRecalibrator (INDEL)
# -----------------------------------------------------------------------------
if (( nINDEL >= minINDEL )); then
echo ">>> Building INDEL VQSR model (VariantRecalibrator)" | tee -a "$LOG"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
-R "$REF" \
-V "$COHORT_RAW_VCF" \
$INDEL_RES \
-an QD -an FS -an ReadPosRankSum \
--mode INDEL \
-O "$COHORT_VQSR_INDEL" \
--tranches-file "$COHORT_INDEL_TRANCHES" \
--max-gaussians 4 \
--tmp-dir "$TMPDIR" \
2>> "$LOG"
set +x
apply_indel=true
else
echo "Skipping INDEL VQSR (found $nINDEL < min $minINDEL)" | tee -a "$LOG"
fi
# -----------------------------------------------------------------------------
# Step 6: Apply VQSR (if models exist)
# -----------------------------------------------------------------------------
echo ">>> Step 6: Apply VQSR (if available)" | tee -a "$LOG"
tmp_vcf="$COHORT_RAW_VCF"
if [ "$apply_snp" = true ]; then
echo "Applying SNP recalibration..." | tee -a "$LOG"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
-R "$REF" -V "$tmp_vcf" \
--recal-file "$COHORT_VQSR_SNP" \
--tranches-file "$COHORT_SNP_TRANCHES" \
--mode SNP --truth-sensitivity-filter-level 99.0 \
-O "$COHORT_POST_SNP" \
--tmp-dir "$TMPDIR" \
2>> "$LOG"
set +x
tmp_vcf="$COHORT_POST_SNP"
fi
if [ "$apply_indel" = true ]; then
echo "Applying INDEL recalibration..." | tee -a "$LOG"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
-R "$REF" -V "$tmp_vcf" \
--recal-file "$COHORT_VQSR_INDEL" \
--tranches-file "$COHORT_INDEL_TRANCHES" \
--mode INDEL --truth-sensitivity-filter-level 95.0 \
-O "$COHORT_POST_VQSR" \
--tmp-dir "$TMPDIR" \
2>> "$LOG"
set +x
tmp_vcf="$COHORT_POST_VQSR"
fi
# -----------------------------------------------------------------------------
# Step 7: Hard filters & write cohort QC VCF (always run for QC)
# (filters copied from your single-sample script)
# -----------------------------------------------------------------------------
echo ">>> Step 7: Hard-filter & write cohort QC VCF" | tee -a "$LOG"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantFiltration \
-R "$REF" \
-V "$tmp_vcf" \
--filter-name "LowQUAL" --filter-expression "QUAL < 30.0" \
--filter-name "QD2" --filter-expression "QD < 2.0" \
--filter-name "FS60" --filter-expression "FS > 60.0" \
--filter-name "MQ40" --filter-expression "MQ < 40.0" \
--filter-name "MQRS-12.5" --filter-expression "MQRankSum < -12.5" \
--filter-name "RPRS-8" --filter-expression "ReadPosRankSum < -8.0" \
--filter-name "QD2_indel" --filter-expression "QD < 2.0" \
--filter-name "FS200" --filter-expression "FS > 200.0" \
--filter-name "RPRS-20" --filter-expression "ReadPosRankSum < -20.0" \
-O "$COHORT_QC_VCF" \
--tmp-dir "$TMPDIR" \
2>> "$LOG"
set +x
if [ $? -ne 0 ]; then
echo "ERROR: VariantFiltration (cohort QC) failed. See log: $LOG" >&2
exit 1
fi
echo "Cohort QC VCF written: $COHORT_QC_VCF" | tee -a "$LOG"
# Final message
echo "All done. Cohort raw: $COHORT_RAW_VCF ; Cohort QC: $COHORT_QC_VCF" | tee -a "$LOG"
exit 0
Diagram: Cohort Joint-Genotyping Workflow¶
flowchart TD
A["Input per sample gVCF files"]
B["Import gVCFs into GenomicsDB workspace"]
C["Run GenotypeGVCFs to create cohort raw VCF"]
D["Count SNP and INDEL variants"]
E["Check SNP count threshold"]
F["Check INDEL count threshold"]
G["Build SNP VQSR model"]
H["Build INDEL VQSR model"]
I["Apply SNP VQSR"]
J["Apply INDEL VQSR"]
K["Choose best available VCF"]
L["Apply hard filters"]
M["Write cohort QC VCF"]
A --> B --> C --> D
D --> E -->|Enough SNPs| G --> I
D --> F -->|Enough INDELs| H --> J
C --> K
I --> K
J --> K
K --> L --> M
Purpose¶
This pipeline combines many per-sample gVCFs and performs cohort-level joint genotyping, producing a single VCF for all samples with consistent genotype calls and variant filtering.
Use this when:
- You want consistent genotypes across a family or population.
- You plan to build VQSR models on cohort-level variant distributions.
- You are preparing a joint VCF for association or segregation analyses.
Inputs¶
- Sample map file (
--sample-map):
TSV used byGenomicsDBImport(sample name → gVCF path). - Per-sample gVCFs from the single-sample pipeline.
- Reference genome (
REFfromparameters.sh). - VQSR resources (SNP and INDEL training sets).
- Optional interval list for WES mode.
Workflow¶
1. GenomicsDBImport¶
- Imports all gVCFs into a GenomicsDB workspace.
- Handles both WES (interval-limited) and WGS (whole genome) modes.
- Output: on-disk database accessed as
gendb://<workspace>.
2. Joint Genotyping (GenotypeGVCFs)¶
- Runs
GenotypeGVCFsongendb://<workspace>and the reference. -
Produces the cohort-level VCF:
-
cohort.gv.raw.vcf.gz
3. Count Variants and Decide on VQSR¶
- Counts SNPs and INDELs in the raw cohort VCF.
- Compares counts to configurable thresholds:
MIN_SNP_FOR_VQSR(default 1000)MIN_INDEL_FOR_VQSR(default 8000)- Determines whether to build SNP and/or INDEL VQSR models.
4. Build VQSR SNP Model¶
- If enough SNPs:
- Run
VariantRecalibratorin SNP mode. - Uses training resources and multiple annotations:
- QD, FS, MQ, MQRankSum, ReadPosRankSum.
- Outputs:
cohort.snp.recal.vcf.gzcohort.snp.tranches.txt.
5. Build VQSR INDEL Model¶
- If enough INDELs:
- Run
VariantRecalibratorin INDEL mode. - Uses annotations:
- QD, FS, ReadPosRankSum.
- Outputs:
cohort.indel.recal.vcf.gzcohort.indel.tranches.txt.
6. Apply VQSR¶
- If SNP model exists:
- Apply SNP VQSR →
cohort.post_snp.vcf.gz. - If INDEL model exists:
- Apply INDEL VQSR →
cohort.vqsr.vcf.gz.
The best available VCF (VQSR-filtered or raw) is used as input to the next step.
7. Hard Filtering and QC VCF¶
- Run
VariantFiltrationwith a set of hard filters on: - QUAL, QD, FS, MQ, MQRankSum, ReadPosRankSum.
- Output:
cohort.gv.QC.vcf.gz.
This QC VCF is the recommended cohort VCF for downstream analysis.
Output Files¶
| File | Description |
|---|---|
cohort.gv.raw.vcf.gz |
Raw cohort joint-genotyped VCF |
cohort.snp.recal.vcf.gz |
SNP VQSR model VCF |
cohort.snp.tranches.txt |
SNP VQSR tranches and diagnostics |
cohort.indel.recal.vcf.gz |
INDEL VQSR model VCF |
cohort.indel.tranches.txt |
INDEL VQSR tranches and diagnostics |
cohort.post_snp.vcf.gz |
VCF after applying SNP VQSR |
cohort.vqsr.vcf.gz |
VCF after applying SNP and INDEL VQSR |
cohort.gv.QC.vcf.gz |
Final hard-filtered cohort QC VCF (recommended) |
logs/cohort_joint_genotyping.log |
Main log file for the cohort pipeline |
When to Use This Pipeline¶
- After you have gVCFs from the single-sample pipeline.
- When you need a single VCF for all samples for:
- Family-based segregation analysis.
- Case/control or population cohorts.
- Downstream tools that expect joint genotypes.