WES/WGS Single-Sample Pipeline¶
A user-focused guide to processing whole-exome (WES) and whole-genome (WGS) data using GATK Best Practices.
See Bash pipeline:
#!/usr/bin/env bash
#
# WGS/WES pipeline Bash script (GATK4.6 best-practices edition).
# Last Modified: 2025-05-02
#
# README:
# This script runs either WES or WGS pipelines based on the '-p' mode flag.
# Mode 'wes': uses provided exome interval list to restrict variant calling.
# Mode 'wgs': processes the entire genome (no intervals).
# Common steps (GATK Best Practices):
# 1) Align & add read groups
# 2) Merge lane-level BAMs
# 3) Mark duplicates
# 4) Base Quality Score Recalibration (BQSR)
# 5) HaplotypeCaller (per-sample gVCF)
# 6) GenotypeGVCFs (raw VCF)
# 7) VariantRecalibrator (VQSR) if variant counts suffice
# 8) Apply VQSR models or fallback to hard filters
# 9) Hard-filter & write QC VCF
# 10) Coverage stats & sex determination
set -eu
CLEANUP_BAM=false
function usage {
cat <<EOF
Usage: $0 -t <n_threads> -p <wes|wgs> [-c]
-t, --threads Number of CPU threads for GATK tools
-p, --pipeline Pipeline mode: 'wes' or 'wgs'
-c, --cleanup-bam If set, delete all 01_bam/*.bam when done (default: off)
EOF
exit 1
}
# Parse command-line arguments
while [[ $# -gt 0 ]]; do
case "$1" in
-t|--threads)
THREADS="$2"
shift 2
;;
-p|--pipeline)
PIPELINE="$2"
shift 2
;;
-c|--cleanup-bam)
CLEANUP_BAM=true
shift
;;
*)
usage
;;
esac
done
# make sure the required ones really got set
if [ -z "${THREADS:-}" ] || [ -z "${PIPELINE:-}" ]; then
usage
fi
# Convert to Uppercase
PIPELINE=${PIPELINE^^}
if [[ "$PIPELINE" != "WES" && "$PIPELINE" != "WGS" ]]; then
echo "Error: Mode must be 'wes' or 'wgs'." >&2
usage
fi
# Load GATK parameters (reference, known-sites, interval list, tool paths)
BINDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$BINDIR/parameters.sh"
# Prepare output directories
dir=$(pwd)
BAMDIR=$dir/01_bam
VARCALLDIR=$dir/02_varcall
STATSDIR=$dir/03_stats
LOGDIR=$dir/logs
mkdir -p "$BAMDIR" "$VARCALLDIR" "$STATSDIR" "$LOGDIR"
# Determine sample ID & log file
rawid=$(basename "$(dirname "$PWD")")
id=${rawid%%_*}
LOG=$LOGDIR/${id}.log
# Set interval argument for WES vs WGS
if [ "$PIPELINE" = "WES" ]; then
INTERVAL_ARG="-L $INTERVAL_LIST"
echo "WES mode: restricting to $INTERVAL_LIST"
else
INTERVAL_ARG=""
echo "WGS mode: processing whole genome"
fi
#------------------------------------------------------------------------------
# STEP 1: Align & Add Read Groups (per-lane BAMs)
# Theory: Read groups are required for GATK to tag sample, library, and platform
# information, enabling per-sample metrics and proper handling of multi-library data.
#------------------------------------------------------------------------------
echo ">>> STEP 1: Align & add read groups"
for R1 in ../*_R1_*fastq.gz; do
fn=$(basename "$R1" .fastq.gz)
base=${fn%_R1*}
R2=${R1/_R1_/_R2_}
SAMPLE=$(echo "$base" | cut -d'_' -f1-2)
LANE=$(echo "$base" | cut -d'_' -f3)
# ensure RGIDs are unique by appending a timestamp
RGID="${SAMPLE}.${LANE}.$(date +%s)"
RGPU="${SAMPLE}.${LANE}.unit1"
out_bam="$BAMDIR/${base}.rg.bam"
echo "Aligning $fn -> $(basename "$out_bam")"
# Align and drop secondary (0x100) & supplementary (0x800) alignments before RG tagging
#$BWA mem -M -t "$THREADS" "$REFGZ" "$R1" "$R2" \
# | $SAM view -b -F 0x900 - \
# ...
# Align without filtering (keep all alignments):
$BWA mem -M -t "$THREADS" "$REFGZ" "$R1" "$R2" \
| "$GATK4_BIN" $GATK4_JAVA_OPTS AddOrReplaceReadGroups \
--INPUT /dev/stdin \
--OUTPUT "$out_bam" \
--TMP_DIR "$TMPDIR" \
--RGPL ILLUMINA \
--RGLB sureselect \
--RGSM "$SAMPLE" \
--RGID "$RGID" \
--RGPU "$RGPU" \
2>> "$LOG"
done
#------------------------------------------------------------------------------
# STEP 2: Merge Lane-level BAMs
# Theory: Merging ensures downstream duplicate marking and BQSR
# operate on the full sample, minimizing lane-specific biases.
#------------------------------------------------------------------------------
echo ">>> STEP 2: Merge lane-level BAMs"
rg_bams=( $BAMDIR/*.rg.bam )
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS MergeSamFiles \
${rg_bams[@]/#/-I } \
-O "$BAMDIR/${id}.rg.merged.bam" \
--CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --TMP_DIR "$TMPDIR" \
2>> "$LOG"
set +x
#------------------------------------------------------------------------------
# STEP 3: Mark Duplicates
# Theory: PCR/optical duplicates inflate support for artefactual variants,
# so marking them reduces false positives in variant calling.
#------------------------------------------------------------------------------
echo ">>> STEP 3: Mark duplicates"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS MarkDuplicates \
-I "$BAMDIR/${id}.rg.merged.bam" \
-O "$BAMDIR/${id}.rg.merged.dedup.bam" \
--METRICS_FILE "$BAMDIR/${id}.rg.merged.dedup.metrics.txt" \
--CREATE_INDEX true --TMP_DIR "$TMPDIR" \
2>> "$LOG"
set +x
$SAM index "$BAMDIR/${id}.rg.merged.dedup.bam"
#------------------------------------------------------------------------------
# STEP 4: Base Quality Score Recalibration (BQSR)
# Theory: Models systematic error in reported base qualities using known-sites,
# improving the accuracy of variant calls by correcting quality scores.
#------------------------------------------------------------------------------
echo ">>> STEP 4: Base recalibration"
bqsr_table="$BAMDIR/${id}.rg.merged.dedup.recal.table"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS BaseRecalibrator \
-R "$REF" \
-I "$BAMDIR/${id}.rg.merged.dedup.bam" \
--known-sites "$dbSNP" --known-sites "$MILLS_INDELS" --known-sites "$KG_INDELS" \
-O "$bqsr_table" --tmp-dir "$TMPDIR" 2>> "$LOG"
"$GATK4_BIN" $GATK4_JAVA_OPTS ApplyBQSR \
-R "$REF" -I "$BAMDIR/${id}.rg.merged.dedup.bam" \
--bqsr-recal-file "$bqsr_table" \
-O "$BAMDIR/${id}.rg.merged.dedup.recal.bam" --tmp-dir "$TMPDIR" 2>> "$LOG"
$SAM index "$BAMDIR/${id}.rg.merged.dedup.recal.bam"
set +x
#------------------------------------------------------------------------------
# STEP 5: HaplotypeCaller -> gVCF
# Theory: Local de Bruijn graph assembly per active region improves indel calling;
# emits reference-confidence gVCF required for joint genotyping.
#------------------------------------------------------------------------------
echo ">>> STEP 5: HaplotypeCaller -> gVCF"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS HaplotypeCaller \
-R "$REF" -I "$BAMDIR/${id}.rg.merged.dedup.recal.bam" \
-O "$VARCALLDIR/${id}.hc.g.vcf.gz" \
$INTERVAL_ARG \
--native-pair-hmm-threads "$THREADS" -ERC GVCF 2>> "$LOG"
set +x
#------------------------------------------------------------------------------
# STEP 6: GenotypeGVCFs -> Raw VCF
# Theory: Jointly genotype one or more gVCFs to produce high-confidence sample VCF.
#------------------------------------------------------------------------------
echo ">>> STEP 6: GenotypeGVCFs -> raw VCF"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS GenotypeGVCFs \
-R "$REF" -V "$VARCALLDIR/${id}.hc.g.vcf.gz" \
-O "$VARCALLDIR/${id}.hc.raw.vcf.gz" --stand-call-conf 10 2>> "$LOG"
set +x
#------------------------------------------------------------------------------
# STEP 7: Build VQSR Models
# Theory: Use Gaussian mixture models on variant annotations
# to distinguish true variants from artefacts; requires sufficient counts.
#------------------------------------------------------------------------------
echo ">>> STEP 7: Optional VQSR if enough variants"
nSNP=$(zgrep -v '^#' "$VARCALLDIR/${id}.hc.raw.vcf.gz" | awk 'length($5)==1' | wc -l)
nINDEL=$(zgrep -v '^#' "$VARCALLDIR/${id}.hc.raw.vcf.gz" | awk 'length($5)!=1' | wc -l)
minSNP=1000; minINDEL=8000; apply_snp=false; apply_indel=false
if (( nSNP >= minSNP )); then
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
-R "$REF" \
-V "$VARCALLDIR/${id}.hc.raw.vcf.gz" \
$SNP_RES \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
--mode SNP \
-O "$VARCALLDIR/${id}.hc.snp.recal.vcf.gz" \
--tranches-file "$VARCALLDIR/${id}.hc.snp.tranches.txt" \
--max-gaussians 6 2>> "$LOG"
apply_snp=true
fi
if (( nINDEL >= minINDEL )); then
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
-R "$REF" \
-V "$VARCALLDIR/${id}.hc.raw.vcf.gz" \
$INDEL_RES \
-an QD -an FS -an ReadPosRankSum \
--mode INDEL \
-O "$VARCALLDIR/${id}.hc.indel.recal.vcf.gz" \
--tranches-file "$VARCALLDIR/${id}.hc.indel.tranches.txt" \
--max-gaussians 4 2>> "$LOG"
apply_indel=true
fi
#------------------------------------------------------------------------------
# STEP 8: Apply VQSR Models or Fallback
#------------------------------------------------------------------------------
echo ">>> STEP 8: Apply VQSR or fallback"
tmp_vcf="$VARCALLDIR/${id}.hc.raw.vcf.gz"
if [ "$apply_snp" = true ]; then
"$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
-R "$REF" -V "$tmp_vcf" \
--recal-file "$VARCALLDIR/${id}.hc.snp.recal.vcf.gz" \
--tranches-file "$VARCALLDIR/${id}.hc.snp.tranches.txt" \
--mode SNP --truth-sensitivity-filter-level 99.0 \
-O "$VARCALLDIR/${id}.hc.post_snp.vcf.gz" 2>> "$LOG"
tmp_vcf="$VARCALLDIR/${id}.hc.post_snp.vcf.gz"
fi
if [ "$apply_indel" = true ]; then
"$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
-R "$REF" -V "$tmp_vcf" \
--recal-file "$VARCALLDIR/${id}.hc.indel.recal.vcf.gz" \
--tranches-file "$VARCALLDIR/${id}.hc.indel.tranches.txt" \
--mode INDEL --truth-sensitivity-filter-level 95.0 \
-O "$VARCALLDIR/${id}.hc.vqsr.vcf.gz" 2>> "$LOG"
tmp_vcf="$VARCALLDIR/${id}.hc.vqsr.vcf.gz"
fi
#------------------------------------------------------------------------------
# STEP 9: Hard-Filter & Write QC VCF
# Theory: Apply recommended hard filters on annotations when VQSR isn't applied or as QC.
# SNPs: QD<2.0, FS>60.0, MQ<40.0, MQRankSum<-12.5, ReadPosRankSum<-8.0
# INDELs: QD<2.0, FS>200.0, ReadPosRankSum<-20.0
#------------------------------------------------------------------------------
echo ">>> STEP 9: Hard-filter & write QC.vcf"
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantFiltration \
-R "$REF" \
-V "$tmp_vcf" \
--filter-name "LowQUAL" --filter-expression "QUAL < 30.0" \
--filter-name "QD2" --filter-expression "QD < 2.0" \
--filter-name "FS60" --filter-expression "FS > 60.0" \
--filter-name "MQ40" --filter-expression "MQ < 40.0" \
--filter-name "MQRS-12.5" --filter-expression "MQRankSum < -12.5" \
--filter-name "RPRS-8" --filter-expression "ReadPosRankSum < -8.0" \
--filter-name "QD2_indel" --filter-expression "QD < 2.0" \
--filter-name "FS200" --filter-expression "FS > 200.0" \
--filter-name "RPRS-20" --filter-expression "ReadPosRankSum < -20.0" \
-O "$VARCALLDIR/${id}.hc.QC.vcf.gz" \
2>> "$LOG"
#------------------------------------------------------------------------------
# STEP 10: Coverage Stats & Sex Determination
#------------------------------------------------------------------------------
echo ">>> STEP 10: Coverage & Sex Determination" 2>> "$LOG"
# choose chromosome 1 naming
if [[ "$REF" == *b37*.fasta ]]; then
chrN=1
else
chrN=chr1
fi
bam_raw="$BAMDIR/${id}.rg.merged.dedup.bam"
bam_recal="$BAMDIR/${id}.rg.merged.dedup.recal.bam"
out_raw="$STATSDIR/${chrN}.raw.bam"
out_dedup="$STATSDIR/${chrN}.dedup.bam"
# extract chr1 (or “1”) reads and redirect errors
$SAM view -b "$bam_raw" "$chrN" > "$out_raw" 2>> "$LOG"
$SAM view -b "$bam_recal" "$chrN" > "$out_dedup" 2>> "$LOG"
# index them, also logging STDERR
$SAM index "$out_raw" 2>> "$LOG"
$SAM index "$out_dedup" 2>> "$LOG"
# run coverage & sex scripts: STDOUT -> stats file, STDERR -> main log
"$BINDIR"/coverage.sh "$id" "$out_raw" "$out_dedup" "$PIPELINE" \
> "$STATSDIR/${id}.coverage.txt" 2>> "$LOG"
"$BINDIR"/vcf2sex.sh "$VARCALLDIR/${id}.hc.QC.vcf.gz" \
> "$STATSDIR/${id}.sex.txt" 2>> "$LOG"
# Delete $STATSDIR/*bam
rm "$out_raw" "$out_dedup" "$out_raw.bai" "$out_dedup.bai"
#------------------------------------------------------------------------------
# STEP 11: Cleanup BAMs (optional)
#------------------------------------------------------------------------------
if [ "$CLEANUP_BAM" = true ]; then
echo ">>> STEP 11: Cleanup BAMs" 2>> "$LOG"
# -f silences “no such file” errors if the glob is empty
rm -f "$BAMDIR"/*.{bam,bai} 2>> "$LOG"
fi
# End
echo "All done! QC VCF: $VARCALLDIR/${id}.hc.QC.vcf.gz"
See Snakemake pipeline:
# wes_single.smk - bash-replica of wes_single.sh (GATK4.6) with per-rule logs
import os
import glob
import platform
import shlex
from pathlib import Path
snakefile_dir = Path(workflow.snakefile).parent
configfile: str(snakefile_dir / "config.yaml")
# -----------------------------
# Environment (match bash)
# -----------------------------
os.environ["TMPDIR"] = config["tmpdir"]
os.environ["LC_ALL"] = "C"
os.environ["GATK_DISABLE_AUTO_S3_UPLOAD"] = "true"
# -----------------------------
# Config / tools
# -----------------------------
DATADIR = config["datadir"]
DBDIR = config["dbdir"].format(datadir=DATADIR)
NGSUTILS = config["ngsutils"].format(datadir=DATADIR)
TMPDIR = config["tmpdir"].format(datadir=DATADIR)
MEM = config.get("mem", "8G")
PIPELINE = config.get("pipeline", "wes").lower()
if PIPELINE not in ("wes", "wgs"):
raise ValueError("config[pipeline] must be 'wes' or 'wgs'")
CLEANUP_BAM = bool(config.get("cleanup_bam", False))
THREADS = workflow.cores or int(config.get("threads", 4))
ARCH = platform.machine()
if ARCH == "aarch64":
JAVA = config["java"]["aarch64"]
BWA = config["tools"]["aarch64"]["bwa"].format(ngsutils=NGSUTILS)
SAM = config["tools"]["aarch64"]["samtools"].format(ngsutils=NGSUTILS)
else:
JAVA = config["java"]["amd64"]
BWA = config["tools"]["amd64"]["bwa"].format(ngsutils=NGSUTILS)
SAM = config["tools"]["amd64"]["samtools"].format(ngsutils=NGSUTILS)
GATK4 = config["gatk4_cmd"].format(ngsutils=NGSUTILS, mem=MEM)
COV = str(snakefile_dir / "coverage.sh")
VCF2SEX = str(snakefile_dir / "vcf2sex.sh")
bundle = config["bundle"].format(dbdir=DBDIR)
REF = config["ref"].format(bundle=bundle)
REFGZ = config["refgz"].format(bundle=bundle)
dbSNP = config["dbsnp"].format(dbdir=DBDIR)
MILLS_INDELS = config["mills_indels"].format(bundle=bundle)
KG_INDELS = config["kg_indels"].format(bundle=bundle)
HAPMAP = config["hapmap"].format(bundle=bundle)
OMNI = config["omni"].format(bundle=bundle)
SNP_RES = config["snp_res"].format(hapmap=HAPMAP, omni=OMNI, dbsnp=dbSNP)
INDEL_RES = config["indel_res"].format(mills_indels=MILLS_INDELS)
INTERVAL_LIST = config["interval_list"].format(bundle=bundle)
INTERVAL_ARG = f"-L {shlex.quote(INTERVAL_LIST)}" if PIPELINE == "wes" else ""
MIN_SNP_FOR_VQSR = int(config.get("min_snp_for_vqsr", 1000))
MIN_INDEL_FOR_VQSR = int(config.get("min_indel_for_vqsr", 8000))
# Output dirs
BAMDIR = "01_bam"
VARCALLDIR = "02_varcall"
STATSDIR = "03_stats"
LOGDIR = "logs"
for d in (BAMDIR, VARCALLDIR, STATSDIR, LOGDIR):
os.makedirs(d, exist_ok=True)
# -----------------------------
# Sample ID (match bash)
# -----------------------------
rawid = Path.cwd().parent.name
ID = rawid.split("_", 1)[0]
# -----------------------------
# FASTQ pairs (match bash glob)
# -----------------------------
FASTQ_DIR = "../"
FASTQ_R1 = sorted(glob.glob(os.path.join(FASTQ_DIR, "*_R1_*fastq.gz")))
if not FASTQ_R1:
raise ValueError("No FASTQs found matching ../*_R1_*fastq.gz")
FASTQ_BASES = []
FASTQ_DICT = {}
for r1 in FASTQ_R1:
r2 = r1.replace("_R1_", "_R2_")
base = os.path.basename(r1).replace(".fastq.gz", "")
base = base.split("_R1_", 1)[0]
FASTQ_BASES.append(base)
FASTQ_DICT[base] = {"r1": r1, "r2": r2}
RG_BAMS = [os.path.join(BAMDIR, f"{b}.rg.bam") for b in FASTQ_BASES]
CHR1 = "1" if "b37" in str(REF) else "chr1"
# -----------------------------
# Targets
# -----------------------------
FINAL_INPUTS = [
os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz"),
os.path.join(STATSDIR, f"{ID}.coverage.txt"),
os.path.join(STATSDIR, f"{ID}.sex.txt"),
]
if CLEANUP_BAM:
FINAL_INPUTS.append(os.path.join(LOGDIR, f"{ID}.cleanup.done"))
rule all:
input:
FINAL_INPUTS
# -----------------------------
# STEP 1: Align & AddReadGroups
# -----------------------------
rule align_rg:
input:
r1=lambda wc: FASTQ_DICT[wc.base]["r1"],
r2=lambda wc: FASTQ_DICT[wc.base]["r2"],
output:
bam=os.path.join(BAMDIR, "{base}.rg.bam"),
threads: THREADS
log:
os.path.join(LOGDIR, f"{ID}.01_align_rg.{{base}}.log")
shell:
r"""
set -eu
SAMPLE=$(echo {wildcards.base} | cut -d'_' -f1-2)
LANE=$(echo {wildcards.base} | cut -d'_' -f3)
RGID="${{SAMPLE}}.${{LANE}}.$(date +%s)"
RGPU="${{SAMPLE}}.${{LANE}}.unit1"
{BWA} mem -M -t {threads} {REFGZ} {input.r1} {input.r2} \
| {GATK4} AddOrReplaceReadGroups \
--INPUT /dev/stdin \
--OUTPUT {output.bam} \
--TMP_DIR {TMPDIR} \
--RGPL ILLUMINA \
--RGLB sureselect \
--RGSM "$SAMPLE" \
--RGID "$RGID" \
--RGPU "$RGPU" \
2>> {log}
"""
# -----------------------------
# STEP 2: Merge lane BAMs
# -----------------------------
rule merge_bams:
input:
rg_bams=RG_BAMS
output:
merged=os.path.join(BAMDIR, f"{ID}.rg.merged.bam")
params:
merge_inputs=lambda wc, input: " ".join([f"-I {b}" for b in input.rg_bams])
log:
os.path.join(LOGDIR, f"{ID}.02_merge_bams.log")
shell:
r"""
set -eu
{GATK4} MergeSamFiles \
{params.merge_inputs} \
-O {output.merged} \
--CREATE_INDEX true \
--VALIDATION_STRINGENCY SILENT \
--TMP_DIR {TMPDIR} \
2>> {log}
"""
# -----------------------------
# STEP 3: MarkDuplicates (+ samtools index)
# -----------------------------
rule mark_duplicates:
input:
merged=os.path.join(BAMDIR, f"{ID}.rg.merged.bam")
output:
dedup=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.bam"),
metrics=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.metrics.txt")
log:
os.path.join(LOGDIR, f"{ID}.03_mark_duplicates.log")
shell:
r"""
set -eu
{GATK4} MarkDuplicates \
-I {input.merged} \
-O {output.dedup} \
--METRICS_FILE {output.metrics} \
--CREATE_INDEX true \
--TMP_DIR {TMPDIR} \
2>> {log}
{SAM} index {output.dedup} 2>> {log}
"""
# -----------------------------
# STEP 4: BQSR
# -----------------------------
rule bqsr:
input:
dedup=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.bam")
output:
table=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.table"),
recal=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.bam")
log:
os.path.join(LOGDIR, f"{ID}.04_bqsr.log")
shell:
r"""
set -eu
{GATK4} BaseRecalibrator \
-R {REF} \
-I {input.dedup} \
--known-sites {dbSNP} \
--known-sites {MILLS_INDELS} \
--known-sites {KG_INDELS} \
-O {output.table} \
--tmp-dir {TMPDIR} \
2>> {log}
{GATK4} ApplyBQSR \
-R {REF} \
-I {input.dedup} \
--bqsr-recal-file {output.table} \
-O {output.recal} \
--tmp-dir {TMPDIR} \
2>> {log}
{SAM} index {output.recal} 2>> {log}
"""
# -----------------------------
# STEP 5: HaplotypeCaller -> gVCF
# -----------------------------
rule haplotypecaller:
input:
recal=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.bam")
output:
gvcf=os.path.join(VARCALLDIR, f"{ID}.hc.g.vcf.gz")
threads: THREADS
log:
os.path.join(LOGDIR, f"{ID}.05_haplotypecaller.log")
shell:
r"""
set -eu
{GATK4} HaplotypeCaller \
-R {REF} \
-I {input.recal} \
-O {output.gvcf} \
{INTERVAL_ARG} \
--native-pair-hmm-threads {threads} \
-ERC GVCF \
2>> {log}
"""
# -----------------------------
# STEP 6: GenotypeGVCFs -> raw VCF
# -----------------------------
rule genotype_gvcfs:
input:
gvcf=os.path.join(VARCALLDIR, f"{ID}.hc.g.vcf.gz")
output:
raw=os.path.join(VARCALLDIR, f"{ID}.hc.raw.vcf.gz")
log:
os.path.join(LOGDIR, f"{ID}.06_genotype_gvcfs.log")
shell:
r"""
set -eu
{GATK4} GenotypeGVCFs \
-R {REF} \
-V {input.gvcf} \
-O {output.raw} \
--stand-call-conf 10 \
2>> {log}
"""
# -----------------------------
# STEPS 7-9: Conditional VQSR + always QC VariantFiltration
# -----------------------------
rule vqsr_and_qc:
input:
raw=os.path.join(VARCALLDIR, f"{ID}.hc.raw.vcf.gz")
output:
qc=os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz")
log:
os.path.join(LOGDIR, f"{ID}.07_vqsr_and_qc.log")
shell:
r"""
set -eu
rawvcf={input.raw}
tmp_vcf="$rawvcf"
nSNP=$(zgrep -v '^#' "$rawvcf" | awk 'length($5)==1' | wc -l | tr -d ' ')
nINDEL=$(zgrep -v '^#' "$rawvcf" | awk 'length($5)!=1' | wc -l | tr -d ' ')
apply_snp=false
apply_indel=false
if [ "$nSNP" -ge "{MIN_SNP_FOR_VQSR}" ]; then
{GATK4} VariantRecalibrator \
-R {REF} \
-V "$rawvcf" \
{SNP_RES} \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
--mode SNP \
-O {VARCALLDIR}/{ID}.hc.snp.recal.vcf.gz \
--tranches-file {VARCALLDIR}/{ID}.hc.snp.tranches.txt \
--max-gaussians 6 \
2>> {log}
apply_snp=true
fi
if [ "$nINDEL" -ge "{MIN_INDEL_FOR_VQSR}" ]; then
{GATK4} VariantRecalibrator \
-R {REF} \
-V "$rawvcf" \
{INDEL_RES} \
-an QD -an FS -an ReadPosRankSum \
--mode INDEL \
-O {VARCALLDIR}/{ID}.hc.indel.recal.vcf.gz \
--tranches-file {VARCALLDIR}/{ID}.hc.indel.tranches.txt \
--max-gaussians 4 \
2>> {log}
apply_indel=true
fi
if [ "$apply_snp" = true ]; then
{GATK4} ApplyVQSR \
-R {REF} -V "$tmp_vcf" \
--recal-file {VARCALLDIR}/{ID}.hc.snp.recal.vcf.gz \
--tranches-file {VARCALLDIR}/{ID}.hc.snp.tranches.txt \
--mode SNP --truth-sensitivity-filter-level 99.0 \
-O {VARCALLDIR}/{ID}.hc.post_snp.vcf.gz \
2>> {log}
tmp_vcf="{VARCALLDIR}/{ID}.hc.post_snp.vcf.gz"
fi
if [ "$apply_indel" = true ]; then
{GATK4} ApplyVQSR \
-R {REF} -V "$tmp_vcf" \
--recal-file {VARCALLDIR}/{ID}.hc.indel.recal.vcf.gz \
--tranches-file {VARCALLDIR}/{ID}.hc.indel.tranches.txt \
--mode INDEL --truth-sensitivity-filter-level 95.0 \
-O {VARCALLDIR}/{ID}.hc.vqsr.vcf.gz \
2>> {log}
tmp_vcf="{VARCALLDIR}/{ID}.hc.vqsr.vcf.gz"
fi
{GATK4} VariantFiltration \
-R {REF} \
-V "$tmp_vcf" \
--filter-name "LowQUAL" --filter-expression "QUAL < 30.0" \
--filter-name "QD2" --filter-expression "QD < 2.0" \
--filter-name "FS60" --filter-expression "FS > 60.0" \
--filter-name "MQ40" --filter-expression "MQ < 40.0" \
--filter-name "MQRS-12.5" --filter-expression "MQRankSum < -12.5" \
--filter-name "RPRS-8" --filter-expression "ReadPosRankSum < -8.0" \
--filter-name "QD2_indel" --filter-expression "QD < 2.0" \
--filter-name "FS200" --filter-expression "FS > 200.0" \
--filter-name "RPRS-20" --filter-expression "ReadPosRankSum < -20.0" \
-O {output.qc} \
2>> {log}
"""
# -----------------------------
# STEP 10: Coverage
# -----------------------------
rule coverage_stats:
input:
raw = os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.bam"),
recal = os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.bam")
output:
cov = os.path.join(STATSDIR, f"{ID}.coverage.txt")
log:
os.path.join(LOGDIR, f"{ID}.08_coverage_stats.log")
shell:
r"""
set -eu
chrN="{CHR1}"
bam_raw="{input.raw}"
bam_recal="{input.recal}"
out_raw="{STATSDIR}/{CHR1}.raw.bam"
out_dedup="{STATSDIR}/{CHR1}.dedup.bam"
{SAM} view -b "$bam_raw" "$chrN" > "$out_raw" 2>> {log}
{SAM} view -b "$bam_recal" "$chrN" > "$out_dedup" 2>> {log}
{SAM} index "$out_raw" 2>> {log}
{SAM} index "$out_dedup" 2>> {log}
bash {COV} {ID} "$out_raw" "$out_dedup" {PIPELINE} \
> {output.cov} 2>> {log}
rm -f "$out_raw" "$out_dedup" "$out_raw.bai" "$out_dedup.bai"
"""
# -----------------------------
# STEP 10 cont: Sex
# -----------------------------
rule sex_determination:
input:
qc=os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz")
output:
sex=os.path.join(STATSDIR, f"{ID}.sex.txt")
log:
os.path.join(LOGDIR, f"{ID}.09_sex_determination.log")
shell:
r"""
set -eu
bash {VCF2SEX} {input.qc} > {output.sex} 2>> {log}
"""
# -----------------------------
# STEP 11: Optional cleanup
# -----------------------------
rule cleanup_bams:
input:
os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz")
output:
done=os.path.join(LOGDIR, f"{ID}.cleanup.done")
log:
os.path.join(LOGDIR, f"{ID}.10_cleanup_bams.log")
run:
# write any python errors to the log file
try:
if CLEANUP_BAM:
for pat in ("01_bam/*.bam", "01_bam/*.bai"):
for fp in glob.glob(pat):
try:
os.remove(fp)
except FileNotFoundError:
pass
Path(output.done).write_text("ok\n")
except Exception as e:
Path(log[0]).write_text(str(e) + "\n")
raise
Diagram: Single-Sample WES/WGS Workflow¶
flowchart TD
A["Input FASTQ reads"]
B["Align with BWA MEM and add read groups"]
C["Merge lane BAM files"]
D["Mark duplicates"]
E["Base recalibration BQSR"]
F["Run HaplotypeCaller to produce gVCF"]
G["Run GenotypeGVCFs to produce raw VCF"]
H["Check if enough variants for VQSR"]
I["Build SNP VQSR model"]
J["Build INDEL VQSR model"]
K["Apply VQSR"]
L["Apply hard filters"]
M["Write final QC VCF"]
N["Compute coverage and infer sex"]
A --> B --> C --> D --> E --> F --> G --> H
H -->|Yes| I --> J --> K --> M
H -->|No| L --> M
M --> N
Purpose¶
This pipeline processes one sample at a time and produces a high-quality, filtered VCF suitable for downstream analysis. It automatically adapts to:
- WES: restricted to an exome interval list
- WGS: whole genome (no interval restriction)
What the Pipeline Does¶
1. Alignment & Read Groups¶
- Align paired-end FASTQ files using BWA-MEM.
- Add read groups (sample, library, lane, platform) required by GATK.
- Output: lane-level BAMs with correct RG tags.
2. Lane Merging¶
- Merge all lane BAMs for the same sample into a single BAM.
- Ensures duplicate marking and BQSR operate on the full dataset.
3. Duplicate Marking¶
- Use GATK
MarkDuplicateson the merged BAM. - Flags PCR/optical duplicates to prevent them from inflating support for artefactual variants.
4. Base Quality Score Recalibration (BQSR)¶
- Two-step process:
BaseRecalibratorthenApplyBQSR. - Uses known variant databases (dbSNP, Mills, 1000G indels) to model and correct systematic base-quality errors.
- Output: recalibrated BAM used for variant calling.
5. Variant Calling (HaplotypeCaller, gVCF)¶
- Run GATK
HaplotypeCallerin GVCF mode (-ERC GVCF). - WES: uses exome intervals; WGS: full genome.
- Output:
<id>.hc.g.vcf.gz(per-sample gVCF).
6. GenotypeGVCFs (Raw VCF)¶
- Run GATK
GenotypeGVCFson the sample gVCF. - Output:
<id>.hc.raw.vcf.gz(raw VCF with SNPs and indels).
7. Variant Quality Score Recalibration (VQSR)¶
- If there are enough variants (SNPs and indels), build VQSR models:
VariantRecalibratorfor SNPs and indels separately.- Uses multiple annotations (QD, MQ, FS, MQRankSum, ReadPosRankSum).
- Output: recalibration VCFs and tranche files.
8. Apply VQSR or Hard Filters¶
- If models exist:
- Apply SNP VQSR.
- Then apply INDEL VQSR.
- Output:
<id>.hc.vqsr.vcf.gz. - If not:
- Skip directly to hard filters on the raw VCF or post-SNP VQSR VCF.
9. Generate Final QC VCF¶
- Run
VariantFiltrationwith recommended hard filters on annotations. - Output:
<id>.hc.QC.vcf.gz(final QC VCF).
10. Coverage & Sex Determination¶
- Extract chromosome 1 reads from raw and recalibrated BAMs.
- Compute coverage statistics.
- Infer sample sex from final VCF using a dedicated script.
- Outputs:
03_stats/<id>.coverage.txt03_stats/<id>.sex.txt
Output Files¶
| File | Meaning |
|---|---|
02_varcall/<id>.hc.g.vcf.gz |
Per-sample gVCF (HaplotypeCaller) |
02_varcall/<id>.hc.raw.vcf.gz |
Raw VCF after GenotypeGVCFs |
02_varcall/<id>.hc.vqsr.vcf.gz |
VCF after VQSR (if VQSR was applied) |
02_varcall/<id>.hc.QC.vcf.gz |
Final QC-filtered VCF (recommended) |
03_stats/<id>.coverage.txt |
Coverage metrics |
03_stats/<id>.sex.txt |
Sex determination result |
logs/<id>.log |
Main pipeline log |
When to Use This Pipeline¶
- Standard clinical or research WES or WGS processing.
- Generating gVCFs for cohort joint genotyping.
- Producing high-quality single-sample VCFs for interpretation.