Skip to content

WES/WGS Single-Sample Pipeline

A user-focused guide to processing whole-exome (WES) and whole-genome (WGS) data using GATK Best Practices.

See Bash pipeline:
#!/usr/bin/env bash
#
#   WGS/WES pipeline Bash script (GATK4.6 best-practices edition).
#   Last Modified: 2025-05-02
#
# README:
#   This script runs either WES or WGS pipelines based on the '-p' mode flag.
#   Mode 'wes': uses provided exome interval list to restrict variant calling.
#   Mode 'wgs': processes the entire genome (no intervals).
#   Common steps (GATK Best Practices):
#     1) Align & add read groups
#     2) Merge lane-level BAMs
#     3) Mark duplicates
#     4) Base Quality Score Recalibration (BQSR)
#     5) HaplotypeCaller (per-sample gVCF)
#     6) GenotypeGVCFs (raw VCF)
#     7) VariantRecalibrator (VQSR) if variant counts suffice
#     8) Apply VQSR models or fallback to hard filters
#     9) Hard-filter & write QC VCF
#    10) Coverage stats & sex determination

set -eu

CLEANUP_BAM=false

function usage {
    cat <<EOF
Usage: $0 -t <n_threads> -p <wes|wgs> [-c]

  -t, --threads         Number of CPU threads for GATK tools
  -p, --pipeline        Pipeline mode: 'wes' or 'wgs'
  -c, --cleanup-bam     If set, delete all 01_bam/*.bam when done (default: off)
EOF
    exit 1
}

# Parse command-line arguments
while [[ $# -gt 0 ]]; do
  case "$1" in
    -t|--threads)
      THREADS="$2"
      shift 2
      ;;
    -p|--pipeline)
      PIPELINE="$2"
      shift 2
      ;;
    -c|--cleanup-bam)
      CLEANUP_BAM=true
      shift
      ;;
    *)
      usage
      ;;
  esac
done

# make sure the required ones really got set
if [ -z "${THREADS:-}" ] || [ -z "${PIPELINE:-}" ]; then
  usage
fi

# Convert to Uppercase
PIPELINE=${PIPELINE^^}
if [[ "$PIPELINE" != "WES" && "$PIPELINE" != "WGS" ]]; then
  echo "Error: Mode must be 'wes' or 'wgs'." >&2
  usage
fi

# Load GATK parameters (reference, known-sites, interval list, tool paths)
BINDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$BINDIR/parameters.sh"

# Prepare output directories
dir=$(pwd)
BAMDIR=$dir/01_bam
VARCALLDIR=$dir/02_varcall
STATSDIR=$dir/03_stats
LOGDIR=$dir/logs
mkdir -p "$BAMDIR" "$VARCALLDIR" "$STATSDIR" "$LOGDIR"

# Determine sample ID & log file
rawid=$(basename "$(dirname "$PWD")")
id=${rawid%%_*}
LOG=$LOGDIR/${id}.log

# Set interval argument for WES vs WGS
if [ "$PIPELINE" = "WES" ]; then
  INTERVAL_ARG="-L $INTERVAL_LIST"
  echo "WES mode: restricting to $INTERVAL_LIST"
else
  INTERVAL_ARG=""
  echo "WGS mode: processing whole genome"
fi

#------------------------------------------------------------------------------
# STEP 1: Align & Add Read Groups (per-lane BAMs)
# Theory: Read groups are required for GATK to tag sample, library, and platform
# information, enabling per-sample metrics and proper handling of multi-library data.
#------------------------------------------------------------------------------
echo ">>> STEP 1: Align & add read groups"
for R1 in ../*_R1_*fastq.gz; do
  fn=$(basename "$R1" .fastq.gz)
  base=${fn%_R1*}
  R2=${R1/_R1_/_R2_}

  SAMPLE=$(echo "$base" | cut -d'_' -f1-2)
  LANE=$(echo   "$base" | cut -d'_' -f3)
  # ensure RGIDs are unique by appending a timestamp
  RGID="${SAMPLE}.${LANE}.$(date +%s)"
  RGPU="${SAMPLE}.${LANE}.unit1"

  out_bam="$BAMDIR/${base}.rg.bam"
  echo "Aligning $fn -> $(basename "$out_bam")"

  # Align and drop secondary (0x100) & supplementary (0x800) alignments before RG tagging
  #$BWA mem -M -t "$THREADS" "$REFGZ" "$R1" "$R2" \
  #  | $SAM view -b -F 0x900 - \
  # ...

  # Align without filtering (keep all alignments):
  $BWA mem -M -t "$THREADS" "$REFGZ" "$R1" "$R2" \
    | "$GATK4_BIN" $GATK4_JAVA_OPTS AddOrReplaceReadGroups \
        --INPUT /dev/stdin \
        --OUTPUT "$out_bam" \
        --TMP_DIR "$TMPDIR" \
        --RGPL ILLUMINA \
        --RGLB sureselect \
        --RGSM "$SAMPLE" \
        --RGID "$RGID" \
        --RGPU "$RGPU" \
    2>> "$LOG"
done

#------------------------------------------------------------------------------
# STEP 2: Merge Lane-level BAMs
# Theory: Merging ensures downstream duplicate marking and BQSR
# operate on the full sample, minimizing lane-specific biases.
#------------------------------------------------------------------------------
echo ">>> STEP 2: Merge lane-level BAMs"
rg_bams=( $BAMDIR/*.rg.bam )
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS MergeSamFiles \
  ${rg_bams[@]/#/-I } \
  -O "$BAMDIR/${id}.rg.merged.bam" \
  --CREATE_INDEX true --VALIDATION_STRINGENCY SILENT --TMP_DIR "$TMPDIR" \
  2>> "$LOG"
set +x

#------------------------------------------------------------------------------
# STEP 3: Mark Duplicates
# Theory: PCR/optical duplicates inflate support for artefactual variants,
# so marking them reduces false positives in variant calling.
#------------------------------------------------------------------------------
echo ">>> STEP 3: Mark duplicates"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS MarkDuplicates \
  -I "$BAMDIR/${id}.rg.merged.bam" \
  -O "$BAMDIR/${id}.rg.merged.dedup.bam" \
  --METRICS_FILE "$BAMDIR/${id}.rg.merged.dedup.metrics.txt" \
  --CREATE_INDEX true --TMP_DIR "$TMPDIR" \
  2>> "$LOG"
set +x
$SAM index "$BAMDIR/${id}.rg.merged.dedup.bam"

#------------------------------------------------------------------------------
# STEP 4: Base Quality Score Recalibration (BQSR)
# Theory: Models systematic error in reported base qualities using known-sites,
# improving the accuracy of variant calls by correcting quality scores.
#------------------------------------------------------------------------------
echo ">>> STEP 4: Base recalibration"
bqsr_table="$BAMDIR/${id}.rg.merged.dedup.recal.table"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS BaseRecalibrator \
  -R "$REF" \
  -I "$BAMDIR/${id}.rg.merged.dedup.bam" \
  --known-sites "$dbSNP" --known-sites "$MILLS_INDELS" --known-sites "$KG_INDELS" \
  -O "$bqsr_table" --tmp-dir "$TMPDIR" 2>> "$LOG"

"$GATK4_BIN" $GATK4_JAVA_OPTS ApplyBQSR \
  -R "$REF" -I "$BAMDIR/${id}.rg.merged.dedup.bam" \
  --bqsr-recal-file "$bqsr_table" \
  -O "$BAMDIR/${id}.rg.merged.dedup.recal.bam" --tmp-dir "$TMPDIR" 2>> "$LOG"
$SAM index "$BAMDIR/${id}.rg.merged.dedup.recal.bam"
set +x

#------------------------------------------------------------------------------
# STEP 5: HaplotypeCaller -> gVCF
# Theory: Local de Bruijn graph assembly per active region improves indel calling;
# emits reference-confidence gVCF required for joint genotyping.
#------------------------------------------------------------------------------
echo ">>> STEP 5: HaplotypeCaller -> gVCF"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS HaplotypeCaller \
  -R "$REF" -I "$BAMDIR/${id}.rg.merged.dedup.recal.bam" \
  -O "$VARCALLDIR/${id}.hc.g.vcf.gz" \
  $INTERVAL_ARG \
  --native-pair-hmm-threads "$THREADS" -ERC GVCF 2>> "$LOG"
set +x

#------------------------------------------------------------------------------
# STEP 6: GenotypeGVCFs -> Raw VCF
# Theory: Jointly genotype one or more gVCFs to produce high-confidence sample VCF.
#------------------------------------------------------------------------------
echo ">>> STEP 6: GenotypeGVCFs -> raw VCF"
set -x
"$GATK4_BIN" $GATK4_JAVA_OPTS GenotypeGVCFs \
  -R "$REF" -V "$VARCALLDIR/${id}.hc.g.vcf.gz" \
  -O "$VARCALLDIR/${id}.hc.raw.vcf.gz" --stand-call-conf 10 2>> "$LOG"
set +x

#------------------------------------------------------------------------------
# STEP 7: Build VQSR Models
# Theory: Use Gaussian mixture models on variant annotations
# to distinguish true variants from artefacts; requires sufficient counts.
#------------------------------------------------------------------------------
echo ">>> STEP 7: Optional VQSR if enough variants"
nSNP=$(zgrep -v '^#' "$VARCALLDIR/${id}.hc.raw.vcf.gz" | awk 'length($5)==1' | wc -l)
nINDEL=$(zgrep -v '^#' "$VARCALLDIR/${id}.hc.raw.vcf.gz" | awk 'length($5)!=1' | wc -l)
minSNP=1000; minINDEL=8000; apply_snp=false; apply_indel=false

if (( nSNP >= minSNP )); then
  "$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
    -R "$REF" \
    -V "$VARCALLDIR/${id}.hc.raw.vcf.gz" \
    $SNP_RES \
    -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
    --mode SNP \
    -O "$VARCALLDIR/${id}.hc.snp.recal.vcf.gz" \
    --tranches-file "$VARCALLDIR/${id}.hc.snp.tranches.txt" \
    --max-gaussians 6 2>> "$LOG"
  apply_snp=true
fi

if (( nINDEL >= minINDEL )); then
  "$GATK4_BIN" $GATK4_JAVA_OPTS VariantRecalibrator \
    -R "$REF" \
    -V "$VARCALLDIR/${id}.hc.raw.vcf.gz" \
    $INDEL_RES \
    -an QD -an FS -an ReadPosRankSum \
    --mode INDEL \
    -O "$VARCALLDIR/${id}.hc.indel.recal.vcf.gz" \
    --tranches-file "$VARCALLDIR/${id}.hc.indel.tranches.txt" \
    --max-gaussians 4 2>> "$LOG"
  apply_indel=true
fi

#------------------------------------------------------------------------------
# STEP 8: Apply VQSR Models or Fallback
#------------------------------------------------------------------------------
echo ">>> STEP 8: Apply VQSR or fallback"
tmp_vcf="$VARCALLDIR/${id}.hc.raw.vcf.gz"

if [ "$apply_snp" = true ]; then
  "$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
    -R "$REF" -V "$tmp_vcf" \
    --recal-file "$VARCALLDIR/${id}.hc.snp.recal.vcf.gz" \
    --tranches-file "$VARCALLDIR/${id}.hc.snp.tranches.txt" \
    --mode SNP --truth-sensitivity-filter-level 99.0 \
    -O "$VARCALLDIR/${id}.hc.post_snp.vcf.gz" 2>> "$LOG"
  tmp_vcf="$VARCALLDIR/${id}.hc.post_snp.vcf.gz"
fi

if [ "$apply_indel" = true ]; then
  "$GATK4_BIN" $GATK4_JAVA_OPTS ApplyVQSR \
    -R "$REF" -V "$tmp_vcf" \
    --recal-file "$VARCALLDIR/${id}.hc.indel.recal.vcf.gz" \
    --tranches-file "$VARCALLDIR/${id}.hc.indel.tranches.txt" \
    --mode INDEL --truth-sensitivity-filter-level 95.0 \
    -O "$VARCALLDIR/${id}.hc.vqsr.vcf.gz" 2>> "$LOG"
  tmp_vcf="$VARCALLDIR/${id}.hc.vqsr.vcf.gz"
fi

#------------------------------------------------------------------------------
# STEP 9: Hard-Filter & Write QC VCF
# Theory: Apply recommended hard filters on annotations when VQSR isn't applied or as QC.
#   SNPs: QD<2.0, FS>60.0, MQ<40.0, MQRankSum<-12.5, ReadPosRankSum<-8.0
#   INDELs: QD<2.0, FS>200.0, ReadPosRankSum<-20.0
#------------------------------------------------------------------------------
echo ">>> STEP 9: Hard-filter & write QC.vcf"
"$GATK4_BIN" $GATK4_JAVA_OPTS VariantFiltration \
  -R "$REF" \
  -V "$tmp_vcf" \
  --filter-name "LowQUAL" --filter-expression "QUAL < 30.0" \
  --filter-name "QD2"        --filter-expression "QD < 2.0" \
  --filter-name "FS60"       --filter-expression "FS > 60.0" \
  --filter-name "MQ40"       --filter-expression "MQ < 40.0" \
  --filter-name "MQRS-12.5"  --filter-expression "MQRankSum < -12.5" \
  --filter-name "RPRS-8"     --filter-expression "ReadPosRankSum < -8.0" \
  --filter-name "QD2_indel"  --filter-expression "QD < 2.0" \
  --filter-name "FS200"      --filter-expression "FS > 200.0" \
  --filter-name "RPRS-20"    --filter-expression "ReadPosRankSum < -20.0" \
  -O "$VARCALLDIR/${id}.hc.QC.vcf.gz" \
  2>> "$LOG"

#------------------------------------------------------------------------------
# STEP 10: Coverage Stats & Sex Determination
#------------------------------------------------------------------------------
echo ">>> STEP 10: Coverage & Sex Determination" 2>> "$LOG"

# choose chromosome 1 naming
if [[ "$REF" == *b37*.fasta ]]; then
  chrN=1
else
  chrN=chr1
fi

bam_raw="$BAMDIR/${id}.rg.merged.dedup.bam"
bam_recal="$BAMDIR/${id}.rg.merged.dedup.recal.bam"
out_raw="$STATSDIR/${chrN}.raw.bam"
out_dedup="$STATSDIR/${chrN}.dedup.bam"

# extract chr1 (or “1”) reads and redirect errors
$SAM view -b "$bam_raw" "$chrN"  > "$out_raw"   2>> "$LOG"
$SAM view -b "$bam_recal" "$chrN"  > "$out_dedup" 2>> "$LOG"

# index them, also logging STDERR
$SAM index "$out_raw"    2>> "$LOG"
$SAM index "$out_dedup"  2>> "$LOG"

# run coverage & sex scripts: STDOUT -> stats file, STDERR -> main log
"$BINDIR"/coverage.sh "$id" "$out_raw" "$out_dedup" "$PIPELINE" \
    > "$STATSDIR/${id}.coverage.txt" 2>> "$LOG"
"$BINDIR"/vcf2sex.sh "$VARCALLDIR/${id}.hc.QC.vcf.gz" \
    > "$STATSDIR/${id}.sex.txt" 2>> "$LOG"

# Delete $STATSDIR/*bam
rm "$out_raw" "$out_dedup" "$out_raw.bai" "$out_dedup.bai"

#------------------------------------------------------------------------------
# STEP 11: Cleanup BAMs (optional)
#------------------------------------------------------------------------------
if [ "$CLEANUP_BAM" = true ]; then
  echo ">>> STEP 11: Cleanup BAMs" 2>> "$LOG"
  # -f silences “no such file” errors if the glob is empty
  rm -f "$BAMDIR"/*.{bam,bai} 2>> "$LOG"
fi

# End
echo "All done! QC VCF: $VARCALLDIR/${id}.hc.QC.vcf.gz"
See Snakemake pipeline:
# wes_single.smk - bash-replica of wes_single.sh (GATK4.6) with per-rule logs
import os
import glob
import platform
import shlex
from pathlib import Path

snakefile_dir = Path(workflow.snakefile).parent
configfile: str(snakefile_dir / "config.yaml")

# -----------------------------
# Environment (match bash)
# -----------------------------
os.environ["TMPDIR"] = config["tmpdir"]
os.environ["LC_ALL"] = "C"
os.environ["GATK_DISABLE_AUTO_S3_UPLOAD"] = "true"

# -----------------------------
# Config / tools
# -----------------------------
DATADIR  = config["datadir"]
DBDIR    = config["dbdir"].format(datadir=DATADIR)
NGSUTILS = config["ngsutils"].format(datadir=DATADIR)
TMPDIR   = config["tmpdir"].format(datadir=DATADIR)
MEM      = config.get("mem", "8G")

PIPELINE = config.get("pipeline", "wes").lower()
if PIPELINE not in ("wes", "wgs"):
    raise ValueError("config[pipeline] must be 'wes' or 'wgs'")

CLEANUP_BAM = bool(config.get("cleanup_bam", False))

THREADS = workflow.cores or int(config.get("threads", 4))

ARCH = platform.machine()
if ARCH == "aarch64":
    JAVA = config["java"]["aarch64"]
    BWA  = config["tools"]["aarch64"]["bwa"].format(ngsutils=NGSUTILS)
    SAM  = config["tools"]["aarch64"]["samtools"].format(ngsutils=NGSUTILS)
else:
    JAVA = config["java"]["amd64"]
    BWA  = config["tools"]["amd64"]["bwa"].format(ngsutils=NGSUTILS)
    SAM  = config["tools"]["amd64"]["samtools"].format(ngsutils=NGSUTILS)

GATK4 = config["gatk4_cmd"].format(ngsutils=NGSUTILS, mem=MEM)

COV     = str(snakefile_dir / "coverage.sh")
VCF2SEX = str(snakefile_dir / "vcf2sex.sh")

bundle       = config["bundle"].format(dbdir=DBDIR)
REF          = config["ref"].format(bundle=bundle)
REFGZ        = config["refgz"].format(bundle=bundle)
dbSNP        = config["dbsnp"].format(dbdir=DBDIR)
MILLS_INDELS = config["mills_indels"].format(bundle=bundle)
KG_INDELS    = config["kg_indels"].format(bundle=bundle)
HAPMAP       = config["hapmap"].format(bundle=bundle)
OMNI         = config["omni"].format(bundle=bundle)

SNP_RES   = config["snp_res"].format(hapmap=HAPMAP, omni=OMNI, dbsnp=dbSNP)
INDEL_RES = config["indel_res"].format(mills_indels=MILLS_INDELS)

INTERVAL_LIST = config["interval_list"].format(bundle=bundle)
INTERVAL_ARG  = f"-L {shlex.quote(INTERVAL_LIST)}" if PIPELINE == "wes" else ""

MIN_SNP_FOR_VQSR   = int(config.get("min_snp_for_vqsr", 1000))
MIN_INDEL_FOR_VQSR = int(config.get("min_indel_for_vqsr", 8000))

# Output dirs
BAMDIR     = "01_bam"
VARCALLDIR = "02_varcall"
STATSDIR   = "03_stats"
LOGDIR     = "logs"
for d in (BAMDIR, VARCALLDIR, STATSDIR, LOGDIR):
    os.makedirs(d, exist_ok=True)

# -----------------------------
# Sample ID (match bash)
# -----------------------------
rawid = Path.cwd().parent.name
ID = rawid.split("_", 1)[0]

# -----------------------------
# FASTQ pairs (match bash glob)
# -----------------------------
FASTQ_DIR = "../"
FASTQ_R1 = sorted(glob.glob(os.path.join(FASTQ_DIR, "*_R1_*fastq.gz")))
if not FASTQ_R1:
    raise ValueError("No FASTQs found matching ../*_R1_*fastq.gz")

FASTQ_BASES = []
FASTQ_DICT = {}
for r1 in FASTQ_R1:
    r2 = r1.replace("_R1_", "_R2_")
    base = os.path.basename(r1).replace(".fastq.gz", "")
    base = base.split("_R1_", 1)[0]
    FASTQ_BASES.append(base)
    FASTQ_DICT[base] = {"r1": r1, "r2": r2}

RG_BAMS = [os.path.join(BAMDIR, f"{b}.rg.bam") for b in FASTQ_BASES]

CHR1 = "1" if "b37" in str(REF) else "chr1"

# -----------------------------
# Targets
# -----------------------------
FINAL_INPUTS = [
    os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz"),
    os.path.join(STATSDIR,   f"{ID}.coverage.txt"),
    os.path.join(STATSDIR,   f"{ID}.sex.txt"),
]
if CLEANUP_BAM:
    FINAL_INPUTS.append(os.path.join(LOGDIR, f"{ID}.cleanup.done"))

rule all:
    input:
        FINAL_INPUTS

# -----------------------------
# STEP 1: Align & AddReadGroups
# -----------------------------
rule align_rg:
    input:
        r1=lambda wc: FASTQ_DICT[wc.base]["r1"],
        r2=lambda wc: FASTQ_DICT[wc.base]["r2"],
    output:
        bam=os.path.join(BAMDIR, "{base}.rg.bam"),
    threads: THREADS
    log:
        os.path.join(LOGDIR, f"{ID}.01_align_rg.{{base}}.log")
    shell:
        r"""
        set -eu
        SAMPLE=$(echo {wildcards.base} | cut -d'_' -f1-2)
        LANE=$(echo   {wildcards.base} | cut -d'_' -f3)
        RGID="${{SAMPLE}}.${{LANE}}.$(date +%s)"
        RGPU="${{SAMPLE}}.${{LANE}}.unit1"

        {BWA} mem -M -t {threads} {REFGZ} {input.r1} {input.r2} \
          | {GATK4} AddOrReplaceReadGroups \
              --INPUT /dev/stdin \
              --OUTPUT {output.bam} \
              --TMP_DIR {TMPDIR} \
              --RGPL ILLUMINA \
              --RGLB sureselect \
              --RGSM "$SAMPLE" \
              --RGID "$RGID" \
              --RGPU "$RGPU" \
          2>> {log}
        """

# -----------------------------
# STEP 2: Merge lane BAMs
# -----------------------------
rule merge_bams:
    input:
        rg_bams=RG_BAMS
    output:
        merged=os.path.join(BAMDIR, f"{ID}.rg.merged.bam")
    params:
        merge_inputs=lambda wc, input: " ".join([f"-I {b}" for b in input.rg_bams])
    log:
        os.path.join(LOGDIR, f"{ID}.02_merge_bams.log")
    shell:
        r"""
        set -eu
        {GATK4} MergeSamFiles \
          {params.merge_inputs} \
          -O {output.merged} \
          --CREATE_INDEX true \
          --VALIDATION_STRINGENCY SILENT \
          --TMP_DIR {TMPDIR} \
          2>> {log}
        """

# -----------------------------
# STEP 3: MarkDuplicates (+ samtools index)
# -----------------------------
rule mark_duplicates:
    input:
        merged=os.path.join(BAMDIR, f"{ID}.rg.merged.bam")
    output:
        dedup=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.bam"),
        metrics=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.metrics.txt")
    log:
        os.path.join(LOGDIR, f"{ID}.03_mark_duplicates.log")
    shell:
        r"""
        set -eu
        {GATK4} MarkDuplicates \
          -I {input.merged} \
          -O {output.dedup} \
          --METRICS_FILE {output.metrics} \
          --CREATE_INDEX true \
          --TMP_DIR {TMPDIR} \
          2>> {log}
        {SAM} index {output.dedup} 2>> {log}
        """

# -----------------------------
# STEP 4: BQSR
# -----------------------------
rule bqsr:
    input:
        dedup=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.bam")
    output:
        table=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.table"),
        recal=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.bam")
    log:
        os.path.join(LOGDIR, f"{ID}.04_bqsr.log")
    shell:
        r"""
        set -eu
        {GATK4} BaseRecalibrator \
          -R {REF} \
          -I {input.dedup} \
          --known-sites {dbSNP} \
          --known-sites {MILLS_INDELS} \
          --known-sites {KG_INDELS} \
          -O {output.table} \
          --tmp-dir {TMPDIR} \
          2>> {log}

        {GATK4} ApplyBQSR \
          -R {REF} \
          -I {input.dedup} \
          --bqsr-recal-file {output.table} \
          -O {output.recal} \
          --tmp-dir {TMPDIR} \
          2>> {log}

        {SAM} index {output.recal} 2>> {log}
        """

# -----------------------------
# STEP 5: HaplotypeCaller -> gVCF
# -----------------------------
rule haplotypecaller:
    input:
        recal=os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.bam")
    output:
        gvcf=os.path.join(VARCALLDIR, f"{ID}.hc.g.vcf.gz")
    threads: THREADS
    log:
        os.path.join(LOGDIR, f"{ID}.05_haplotypecaller.log")
    shell:
        r"""
        set -eu
        {GATK4} HaplotypeCaller \
          -R {REF} \
          -I {input.recal} \
          -O {output.gvcf} \
          {INTERVAL_ARG} \
          --native-pair-hmm-threads {threads} \
          -ERC GVCF \
          2>> {log}
        """

# -----------------------------
# STEP 6: GenotypeGVCFs -> raw VCF
# -----------------------------
rule genotype_gvcfs:
    input:
        gvcf=os.path.join(VARCALLDIR, f"{ID}.hc.g.vcf.gz")
    output:
        raw=os.path.join(VARCALLDIR, f"{ID}.hc.raw.vcf.gz")
    log:
        os.path.join(LOGDIR, f"{ID}.06_genotype_gvcfs.log")
    shell:
        r"""
        set -eu
        {GATK4} GenotypeGVCFs \
          -R {REF} \
          -V {input.gvcf} \
          -O {output.raw} \
          --stand-call-conf 10 \
          2>> {log}
        """

# -----------------------------
# STEPS 7-9: Conditional VQSR + always QC VariantFiltration
# -----------------------------
rule vqsr_and_qc:
    input:
        raw=os.path.join(VARCALLDIR, f"{ID}.hc.raw.vcf.gz")
    output:
        qc=os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz")
    log:
        os.path.join(LOGDIR, f"{ID}.07_vqsr_and_qc.log")
    shell:
        r"""
        set -eu

        rawvcf={input.raw}
        tmp_vcf="$rawvcf"

        nSNP=$(zgrep -v '^#' "$rawvcf" | awk 'length($5)==1' | wc -l | tr -d ' ')
        nINDEL=$(zgrep -v '^#' "$rawvcf" | awk 'length($5)!=1' | wc -l | tr -d ' ')

        apply_snp=false
        apply_indel=false

        if [ "$nSNP" -ge "{MIN_SNP_FOR_VQSR}" ]; then
          {GATK4} VariantRecalibrator \
            -R {REF} \
            -V "$rawvcf" \
            {SNP_RES} \
            -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ \
            --mode SNP \
            -O {VARCALLDIR}/{ID}.hc.snp.recal.vcf.gz \
            --tranches-file {VARCALLDIR}/{ID}.hc.snp.tranches.txt \
            --max-gaussians 6 \
            2>> {log}
          apply_snp=true
        fi

        if [ "$nINDEL" -ge "{MIN_INDEL_FOR_VQSR}" ]; then
          {GATK4} VariantRecalibrator \
            -R {REF} \
            -V "$rawvcf" \
            {INDEL_RES} \
            -an QD -an FS -an ReadPosRankSum \
            --mode INDEL \
            -O {VARCALLDIR}/{ID}.hc.indel.recal.vcf.gz \
            --tranches-file {VARCALLDIR}/{ID}.hc.indel.tranches.txt \
            --max-gaussians 4 \
            2>> {log}
          apply_indel=true
        fi

        if [ "$apply_snp" = true ]; then
          {GATK4} ApplyVQSR \
            -R {REF} -V "$tmp_vcf" \
            --recal-file {VARCALLDIR}/{ID}.hc.snp.recal.vcf.gz \
            --tranches-file {VARCALLDIR}/{ID}.hc.snp.tranches.txt \
            --mode SNP --truth-sensitivity-filter-level 99.0 \
            -O {VARCALLDIR}/{ID}.hc.post_snp.vcf.gz \
            2>> {log}
          tmp_vcf="{VARCALLDIR}/{ID}.hc.post_snp.vcf.gz"
        fi

        if [ "$apply_indel" = true ]; then
          {GATK4} ApplyVQSR \
            -R {REF} -V "$tmp_vcf" \
            --recal-file {VARCALLDIR}/{ID}.hc.indel.recal.vcf.gz \
            --tranches-file {VARCALLDIR}/{ID}.hc.indel.tranches.txt \
            --mode INDEL --truth-sensitivity-filter-level 95.0 \
            -O {VARCALLDIR}/{ID}.hc.vqsr.vcf.gz \
            2>> {log}
          tmp_vcf="{VARCALLDIR}/{ID}.hc.vqsr.vcf.gz"
        fi

        {GATK4} VariantFiltration \
          -R {REF} \
          -V "$tmp_vcf" \
          --filter-name "LowQUAL" --filter-expression "QUAL < 30.0" \
          --filter-name "QD2"        --filter-expression "QD < 2.0" \
          --filter-name "FS60"       --filter-expression "FS > 60.0" \
          --filter-name "MQ40"       --filter-expression "MQ < 40.0" \
          --filter-name "MQRS-12.5"  --filter-expression "MQRankSum < -12.5" \
          --filter-name "RPRS-8"     --filter-expression "ReadPosRankSum < -8.0" \
          --filter-name "QD2_indel"  --filter-expression "QD < 2.0" \
          --filter-name "FS200"      --filter-expression "FS > 200.0" \
          --filter-name "RPRS-20"    --filter-expression "ReadPosRankSum < -20.0" \
          -O {output.qc} \
          2>> {log}
        """

# -----------------------------
# STEP 10: Coverage
# -----------------------------
rule coverage_stats:
    input:
        raw   = os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.bam"),
        recal = os.path.join(BAMDIR, f"{ID}.rg.merged.dedup.recal.bam")
    output:
        cov   = os.path.join(STATSDIR, f"{ID}.coverage.txt")
    log:
        os.path.join(LOGDIR, f"{ID}.08_coverage_stats.log")
    shell:
        r"""
        set -eu

        chrN="{CHR1}"
        bam_raw="{input.raw}"
        bam_recal="{input.recal}"

        out_raw="{STATSDIR}/{CHR1}.raw.bam"
        out_dedup="{STATSDIR}/{CHR1}.dedup.bam"

        {SAM} view -b "$bam_raw"   "$chrN" > "$out_raw"   2>> {log}
        {SAM} view -b "$bam_recal" "$chrN" > "$out_dedup" 2>> {log}

        {SAM} index "$out_raw"   2>> {log}
        {SAM} index "$out_dedup" 2>> {log}

        bash {COV} {ID} "$out_raw" "$out_dedup" {PIPELINE} \
          > {output.cov} 2>> {log}

        rm -f "$out_raw" "$out_dedup" "$out_raw.bai" "$out_dedup.bai"
        """

# -----------------------------
# STEP 10 cont: Sex
# -----------------------------
rule sex_determination:
    input:
        qc=os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz")
    output:
        sex=os.path.join(STATSDIR, f"{ID}.sex.txt")
    log:
        os.path.join(LOGDIR, f"{ID}.09_sex_determination.log")
    shell:
        r"""
        set -eu
        bash {VCF2SEX} {input.qc} > {output.sex} 2>> {log}
        """

# -----------------------------
# STEP 11: Optional cleanup
# -----------------------------
rule cleanup_bams:
    input:
        os.path.join(VARCALLDIR, f"{ID}.hc.QC.vcf.gz")
    output:
        done=os.path.join(LOGDIR, f"{ID}.cleanup.done")
    log:
        os.path.join(LOGDIR, f"{ID}.10_cleanup_bams.log")
    run:
        # write any python errors to the log file
        try:
            if CLEANUP_BAM:
                for pat in ("01_bam/*.bam", "01_bam/*.bai"):
                    for fp in glob.glob(pat):
                        try:
                            os.remove(fp)
                        except FileNotFoundError:
                            pass
            Path(output.done).write_text("ok\n")
        except Exception as e:
            Path(log[0]).write_text(str(e) + "\n")
            raise

Diagram: Single-Sample WES/WGS Workflow

flowchart TD A["Input FASTQ reads"] B["Align with BWA MEM and add read groups"] C["Merge lane BAM files"] D["Mark duplicates"] E["Base recalibration BQSR"] F["Run HaplotypeCaller to produce gVCF"] G["Run GenotypeGVCFs to produce raw VCF"] H["Check if enough variants for VQSR"] I["Build SNP VQSR model"] J["Build INDEL VQSR model"] K["Apply VQSR"] L["Apply hard filters"] M["Write final QC VCF"] N["Compute coverage and infer sex"] A --> B --> C --> D --> E --> F --> G --> H H -->|Yes| I --> J --> K --> M H -->|No| L --> M M --> N

Purpose

This pipeline processes one sample at a time and produces a high-quality, filtered VCF suitable for downstream analysis. It automatically adapts to:

  • WES: restricted to an exome interval list
  • WGS: whole genome (no interval restriction)

What the Pipeline Does

1. Alignment & Read Groups

  • Align paired-end FASTQ files using BWA-MEM.
  • Add read groups (sample, library, lane, platform) required by GATK.
  • Output: lane-level BAMs with correct RG tags.

2. Lane Merging

  • Merge all lane BAMs for the same sample into a single BAM.
  • Ensures duplicate marking and BQSR operate on the full dataset.

3. Duplicate Marking

  • Use GATK MarkDuplicates on the merged BAM.
  • Flags PCR/optical duplicates to prevent them from inflating support for artefactual variants.

4. Base Quality Score Recalibration (BQSR)

  • Two-step process: BaseRecalibrator then ApplyBQSR.
  • Uses known variant databases (dbSNP, Mills, 1000G indels) to model and correct systematic base-quality errors.
  • Output: recalibrated BAM used for variant calling.

5. Variant Calling (HaplotypeCaller, gVCF)

  • Run GATK HaplotypeCaller in GVCF mode (-ERC GVCF).
  • WES: uses exome intervals; WGS: full genome.
  • Output: <id>.hc.g.vcf.gz (per-sample gVCF).

6. GenotypeGVCFs (Raw VCF)

  • Run GATK GenotypeGVCFs on the sample gVCF.
  • Output: <id>.hc.raw.vcf.gz (raw VCF with SNPs and indels).

7. Variant Quality Score Recalibration (VQSR)

  • If there are enough variants (SNPs and indels), build VQSR models:
  • VariantRecalibrator for SNPs and indels separately.
  • Uses multiple annotations (QD, MQ, FS, MQRankSum, ReadPosRankSum).
  • Output: recalibration VCFs and tranche files.

8. Apply VQSR or Hard Filters

  • If models exist:
  • Apply SNP VQSR.
  • Then apply INDEL VQSR.
  • Output: <id>.hc.vqsr.vcf.gz.
  • If not:
  • Skip directly to hard filters on the raw VCF or post-SNP VQSR VCF.

9. Generate Final QC VCF

  • Run VariantFiltration with recommended hard filters on annotations.
  • Output: <id>.hc.QC.vcf.gz (final QC VCF).

10. Coverage & Sex Determination

  • Extract chromosome 1 reads from raw and recalibrated BAMs.
  • Compute coverage statistics.
  • Infer sample sex from final VCF using a dedicated script.
  • Outputs:
  • 03_stats/<id>.coverage.txt
  • 03_stats/<id>.sex.txt

Output Files

File Meaning
02_varcall/<id>.hc.g.vcf.gz Per-sample gVCF (HaplotypeCaller)
02_varcall/<id>.hc.raw.vcf.gz Raw VCF after GenotypeGVCFs
02_varcall/<id>.hc.vqsr.vcf.gz VCF after VQSR (if VQSR was applied)
02_varcall/<id>.hc.QC.vcf.gz Final QC-filtered VCF (recommended)
03_stats/<id>.coverage.txt Coverage metrics
03_stats/<id>.sex.txt Sex determination result
logs/<id>.log Main pipeline log

When to Use This Pipeline

  • Standard clinical or research WES or WGS processing.
  • Generating gVCFs for cohort joint genotyping.
  • Producing high-quality single-sample VCFs for interpretation.