WES/WGS Single-Sample Pipeline
A user-focused guide to processing whole-exome (WES) and whole-genome (WGS) data using GATK Best Practices.
- Bash
- Snakemake
Source: View source
Source: View source
Diagram: Single-Sample WES/WGS Workflow
Purpose
This pipeline processes one sample at a time and produces a filtered VCF and gVCF organized for downstream tools and project QC. It automatically adapts to:
- WES: restricted to an exome interval list
- WGS: whole genome (no interval restriction)
What the Pipeline Does
1. Alignment & Read Groups
- Align paired-end FASTQ files using BWA-MEM.
- Add read groups (sample, library, lane, platform) required by GATK.
- Output: lane-level BAMs with correct RG tags.
2. Lane Merging
- Merge all lane BAMs for the same sample into a single BAM.
- Ensures duplicate marking and BQSR operate on the full dataset.
3. Duplicate Marking
- Use GATK
MarkDuplicateson the merged BAM. - Flags PCR/optical duplicates to prevent them from inflating support for artefactual variants.
4. Base Quality Score Recalibration (BQSR)
- Two-step process:
BaseRecalibratorthenApplyBQSR. - Uses known variant databases (dbSNP, Mills, 1000G indels) to model and correct systematic base-quality errors.
- Output: recalibrated BAM used for variant calling.
5. Variant Calling (HaplotypeCaller, gVCF)
- Run GATK
HaplotypeCallerin GVCF mode (-ERC GVCF). - WES: uses exome intervals; WGS: full genome.
- Output:
<id>.hc.g.vcf.gz(per-sample gVCF).
6. GenotypeGVCFs (Raw VCF)
- Run GATK
GenotypeGVCFson the sample gVCF. - Output:
<id>.hc.raw.vcf.gz(raw VCF with SNPs and indels).
7. Variant Quality Score Recalibration (VQSR)
- If there are enough variants (SNPs and indels), build VQSR models:
VariantRecalibratorfor SNPs and indels separately.- Uses multiple annotations (QD, MQ, FS, MQRankSum, ReadPosRankSum).
- Output: recalibration VCFs and tranche files.
8. Apply VQSR or Hard Filters
- If models exist:
- Apply SNP VQSR.
- Then apply INDEL VQSR.
- Output:
<id>.hc.vqsr.vcf.gz.
- If not:
- Skip directly to hard filters on the raw VCF or post-SNP VQSR VCF.
9. Generate Final QC VCF
- Run
VariantFiltrationwith recommended hard filters on annotations. - Output:
<id>.hc.QC.vcf.gz(final QC VCF).
10. Coverage & Sex Determination
- Extract chromosome 1 reads from raw and recalibrated BAMs.
- Compute coverage statistics.
- Infer sample sex from final VCF using a dedicated script.
- Outputs:
03_stats/<id>.coverage.txt03_stats/<id>.sex.txt
Output Files
| File | Meaning |
|---|---|
02_varcall/<id>.hc.g.vcf.gz | Per-sample gVCF (HaplotypeCaller) |
02_varcall/<id>.hc.raw.vcf.gz | Raw VCF after GenotypeGVCFs |
02_varcall/<id>.hc.vqsr.vcf.gz | VCF after VQSR (if VQSR was applied) |
02_varcall/<id>.hc.QC.vcf.gz | Final QC-filtered VCF (recommended) |
03_stats/<id>.coverage.txt | Coverage metrics |
03_stats/<id>.sex.txt | Sex determination result |
logs/<id>.log | Main pipeline log |
When to Use This Pipeline
- Standard research WES or WGS processing.
- Generating gVCFs for cohort joint genotyping.
- Producing filtered single-sample VCFs for downstream review or interpretation.