Frequently Asked Questions¶
WES / WGS¶
What are the reference genomes used?
GRCh37 (b37) - GATK-compatible reference genome
GRCh38 (hg38) - GATK-compatible reference genome
last change 2025-10-15 by Manuel Rueda ¶
What are the capture kits for WES?
-
For GATK version 3.5: Exome capture is based on Agilent SureSelect.
-
For GATK version 4.6: Exome and WGS reference is based on the GATK bundle (b37).
last change 2025-10-15 by Manuel Rueda ¶
mtDNA (MToolBox)¶
What is the reference genome used?
RSRS (rsrs) - Reconstructed Sapiens Reference Sequence
last change 2025-10-15 by Manuel Rueda ¶
What does GT=1 mean in results?
In variant reports, the Genotype (GT) field shows the observed allele using VCF allele indices:
0= reference allele1= first alternate (ALT) allele2,3, ... = additional ALT alleles (multiallelic)
For chrM/MT (mtDNA), callers typically encode genotypes as haploid (not allele pairs).
Meaning¶
GT = 1β ALT allele detected in that sample- No
/or|separator because only one allele index is stored - Biological interpretation relies on:
HFβ heteroplasmy fraction (molecules supporting ALT)DPβ read depth (total support)
Examples¶
| GT | Interpretation (mtDNA) |
|---|---|
0 |
Only reference allele observed |
1 |
ALT allele present (homoplasmic or heteroplasmic, check HF + DP) |
0/1, 1/2 (rare) |
Multiallelic call, still haploid encoding β not diploid zygosity |
TL;DR:
GT = 1= ALT detected. CheckHFandDPfor biology.
Tip
For mtDNA, GT tells you which allele, not how much.
Use HF + DP to interpret heteroplasmy or homoplasmy.
last change 2025-10-15 by Manuel Rueda ¶
General¶
How do I set up cbicall to work on an HPC system?
To adapt cbicall for your HPC environment, update the file
workflows/bash/gatk_3.5/parameters.sh so that it reflects your local module
system, paths, and resource settings.
Below is the configuration used at CNAG-HPC, which you can use as a template:
# $VERSION taken from CBICall
# Paths
DATADIR=/media/mrueda/2TBS
#DATADIR=/cbicall-data # From inside the container
DBDIR=$DATADIR/Databases
NGSUTILS=$DATADIR/NGSutils
# Environment
export TMPDIR=$DATADIR/tmp
export LC_ALL=C
export GATK_DISABLE_AUTO_S3_UPLOAD=true # disable unintended S3 uploads
export ARCH=$(uname -m)
# Genome selection (default b37)
: "${GENOME:=b37}"
# Memory & architecture
MEM=8G
MEM_GENOTYPE=64G
# Java & tool binaries per architecture
if [ "$ARCH" == "aarch64" ]; then
export JAVA8=/usr/lib/jvm/java-8-openjdk-arm64/bin/java
BWA=$NGSUTILS/bwa-0.7.18_arm64/bwa
SAM=$NGSUTILS/samtools-0.1.19_arm64/samtools
BED=$NGSUTILS/bedtools2_arm64/bin/bedtools
# Mtoolbox bundled binaries do not work with aarch64
# PY27_PREFIX=$NGSUTILS/python_2.7/linux-aarch64/Python-2.7.18
else
export JAVA8=$NGSUTILS/java8/amazon-corretto-8.472.08.1-linux-x64/bin/java
BWA=$NGSUTILS/bwa-0.7.18/bwa
SAM=$NGSUTILS/samtools-0.1.19/samtools
BED=$NGSUTILS/bedtools2/bin/bedtools
PY27_PREFIX=$NGSUTILS/python_2.7/linux-x86_64/python27_portable
# Python 2 module to use on HPC (fallback)
PY27_MODULE="Python/2.7.18-GCCcore-11.2.0"
fi
# Picard (shared by GATK3 & bed conversion)
PIC="$JAVA8 -Xmx$MEM -Djava.io.tmpdir=$TMPDIR -jar $NGSUTILS/picard-2.25/build/libs/picard.jar"
# GATK 3.5 (legacy)
GATK="$JAVA8 -Xmx$MEM -Djava.io.tmpdir=$TMPDIR -jar $NGSUTILS/gatk/gatk-3.5/GenomeAnalysisTK.jar"
# GATK 4+ launcher (recommended)
GATK4_BIN="$NGSUTILS/gatk/gatk-4.6.2.0/gatk"
GATK4_JAVA_OPTS="--java-options -Xmx${MEM}"
GATK4_JAVA_OPTS_64G="--java-options -Xmx${MEM_GENOTYPE}"
# MToolBox directory and DB
MTOOLBOXDIR=$NGSUTILS/MToolBox-master/MToolBox
MTOOLBOXDB=$DBDIR/mtDNA
############################################
# Reference bundle & resources by GENOME
############################################
if [ "$GENOME" = "hg38" ]; then
# GATK bundle & reference (hg38) - WGS only
BUNDLE=$DBDIR/GATK_bundle/hg38
REF=$BUNDLE/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
REFGZ=$REF # not gz in this bundle; keep var for compatibility
REF_DICT=$BUNDLE/resources_broad_hg38_v0_Homo_sapiens_assembly38.dict
# Known-sites / resources (hg38)
dbSNP="$DBDIR/dbSNP/human_9606_b146_GRCh38p2/All_20160407.renamed.vcf.gz"
MILLS_INDELS=$BUNDLE/resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
KG_INDELS=$BUNDLE/resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz
HAPMAP=$BUNDLE/resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz
OMNI=$BUNDLE/resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz
# No WES intervals/capture on hg38 (WGS only)
EXOME_BED=""
INTERVAL_LIST=""
EXOM=""
else
# Default b37
BUNDLE=$DBDIR/GATK_bundle/b37
REF=$BUNDLE/references_b37_Homo_sapiens_assembly19.fasta
REFGZ=$BUNDLE/references_b37_Homo_sapiens_assembly19.fasta.gz
REF_DICT=$BUNDLE/references_b37_Homo_sapiens_assembly19.dict
# Variant resources
dbSNP=$DBDIR/dbSNP/human_9606_b144_GRCh37p13/All_20160408.vcf.gz
MILLS_INDELS=$BUNDLE/b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz
KG_INDELS=$BUNDLE/b37_1000G_phase1.indels.b37.vcf.gz
HAPMAP=$BUNDLE/b37_hapmap_3.3.b37.vcf.gz
OMNI=$BUNDLE/b37_1000G_omni2.5.b37.vcf.gz
# Exome targets
EXOME_BED=$BUNDLE/b37_Broad.human.exome.b37.bed
INTERVAL_LIST=$BUNDLE/b37_Broad.human.exome.b37.interval_list
# Agilent SureSelect Whole Exome (your current setup)
EXOM=$DBDIR/Agilent_SureSelect/hg19/bed
fi
# Training sets for VQSR
SNP_RES="-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $HAPMAP \
-resource:omni,known=false,training=true,truth=false,prior=12.0 $OMNI \
-resource:dbsnp,known=true,training=false,truth=false,prior=6.0 $dbSNP"
INDEL_RES="-resource:mills,known=true,training=true,truth=true,prior=12.0 $MILLS_INDELS"
# Joint variant calling
BATCH_SIZE=50
MIN_SNP_FOR_VQSR=1000
MIN_INDEL_FOR_VQSR=8000
# UnifiedGenotyper parameters (legacy)
DCOV=1000
UG_CALL=50
UG_EMIT=10
Do you have an example in how to run cbicall in Slurm HPC?
#!/bin/bash
#
# run_cbicall_slurm.sh
# usage: ./run_cbicall_slurm.sh <sample_id> <pipeline: wes|wgs>
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <sample_id> <pipeline: wes|wgs>"
exit 1
fi
SAMPLE_ID=$1
PIPELINE=$2
if [[ "$PIPELINE" != "wes" && "$PIPELINE" != "wgs" ]]; then
echo "Error: pipeline must be 'wes' or 'wgs'"
exit 1
fi
# choose SLURM settings based on pipeline
if [ "$PIPELINE" = "wes" ]; then
QUEUE="normal"
TIME="10:00:00"
elif [ "$PIPELINE" = "wgs" ]; then
QUEUE="vlong"
TIME="2-00:00:00"
fi
# Uppercase version of pipeline
PIPELINE_UC=${PIPELINE^^}
# where your data and logs live
WORKDIR="/scratch_isilon/projects/0012-hereditary/dbgap/fastq/phs001585/${PIPELINE_UC}/${SAMPLE_ID}"
# name the generated job script
JOB_SCRIPT="job_${SAMPLE_ID}_${PIPELINE}.slurm"
# Number of threads
THREADS=4
# RAM (x1.5 to help prevent oom-kills)
MEM="24G"
cat > "${JOB_SCRIPT}" <<EOF
#!/bin/bash
#SBATCH --job-name=cbicall
#SBATCH -q ${QUEUE}
#SBATCH -D ${WORKDIR}
#SBATCH -e ${WORKDIR}/slurm-%N.%j.err
#SBATCH -o ${WORKDIR}/slurm-%N.%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=${THREADS}
#SBATCH --mem=${MEM}
#SBATCH -t ${TIME}
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=manuel.rueda@cnag.eu
# Load Python + modules
module load Python/3.10.8-GCCcore-12.2.0
export PYTHONPATH="/software/biomed/cbi_py3/lib/python3.10/site-packages:${PYTHONPATH}"
# Set CBICALL exe
CBICALL_DIR="/software/biomed/cbicall"
CBICALL="\$CBICALL_DIR/bin/cbicall"
cd \$SLURM_SUBMIT_DIR
# write a pipelineβspecific yaml
YAML_FILE="${SAMPLE_ID}_${PIPELINE}_param.yaml"
cat <<YAML > "\${YAML_FILE}"
mode: single
pipeline: ${PIPELINE}
workflow_engine: bash
gatk_version: gatk-4.6
sample: ${WORKDIR}
projectdir: ${SAMPLE_ID}_cbicall
cleanup_bam: false
YAML
srun "\$CBICALL" \\
-p "\$YAML_FILE" \\
-t $THREADS \\
--no-color \\
--no-emoji
EOF
# submit it
sbatch "${JOB_SCRIPT}"
last change 2025-10-15 by Manuel Rueda ¶
How do I cite CBIcall?
You can cite the CBIcall paper. Thx!
Citation
CBIcall: a configuration-driven framework for variant calling in large DNA-seq cohorts. Manuscript In preparation.