Skip to content

Frequently Asked Questions

WES / WGS

What are the reference genomes used?

GRCh37 (b37) - GATK-compatible reference genome

GRCh38 (hg38) - GATK-compatible reference genome

last change 2025-10-15 by Manuel Rueda
What are the capture kits for WES?
  • For GATK version 3.5: Exome capture is based on Agilent SureSelect.

  • For GATK version 4.6: Exome and WGS reference is based on the GATK bundle (b37).

last change 2025-10-15 by Manuel Rueda

mtDNA (MToolBox)

What is the reference genome used?

RSRS (rsrs) - Reconstructed Sapiens Reference Sequence

last change 2025-10-15 by Manuel Rueda
What does GT=1 mean in results?

In variant reports, the Genotype (GT) field shows the observed allele using VCF allele indices:

  • 0 = reference allele
  • 1 = first alternate (ALT) allele
  • 2, 3, ... = additional ALT alleles (multiallelic)

For chrM/MT (mtDNA), callers typically encode genotypes as haploid (not allele pairs).

Meaning

  • GT = 1 β†’ ALT allele detected in that sample
  • No / or | separator because only one allele index is stored
  • Biological interpretation relies on:
    • HF β†’ heteroplasmy fraction (molecules supporting ALT)
    • DP β†’ read depth (total support)

Examples

GT Interpretation (mtDNA)
0 Only reference allele observed
1 ALT allele present (homoplasmic or heteroplasmic, check HF + DP)
0/1, 1/2 (rare) Multiallelic call, still haploid encoding β€” not diploid zygosity

TL;DR: GT = 1 = ALT detected. Check HF and DP for biology.

Tip

For mtDNA, GT tells you which allele, not how much.

Use HF + DP to interpret heteroplasmy or homoplasmy.

last change 2025-10-15 by Manuel Rueda

General

How do I set up cbicall to work on an HPC system?

To adapt cbicall for your HPC environment, update the file
workflows/bash/gatk_3.5/parameters.sh so that it reflects your local module system, paths, and resource settings.

Below is the configuration used at CNAG-HPC, which you can use as a template:

# $VERSION taken from CBICall

# Paths
DATADIR=/media/mrueda/2TBS
#DATADIR=/cbicall-data  # From inside the container
DBDIR=$DATADIR/Databases
NGSUTILS=$DATADIR/NGSutils

# Environment
export TMPDIR=$DATADIR/tmp
export LC_ALL=C
export GATK_DISABLE_AUTO_S3_UPLOAD=true   # disable unintended S3 uploads
export ARCH=$(uname -m)

# Genome selection (default b37)
: "${GENOME:=b37}"

# Memory & architecture
MEM=8G
MEM_GENOTYPE=64G

# Java & tool binaries per architecture
if [ "$ARCH" == "aarch64" ]; then
    export JAVA8=/usr/lib/jvm/java-8-openjdk-arm64/bin/java
    BWA=$NGSUTILS/bwa-0.7.18_arm64/bwa
    SAM=$NGSUTILS/samtools-0.1.19_arm64/samtools
    BED=$NGSUTILS/bedtools2_arm64/bin/bedtools
    # Mtoolbox bundled binaries do not work with aarch64
    # PY27_PREFIX=$NGSUTILS/python_2.7/linux-aarch64/Python-2.7.18
else
    export JAVA8=$NGSUTILS/java8/amazon-corretto-8.472.08.1-linux-x64/bin/java
    BWA=$NGSUTILS/bwa-0.7.18/bwa
    SAM=$NGSUTILS/samtools-0.1.19/samtools
    BED=$NGSUTILS/bedtools2/bin/bedtools
    PY27_PREFIX=$NGSUTILS/python_2.7/linux-x86_64/python27_portable
    # Python 2 module to use on HPC (fallback)
    PY27_MODULE="Python/2.7.18-GCCcore-11.2.0"
fi

# Picard (shared by GATK3 & bed conversion)
PIC="$JAVA8 -Xmx$MEM -Djava.io.tmpdir=$TMPDIR -jar $NGSUTILS/picard-2.25/build/libs/picard.jar"

# GATK 3.5 (legacy)
GATK="$JAVA8 -Xmx$MEM -Djava.io.tmpdir=$TMPDIR -jar $NGSUTILS/gatk/gatk-3.5/GenomeAnalysisTK.jar"

# GATK 4+ launcher (recommended)
GATK4_BIN="$NGSUTILS/gatk/gatk-4.6.2.0/gatk"
GATK4_JAVA_OPTS="--java-options -Xmx${MEM}"
GATK4_JAVA_OPTS_64G="--java-options -Xmx${MEM_GENOTYPE}"

# MToolBox directory and DB
MTOOLBOXDIR=$NGSUTILS/MToolBox-master/MToolBox
MTOOLBOXDB=$DBDIR/mtDNA

############################################
# Reference bundle & resources by GENOME
############################################
if [ "$GENOME" = "hg38" ]; then
    # GATK bundle & reference (hg38) - WGS only
    BUNDLE=$DBDIR/GATK_bundle/hg38

    REF=$BUNDLE/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
    REFGZ=$REF   # not gz in this bundle; keep var for compatibility
    REF_DICT=$BUNDLE/resources_broad_hg38_v0_Homo_sapiens_assembly38.dict

    # Known-sites / resources (hg38)
    dbSNP="$DBDIR/dbSNP/human_9606_b146_GRCh38p2/All_20160407.renamed.vcf.gz"
    MILLS_INDELS=$BUNDLE/resources_broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    KG_INDELS=$BUNDLE/resources_broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz
    HAPMAP=$BUNDLE/resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz
    OMNI=$BUNDLE/resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz

    # No WES intervals/capture on hg38 (WGS only)
    EXOME_BED=""
    INTERVAL_LIST=""
    EXOM=""

else
    # Default b37
    BUNDLE=$DBDIR/GATK_bundle/b37
    REF=$BUNDLE/references_b37_Homo_sapiens_assembly19.fasta
    REFGZ=$BUNDLE/references_b37_Homo_sapiens_assembly19.fasta.gz
    REF_DICT=$BUNDLE/references_b37_Homo_sapiens_assembly19.dict

    # Variant resources
    dbSNP=$DBDIR/dbSNP/human_9606_b144_GRCh37p13/All_20160408.vcf.gz
    MILLS_INDELS=$BUNDLE/b37_Mills_and_1000G_gold_standard.indels.b37.vcf.gz
    KG_INDELS=$BUNDLE/b37_1000G_phase1.indels.b37.vcf.gz
    HAPMAP=$BUNDLE/b37_hapmap_3.3.b37.vcf.gz
    OMNI=$BUNDLE/b37_1000G_omni2.5.b37.vcf.gz

    # Exome targets
    EXOME_BED=$BUNDLE/b37_Broad.human.exome.b37.bed
    INTERVAL_LIST=$BUNDLE/b37_Broad.human.exome.b37.interval_list

    # Agilent SureSelect Whole Exome (your current setup)
    EXOM=$DBDIR/Agilent_SureSelect/hg19/bed
fi

# Training sets for VQSR
SNP_RES="-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $HAPMAP \
         -resource:omni,known=false,training=true,truth=false,prior=12.0 $OMNI \
         -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 $dbSNP"
INDEL_RES="-resource:mills,known=true,training=true,truth=true,prior=12.0 $MILLS_INDELS"

# Joint variant calling
BATCH_SIZE=50
MIN_SNP_FOR_VQSR=1000
MIN_INDEL_FOR_VQSR=8000

# UnifiedGenotyper parameters (legacy)
DCOV=1000
UG_CALL=50
UG_EMIT=10
Do you have an example in how to run cbicall in Slurm HPC?
#!/bin/bash
#
# run_cbicall_slurm.sh
# usage: ./run_cbicall_slurm.sh <sample_id> <pipeline: wes|wgs>

if [ "$#" -ne 2 ]; then
  echo "Usage: $0 <sample_id> <pipeline: wes|wgs>"
  exit 1
fi

SAMPLE_ID=$1
PIPELINE=$2

if [[ "$PIPELINE" != "wes" && "$PIPELINE" != "wgs" ]]; then
  echo "Error: pipeline must be 'wes' or 'wgs'"
  exit 1
fi

# choose SLURM settings based on pipeline
if [ "$PIPELINE" = "wes" ]; then
  QUEUE="normal"
  TIME="10:00:00"
elif [ "$PIPELINE" = "wgs" ]; then
  QUEUE="vlong"
  TIME="2-00:00:00"
fi

# Uppercase version of pipeline
PIPELINE_UC=${PIPELINE^^}

# where your data and logs live
WORKDIR="/scratch_isilon/projects/0012-hereditary/dbgap/fastq/phs001585/${PIPELINE_UC}/${SAMPLE_ID}"

# name the generated job script
JOB_SCRIPT="job_${SAMPLE_ID}_${PIPELINE}.slurm"

# Number of threads
THREADS=4

# RAM (x1.5 to help prevent oom-kills)
MEM="24G"

cat > "${JOB_SCRIPT}" <<EOF
#!/bin/bash
#SBATCH --job-name=cbicall
#SBATCH -q ${QUEUE}
#SBATCH -D ${WORKDIR}
#SBATCH -e ${WORKDIR}/slurm-%N.%j.err
#SBATCH -o ${WORKDIR}/slurm-%N.%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=${THREADS}
#SBATCH --mem=${MEM}
#SBATCH -t ${TIME}
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=manuel.rueda@cnag.eu

# Load Python + modules
module load Python/3.10.8-GCCcore-12.2.0
export PYTHONPATH="/software/biomed/cbi_py3/lib/python3.10/site-packages:${PYTHONPATH}"

# Set CBICALL exe
CBICALL_DIR="/software/biomed/cbicall"
CBICALL="\$CBICALL_DIR/bin/cbicall"

cd \$SLURM_SUBMIT_DIR

# write a pipeline‐specific yaml
YAML_FILE="${SAMPLE_ID}_${PIPELINE}_param.yaml"
cat <<YAML > "\${YAML_FILE}"
mode: single
pipeline: ${PIPELINE}
workflow_engine: bash
gatk_version: gatk-4.6
sample: ${WORKDIR}
projectdir: ${SAMPLE_ID}_cbicall
cleanup_bam: false
YAML

srun "\$CBICALL" \\
     -p "\$YAML_FILE" \\
     -t $THREADS \\
     --no-color \\
     --no-emoji
EOF

# submit it
sbatch "${JOB_SCRIPT}"
last change 2025-10-15 by Manuel Rueda
How do I cite CBIcall?

You can cite the CBIcall paper. Thx!

Citation

CBIcall: a configuration-driven framework for variant calling in large DNA-seq cohorts. Manuscript In preparation.

last change 2025-10-15 by Manuel Rueda