End-to-end examples (GATK 4.6)¶
Prerequisites
Installation, reference bundles, and all dependencies must be completed beforehand.
This example demonstrates how to run CBIcall on a real WES sample from FASTQ files through final VCF and QC outputs.
1. Prepare your FASTQ files¶
CBIcall expects paired-end FASTQ files with a shared prefix, for example:
# Project / Sample (Proband WES)
CNAG999_exome/CNAG99901P_ex/
CNAG99901P_ex_S1_L001_R1_001.fastq.gz
CNAG99901P_ex_S1_L001_R2_001.fastq.gz
Note on nomenclature
Please see this page.
2. Create a parameters file¶
Create a YAML file, e.g. wes_single.yaml:
mode: single
pipeline: wes
workflow_engine: bash
gatk_version: gatk-4.6
input_dir: CNAG999_exome/CNAG99901P_ex
genome: b37
cleanup_bam: false
Notes:
modeselects single-sample or cohort (joint genotyping).pipelineswitches between WES, WGS or mtDNA.workflow_enginechooses the backend (bash or snakemake).
How can I perform WGS?
Simply change the parameter pipeline to wgs. Like this:
3. Run CBIcall¶
-pselects the YAML parameters file-tsets the number of threads
4. Inspect outputs¶
After completion, you will find:
CNAG999_exome/CNAG99901P_ex/cbicall_bash_wes_single_b37_gatk-4.6_*/
01_bam/
02_varcall/
03_stats/
logs/
Where:
- VCF files are stored in
02_varcall/ - QC metrics (coverage, sample stats, sex prediction) are in
03_stats - Logs for all pipeline steps are under
logs/
These files are ready for downstream analysis, annotation or integration with cohort-level studies.
For advanced parameters, multi-sample analyses, mtDNA workflows and troubleshooting, see the Usage and FAQ sections.
Important
In order to run a cohort based calculation you first have to create GVCF for each sample. This is being done by running wes mode single.
1. Create a sample map file like the one we display below:¶
CNAG99901P_ex /media/mrueda/2TBS/CNAG/Project_CBI_Call/cbicall/examples/input/CNAG999_exome/CNAG99901P_ex/ref_cbicall_bash_wes_single_b37_gatk-4.6_765963065360466/02_varcall/CNAG99901P.hc.QC.vcf.gz
CNAG99902P_ex /media/mrueda/2TBS/CNAG/Project_CBI_Call/cbicall/examples/input/CNAG999_exome/CNAG99901P_ex/ref_cbicall_bash_wes_single_b37_gatk-4.6_765963065360466/02_varcall/CNAG99901P.hc.QC.vcf.gz
GATK needs absolute paths for the files.
Scaling joint genotyping for large cohorts
For very large cohorts (hundreds to thousands of samples), the joint genotyping step can become computationally demanding when executed as a single job.
A common strategy is to split the analysis by chromosome and run one job per chromosome through the HPC scheduler. This reduces memory usage per job and allows parallel execution across compute nodes.
After all chromosomes have finished, the resulting VCF files can be merged (if needed) into a single cohort callset.
Example:
In this example, chr*.vcf.gz simply represents a set of per-chromosome
VCF files (e.g., chr1.vcf.gz, chr2.vcf.gz, โฆ). The naming pattern is
arbitrary and should be adapted to the filenames generated by your
workflow.
2. Create a parameters file¶
Create a YAML file, e.g. wes_cohort.yaml:
mode: cohort
pipeline: wes
workflow_engine: bash
gatk_version: gatk-4.6
genome: b37
sample_map: ./sample_map.tsv
3. Run CBIcall¶
-pselects the YAML parameters file-tsets the number of threads
4. Inspect outputs¶
After completion, you will find:
Where:
- Final VCF files are stored in
02_varcall/ - Logs for all pipeline steps are under
logs/
Do you have examples in how to run CBIcall programatically?
Yes, you can find examples at https://github.com/CNAG-Biomedical-Informatics/cbicall/tree/main/examples/scripts.
Any suggestions for performing annotation?
We recommend using beacon2-cbi-tools. This tool allows you not only to annotate data, but also to convert it into a data exchange format compatible with the Beacon v2 API.