Skip to main content

Performance

Runtime Behavior

The -t/--threads value is passed to the selected workflow backend. Most memory and CPU usage comes from external tools:

  • BWA-MEM Memory usage increases with thread count and reference size. BWA does not provide an internal memory cap, so limiting RAM requires external mechanisms such as ulimit.

  • GATK and Picard These tools default to using 8 GB of memory. This value can be adjusted through the CBIcall GATK 4.6 environment file or the Snakemake configuration file.

Cohort mode with GATK 4.6

Joint genotyping defaults to 64 GB of RAM for GenomicsDBImport and GenotypeGVCFs. The value is controlled by MEM_GENOTYPE in the GATK 4.6 environment file and by mem_genotype in the Snakemake workflow.

Python driver overhead

CBIcall adds negligible orchestration overhead. The Python wrapper typically remains below 2% of a 16 GB system, does not process reads or variants, and does not create Python worker threads. It is expected to require one CPU core only during short setup phases. For long-running variant-calling jobs, scheduler CPU and memory requests should be sized for the selected external tools and workflow threads, not for the CBIcall Python process itself.

Parallelization

Total CPU time

Some workflow steps can be split further, for example across FASTQ chunks or genomic intervals. This may shorten one job, but it does not necessarily reduce total CPU time. On a Slurm cluster with a fixed CPU allocation, 1 job with 24 threads and 6 jobs with 4 threads each use the same simultaneous CPU budget. CBIcall deliberately favors moderate per-job thread counts because the intended production use case is running thousands of jobs on Slurm. This matters most for WGS, where one job can run for days.

Parallel execution is supported, but performance does not scale linearly with additional threads. In practice, optimal throughput is usually achieved with 4-6 threads per task.

For example, on a 12-core workstation:

  • Running 3 tasks with 4 threads each is typically preferable to
  • Running 1 task with all 12 threads

The benchmark below shows the shape of this scaling for WES single-sample calling on the 1000 Genomes sample HG00103 using run accession SRR1596639. The paired FASTQ inputs were 1.9 GB (R1) and 2.0 GB (R2). Runs used an HP Z2 G8 Tower Workstation (x86_64) with an Intel Xeon W-1350P @ 4.00 GHz, 6 physical cores / 12 hardware threads, and 31 GiB RAM. The GATK/Picard memory setting was fixed at MEM=8G in env.sh for all thread counts.

The biggest gain comes from moving from 2 to 4 threads; after 6 threads, the improvement is small.

Run time versus number of threads for WES single-mode

ThreadsElapsed secondsRuntime (minutes)
27856.302130.9
46342.747105.7
65812.15596.9
85695.98194.9
105525.61692.1
125452.73290.9
Practical default

For batch processing, start with 4 threads per task and scale by running more tasks in parallel when the machine or scheduler has available cores.