Skip to main content

Adding a Pipeline

CBIcall is built so most new workflows can be added with two things:

  1. Workflow script(s)
  2. A registry entry in workflows/registry/workflows.yaml
Most additions do not need Python changes

If the new pipeline fits the existing concepts of pipeline, mode, workflow_engine, gatk_version, input_dir, and sample_map, start with workflow scripts and the registry only.

What You Are Adding

Before editing files, define the shape of the workflow.

DecisionOptionsWhy it matters
Enginebash, snakemakeDetermines where the entrypoint lives and how CBIcall launches it.
Versiongatk-3.5, gatk-4.6Determines the workflow subdirectory and shared helper files.
Pipeline namee.g. mypipeBecomes the value users set as pipeline: mypipe.
Modesingle, cohort, or bothDetermines which entrypoint filenames are needed.
Inputsinput_dir, sample_map, or bothDetermines how users configure the run.

For example, a Bash single-sample pipeline named mypipe for GATK 4.6 would use:

workflows/bash/gatk-4.6/mypipe_single.sh

and users would select it with:

pipeline: mypipe
mode: single
workflow_engine: bash
gatk_version: gatk-4.6

How Execution Works

At runtime, CBIcall:

  1. Reads the YAML parameters file
  2. Validates values and compatibility rules
  3. Loads the workflow registry
  4. Resolves the requested workflow script
  5. Creates a run directory
  6. Launches the workflow from inside that run directory

The main implementation layers are:

FileResponsibility
src/cbicall/config.pyParameter defaults, semantic validation, runtime metadata.
src/cbicall/workflow_registry.pyRegistry loading, workflow resolution, referenced-file validation.
src/cbicall/dnaseq.pyEngine-specific execution through BashRunner and SnakemakeRunner.
workflows/registry/workflows.yamlDeclares available workflow scripts.
workflows/schema/workflows.schema.jsonValidates the registry structure.
What is the registry?

The workflow registry is CBIcall's developer-facing workflow map: workflows/registry/workflows.yaml. It tells CBIcall which Bash or Snakemake file to launch when a parameters YAML selects values such as workflow_engine, gatk_version, pipeline, mode, and pipeline_version in the parameters YAML.

Normal users do not edit this file. Pipeline maintainers edit it when adding or changing workflow implementations. After editing it, run:

bin/cbicall validate-registry

This checks that the workflow registry is structurally valid against workflows/schema/workflows.schema.json; it does not run a workflow.

1. Add the Workflow Entrypoint

Workflow filenames follow:

{pipeline}_{mode}.{sh|smk}

Bash Layout

workflows/bash/gatk-4.6/
env.sh
mypipe_single.sh
mypipe_cohort.sh

Bash workflows are executed directly. CBIcall sets GENOME in the environment and launches the script from inside the generated run directory.

Entrypoint location

CBIcall does not copy Bash workflow scripts into the run directory. It launches the registered script from workflows/bash/... while setting the process working directory to the generated run directory.

This keeps workflow code centralized and keeps helper paths such as env.sh stable through BASH_SOURCE[0]. The tradeoff is that workflow .sh files should not be edited while jobs are running.

Minimal Bash example:

#!/usr/bin/env bash
set -eu

BINDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RUNDIR="$(pwd)"

source "$BINDIR/env.sh"

mkdir -p logs results

echo "Running mypipe in $RUNDIR" | tee logs/mypipe.log

for R1 in ../*_R1_*fastq.gz; do
R2="${R1/_R1_/_R2_}"
echo "Pair: $R1 $R2" >> logs/mypipe.log
done

echo "done" > results/mypipe.done

Make it executable:

chmod +x workflows/bash/gatk-4.6/mypipe_single.sh

Snakemake Layout

workflows/snakemake/gatk-4.6/
config.yaml
mypipe_single.smk
mypipe_cohort.smk

CBIcall launches Snakemake with:

  • the resolved Snakefile through -s
  • the shared config file through --configfile
  • genome through --config
  • sample_map and workspace for cohort workflows when needed
  • workflow_rule as the Snakemake target when partial execution is requested

2. Register the Pipeline

Edit:

workflows/registry/workflows.yaml

Add the pipeline under the correct engine and version.

workflows:
bash:
base_dir: "workflows/bash"
versions:
gatk-4.6:
helpers:
env: "env.sh"
pipelines:
mypipe:
single:
default: "v1"
versions:
v1:
script: "mypipe_single.sh"

If cohort mode is supported:

pipelines:
mypipe:
single:
default: "v1"
versions:
v1:
script: "mypipe_single.sh"
cohort:
default: "v1"
versions:
v1:
script: "mypipe_cohort.sh"
Registry paths

Registry paths are relative to the engine/version directory. For Bash GATK 4.6, mypipe_single.sh resolves below workflows/bash/gatk-4.6/.

Pipeline implementation versions

The registry version above is the tool family version, for example gatk-4.6. The nested v1 is the CBIcall pipeline implementation version. Normal run YAML files do not need to set it because the registry default is used. If a pipeline script changes in a way that may affect outputs, add a new implementation such as v2, point it to the new script, and choose whether to move default to that version.

3. Make the Pipeline Selectable

The registry makes the script discoverable, but Python validation still controls which pipeline, mode, and gatk_version combinations are allowed.

If mypipe is a new pipeline name, update the validation sets in src/cbicall/config.py:

PIPELINE_VALUES = {"wes", "wgs", "mit", "mypipe"}

_ALLOWED_COMBOS = {
"gatk-4.6": {
"wes": ["single", "cohort"],
"wgs": ["single", "cohort"],
"mypipe": ["single"],
},
}

If you are only adding a missing mode or script for an existing pipeline and version, the registry entry may be enough.

4. Add a User-Facing YAML Example

Create a minimal parameters file that exercises the new workflow:

mode: single
pipeline: mypipe
workflow_engine: bash
gatk_version: gatk-4.6
input_dir: SAMPLE01
genome: b37

Run it:

bin/cbicall run -p mypipe.yaml -t 4

CBIcall should create a run directory similar to:

SAMPLE01/cbicall_bash_mypipe_single_b37_gatk-4.6_<run-id>/

5. Validate the Addition

Check the pipeline at three levels.

LevelWhat to verify
RegistryThe workflow appears under the correct engine/version/pipeline/mode.
Resource catalogCompatible bundle entries point to real registry workflow keys.
FilesReferenced scripts exist; Bash scripts are executable.
RuntimeCBIcall creates a run directory, writes log.json, and produces expected outputs.

Good first checks:

bin/cbicall validate-registry
bin/cbicall validate-resources
bin/cbicall validate-param -p mypipe.yaml
bin/cbicall run -p mypipe.yaml -t 2

Then inspect:

<run-dir>/log.json
<run-dir>/logs/

When Python Changes Are Needed

Most workflow additions only need validation changes in config.py if the pipeline name is new. Broader Python changes are needed when the execution model changes.

ChangeLikely file
New YAML key or defaultsrc/cbicall/config.py
New value in the typed runtime modelsrc/cbicall/models.py
New registry resolution behaviorsrc/cbicall/workflow_registry.py
New execution enginesrc/cbicall/dnaseq.py
Different command-line launch behaviorsrc/cbicall/dnaseq.py
New engine

Adding an engine is different from adding a pipeline. Prefer adding a new runner class rather than expanding conditional logic inside an existing runner.

Contributor Checklist

  • Pick engine, version, pipeline name, and mode.
  • Inspect the closest existing workflow in workflows/{bash|snakemake}/{gatk-version}/.
  • Add workflow entrypoint scripts.
  • Make Bash scripts executable.
  • Register scripts in workflows/registry/workflows.yaml.
  • Update Python validation if the pipeline name or compatibility matrix changes.
  • Add a minimal YAML example.
  • Run CBIcall and inspect log.json, logs/, and expected outputs.
  • Add or update tests when validation or execution behavior changes.

Next Steps