Adding a Pipeline
CBIcall is built so most new workflows can be added with two things:
- Workflow script(s)
- A registry entry in
workflows/registry/workflows.yaml
If the new pipeline fits the existing concepts of pipeline, mode, workflow_engine, gatk_version, input_dir, and sample_map, start with workflow scripts and the registry only.
What You Are Adding
Before editing files, define the shape of the workflow.
| Decision | Options | Why it matters |
|---|---|---|
| Engine | bash, snakemake | Determines where the entrypoint lives and how CBIcall launches it. |
| Version | gatk-3.5, gatk-4.6 | Determines the workflow subdirectory and shared helper files. |
| Pipeline name | e.g. mypipe | Becomes the value users set as pipeline: mypipe. |
| Mode | single, cohort, or both | Determines which entrypoint filenames are needed. |
| Inputs | input_dir, sample_map, or both | Determines how users configure the run. |
For example, a Bash single-sample pipeline named mypipe for GATK 4.6 would use:
workflows/bash/gatk-4.6/mypipe_single.sh
and users would select it with:
pipeline: mypipe
mode: single
workflow_engine: bash
gatk_version: gatk-4.6
How Execution Works
At runtime, CBIcall:
- Reads the YAML parameters file
- Validates values and compatibility rules
- Loads the workflow registry
- Resolves the requested workflow script
- Creates a run directory
- Launches the workflow from inside that run directory
The main implementation layers are:
| File | Responsibility |
|---|---|
src/cbicall/config.py | Parameter defaults, semantic validation, runtime metadata. |
src/cbicall/workflow_registry.py | Registry loading, workflow resolution, referenced-file validation. |
src/cbicall/dnaseq.py | Engine-specific execution through BashRunner and SnakemakeRunner. |
workflows/registry/workflows.yaml | Declares available workflow scripts. |
workflows/schema/workflows.schema.json | Validates the registry structure. |
The workflow registry is CBIcall's developer-facing workflow map:
workflows/registry/workflows.yaml. It tells CBIcall which Bash or Snakemake
file to launch when a parameters YAML selects values such as workflow_engine,
gatk_version, pipeline, mode, and pipeline_version in the parameters
YAML.
Normal users do not edit this file. Pipeline maintainers edit it when adding or changing workflow implementations. After editing it, run:
bin/cbicall validate-registry
This checks that the workflow registry is structurally valid against
workflows/schema/workflows.schema.json; it does not run a workflow.
1. Add the Workflow Entrypoint
Workflow filenames follow:
{pipeline}_{mode}.{sh|smk}
Bash Layout
workflows/bash/gatk-4.6/
env.sh
mypipe_single.sh
mypipe_cohort.sh
Bash workflows are executed directly. CBIcall sets GENOME in the environment and launches the script from inside the generated run directory.
CBIcall does not copy Bash workflow scripts into the run directory. It launches the registered script from workflows/bash/... while setting the process working directory to the generated run directory.
This keeps workflow code centralized and keeps helper paths such as env.sh stable through BASH_SOURCE[0]. The tradeoff is that workflow .sh files should not be edited while jobs are running.
Minimal Bash example:
#!/usr/bin/env bash
set -eu
BINDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RUNDIR="$(pwd)"
source "$BINDIR/env.sh"
mkdir -p logs results
echo "Running mypipe in $RUNDIR" | tee logs/mypipe.log
for R1 in ../*_R1_*fastq.gz; do
R2="${R1/_R1_/_R2_}"
echo "Pair: $R1 $R2" >> logs/mypipe.log
done
echo "done" > results/mypipe.done
Make it executable:
chmod +x workflows/bash/gatk-4.6/mypipe_single.sh
Snakemake Layout
workflows/snakemake/gatk-4.6/
config.yaml
mypipe_single.smk
mypipe_cohort.smk
CBIcall launches Snakemake with:
- the resolved Snakefile through
-s - the shared config file through
--configfile genomethrough--configsample_mapandworkspacefor cohort workflows when neededworkflow_ruleas the Snakemake target when partial execution is requested
2. Register the Pipeline
Edit:
workflows/registry/workflows.yaml
Add the pipeline under the correct engine and version.
workflows:
bash:
base_dir: "workflows/bash"
versions:
gatk-4.6:
helpers:
env: "env.sh"
pipelines:
mypipe:
single:
default: "v1"
versions:
v1:
script: "mypipe_single.sh"
If cohort mode is supported:
pipelines:
mypipe:
single:
default: "v1"
versions:
v1:
script: "mypipe_single.sh"
cohort:
default: "v1"
versions:
v1:
script: "mypipe_cohort.sh"
Registry paths are relative to the engine/version directory. For Bash GATK 4.6, mypipe_single.sh resolves below workflows/bash/gatk-4.6/.
The registry version above is the tool family version, for example gatk-4.6. The nested v1 is the CBIcall pipeline implementation version. Normal run YAML files do not need to set it because the registry default is used. If a pipeline script changes in a way that may affect outputs, add a new implementation such as v2, point it to the new script, and choose whether to move default to that version.
3. Make the Pipeline Selectable
The registry makes the script discoverable, but Python validation still controls which pipeline, mode, and gatk_version combinations are allowed.
If mypipe is a new pipeline name, update the validation sets in src/cbicall/config.py:
PIPELINE_VALUES = {"wes", "wgs", "mit", "mypipe"}
_ALLOWED_COMBOS = {
"gatk-4.6": {
"wes": ["single", "cohort"],
"wgs": ["single", "cohort"],
"mypipe": ["single"],
},
}
If you are only adding a missing mode or script for an existing pipeline and version, the registry entry may be enough.
4. Add a User-Facing YAML Example
Create a minimal parameters file that exercises the new workflow:
mode: single
pipeline: mypipe
workflow_engine: bash
gatk_version: gatk-4.6
input_dir: SAMPLE01
genome: b37
Run it:
bin/cbicall run -p mypipe.yaml -t 4
CBIcall should create a run directory similar to:
SAMPLE01/cbicall_bash_mypipe_single_b37_gatk-4.6_<run-id>/
5. Validate the Addition
Check the pipeline at three levels.
| Level | What to verify |
|---|---|
| Registry | The workflow appears under the correct engine/version/pipeline/mode. |
| Resource catalog | Compatible bundle entries point to real registry workflow keys. |
| Files | Referenced scripts exist; Bash scripts are executable. |
| Runtime | CBIcall creates a run directory, writes log.json, and produces expected outputs. |
Good first checks:
bin/cbicall validate-registry
bin/cbicall validate-resources
bin/cbicall validate-param -p mypipe.yaml
bin/cbicall run -p mypipe.yaml -t 2
Then inspect:
<run-dir>/log.json
<run-dir>/logs/
When Python Changes Are Needed
Most workflow additions only need validation changes in config.py if the pipeline name is new. Broader Python changes are needed when the execution model changes.
| Change | Likely file |
|---|---|
| New YAML key or default | src/cbicall/config.py |
| New value in the typed runtime model | src/cbicall/models.py |
| New registry resolution behavior | src/cbicall/workflow_registry.py |
| New execution engine | src/cbicall/dnaseq.py |
| Different command-line launch behavior | src/cbicall/dnaseq.py |
Adding an engine is different from adding a pipeline. Prefer adding a new runner class rather than expanding conditional logic inside an existing runner.
Contributor Checklist
- Pick engine, version, pipeline name, and mode.
- Inspect the closest existing workflow in
workflows/{bash|snakemake}/{gatk-version}/. - Add workflow entrypoint scripts.
- Make Bash scripts executable.
- Register scripts in
workflows/registry/workflows.yaml. - Update Python validation if the pipeline name or compatibility matrix changes.
- Add a minimal YAML example.
- Run CBIcall and inspect
log.json,logs/, and expected outputs. - Add or update tests when validation or execution behavior changes.
Next Steps
- Review the execution model in Architecture.
- Check YAML behavior in Configuration Reference.
- Document generated files in Outputs if the new pipeline produces user-facing outputs.