Skip to main content

Google Cloud with Docker

This page is a practical recipe for running the CBIcall one-sample WES reproducibility check on a Google Cloud Compute Engine VM with Docker.

Use this when you want evidence that the same CBIcall checkout, Docker image, parameters YAML, and selected resource resolve and run outside your local or institutional environment.

Scope

This is not a cloud-scale production benchmark. It runs the bundled one-sample WES integration test and checks framework portability.

What This Checks

The recipe runs:

bin/cbicall validate-resources
bin/cbicall validate-param -p examples/input/param.yaml
bin/cbicall test --wes -t 1

Expected evidence:

EvidenceMeaning
validate-resources succeedsThe resource catalog is well formed and points to declared workflows.
validate-param resolves the runThe parameters YAML resolves to a concrete workflow, profile, pipeline version, and resource.
test --wes succeedsThe generated WES VCF matches the shipped reference output under the integration-test comparison.
Normalized SHA-256 is printedThe filtered, sorted variant records have a compact reporting fingerprint.
run-report.json existsThe cloud run records workflow and resource provenance.

Prerequisites

On your local machine:

  • a Google Cloud project with billing enabled
  • the gcloud CLI installed and initialized
  • permission to create Compute Engine VMs

References:

1. Create a VM

Set your project, zone, and VM name:

export PROJECT_ID="your-google-cloud-project"
export ZONE="europe-west1-b"
export VM_NAME="cbicall-docker-test"

gcloud config set project "$PROJECT_ID"
gcloud services enable compute.googleapis.com

Create an Ubuntu VM. The disk is intentionally larger than the example data because the CBIcall-provided bundle is the large part of the test.

gcloud compute instances create "$VM_NAME" \
--zone "$ZONE" \
--machine-type "e2-standard-4" \
--boot-disk-size "200GB" \
--boot-disk-type "pd-balanced" \
--image-family "ubuntu-2204-lts" \
--image-project "ubuntu-os-cloud"

Connect to the VM:

gcloud compute ssh "$VM_NAME" --zone "$ZONE"
Cost

The VM and disk can generate charges while they exist. Delete the VM at the end of the recipe if you only need a reproducibility check.

2. Install Docker on the VM

For a short-lived reproducibility VM, the Ubuntu package is sufficient:

sudo apt-get update
sudo apt-get install -y docker.io git python3
sudo systemctl enable --now docker
sudo docker run --rm hello-world

For long-lived systems, use the current Docker Engine instructions linked above.

3. Clone CBIcall and Pull the Image

git clone https://github.com/CNAG-Biomedical-Informatics/cbicall.git
cd cbicall

export CBICALL_IMAGE="manuelrueda/cbicall:latest"
sudo docker pull "$CBICALL_IMAGE"

Record the checkout and image identity:

git rev-parse HEAD
sudo docker image inspect "$CBICALL_IMAGE" --format '{{index .RepoDigests 0}}'

For a manuscript or rebuttal, keep those two values with the run artifacts.

4. Prepare the Resource Directory

Create a host directory that will be mounted as /cbicall-data inside Docker:

export CBICALL_DATA="$HOME/cbicall-data"
mkdir -p "$CBICALL_DATA"

Define a small helper so every CBIcall command uses the same image, checkout, and resource directory:

cbicall_docker() {
sudo docker run --rm \
--user "$(id -u):$(id -g)" \
-v "$PWD":/usr/share/cbicall \
-v "$CBICALL_DATA":/cbicall-data \
-w /usr/share/cbicall \
"$CBICALL_IMAGE" "$@"
}

Download, assemble, verify, and extract the CBIcall-provided bundle:

cbicall_docker python3 scripts/download_cbicall_bundle.py \
--outdir /cbicall-data \
--catalog resources/cbicall-resource-catalog.json

If Google Drive throttles or stalls, print the manual download URLs:

cbicall_docker python3 scripts/download_cbicall_bundle.py \
--outdir /cbicall-data \
--catalog resources/cbicall-resource-catalog.json \
--print-manual-download

After copying the listed files into $CBICALL_DATA, finish setup with:

cbicall_docker python3 scripts/download_cbicall_bundle.py \
--outdir /cbicall-data \
--catalog resources/cbicall-resource-catalog.json \
--skip-download

5. Point CBIcall to /cbicall-data

The Docker container sees the mounted resource directory as /cbicall-data. Update the local checkout before launching CBIcall:

sed -i 's|^DATADIR=.*|DATADIR=/cbicall-data|' workflows/bash/gatk-4.6/env.sh
sed -i 's|^DATADIR=.*|DATADIR=/cbicall-data|' workflows/bash/gatk-3.5/env.sh
sed -i 's|^datadir:.*|datadir: "/cbicall-data"|' workflows/snakemake/gatk-4.6/config.yaml

6. Run the Reproducibility Check

cbicall_docker bin/cbicall validate-resources
cbicall_docker bin/cbicall validate-param -p examples/input/param.yaml
cbicall_docker bin/cbicall test --wes -t 1

The final command should print the run directory, workflow log, run-report.json, launcher log, and the output VCF used for comparison.

7. Capture Evidence

Find the newest WES test run:

RUN_DIR=$(find examples/input/CNAG999_exome/CNAG99901P_ex \
-maxdepth 1 \
-type d \
-name 'cbicall_bash_wes_single_b37_gatk-4.6_*' \
| sort \
| tail -n 1)

echo "$RUN_DIR"

Inspect the compact run report:

grep -E '"status"|"pipeline_version"|"resources"|"bundle"|"fingerprint"|"workflow_log"' \
"$RUN_DIR/run-report.json"

Compute a normalized variant-record hash for the generated VCF:

zgrep -v -E '^#' "$RUN_DIR/02_varcall/CNAG99901P.hc.QC.vcf.gz" \
| sort \
| sha256sum

Do not use the raw .vcf.gz file hash for reproducibility reporting. VCF headers and gzip metadata can differ between otherwise equivalent runs. The integration test compares and reports the normalized variant-record stream instead.

For a reproducibility record, keep:

  • CBIcall commit: git rev-parse HEAD
  • Docker image digest
  • the four command outputs from the reproducibility check
  • run-report.json
  • the normalized variant-record SHA-256

8. Clean Up

Exit the VM:

exit

Delete the VM from your local terminal:

gcloud compute instances delete "$VM_NAME" --zone "$ZONE"

If you created a separate disk or copied resources to Cloud Storage, delete those separately.

Suggested Manuscript Wording

To directly assess portability, we executed the bundled single-sample WES integration test on a Google Cloud Compute Engine VM using Docker. The cloud run used the same CBIcall checkout, parameters YAML, Docker image, and selected resource as the local test. CBIcall resolved the same workflow implementation and resource identity, and the generated VCF matched the shipped reference output under the deterministic comparison used by the integration test.