Google Cloud
This page is a practical recipe for running the CBIcall one-sample WES reproducibility check on a Google Cloud Compute Engine VM from a fresh source checkout from GitHub.
Use this recipe to create a small, auditable cloud run from a fresh VM and a fresh GitHub checkout. It installs the CBIcall resource bundle, runs the bundled WES integration test, optionally runs one public 1000 Genomes WES sample, and archives STDOUT, run reports, fingerprints, and run directories as evidence.
Flow
Create VM -> install CBIcall -> install cbicall-data -> validate -> run WES test
-> optionally run one public 1000G WES sample -> archive evidence -> delete VM
What This Checks
The recipe runs:
# vm$
bin/cbicall validate-resources
bin/cbicall validate-parameters -p examples/input/param.yaml
bin/cbicall test --wes-bash -t 1
Expected evidence:
| Evidence | Meaning |
|---|---|
validate-resources succeeds | The resource catalog is well formed and points to declared workflows. |
validate-parameters resolves the run | The parameters YAML resolves to a concrete workflow, profile, registry version, and resource. |
test --wes-bash succeeds | The generated WES VCF matches the normalized hash declared by the integration-test contract. |
| Normalized SHA-256 is printed | The test computes this on the fly from the filtered, sorted reference and test VCF records. |
run-report.json exists | The cloud run records workflow and resource provenance. |
The integration test does not need a copied ref_* run directory. It computes
the normalized hash from the generated VCF and compares it with the small
contract fixture. New WES/WGS runs also write 03_stats/*.vcf.sha256.txt as a
provenance artifact for later run-report and compare-runs workflows.
Prerequisites
On your local machine:
- a Google Cloud project with billing enabled
- the
gcloudCLI installed and initialized - permission to create Compute Engine VMs
References:
1. Create a VM
Commands marked local$ run on your workstation, for example mrueda@mrueda-ws1. Commands marked vm$ run after connecting to the Google Cloud VM, for example mrueda@cbicall-cloud-test. The prompt comments are only labels; the commands remain pasteable.
Set your project, zone, and VM name:
# local$
export PROJECT_ID="your-google-cloud-project"
export ZONE="europe-west1-b"
export VM_NAME="cbicall-cloud-test"
gcloud config set project "$PROJECT_ID"
gcloud services enable compute.googleapis.com
Create an Ubuntu VM. The disk is intentionally larger than the example data because the CBIcall-provided bundle is the large part of the test.
# local$
gcloud compute instances create "$VM_NAME" \
--zone "$ZONE" \
--machine-type "e2-standard-4" \
--boot-disk-size "200GB" \
--boot-disk-type "pd-balanced" \
--image-family "ubuntu-2204-lts" \
--image-project "ubuntu-os-cloud"
Connect to the VM:
# local$
gcloud compute ssh "$VM_NAME" --zone "$ZONE"
The VM and disk can generate charges while they exist. Delete the VM at the end of the recipe if you only need a reproducibility check.
2. Install and Run CBIcall from GitHub
Install the system packages needed by the native Bash WES test. GATK 4.6 requires Java 17, and the bundled legacy samtools-0.1.19 needs the ncurses compatibility libraries:
# vm$
sudo apt-get update
sudo apt-get install -y git python3 python3-pip openjdk-17-jdk libncurses5 libtinfo5
Clone CBIcall and record the checkout:
# vm$
git clone https://github.com/CNAG-Biomedical-Informatics/cbicall.git
cd cbicall
git rev-parse HEAD
Install the Python dependencies for the source checkout. The --upgrade flag is important on Ubuntu images because the system jsonschema package can be too old for CBIcall resource-catalog validation:
# vm$
python3 -m pip install --user --upgrade -r requirements.txt
export PATH="$HOME/.local/bin:$PATH"
Prepare the native resource directory:
# vm$
export CBICALL_DATA="$HOME/cbicall-data"
mkdir -p "$CBICALL_DATA"
Download, assemble, verify, and extract the CBIcall-provided resource bundle:
# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json
If Google Drive throttles or stalls, print the manual download URLs:
# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--print-manual-download
After copying the listed files into $CBICALL_DATA, finish setup with the command below. This step can take time because it assembles, verifies, and extracts the full resource bundle. On an e2-standard-4 VM with a balanced persistent disk, expect roughly 20-50 minutes after all parts are present; faster disks may be shorter.
# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--skip-download
Expected output includes the long assembly and extraction steps:
Resource key: cbicall-germline-resources-v1
Assembling data.tar.gz from split parts...
adding data.tar.gz.part-00
adding data.tar.gz.part-01
adding data.tar.gz.part-02
adding data.tar.gz.part-03
adding data.tar.gz.part-04
adding data.tar.gz.part-05
Verifying split archive parts with data.tar.gz.md5...
Checksum OK.
Renamed archive to cbicall-germline-resources-v1.tar.gz.
Extracting cbicall-germline-resources-v1.tar.gz into /home/mrueda/cbicall-data...
Wrote installation manifest: /home/mrueda/cbicall-data/cbicall-resource-installation.json
Resource setup complete.
Set DATADIR to: /home/mrueda/cbicall-data
Google Drive Quota Recovery
If automatic download stops with a Google Drive message such as Too many users have viewed or downloaded this file recently, keep the files that already
downloaded and fetch only the missing shard manually. The downloader skips
existing non-empty files.
Print the manual URLs on the VM:
# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--print-manual-download
Download the missing file or files from a browser on your workstation. They
will usually land in ~/Downloads. Then copy only the missing parts to the VM
resource directory. For example, if data.tar.gz.part-04 and
data.tar.gz.part-05 were throttled:
# local$
gcloud compute scp \
~/Downloads/data.tar.gz.part-04 \
~/Downloads/data.tar.gz.part-05 \
"${VM_NAME}:~/cbicall-data/" \
--zone "$ZONE"
If only one part is missing, keep only that file in the gcloud compute scp
command.
Resume assembly and validation on the VM without using Google Drive again. This step can take time because it assembles, verifies, and extracts the full resource bundle. On an e2-standard-4 VM with a balanced persistent disk, expect roughly 20-50 minutes after all parts are present; faster disks may be shorter.
# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--skip-download
Optional: Free Disk Space After Resource Setup
After checksum verification and extraction have completed, the extracted
Databases/ and NGSutils/ directories are the files CBIcall needs at runtime.
If disk space is tight on the VM, you can remove the downloaded split parts and
the assembled compressed archive:
# vm$
du -sh "$CBICALL_DATA"/data.tar.gz.part-* \
"$CBICALL_DATA"/cbicall-germline-resources-v1.tar.gz \
"$CBICALL_DATA"/Databases \
"$CBICALL_DATA"/NGSutils
rm -f "$CBICALL_DATA"/data.tar.gz.part-* \
"$CBICALL_DATA"/data.tar.gz \
"$CBICALL_DATA"/cbicall-germline-resources-v1.tar.gz
Keep Databases/, NGSutils/, cbicall-resource-id.json,
cbicall-resource-installation.json, and data.tar.gz.md5.
For future runs, the downloader can remove split parts automatically after
checksum verification with --remove-parts, but it intentionally keeps the
assembled archive unless you delete it yourself.
Point native workflows to the VM resource path:
# vm$
sed -i "s|^DATADIR=.*|DATADIR=${CBICALL_DATA}|" workflows/bash/gatk-3.5/env.sh
sed -i "s|^datadir:.*|datadir: \"${CBICALL_DATA}\"|" workflows/snakemake/gatk-4.6/config.yaml
The GATK 4.6 Bash env.sh is a symlink to the GATK 3.5 Bash env.sh,
so one Bash edit is enough. The native Nextflow and Cromwell configs are
symlinks to this shared GATK 4.6 backend config, so one config edit updates
Snakemake, native Nextflow, and Cromwell workflows.
Run the checks and keep STDOUT as evidence:
# vm$
mkdir -p cloud-evidence
{
date -u
git rev-parse HEAD
bin/cbicall validate-resources
bin/cbicall validate-parameters -p examples/input/param.yaml
bin/cbicall test --wes-bash -t 1
} 2>&1 | tee cloud-evidence/cbicall-google-cloud-wes-stdout.txt
The final command should print the run directory, workflow log,
run-report.json, run-report.html, launcher log, and the output VCF hash used
for comparison. The tee command keeps the same output in
cloud-evidence/cbicall-google-cloud-wes-stdout.txt.
3. Optional: Run One Public WES Sample
The bundled integration test is the reproducibility check because it has a small known output contract. For reviewer evidence, it can be useful to also run one real public exome sample, for example a 1000 Genomes WES sample. CBIcall does not ship those FASTQs; download a public sample yourself from the data provider, or copy FASTQs that you already have on your workstation.
For native single-sample WES, put paired FASTQs in one sample directory. FASTQ
names must contain matching _R1_ and _R2_ tokens, for example
SRR1596639_ex_S14_L001_R1_001.fastq.gz and
SRR1596639_ex_S14_L001_R2_001.fastq.gz. CBIcall writes the run directory under
that sample directory.
Create a sample directory on the VM:
# vm$
cd ~/cbicall
mkdir -p examples/input/1000g/HG00103/SRR1596639
If the FASTQs are already on your workstation, copy them to the VM. This
example uses the 1000 Genomes sample HG00103 run SRR1596639 from a local
checkout-adjacent directory:
# local$
gcloud compute scp \
../1000g/HG00103/SRR1596639/SRR1596639_ex_S14_L001_R1_001.fastq.gz \
../1000g/HG00103/SRR1596639/SRR1596639_ex_S14_L001_R2_001.fastq.gz \
"${VM_NAME}:~/cbicall/examples/input/1000g/HG00103/SRR1596639/" \
--zone "$ZONE"
If your browser saved the FASTQs under ~/Downloads, use those local paths
instead. The important point is that the two files land in the same sample
directory on the VM, preserving the 1000g/HG00103/SRR1596639 hierarchy.
Alternatively, download the public FASTQs directly on the VM into the same directory using the URLs provided by the data source.
Create a parameters YAML for the real-data run:
# vm$
cat > cloud-wes-real.yaml <<YAML
mode: single
pipeline: wes
workflow_backend: bash
software_stack: gatk-4.6
genome: b37
resource: cbicall-germline-resources-v1
cleanup_bam: true
input_dir: ${HOME}/cbicall/examples/input/1000g/HG00103/SRR1596639
YAML
cleanup_bam: true deletes intermediate BAM/BAI files after a successful
single-sample WES run, which helps avoid filling the VM disk.
Validate the YAML contract first:
# vm$
bin/cbicall validate-parameters -p cloud-wes-real.yaml
Run the real-data WES job in the background so it keeps running if the SSH connection drops. STDOUT and STDERR are written to the evidence file:
# vm$
cd ~/cbicall
mkdir -p cloud-evidence
nohup bash -lc '
date -u
git rev-parse HEAD
bin/cbicall validate-parameters -p cloud-wes-real.yaml
bin/cbicall -p cloud-wes-real.yaml -t 4
' > cloud-evidence/cbicall-google-cloud-public-wes-stdout.txt 2>&1 &
Follow the run without attaching the job to the terminal:
# vm$
tail -f cloud-evidence/cbicall-google-cloud-public-wes-stdout.txt
Check whether CBIcall is still running:
# vm$
ps -ef | grep '[c]bicall'
This real-data run is not compared against the bundled integration-test fixture.
It should be reported as execution evidence: CBIcall validated the YAML contract,
resolved the workflow and resource bundle, completed the WES pipeline, and wrote
run-report.json, run-report.html, logs, and VCF fingerprints for later audit
or compare-runs checks.
4. Keep the Evidence
If you did not already use the captured check block above, capture STDOUT while running the checks. This keeps the exact command output, including the normalized VCF hash and the paths to the generated run artifacts:
# vm$
mkdir -p cloud-evidence
{
date -u
git rev-parse HEAD
bin/cbicall validate-resources
bin/cbicall validate-parameters -p examples/input/param.yaml
bin/cbicall test --wes-bash -t 1
} 2>&1 | tee cloud-evidence/cbicall-google-cloud-wes-stdout.txt
The test --wes-bash output prints the run directory, workflow log,
run-report.json, run-report.html, launcher log, and contract fixture. Archive
STDOUT plus the CBIcall run directories created under examples/input:
# vm$
{
printf '%s\n' cloud-evidence examples/input/param.yaml
test -f cloud-wes-real.yaml && printf '%s\n' cloud-wes-real.yaml
find examples/input -type d -name 'cbicall_*' -prune -print
} | sort -u > cloud-evidence/cbicall-evidence-files.txt
tar -czf cbicall-google-cloud-wes-evidence.tar.gz \
--files-from cloud-evidence/cbicall-evidence-files.txt
The find command captures run directories such as
examples/input/CNAG999_exome/CNAG99901P_ex/cbicall_* and, if you ran the
public WES sample, examples/input/1000g/HG00103/SRR1596639/cbicall_*.
Copy the evidence archive back to your local machine from a local terminal:
# local$
gcloud compute scp \
"${VM_NAME}:~/cbicall/cbicall-google-cloud-wes-evidence.tar.gz" \
. \
--zone "$ZONE"
5. Clean Up
Deleting the VM removes the instance and its boot disk contents, including the CBIcall checkout, downloaded resources, FASTQs, run directories, and evidence files. Copy any evidence archive you want to keep back to your local machine before running this command.
From your local machine:
# local$
gcloud compute instances delete "$VM_NAME" --zone "$ZONE"