Google Cloud

This page is a practical recipe for running the CBIcall one-sample WES reproducibility check on a Google Cloud Compute Engine VM from a fresh source checkout from GitHub.

Outcome

Use this recipe to create a small, auditable cloud run from a fresh VM and a fresh GitHub checkout. It installs the CBIcall resource bundle, runs the bundled WES integration test, optionally runs one public 1000 Genomes WES sample, and archives STDOUT, run reports, fingerprints, and run directories as evidence.

Flow

Create VM -> install CBIcall -> install cbicall-data -> validate -> run WES test
          -> optionally run one public 1000G WES sample -> archive evidence -> delete VM

What This Checks

The recipe runs:

# vm$
cbicall doctor
cbicall validate-parameters -p examples/input/param.yaml
cbicall test --wes-bash -t 1

Expected evidence:

Evidence	Meaning
`doctor` reports `READY`	The CBIcall contracts, installed bundle metadata, resource layout, and required execution backend are available.
`validate-parameters` resolves the run	The parameters YAML resolves to a concrete workflow, profile, registry version, and resource.
`test --wes-bash` succeeds	The generated WES VCF matches the normalized hash declared by the integration-test contract.
Normalized SHA-256 is printed	The test computes this on the fly from the filtered, sorted reference and test VCF records.
`run-report.json` exists	The cloud run records workflow and resource provenance.

VCF hash files

The integration test does not need a copied ref_* run directory. It computes the normalized hash from the generated VCF and compares it with the small contract fixture. New WES/WGS runs also write 03_stats/*.vcf.sha256.txt as a provenance artifact for later run-report and compare-runs workflows.

Prerequisites

On your local machine:

a Google Cloud project with billing enabled
the gcloud CLI installed and initialized
permission to create Compute Engine VMs

References:

1. Create a VM

Commands marked local$ run on your workstation, for example mrueda@mrueda-ws1. Commands marked vm$ run after connecting to the Google Cloud VM, for example mrueda@cbicall-cloud-test. The prompt comments are only labels; the commands remain pasteable.

Set your project, zone, and VM name:

# local$
export PROJECT_ID="your-google-cloud-project"
export ZONE="europe-west1-b"
export VM_NAME="cbicall-cloud-test"

gcloud config set project "$PROJECT_ID"
gcloud services enable compute.googleapis.com

Create an Ubuntu VM. The disk is intentionally larger than the example data because the CBIcall-provided bundle is the large part of the test.

# local$
gcloud compute instances create "$VM_NAME" \
  --zone "$ZONE" \
  --machine-type "e2-standard-4" \
  --boot-disk-size "200GB" \
  --boot-disk-type "pd-balanced" \
  --image-family "ubuntu-2204-lts" \
  --image-project "ubuntu-os-cloud"

Connect to the VM:

# local$
gcloud compute ssh "$VM_NAME" --zone "$ZONE"

Cost

The VM and disk can generate charges while they exist. Delete the VM at the end of the recipe if you only need a reproducibility check.

2. Install and Run CBIcall from GitHub

Install the system packages needed by the native Bash WES test. GATK 4.6 requires Java 17, and the bundled legacy samtools-0.1.19 needs the ncurses compatibility libraries:

# vm$
sudo apt-get update
sudo apt-get install -y git python3 python3-pip openjdk-17-jdk libncurses5 libtinfo5

Clone CBIcall and record the checkout:

# vm$
git clone https://github.com/CNAG-Biomedical-Informatics/cbicall.git
cd cbicall
git rev-parse HEAD

Install the Python dependencies for the source checkout. The --upgrade flag is important on Ubuntu images because the system jsonschema package can be too old for CBIcall resource-catalog validation:

# vm$
python3 -m pip install --user --upgrade -e ".[all,test]"
export PATH="$HOME/.local/bin:$PATH"

Prepare the native resource directory:

# vm$
export CBICALL_DATA="$HOME/cbicall-data"
mkdir -p "$CBICALL_DATA"

Download, assemble, verify, and extract the CBIcall-provided resource bundle:

# vm$
python3 scripts/download_cbicall_bundle.py \
  --outdir "$CBICALL_DATA" \
  --catalog resources/cbicall-resource-catalog.json

If Google Drive throttles or stalls, print the manual download URLs:

# vm$
python3 scripts/download_cbicall_bundle.py \
  --outdir "$CBICALL_DATA" \
  --catalog resources/cbicall-resource-catalog.json \
  --print-manual-download

After copying the listed files into $CBICALL_DATA, finish setup with the command below. This step can take time because it assembles, verifies, and extracts the full resource bundle. On an e2-standard-4 VM with a balanced persistent disk, expect roughly 20-50 minutes after all parts are present; faster disks may be shorter.

# vm$
python3 scripts/download_cbicall_bundle.py \
  --outdir "$CBICALL_DATA" \
  --catalog resources/cbicall-resource-catalog.json \
  --skip-download

Expected output includes the long assembly and extraction steps:

Resource key: cbicall-germline-resources-v1
Assembling data.tar.gz from split parts...
  adding data.tar.gz.part-00
  adding data.tar.gz.part-01
  adding data.tar.gz.part-02
  adding data.tar.gz.part-03
  adding data.tar.gz.part-04
  adding data.tar.gz.part-05
Verifying split archive parts with data.tar.gz.md5...
Checksum OK.
Renamed archive to cbicall-germline-resources-v1.tar.gz.
Extracting cbicall-germline-resources-v1.tar.gz into /home/mrueda/cbicall-data...
Wrote installation manifest: /home/mrueda/cbicall-data/cbicall-resource-installation.json

Resource setup complete.
Set CBICALL_DATA to: /home/mrueda/cbicall-data

Google Drive Quota Recovery

If automatic download stops with a Google Drive message such as Too many users have viewed or downloaded this file recently, keep the files that already downloaded and fetch only the missing shard manually. The downloader skips existing non-empty files.

Print the manual URLs on the VM:

# vm$
python3 scripts/download_cbicall_bundle.py \
  --outdir "$CBICALL_DATA" \
  --catalog resources/cbicall-resource-catalog.json \
  --print-manual-download

Download the missing file or files from a browser on your workstation. They will usually land in ~/Downloads. Then copy only the missing parts to the VM resource directory. For example, if data.tar.gz.part-04 and data.tar.gz.part-05 were throttled:

# local$
gcloud compute scp \
  ~/Downloads/data.tar.gz.part-04 \
  ~/Downloads/data.tar.gz.part-05 \
  "${VM_NAME}:~/cbicall-data/" \
  --zone "$ZONE"

If only one part is missing, keep only that file in the gcloud compute scp command.

Resume assembly and validation on the VM without using Google Drive again. This step can take time because it assembles, verifies, and extracts the full resource bundle. On an e2-standard-4 VM with a balanced persistent disk, expect roughly 20-50 minutes after all parts are present; faster disks may be shorter.

# vm$
python3 scripts/download_cbicall_bundle.py \
  --outdir "$CBICALL_DATA" \
  --catalog resources/cbicall-resource-catalog.json \
  --skip-download

Optional: Free Disk Space After Resource Setup

After checksum verification and extraction have completed, the extracted Databases/ and NGSutils/ directories are the files CBIcall needs at runtime. If disk space is tight on the VM, you can remove the downloaded split parts and the assembled compressed archive:

# vm$
du -sh "$CBICALL_DATA"/data.tar.gz.part-* \
  "$CBICALL_DATA"/cbicall-germline-resources-v1.tar.gz \
  "$CBICALL_DATA"/Databases \
  "$CBICALL_DATA"/NGSutils

rm -f "$CBICALL_DATA"/data.tar.gz.part-* \
  "$CBICALL_DATA"/data.tar.gz \
  "$CBICALL_DATA"/cbicall-germline-resources-v1.tar.gz

Keep Databases/, NGSutils/, cbicall-resource-id.json, cbicall-resource-installation.json, and data.tar.gz.md5.

For future runs, the downloader can remove split parts automatically after checksum verification with --remove-parts, but it intentionally keeps the assembled archive unless you delete it yourself.

Expose the VM resource path to every native backend:

# vm$
export CBICALL_DATA

CBIcall passes this location to Bash, Snakemake, native Nextflow, and Cromwell without modifying workflow files.

Run the checks and keep STDOUT as evidence:

# vm$
mkdir -p cloud-evidence

{
  date -u
  git rev-parse HEAD
  cbicall doctor
  cbicall validate-parameters -p examples/input/param.yaml
  cbicall test --wes-bash -t 1
} 2>&1 | tee cloud-evidence/cbicall-google-cloud-wes-stdout.txt

The final command should print the run directory, workflow log, run-report.json, run-report.html, launcher log, and the output VCF hash used for comparison. The tee command keeps the same output in cloud-evidence/cbicall-google-cloud-wes-stdout.txt.

3. Optional: Run One Public WES Sample

The bundled integration test is the reproducibility check because it has a small known output contract. For reviewer evidence, it can be useful to also run one real public exome sample, for example a 1000 Genomes WES sample. CBIcall does not ship those FASTQs; download a public sample yourself from the data provider, or copy FASTQs that you already have on your workstation.

Input naming

For native single-sample WES, put paired FASTQs in one sample directory. FASTQ names must contain matching _R1_ and _R2_ tokens, for example SRR1596639_ex_S14_L001_R1_001.fastq.gz and SRR1596639_ex_S14_L001_R2_001.fastq.gz. CBIcall writes the run directory under that sample directory.

Create a sample directory on the VM:

# vm$
cd ~/cbicall
mkdir -p examples/input/1000g/HG00103/SRR1596639

If the FASTQs are already on your workstation, copy them to the VM. This example uses the 1000 Genomes sample HG00103 run SRR1596639 from a local checkout-adjacent directory:

# local$
gcloud compute scp \
  ../1000g/HG00103/SRR1596639/SRR1596639_ex_S14_L001_R1_001.fastq.gz \
  ../1000g/HG00103/SRR1596639/SRR1596639_ex_S14_L001_R2_001.fastq.gz \
  "${VM_NAME}:~/cbicall/examples/input/1000g/HG00103/SRR1596639/" \
  --zone "$ZONE"

If your browser saved the FASTQs under ~/Downloads, use those local paths instead. The important point is that the two files land in the same sample directory on the VM, preserving the 1000g/HG00103/SRR1596639 hierarchy.

Alternatively, download the public FASTQs directly on the VM into the same directory using the URLs provided by the data source.

Create a parameters YAML for the real-data run:

# vm$
cat > cloud-wes-real.yaml <<YAML
mode: single
pipeline: wes
workflow_backend: bash
software_stack: gatk-4.6
genome: b37
resource: cbicall-germline-resources-v1
cleanup_bam: true
input_dir: ${HOME}/cbicall/examples/input/1000g/HG00103/SRR1596639
YAML

cleanup_bam: true deletes intermediate BAM/BAI files after a successful single-sample WES run, which helps avoid filling the VM disk.

Validate the YAML contract first:

# vm$
cbicall validate-parameters -p cloud-wes-real.yaml

Run the real-data WES job in the background so it keeps running if the SSH connection drops. STDOUT and STDERR are written to the evidence file:

# vm$
cd ~/cbicall
mkdir -p cloud-evidence

nohup bash -lc '
date -u
git rev-parse HEAD
cbicall validate-parameters -p cloud-wes-real.yaml
cbicall run -p cloud-wes-real.yaml -t 4
' > cloud-evidence/cbicall-google-cloud-public-wes-stdout.txt 2>&1 &

Follow the run without attaching the job to the terminal:

# vm$
tail -f cloud-evidence/cbicall-google-cloud-public-wes-stdout.txt

Check whether CBIcall is still running:

# vm$
ps -ef | grep '[c]bicall'

This real-data run is not compared against the bundled integration-test fixture. It should be reported as execution evidence: CBIcall validated the YAML contract, resolved the workflow and resource bundle, completed the WES pipeline, and wrote run-report.json, run-report.html, logs, and VCF fingerprints for later audit or compare-runs checks.

4. Keep the Evidence

If you did not already use the captured check block above, capture STDOUT while running the checks. This keeps the exact command output, including the normalized VCF hash and the paths to the generated run artifacts:

# vm$
mkdir -p cloud-evidence

{
  date -u
  git rev-parse HEAD
  cbicall doctor
  cbicall validate-parameters -p examples/input/param.yaml
  cbicall test --wes-bash -t 1
} 2>&1 | tee cloud-evidence/cbicall-google-cloud-wes-stdout.txt

The test --wes-bash output prints the run directory, workflow log, run-report.json, run-report.html, launcher log, and contract fixture. Archive STDOUT plus the CBIcall run directories created under examples/input:

# vm$
{
  printf '%s\n' cloud-evidence examples/input/param.yaml
  test -f cloud-wes-real.yaml && printf '%s\n' cloud-wes-real.yaml
  find examples/input -type d -name 'cbicall_*' -prune -print
} | sort -u > cloud-evidence/cbicall-evidence-files.txt

tar -czf cbicall-google-cloud-wes-evidence.tar.gz \
  --files-from cloud-evidence/cbicall-evidence-files.txt

The find command captures run directories such as examples/input/CNAG999_exome/CNAG99901P_ex/cbicall_* and, if you ran the public WES sample, examples/input/1000g/HG00103/SRR1596639/cbicall_*.

Copy the evidence archive back to your local machine from a local terminal:

# local$
gcloud compute scp \
  "${VM_NAME}:~/cbicall/cbicall-google-cloud-wes-evidence.tar.gz" \
  . \
  --zone "$ZONE"

5. Clean Up

Deleting the VM removes the instance and its boot disk contents, including the CBIcall checkout, downloaded resources, FASTQs, run directories, and evidence files. Copy any evidence archive you want to keep back to your local machine before running this command.

From your local machine:

# local$
gcloud compute instances delete "$VM_NAME" --zone "$ZONE"

Flow​

What This Checks​

Prerequisites​

1. Create a VM​

2. Install and Run CBIcall from GitHub​

Google Drive Quota Recovery​

Optional: Free Disk Space After Resource Setup​

3. Optional: Run One Public WES Sample​

4. Keep the Evidence​

5. Clean Up​