Skip to main content

Google Cloud

This page is a practical recipe for running the CBIcall one-sample WES reproducibility check on a Google Cloud Compute Engine VM from a fresh source checkout from GitHub.

Outcome

Use this recipe to create a small, auditable cloud run from a fresh VM and a fresh GitHub checkout. It installs the CBIcall resource bundle, runs the bundled WES integration test, optionally runs one public 1000 Genomes WES sample, and archives STDOUT, run reports, fingerprints, and run directories as evidence.

Flow

Create VM -> install CBIcall -> install cbicall-data -> validate -> run WES test
-> optionally run one public 1000G WES sample -> archive evidence -> delete VM

What This Checks

The recipe runs:

# vm$
bin/cbicall validate-resources
bin/cbicall validate-parameters -p examples/input/param.yaml
bin/cbicall test --wes-bash -t 1

Expected evidence:

EvidenceMeaning
validate-resources succeedsThe resource catalog is well formed and points to declared workflows.
validate-parameters resolves the runThe parameters YAML resolves to a concrete workflow, profile, registry version, and resource.
test --wes-bash succeedsThe generated WES VCF matches the normalized hash declared by the integration-test contract.
Normalized SHA-256 is printedThe test computes this on the fly from the filtered, sorted reference and test VCF records.
run-report.json existsThe cloud run records workflow and resource provenance.
VCF hash files

The integration test does not need a copied ref_* run directory. It computes the normalized hash from the generated VCF and compares it with the small contract fixture. New WES/WGS runs also write 03_stats/*.vcf.sha256.txt as a provenance artifact for later run-report and compare-runs workflows.

Prerequisites

On your local machine:

  • a Google Cloud project with billing enabled
  • the gcloud CLI installed and initialized
  • permission to create Compute Engine VMs

References:

1. Create a VM

Commands marked local$ run on your workstation, for example mrueda@mrueda-ws1. Commands marked vm$ run after connecting to the Google Cloud VM, for example mrueda@cbicall-cloud-test. The prompt comments are only labels; the commands remain pasteable.

Set your project, zone, and VM name:

# local$
export PROJECT_ID="your-google-cloud-project"
export ZONE="europe-west1-b"
export VM_NAME="cbicall-cloud-test"

gcloud config set project "$PROJECT_ID"
gcloud services enable compute.googleapis.com

Create an Ubuntu VM. The disk is intentionally larger than the example data because the CBIcall-provided bundle is the large part of the test.

# local$
gcloud compute instances create "$VM_NAME" \
--zone "$ZONE" \
--machine-type "e2-standard-4" \
--boot-disk-size "200GB" \
--boot-disk-type "pd-balanced" \
--image-family "ubuntu-2204-lts" \
--image-project "ubuntu-os-cloud"

Connect to the VM:

# local$
gcloud compute ssh "$VM_NAME" --zone "$ZONE"
Cost

The VM and disk can generate charges while they exist. Delete the VM at the end of the recipe if you only need a reproducibility check.

2. Install and Run CBIcall from GitHub

Install the system packages needed by the native Bash WES test. GATK 4.6 requires Java 17, and the bundled legacy samtools-0.1.19 needs the ncurses compatibility libraries:

# vm$
sudo apt-get update
sudo apt-get install -y git python3 python3-pip openjdk-17-jdk libncurses5 libtinfo5

Clone CBIcall and record the checkout:

# vm$
git clone https://github.com/CNAG-Biomedical-Informatics/cbicall.git
cd cbicall
git rev-parse HEAD

Install the Python dependencies for the source checkout. The --upgrade flag is important on Ubuntu images because the system jsonschema package can be too old for CBIcall resource-catalog validation:

# vm$
python3 -m pip install --user --upgrade -r requirements.txt
export PATH="$HOME/.local/bin:$PATH"

Prepare the native resource directory:

# vm$
export CBICALL_DATA="$HOME/cbicall-data"
mkdir -p "$CBICALL_DATA"

Download, assemble, verify, and extract the CBIcall-provided resource bundle:

# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json

If Google Drive throttles or stalls, print the manual download URLs:

# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--print-manual-download

After copying the listed files into $CBICALL_DATA, finish setup with the command below. This step can take time because it assembles, verifies, and extracts the full resource bundle. On an e2-standard-4 VM with a balanced persistent disk, expect roughly 20-50 minutes after all parts are present; faster disks may be shorter.

# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--skip-download

Expected output includes the long assembly and extraction steps:

Resource key: cbicall-germline-resources-v1
Assembling data.tar.gz from split parts...
adding data.tar.gz.part-00
adding data.tar.gz.part-01
adding data.tar.gz.part-02
adding data.tar.gz.part-03
adding data.tar.gz.part-04
adding data.tar.gz.part-05
Verifying split archive parts with data.tar.gz.md5...
Checksum OK.
Renamed archive to cbicall-germline-resources-v1.tar.gz.
Extracting cbicall-germline-resources-v1.tar.gz into /home/mrueda/cbicall-data...
Wrote installation manifest: /home/mrueda/cbicall-data/cbicall-resource-installation.json

Resource setup complete.
Set DATADIR to: /home/mrueda/cbicall-data

Google Drive Quota Recovery

If automatic download stops with a Google Drive message such as Too many users have viewed or downloaded this file recently, keep the files that already downloaded and fetch only the missing shard manually. The downloader skips existing non-empty files.

Print the manual URLs on the VM:

# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--print-manual-download

Download the missing file or files from a browser on your workstation. They will usually land in ~/Downloads. Then copy only the missing parts to the VM resource directory. For example, if data.tar.gz.part-04 and data.tar.gz.part-05 were throttled:

# local$
gcloud compute scp \
~/Downloads/data.tar.gz.part-04 \
~/Downloads/data.tar.gz.part-05 \
"${VM_NAME}:~/cbicall-data/" \
--zone "$ZONE"

If only one part is missing, keep only that file in the gcloud compute scp command.

Resume assembly and validation on the VM without using Google Drive again. This step can take time because it assembles, verifies, and extracts the full resource bundle. On an e2-standard-4 VM with a balanced persistent disk, expect roughly 20-50 minutes after all parts are present; faster disks may be shorter.

# vm$
python3 scripts/download_cbicall_bundle.py \
--outdir "$CBICALL_DATA" \
--catalog resources/cbicall-resource-catalog.json \
--skip-download

Optional: Free Disk Space After Resource Setup

After checksum verification and extraction have completed, the extracted Databases/ and NGSutils/ directories are the files CBIcall needs at runtime. If disk space is tight on the VM, you can remove the downloaded split parts and the assembled compressed archive:

# vm$
du -sh "$CBICALL_DATA"/data.tar.gz.part-* \
"$CBICALL_DATA"/cbicall-germline-resources-v1.tar.gz \
"$CBICALL_DATA"/Databases \
"$CBICALL_DATA"/NGSutils

rm -f "$CBICALL_DATA"/data.tar.gz.part-* \
"$CBICALL_DATA"/data.tar.gz \
"$CBICALL_DATA"/cbicall-germline-resources-v1.tar.gz

Keep Databases/, NGSutils/, cbicall-resource-id.json, cbicall-resource-installation.json, and data.tar.gz.md5.

For future runs, the downloader can remove split parts automatically after checksum verification with --remove-parts, but it intentionally keeps the assembled archive unless you delete it yourself.

Point native workflows to the VM resource path:

# vm$
sed -i "s|^DATADIR=.*|DATADIR=${CBICALL_DATA}|" workflows/bash/gatk-3.5/env.sh
sed -i "s|^datadir:.*|datadir: \"${CBICALL_DATA}\"|" workflows/snakemake/gatk-4.6/config.yaml

The GATK 4.6 Bash env.sh is a symlink to the GATK 3.5 Bash env.sh, so one Bash edit is enough. The native Nextflow and Cromwell configs are symlinks to this shared GATK 4.6 backend config, so one config edit updates Snakemake, native Nextflow, and Cromwell workflows.

Run the checks and keep STDOUT as evidence:

# vm$
mkdir -p cloud-evidence

{
date -u
git rev-parse HEAD
bin/cbicall validate-resources
bin/cbicall validate-parameters -p examples/input/param.yaml
bin/cbicall test --wes-bash -t 1
} 2>&1 | tee cloud-evidence/cbicall-google-cloud-wes-stdout.txt

The final command should print the run directory, workflow log, run-report.json, run-report.html, launcher log, and the output VCF hash used for comparison. The tee command keeps the same output in cloud-evidence/cbicall-google-cloud-wes-stdout.txt.

3. Optional: Run One Public WES Sample

The bundled integration test is the reproducibility check because it has a small known output contract. For reviewer evidence, it can be useful to also run one real public exome sample, for example a 1000 Genomes WES sample. CBIcall does not ship those FASTQs; download a public sample yourself from the data provider, or copy FASTQs that you already have on your workstation.

Input naming

For native single-sample WES, put paired FASTQs in one sample directory. FASTQ names must contain matching _R1_ and _R2_ tokens, for example SRR1596639_ex_S14_L001_R1_001.fastq.gz and SRR1596639_ex_S14_L001_R2_001.fastq.gz. CBIcall writes the run directory under that sample directory.

Create a sample directory on the VM:

# vm$
cd ~/cbicall
mkdir -p examples/input/1000g/HG00103/SRR1596639

If the FASTQs are already on your workstation, copy them to the VM. This example uses the 1000 Genomes sample HG00103 run SRR1596639 from a local checkout-adjacent directory:

# local$
gcloud compute scp \
../1000g/HG00103/SRR1596639/SRR1596639_ex_S14_L001_R1_001.fastq.gz \
../1000g/HG00103/SRR1596639/SRR1596639_ex_S14_L001_R2_001.fastq.gz \
"${VM_NAME}:~/cbicall/examples/input/1000g/HG00103/SRR1596639/" \
--zone "$ZONE"

If your browser saved the FASTQs under ~/Downloads, use those local paths instead. The important point is that the two files land in the same sample directory on the VM, preserving the 1000g/HG00103/SRR1596639 hierarchy.

Alternatively, download the public FASTQs directly on the VM into the same directory using the URLs provided by the data source.

Create a parameters YAML for the real-data run:

# vm$
cat > cloud-wes-real.yaml <<YAML
mode: single
pipeline: wes
workflow_backend: bash
software_stack: gatk-4.6
genome: b37
resource: cbicall-germline-resources-v1
cleanup_bam: true
input_dir: ${HOME}/cbicall/examples/input/1000g/HG00103/SRR1596639
YAML

cleanup_bam: true deletes intermediate BAM/BAI files after a successful single-sample WES run, which helps avoid filling the VM disk.

Validate the YAML contract first:

# vm$
bin/cbicall validate-parameters -p cloud-wes-real.yaml

Run the real-data WES job in the background so it keeps running if the SSH connection drops. STDOUT and STDERR are written to the evidence file:

# vm$
cd ~/cbicall
mkdir -p cloud-evidence

nohup bash -lc '
date -u
git rev-parse HEAD
bin/cbicall validate-parameters -p cloud-wes-real.yaml
bin/cbicall -p cloud-wes-real.yaml -t 4
' > cloud-evidence/cbicall-google-cloud-public-wes-stdout.txt 2>&1 &

Follow the run without attaching the job to the terminal:

# vm$
tail -f cloud-evidence/cbicall-google-cloud-public-wes-stdout.txt

Check whether CBIcall is still running:

# vm$
ps -ef | grep '[c]bicall'

This real-data run is not compared against the bundled integration-test fixture. It should be reported as execution evidence: CBIcall validated the YAML contract, resolved the workflow and resource bundle, completed the WES pipeline, and wrote run-report.json, run-report.html, logs, and VCF fingerprints for later audit or compare-runs checks.

4. Keep the Evidence

If you did not already use the captured check block above, capture STDOUT while running the checks. This keeps the exact command output, including the normalized VCF hash and the paths to the generated run artifacts:

# vm$
mkdir -p cloud-evidence

{
date -u
git rev-parse HEAD
bin/cbicall validate-resources
bin/cbicall validate-parameters -p examples/input/param.yaml
bin/cbicall test --wes-bash -t 1
} 2>&1 | tee cloud-evidence/cbicall-google-cloud-wes-stdout.txt

The test --wes-bash output prints the run directory, workflow log, run-report.json, run-report.html, launcher log, and contract fixture. Archive STDOUT plus the CBIcall run directories created under examples/input:

# vm$
{
printf '%s\n' cloud-evidence examples/input/param.yaml
test -f cloud-wes-real.yaml && printf '%s\n' cloud-wes-real.yaml
find examples/input -type d -name 'cbicall_*' -prune -print
} | sort -u > cloud-evidence/cbicall-evidence-files.txt

tar -czf cbicall-google-cloud-wes-evidence.tar.gz \
--files-from cloud-evidence/cbicall-evidence-files.txt

The find command captures run directories such as examples/input/CNAG999_exome/CNAG99901P_ex/cbicall_* and, if you ran the public WES sample, examples/input/1000g/HG00103/SRR1596639/cbicall_*.

Copy the evidence archive back to your local machine from a local terminal:

# local$
gcloud compute scp \
"${VM_NAME}:~/cbicall/cbicall-google-cloud-wes-evidence.tar.gz" \
. \
--zone "$ZONE"

5. Clean Up

Deleting the VM removes the instance and its boot disk contents, including the CBIcall checkout, downloaded resources, FASTQs, run directories, and evidence files. Copy any evidence archive you want to keep back to your local machine before running this command.

From your local machine:

# local$
gcloud compute instances delete "$VM_NAME" --zone "$ZONE"