Quick Start `bff-tools`¶

This guide provides a one-page cheat sheet for common commands:

# Display help for the tool
bin/bff-tools --help

# Display man for the tool
bin/bff-tools --man

Genetic Data Interpretation Disclaimer

This tool provides research-based annotations of genetic data from SNP microarray formats (e.g., 23andMe) and VCF (Variant Call Format) files. It is intended for research use only and is not a medical device. It does not provide medical or clinical advice.

🩺 Do not use results for medical decisions. Always consult a qualified healthcare professional.
😰 Results may cause emotional or psychological distress. You may learn about increased risks for serious health conditions.
🔬 Genetic data and interpretations have limitations. Not all variants are covered, and scientific understanding continues to evolve.
🔐 You are responsible for safeguarding your genetic data. Use caution when storing or sharing results; privacy or legal implications may apply.
⚡ Use at your own risk. The authors assume no responsibility for how the results are interpreted or used.

By using this tool, you confirm that you understand and accept these terms.

# Convert VCF to BFF
bin/bff-tools vcf -i test/vcf/test_1000G.vcf.gz -p test/vcf/param.yaml

# ✨ Convert TSV (SNP microarray) to BFF
bin/bff-tools tsv -i test/tsv/input.txt.gz -p test/tsv/param.yaml

# Validate metadata and convert to BFF
mkdir bff_out
bin/bff-tools validate -i utils/bff_validator/Beacon-v2-Models_template.xlsx --out-dir bff_out

Usage¶

Full UsageNotes on bff-tool validate

NAME¶

bff-tools: A unified command-line toolkit for working with Beacon v2 Models data. It allows users to annotate and convert VCF/TSV files into the genomicVariations entity using the Beacon-Friendly Format (BFF), validate metadata files (XLSX or JSON) against Beacon v2 schema definitions and load BFF-formatted data into a MongoDB instance.

This tool is part of the beacon2-cbi-tools repository and is designed to support Beacon v2 data ingestion pipelines, metadata validation workflows, and federated data sharing initiatives.

SYNOPSIS¶

bff-tools <mode> [-arguments] [-options]

Mode:
* vcf
     -i | --input <file.vcf>        Requires a VCF.gz file (gz or not gz)
                                    (May also use a parameters file)

* tsv
     -i | --input <file.tsv>        Requires a SNP microarray TSV filea (e.g., from 23andme) 
                                    (May also use a parameters file)

* load
                                    (Requires a parameters file specifying BFF files)

* full (vcf + load)
           or
       (tsv + load)
     -i | --input <file>            Requires a VCF or TSV file
                                    (May also use a parameters file)

   Options [vcf|tsv|load|full]
     -c | --config <file>           Requires a configuration file
     -p | --param <file>            Requires a parameters file (optional)
     -projectdir-override <path>    Custom project directory path (overrides config)
     -t | --threads <number>        Number of threads (optional, mainly for VCF)

* validate
   Options [validate]
     -i | --input <file(s)>         One or more XLSX/JSON metadata files
     -s | --schema-dir <directory>  Directory containing JSON schemas
     -o | --out-dir <directory>     Output directory for validated data
     -gv                            Set this option if you want to process <genomicVariations> entity
     -ignore-validation             Writes JSON collection regardless of results from validation against JSON schemas (AYOR!)

    Experimental:
     -gv-vcf                        Set this option to read <genomicVariations.json> from <beacon vcf> (with one document per line)

   Generic Options:
     -h                             Brief help message
     -man                           Full documentation
     -v                             Display version information
     -debug <level>                 Print debugging information (1 to 5)
     -verbose                       Enable verbosity
     -nc | --no-color               Do not print colors to STDOUT
     -ne | --no-emoji               Do not print emojis to STDOUT

DESCRIPTION¶

`bff-tools`¶

bff-tools is a command-line toolkit with five operational modes (subcommands) for working with Beacon v2 data:

HOW TO RUN `bff-tools`¶

This script supports four modes: vcf, tsv, load, full, and validate.

* Mode vcf

Annotates a gzipped (or uncompressed) VCF file and serializes it into the Beacon-Friendly Format (BFF) as genomicVariationsVcf.json.gz.

* Mode tsv

Annotates a gzipped (or uncompressed) SNP microarray text file and serializes it into the Beacon-Friendly Format (BFF) as genomicVariationsVcf.json.gz.

* Mode load

Loads BFF-formatted JSON files - including metadata and genomic variations - into a MongoDB instance.

* Mode full

Combines vcf and load: it processes a VCF file and ingests the resulting data into MongoDB.

* Mode validate

Validates metadata files (XLSX or JSON) against the Beacon v2 schema definitions and serializes them into BFF JSON collections.
Note: This mode uses a separate internal script and does not require a parameters or configuration file.

To perform these tasks, you may need:

A VCF file or a TSV file for modes: vcf and full.
A parameters file (optional)

YAML file with job-specific values and metadata file references. Recommended for structured processing.
BFF JSON files (required for modes: load and full)

See Beacon-Friendly Format (BFF) for a detailed explanation.
Metadata files (XLSX or JSON) (for mode: validate)

You can start with the provided Excel template and use --gv or --ignore-validation flags if needed.
Threads (only for vcf, tsv and full modes)

You can set the number of threads using -t. However, since SnpEff doesn't parallelize efficiently, it's best to use -t 1 and distribute the work (e.g., by chromosome) using GNU parallel or the included queue system).

bff-tools will create an independent project directory projectdir and store all needed information needed there. Thus, many concurrent calculations are supported. Note that bff-tools will treat your data as read-only (i.e., will not modify your original files).

Annex: Parameters file (YAML)

Example for vcf mode:

--
genome: hs37 # default hg19
annotate: true # default true
bff2html: true # default false

Example for tsv mode:

--
genome: b37 # default hg19
annotate: true # default true
bff2html: true # default false
sampleid: my_sample_id_01 # default '23andme_1'

Example for load mode:

--
bff:
  metadatadir: .
  analyses: analyses.json
  biosamples: biosamples.json
  cohorts: cohorts.json
  datasets: datasets.json
  individuals: individuals.json
  runs: runs.json
  # Note that genomicVariationsVcf is not affected by <metadatadir>
  genomicVariationsVcf: beacon_XXXX/vcf/genomicVariationsVcf.json.gz
projectdir: my_project

Example for full mode:

--
genome: hs37 # default hg19
annotate: true # default true
bff:
  metadatadir: .
  analyses: analyses.json
  runs: runs.json
projectdir: my_project

Please find below a detailed description of all parameters (alphabetical order):

annotate

When the annotate parameters is set to true (default), the tool will perform annotation on the provided VCF file. This process involves running snpEff to enrich the VCF with annotation data by leveraging databases such as dbNFSP, ClinVar, and COSMIC. In this mode, the tool will generate and populate the ANN fields based on the analysis.

If the annotate parameters is set to false, the tool assumes that the VCF file has already been annotated (i.e., it already contains the ANN fields). In this case, it will skip the annotation step and directly parse the existing ANN fields. If you choose this route, please make sure to modify the file lib/internal/complete/config.yaml consisting of database versions with your own values.

One way to use annotate: false is to perform bff2html without having to re-annotate the VCF with SnpEff.
bff

Location for the Beacon Friendly Format JSON files.
bff2html

Set bff2html to true to create HTML for the BFF Genomic Variations Browser.
center

Experimental feature. Not used for now.
datasetid

An unique identifier for the dataset present in the input VCF. Default value is 'id_1'
genome

Your reference genome.

Accepted values: hg19, hg38, hs37, and b37 (b37 will be interpreted as hs37).

If you used GATKs GRCh37 set it to hg19.

Not supported: NCBI36/hg18, NCBI35/hg17, NCBI34/hg16, hg15 and older.
organism

Experimental feature. Not used for now.
projectdir

The prefix for dir name (e.g., 'cancer_sample_001'). Note that it can also contain a path (e.g., /workdir/cancer_sample_001). The script will automatically add an unique identifier to each job.
sampleid

To be used in tsv mode. A string to name your sample, which will be used as the sample ID in the VCF.
technology

Experimental feature. Not used for now.

Examples:

$ bin/bff-tools vcf -i input.vcf.gz

$ bin/bff-tools vcf -i input.vcf.gz -p param.yaml -projectdir-override beacon_exome_id_123456789

$ bin/bff-tools load -p param_file  # MongoDB load only

$ bin/bff-tools full -t 1 --i input.vcf.gz -p param_file  > log 2>&1

$ bin/bff-tools full -t 1 --i input.vcf.gz -p param_file -c config_file > log 2>&1

$ bin/bff-tools validate -i my_data.xlsx -o outdir

$ nohup $path_to_beacon/bin/bff-tools full -i input.vcf.gz -verbose

$ parallel "bin/bff-tools vcf -t 1 -i chr{}.vcf.gz  > chr{}.log 2>&1" ::: {1..22} X Y

NB: If you don't want colors in the output use the flag --no-color. If you did not use the flag and want to get rid off the colors in your printed log file use this command to parse ANSI colors:

perl -pe 's/\x1b\[[0-9;]*[mG]//g'

Note: The script creates log files for all the processes. For instance, when running in vcf mode you can check via tail -f command:

$ tail -f <your_job_id/vcf/run_vcf2bff.log

WHAT IS THE BEACON FRIENDLY FORMAT (BFF)¶

The Beacon Friendly Format is a data exchange format consisting up to 7 JSON files (JSON arrays) that match the 7 schemas from Beacon v2 Models.

Six files correspond to Metadata (analyses.json,biosamples.json,cohorts.json,datasets.json,individuals.json,runs.json) and one corresponds to variations (genomicVariations.json).

Normally, bff-tools script is used to create genomicVariations JSON file. The other 6 files are created with this utility (part of the distribution). See instructions here.

Once we have all seven files, then we can proceed to load the data into MongoDB.

COMMON ERRORS: SYMPTOMS AND TREATMENT¶

* Perl: 
        * Execution errors:
          - Error with YAML::XS
            Solution: Make sure the YAML (config.yaml or parameters file) is well formatted (e.g., space after param:' ').

* Bash: 
        (Possible errors that can happen when the embeded Bash scripts are executed)
        * bcftools errors: bcftools is nit-picky about VCF fields and nomenclature of contigs/chromosomes in reference genome
               => Failed to execute: beacon_161855926405757/run_vcf2bff.sh
                  Please check this file beacon_161855926405757/run_vcf2bff.log
          - Error: 
                 # Running bcftools
                 [E::faidx_adjust_position] The sequence "22" was not found
            Solution: Make sure you have set the correct genome (e.g., hg19, hg38 or hs37) in parameters_file.
                      In this case bcftools was expecting to find 22 in the <*.fa.gz> file from reference genome, but found 'chr22' instead.
                Tips:
                     - hg{19,38} use 'chr' in chromosome naming (e.g., chr1)
                     - hs37 does not use 'chr' in chromosome naming (e.g., 1)

           - Error
                # Running bcftools
                INFO field IDREP only contains 1 field, expecting 2
             Solution: Please Fix VCF info field manually (or get rid of problematic fields with bcftools)
                       e.g., bcftools annotate -x INFO/IDREP input.vcf.gz | gzip > output.vcf.gz
                             bcftools annotate -x INFO/MLEAC,INFO/MLEAF,FMT/AD,FMT/PL input.vcf.gz  | gzip > output.vcf.gz

  NB: The bash scripts can be executed "manually" in the beacon_XXX dir. You must provide the 
      input vcf as an argument. This is a good option for debugging.

KNOWN ISSUES¶

* Some Linux distributions do not include perldoc and thus Perl's library Pod::Usage will complain.
  Please, install it (sudo apt install perl-doc) if needed.

CITATION¶

The author requests that any published work that utilizes B2RI includes a cite to the the following reference:

Rueda, M, Ariosa R. "Beacon v2 Reference Implementation: a toolkit to enable federated sharing of genomic and phenotypic data". Bioinformatics, btac568, doi.org/10.1093/bioinformatics/btac568

AUTHOR¶

Written by Manuel Rueda, PhD. Info about CNAG can be found at https://www.cnag.eu

Credits:

* Sabela De La Torre (SDLT) created a Bash script for Beacon v1 to parse vcf files L<https://github.com/ga4gh-beacon/beacon-elixir>.
* Toshiaki Katayamai re-implemented the Beacon v1 script in Ruby.
* Later Dietmar Fernandez-Orth (DFO) modified the Ruby for Beacon v2 L<https://github.com/ktym/vcftobeacon and added post-processing with R, from which I borrowed ideas to implement vcf2bff.pl.
* DFO for usability suggestions and for creating bcftools/snpEff commands.
* Roberto Ariosa for help with MongoDB implementation.
* Mauricio Moldes helped with the containerization.

COPYRIGHT and LICENSE¶

This PERL file is copyrighted. See the LICENSE file included in this distribution.

For executing bff-validator you will need:

Input file:

You have two options:

A) A XLSX file consisting of multiple sheets. A template version of this file is provided with this installation.

Currently, the file consists of 7 sheets that match the Beacon v2 Models.

Please use the flag --gv should you want to validate the data in the sheet <genomicVariations>.

NB: If you have multiple CSV files instead of a XLSX file you can use the included utility csv2xlsx that will join all CSVs into 1 XLSX.
```
$ ./csv2xlsx *csv -o out.xlsx
```
B) A set of JSON (array) files that follow the Beacon Friendly Format. The files MUST be uncompressed and named <analyses.json>, <biosamples.json>, etc.
Beacon v2 Models (with JSON pointers dereferenced)

You should have them at deref_schemas directory.

Examples:

 $ ./bff-validator -i file.xlsx

 $ $path/bff-validator -i file.xlsx -o my_bff_outdir

 $ $path/bff-validator -i my_bff_in_dir/*json -s deref_schemas -o my_bff_out_dir

 $ $path/bff-validator -i file.xlsx --gv --schema-dir deref_schemas --out-dir my_bff_out_dir

TIPS ON FILLING OUT THE EXCEL TEMPLATE¶

* Please, before filling in any field, check out the provided template for ../../CINECA_synthetic_cohort_EUROPE_UK1/*xlsx
* The script accepts Unicode characters (encoded with utf-8)
* Header fields: 
   - Dots ('.') represent objects: 
       Examples (values):
         1 - foo
         2 - NCIT:C20197
         3 - true # booleans
         4 - ["foo","bar","baz"] # arrays are also allowed
   - Underscores ('_') represent arrays: 
       * Up to 1D (e.g., individuals->measures_assyCode.id) the values are comma separated
          Examples (values):
           1 - measures_assayCode.id
               LOINC:35925-4,LOINC:3141-9,LOINC:8308-9
              measures_assayCode.label
               BMI,Weight,Height-standing

       * Others - Values for array fields start with '[' and end with ']'
          Examples (values): 
           1 - ["foo":{"bar": "baz"}}]
           2 - ["foo","bar","baz"]

COMMON ERRORS AND SOLUTIONS¶

* Error message: , or } expected while parsing object/hash, at character offset 574 (before "]")
  Solution: Make sure you have the right amount of opening or closing keys/brackets.

NB: You can use the flag --ignore-validation and check the temporary files at -o directory.

Quick Start bff-tools¶