Frequently Asked Questions¶
Are Beacon v2 genomicVariations.variation.location.interval.{start,end}
coordinates 0-based or 1-based?
They are 0-based.
last change 2025-03-23 by Manuel Rueda ¶
I have an error when attempting to use bff-tools vcf
, what should I do?
-
In 9 out of 10 cases, the error comes from BCFtools and is related to the reference genome specified in the parameters file. The options are typically hg19, hg38 (which use
chr
prefixes), and hs37, b37 (which do not). Ensure that your VCF’s contigs match the FASTA file or modify yourconfig.yaml
accordingly. -
Additionally, BCFtools may complain about the number of fields (for example, in the INFO field). In such cases, you can try fixing the VCF manually or use:
last change 2025-03-23 by Manuel Rueda ¶
Can I use SINGLE-SAMPLE and MULTI-SAMPLE VCFs?
Yes, you can use both. MongoDB allows for incremental loads so single-sample VCFs are acceptable (you don’t need to merge them into a multisample VCF). The connection between samples and variants is maintained in the datasets
collection (or cohorts
).
last change 2025-03-23 by Manuel Rueda ¶
Can I use genomic VCF (gVCF)?
Yes, but first you will need to convert them to a standard VCF. For example, you can use:
We are interested only in positions with ALT alleles. A “quick and dirty” solution with common Linux tools is:
last change 2025-03-23 by Manuel Rueda ¶
Can I use SNP microarray data files such as those from 23andme?
Yes, starting from version 2.0.10 you can use a TSV/TXT
file. If you use the parameter:
Example with test data:
last change 2025-05-12 by Manuel Rueda ¶
In bff-tools vcf
mode, why are we re-annotating VCFs | Can I use my own annotations?
The goal of re-annotation is to ensure consistency across the community. To create the genomicVariationsVcf.json.gz
BFF, we parse an annotated VCF—this guarantees that the essential fields are present. Any previous annotations will be discarded. This approach has been instrumental in over 1,000 deployments for testing Beacon v2 API implementations.
That said, if you know what you're doing and your VCF
already contains the essential ANN
fields, you can disable annotation by setting:
in the parameters file. Do it at your own risk
If you have internal annotations of value, you can add alternative genomic variations by completing the corresponding tab in the provided XLSX. The resulting file (genomicVariations.json
), together with genomicVariationsVcf.json.gz
, will be loaded into the MongoDB collection genomicVariations. See this tutorial for more details.
last change 2025-03-23 by Manuel Rueda ¶
Is there an alternative to the Excel file for generating metadata/phenotypic data?
Yes. You can use CSV or JSON files directly as input for the bff-tools validate
(a.k.a., bff-validator
) utility. For detailed instructions, refer to the bff-validator manual.
Alternatively, if your clinical data is in REDCap, OMOP CDM, Phenopackets v2, or raw CSV format, consider using the Convert-Pheno tool.
last change 2025-03-23 by Manuel Rueda ¶
bff-tool validate
(a.k.a., bff-validator)
specification mismatches
By default, bff-validator
validates your data against the schemas bundled with your beacon2-cbi-tools
version. If you encounter warnings (e.g., objects matching multiple possibilities in oneOf
keywords), simply use the flag --ignore-validation
when generating your .json
files.
last change 2025-03-23 by Manuel Rueda ¶
Do you load all variations present in a VCF file?
Yes, we do not apply filters (e.g., based on FILTER
or QUAL
fields) when loading variations, although we store those values in case they are needed later.
last change 2025-03-23 by Manuel Rueda ¶
Do you have any recommendations on how to speed up the data ingestion process?
Metadata/phenoclinic data ingestion is typically fast (processing thousands to tens of thousands of values in seconds or minutes). However, VCF processing (especially for WGS data with >100M variants) can be slower. Consider the following:
- Split your VCF by chromosome:
- Using community tools:
- Alternatively:
-
Or with Linux tools:
-
Use parallel processing to submit jobs.
last change 2025-03-23 by Manuel Rueda ¶
Can I use parallel jobs to perform data ingestion into MongoDB?
Yes, you can use parallel jobs; however, note that it may slightly slow down the ingestion process.
last change 2025-03-23 by Manuel Rueda ¶
When performing incremental uploads, do I need to re-index MongoDB?
No. Indexes are created during the first data load and are updated automatically with each insert operation. Subsequent re-indexing attempts are discarded (the operation is idempotent).
last change 2025-03-23 by Manuel Rueda ¶
Where do I get full WGS VCF for the CINECA synthetic cohort EUROPE UK1?
For full WGS data (≈20 GB for 2,504 synthetic individuals), request access and download from the EGA. See this document for details.
last change 2025-03-23 by Manuel Rueda ¶
Are beacon2-cbi-tools
free?
Yes, it is free and open source. The data ingestion tools are released under the GNU General Public License v3.0, and the included CINECA_synthetic_cohort_EUROPE_UK1 dataset is under a CC-BY license.
last change 2025-03-23 by Manuel Rueda ¶
Should I update to the latest
version?
Yes. We recommend checking our GitHub repository (beacon2-cbi-tools for the latest updates.
last change 2025-03-23 by Manuel Rueda ¶
Do you send any personal information to your servers?
No. All files are created locally, and we do not send usage information (or anything else) over the internet.
But please, if you use this software, cite it.
last change 2025-05-29 by Manuel Rueda ¶
How do I cite beacon2-cbi-tools
?
You can cite the Beacon v2 Reference Implementation paper. Thx!
Citation
Rueda, M, Ariosa R. "Beacon v2 Reference Implementation: a toolkit to enable federated sharing of genomic and phenotypic data". Bioinformatics, btac568, DOI.