🤔 FAQs

Frequently Asked Questions

General¶

What does Convert-Pheno do?

Convert-Pheno is an open-source software toolkit designed to interconvert common data models for phenotypic data. The software addresses the challenge of inconsistent data storage across various research centers by enabling seamless conversion between different data models like Beacon v2 Models, CDISC-ODM, OMOP-CDM, Phenopackets v2, and REDCap. This facilitates data sharing and integration, ultimately accelerating scientific progress and improving patient outcomes in precision medicine and public health.

last change 2023-01-05 by Manuel Rueda ¶

Is Convert-Pheno free?

Yes. See the license.

last change 2023-01-04 by Manuel Rueda ¶

Is Convert-Pheno or Pheno-Convert?

It's Convert-Pheno, for two reasons:

The naming is inspired by the convert utility from ImageMagick.
In related contexts, people refer to PhenoConvert as in PhenoCopy or PhenoConversion.

last change 2023-01-05 by Manuel Rueda ¶

Is Convert-Pheno ready for use in production environments?

The software is fully functional and has been successfully used in several European-funded projects. However, it is still in beta, so ongoing improvements and refinements are to be expected.

last change 2023-06-27 by Manuel Rueda ¶

If I use Convert-Pheno to convert my data to Beacon v2 Models, does this mean I have a Beacon v2?

I am afraid not. Beacon v2 is an API specification, and the Beacon v2 Models are merely a component of it. In order to light a Beacon v2, it is necessary to load the JSON files into a database and add an an API on top. Currently, it is advisable to utilize the Beacon v2 Reference Implementation which includes the database, the Beacon v2 API, and other necessary components.

See below an example in how to integrate an OMOP CDM export from SQL with Beacon v2.

last change 2023-06-20 by Manuel Rueda ¶

What is the difference between Beacon v2 Models and Beacon v2?

Beacon v2 is a specification to build an API. The Beacon v2 Models define the format for the API's responses to queries regarding biological data. With the help of Convert-Pheno, data exchange text files (BFF) that align with this response format can be generated. By doing so, the BFF files can be integrated into a non-SQL database, such as MongoDB, without the API having to perform any additional data transformations internally.

last change 2023-02-13 by Manuel Rueda ¶

Why are there so many clinical data standards?

The healthcare industry uses various data standards to meet diverse needs for data exchange, storage, and analysis, tailored for specific purposes like real-time clinical use or research. The abundance of standards also stems from a lack of communication and coordination among different organizations and stakeholders.

Overview of Key Healthcare Data Standards and Models¶

Standard/Model	Purpose	Data Persistence	Live Data Use (Clinical Settings)	Secondary Data Use (Research Settings)
Beacon v2	Facilitates the discovery and sharing of genomic data, enabling researchers to find relevant genomic datasets across different repositories.	Not designed for long-term storage; focuses on data discovery.	No	Yes
CDISC-ODM	Manages and archives clinical trial data, providing a standardized format for the exchange and submission of clinical research data.	Strong support for long-term data archiving and regulatory submissions.	No	Yes
HL7/CDA	Standardizes the structure and semantics of clinical documents (such as discharge summaries and progress notes) for exchange.	Ensures structured document storage; persistence depends on implementation.	Yes	Yes
HL7/FHIR	Facilitates the exchange of healthcare information electronically, supporting interoperability across different health IT systems.	Provides guidelines for data exchange; persistence depends on implementation.	Yes	Yes
OMOP CDM	Standardizes and harmonizes health data for research and secondary use, focusing on observational health data analysis.	Supports data persistence for research purposes, not real-time use.	No	Yes
openEHR	Offers a comprehensive standard for electronic health records, focusing on accurate, long-term clinical data storage and real-time use.	Designed for robust, long-term clinical data persistence.	Yes	Yes
Phenopackets v2	Standardizes the exchange of detailed phenotypic data, particularly for genetic and rare disease research.	Not designed for long-term storage; focuses on data exchange.	No	Yes
REDCap	Provides a secure, web-based application for building and managing online surveys and databases, primarily used in research settings.	Supports data persistence for research projects and surveys.	No	Yes

last change 2024-07-12 by Manuel Rueda ¶

Are you planning in supporting other clinical data formats?

Afirmative, but it will depend on community adoption. Please check our roadmap for more information.

last change 2023-01-04 by Manuel Rueda ¶

Are longitudinal data supported?

Although Beacon v2 and Phenopackets v2 allow for storing time information in some properties, there is currently no way to associate medical visits to properties. To address this:

omop2bff - we added an ad hoc property (_visit) to store medical visit information for longitudinal events in variables that have it (e.g., measures, observations, etc.).
redcap2bff - In REDCap, visit/event information is not stored at the record level. We added this information inside info property.

We raised this issue to the respective communities in the hope of a more permanent solution.

last change 2023-03-24 by Manuel Rueda ¶

What is an "ontology" in Beacon v2 and Phenopacket v2 context?

In the context of Phenopackets and Beacon v2, the terms ontologyClass and ontologyTerm denote standardized identifiers derived from ontologies such as HPO or NCIt, and terminologies like LOINC or RxNorm. The use of "ontology" here is broad, covering both actual ontologies—with their complex semantic relationships and inference abilities—and classifications like LOINC and RxNorm, which, despite not fitting the strict definition of an ontology, serve similar purposes in data standardization.

last change 2024-04-01 by Manuel Rueda ¶

I have a collection of PXF files encoded using HPO and ICD-10 terms, and I need to convert them to BFF format, but encoded in OMIM and SNOMED-CT terminologies. Can you assist me with this?

Neither Phenopacket v2 nor Beacon v2 prescribe the use of a specific ontology; they simply provide recommendations on their websites. Thereby, Convert-Pheno does not change the source ontology terms.

Now, IMHO, it's generally easier to inter-convert ontology terms (it's just a mapping exercise) than to inter-convert data schemas...so here is that.

Nota Bene:

A standard that does enforce the use of an standardized vocabulary is OMOP CDM, you may wanna check it out.

last change 2024-01-16 by Manuel Rueda ¶

What type of database search is carried out?

About text similarity in database searches

Convert-Pheno comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:

1. `exact` (default)¶

Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.

2. `mixed` (use `--search mixed`)¶

Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.

3. ✨ `fuzzy` (use `--search fuzzy`)¶

Hybrid search with fuzzy ranking:
Like mixed, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where: - 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient), - 10% comes from the normalized Levenshtein similarity.

The concept with the highest composite score is returned.

Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.

🔍 Example Search Behavior¶

Query: Exercise pain management
- With --search exact: ✅ Match found — Exercise Pain Management

Query: Brain Hemorrhage
- With --search mixed:
- ❌ No exact match
- ✅ Closest match by similarity: Intraventricular Brain Hemorrhage

💡 Similarity Threshold¶

The --min-text-similarity-score option sets the minimum threshold for mixed and fuzzy searches. - Default: 0.8 (conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.

⚠️ Performance Note¶

Both mixed and fuzzy modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.

🧪 Example Results Table¶

Below is an example showing how the query Sudden Death Syndrome performs using different search modes against the NCIt ontology:

Query	Search	NCIt match (label)	NCIt code	Cosine	Dice	Levenshtein (Normalized)	Composite
Sudden Death Syndrome	exact	NA	NA	NA	NA	NA	NA
	mixed	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.65	0.60	NA	NA
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.65	0.60	NA	NA
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.65	0.60	NA	NA
		Sudden Infant Death Syndrome	NCIT:C85173	0.86	0.86	NA	NA
	✨ fuzzy	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.65	0.60	0.43	0.63
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.65	0.60	0.43	0.63
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.65	0.60	0.46	0.63
		Sudden Infant Death Syndrome	NCIT:C85173	0.86	0.86	0.75	0.85

Interpretation:

With exact, there are no matches.
With mixed, the best match will be Sudden Infant Death Syndrome.
With fuzzy, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
The highest match is Sudden Infant Death Syndrome, with a composite score of 0.85.

✨ Now we introduce a typo on the query Sudden Infant Deth Syndrome:

Query	Mode	Candidate Label	Code	Cosine	Dice	Levenshtein (Normalized)	Composite
Sudden Infant Deth Syndrome	fuzzy	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.38	0.36	0.33	0.37
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.38	0.36	0.43	0.38
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.57	0.55	0.59	0.57
		Sudden Infant Death Syndrome	NCIT:C85173	0.75	0.75	0.96	0.77

To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75

It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>.

Composite Similarity Score

The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.

1. Token-Based Similarity¶

This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.

2. Normalized Levenshtein Similarity¶

The normalized Levenshtein similarity is defined as:

\[ \text{NormalizedLevenshtein}(s_1, s_2) = 1 - \frac{\text{lev}(s_1, s_2)}{\max(|s_1|, |s_2|)} \]

Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distance—the minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.

This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.

3. Composite Score Formula¶

The final composite similarity score \(C\) is a weighted combination of the two metrics:

\[ C(s_1, s_2) = \alpha \cdot \text{TokenSimilarity}(s_1, s_2) + \beta \cdot \text{NormalizedLevenshtein}(s_1, s_2) \]

Where: - \(\alpha\) (or token_weight) is the weight assigned to the token-based similarity. - \(\beta\) (or lev_weight) is the weight assigned to the normalized Levenshtein similarity.

A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4–5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.

last change 2025-04-10 by Manuel Rueda ¶

Error Handling for CSV_XS ERROR: 2023 - EIQ - QUO character not allowed @ rec 1 pos 21 field 1

This indicates a problem with the character used to separate data fields in your file. Our script automatically detects the separator based on the file extension (e.g., it expects commas for .csv files). However, discrepancies can arise if the actual data separator doesn't match the expected one based on the file extension.

Solutions¶

Ensure Consistent Separator Use: If you're using REDCap for input, verify that both --iredcap and --rcd files are configured to use the identical separator. This consistency is crucial for correct data processing.
Specify Separator Manually in Command Line: In cases where the default separator detection fails, you can manually specify the correct separator. For example, to use a tab as your separator, utilize the following syntax in the CLI:

--sep $'\t'

last change 2024-02-06 by Manuel Rueda ¶

Should I export my REDCap project as raw data or as labels for use with Convert-Pheno?

For use with Convert-Pheno, we recommend that you export your REDCap project as CSV / Microsoft Excel (raw data). It's important to include the corresponding dictionary file with your export. For detailed instructions on how to prepare your export correctly, refer to the Convert-Pheno tutorial.

Example of REDCap export settings. Source: CDC

Additionally, when configuring your export settings, ensure that in the Additional report options, the option "Combine checkbox options into single column of only the checked-off options" is not selected.

Checkbox export — REDCap checkbox export settings

If your data has been exported as CSV / Microsoft Excel (labels) you can use follow the CSV input route.

last change 2024-05-18 by Manuel Rueda ¶

Analytics¶

How can I obtain statistics from the individuals.json file if I'm not familiar with JSON format? Any suggestions?

My first recommendation is to use jq, which is like grep for JSON.

Let's begin by generating a TSV (Tab-Separated Values) file where each row represents an individual, and the columns correspond to the array variables:

jq -r '["id", "diseases", "exposures", "interventionsOrProcedures", "measures", "phenotypicFeatures", "treatments"], (.[] | [.id, (.diseases | length), (.exposures | length), (.interventionsOrProcedures | length), (.measures | length), (.phenotypicFeatures | length), (.treatments | length)]) | @tsv' < individuals.json > results.tsv

Another valid option to acomplish the same task is to resort to a scripting language such as Python or Perl:

Python code

import json
import pandas as pd

# Load the JSON data from individuals.json
with open('individuals.json', 'r') as json_file:
    data = json.load(json_file)

# Define the keys you want to extract
keys = [ "diseases", "exposures", "interventionsOrProcedures", "measures", "phenotypicFeatures", "treatments"]

# Create a list of dictionaries with the extracted values
result_data = [
    {
        "id": item["id"],
        **{key: len(item.get(key, [])) for key in keys}
    }
    for item in data
]

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(result_data)

# Save the DataFrame to results.tsv with tab as the separator
df.to_csv('results.tsv', sep='\t', index=False)

Perl code

use strict;
use warnings;
use autodie;
use JSON::XS;
use Text::CSV_XS qw(csv);

# Open the JSON file and read the data
open my $json_file, '<', 'individuals.json';
my $json_text = do { local $/; <$json_file> };
my $data = decode_json($json_text);
close $json_file;

# Define the keys you want to extract
my @keys = ("diseases", "exposures", "interventionsOrProcedures", "measures", "phenotypicFeatures", "treatments");

# Initialize the data array with the header row
my $aoa = [["id", @keys]];

# Process the data
foreach my $item (@$data) {
    my @row = ($item->{"id"});
    foreach my $key (@keys) {
        push @row, scalar @{$item->{$key} // []};
    }
    push @$aoa, \@row;
}

# Write array of arrays as csv file
csv(in => $aoa, out => "results.tsv", sep_char => "\t", eol => "\n");

See result

When you run this in, for example, this file, you'll obtain a text file in the following format:

id	diseases	exposures	interventionsOrProcedures	measures	phenotypicFeatures	treatments
HG00096	0	0	1	3	0	0
HG00097	0	0	1	3	0	0
HG00099	0	0	1	3	0	0
HG00100	0	0	1	3	0	0
HG00101	0	0	1	3	0	0
HG00102	0	0	1	3	0	0
HG00103	1	0	1	3	0	0
HG00105	3	0	1	3	0	0
...

Once you have the data in that format, you can process it however you prefer. Below, you'll find an example:

Example: Basic stats

import pandas as pd

# Load TSV file
df = pd.read_csv('results.tsv', sep='\t')

# Exclude the first column (assuming it's 'id')
df = df.iloc[:, 1:]

# Initialize a dictionary to hold the statistics
stats = {
    'Statistic': ['Mean', 'Median', 'Max', 'Min', '25th Percentile', '75th Percentile', 'IQR', 'Standard Deviation']
}

# Calculate statistics for each column and add to the dictionary
for column in df.columns:
    percentile_25 = df[column].quantile(0.25)
    percentile_75 = df[column].quantile(0.75)

    stats[column] = [
        df[column].mean(),
        df[column].median(),
        df[column].max(),
        df[column].min(),
        percentile_25,
        percentile_75,
        percentile_75 - percentile_25,
        df[column].std()
    ]

# Create a new DataFrame from the stats dictionary
stats_df = pd.DataFrame(stats)

# Save the statistics DataFrame to a CSV file
stats_df.to_csv('column_statistics.csv', index=False)

Statistic	diseases	interventionsOrProcedures	measures
Mean	1.02	1.0	3.0
Median	1.0	1.0	3.0
Max	5.0	1.0	3.0
Min	0.0	1.0	3.0
25th Percentile	0.0	1.0	3.0
75th Percentile	2.0	1.0	3.0
IQR	2.0	0.0	0.0
Standard Deviation	0.92	0.0	0.0

A similar approach but in R:

# Load TSV file
df <- read.csv("results.tsv", sep = "\t")

# Exclude the first column (assuming it's 'id')
df <- df[-1]

# Calculate summary statistics for each numeric column
summary_stats <- summary(df)

# Save the summary statistics to a CSV file
write.csv(summary_stats, file = 'column_statistics.csv')

diseases	exposures	interventionsOrProcedures	measures	phenotypicFeatures	treatments
Min. :0.000	Min. :0	Min. :1	Min. :3	Min. :0	Min. :0
1st Qu.:0.000	1st Qu.:0	1st Qu.:1	1st Qu.:3	1st Qu.:0	1st Qu.:0
Median :1.000	Median :0	Median :1	Median :3	Median :0	Median :0
Mean :1.023	Mean :0	Mean :1	Mean :3	Mean :0	Mean :0
3rd Qu.:2.000	3rd Qu.:0	3rd Qu.:1	3rd Qu.:3	3rd Qu.:0	3rd Qu.:0
Max. :5.000	Max. :0	Max. :1	Max. :3	Max. :0	Max. :0

Example: Plots

For plotting, we recommend using one of Pheno-Ranker's utilities.

last change 2024-01-17 by Manuel Rueda ¶

How can I compare all individuals in one or multiple cohorts?

We recommend using Pheno-Ranker in cohort mode.

last change 2024-01-17 by Manuel Rueda ¶

How can I match patients similar to mine in a cohort(s)?

We recommend using Pheno-Ranker in patient mode.

last change 2024-01-17 by Manuel Rueda ¶

How can I create synthetic data in BFF or PXF data exchange formats?"

We recommend using one of Pheno-Ranker's utilities.

last change 2024-01-17 by Manuel Rueda ¶

How can I convert my BFF/PXF data into Machine Learning features?

We recommend using Pheno-Ranker that performs one-hot encoding while preserving the hierarchical relationships of the JSON data.

last change 2024-01-17 by Manuel Rueda ¶

Installation¶

I am installing Convert-Pheno from source (non-containerized version) but I can't make it work. Any suggestions?

Problems with Python / PyPerler¶

About PyPerler installation

Apart from PypPerler itself, you may need to install cython3 and libperl-dev to make it work.

sudo apt-get install cython3 libperl-dev

🤔 FAQs

General¶

last change 2023-01-05 by Manuel Rueda ¶

last change 2023-01-04 by Manuel Rueda ¶

last change 2023-01-05 by Manuel Rueda ¶

last change 2023-06-27 by Manuel Rueda ¶

last change 2023-06-20 by Manuel Rueda ¶

last change 2023-02-13 by Manuel Rueda ¶

Overview of Key Healthcare Data Standards and Models¶

last change 2024-07-12 by Manuel Rueda ¶

last change 2023-01-04 by Manuel Rueda ¶

last change 2023-03-24 by Manuel Rueda ¶

last change 2024-04-01 by Manuel Rueda ¶

last change 2024-01-16 by Manuel Rueda ¶

1. exact (default)¶

2. mixed (use --search mixed)¶

3. ✨ fuzzy (use --search fuzzy)¶

🔍 Example Search Behavior¶

💡 Similarity Threshold¶

⚠️ Performance Note¶

🧪 Example Results Table¶

1. Token-Based Similarity¶

2. Normalized Levenshtein Similarity¶

3. Composite Score Formula¶

last change 2025-04-10 by Manuel Rueda ¶

Solutions¶

last change 2024-02-06 by Manuel Rueda ¶

last change 2024-05-18 by Manuel Rueda ¶

Analytics¶

last change 2024-01-17 by Manuel Rueda ¶

last change 2024-01-17 by Manuel Rueda ¶

last change 2024-01-17 by Manuel Rueda ¶

last change 2024-01-17 by Manuel Rueda ¶

last change 2024-01-17 by Manuel Rueda ¶

Installation¶

Problems with Python / PyPerler¶

last change 2023-01-04 by Manuel Rueda ¶

1. `exact` (default)¶

2. `mixed` (use `--search mixed`)¶

3. ✨ `fuzzy` (use `--search fuzzy`)¶