🎓 Tutorial

Google Colab version

We created a Google Colab version of the tutorial. Users can view notebooks shared publicly without sign-in, but you need a google account to execute code.

We also have a local copy of the notebook that can be downloaded from the repo.

This page provides brief tutorials on how to perform data conversion by using Convert-Phenocommand-line interface.

Note on installation

Before proceeding, ensure that the software is properly installed. In the following instructions, it will be assumed that you have downloaded and installed Convert-Pheno.

How to convert:¶

REDCap to Phenopackets v2OMOP CDM to Beacon v2 ModelsCSV to Beacon v2 Models

This section provides a summary of the steps to convert a REDCap project to Phenopackets v2.

The starting point is to log in to your REDCap system and export the data to CSV / Microsoft Excel (raw data) format. If you need more information on REDCap, we recommend consulting the comprehensive documentation provided by the Cincinnati Children's Hospital Medical Center.

Can I export CSV / Microsoft Excel (labels) file?

Yes, you can export a CSV or Microsoft Excel file with labels. However, you need to use the --icsv flag instead of the --iredcap flag as the input format. While we recommend exporting raw data along with the dictionary for better accuracy, we understand that this might not always be possible.

For more detailed information and other common questions, please refer to the FAQ.

After exporting the data, you must also download the REDCap dictionary in CSV format. This can be done within REDCap by navigating to Project Setup/Data Dictionary/Download the current.
Since REDCap projects are "free-format," a mapping file is necessary to connect REDCap project variables (i.e. fields) to something meaningful for Convert-Pheno. This mapping file will be used in the conversion process.

What is a Convert-Pheno mapping file?

A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno. This file maps your variables to the required terms of the individuals entity from the Beacon v2 models, which serves a center model.

Creating a mapping file¶

To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:

Type	Required (Optional)	Required properties	Optional properties
Internal	`project`	`id, source, ontology, version`	`description, baselineFieldsToPropagate`
Beacon v2 terms	`id, sex (diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments)`	`fields`	`age,ageOfOnset,assignTermIdFromHeader,bodySite,dateOfProcedure,dictionary,drugDose,drugUnit,duration,durationUnit,familyHistory,fields,mapping,procedureCodeLabel,selector,terminology,unit,visitId`

These are the properties needed to map your data to the entity individuals in the Beacon v2 Models:

baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
age, a string representing the column that points to the age of the patient.
ageOfOnset, an object representing the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition.
assignTermIdFromHeader, an array for columns on which the ontology-term ids have to be assigned from the header.
bodySite, an object representing the column that points to the part of the body affected by a condition or where a procedure was performed.
dateOfProcedure, an object representing the column that points to when a procedure took place.
dictionary, is an object in the form of key: value. The key represents the original variable name in REDCap and the value represents the "phrase" that will be used to query a database to find an ontology candidate. For instance, you may have a variable named cigarettes_days, but you know that in NCIt the label is Average Number Cigarettes Smoked a Day. In this case, you will use cigarettes_days: Average Number Cigarettes Smoked a Day.
drugDose, an object representing the column that points to the dose column for each treatment.
drugUnit, an object representing the column that points to the unit column for each treatment.
duration, an object representing the column that points to the duration column for each treatment.
durationUnit, an object representing the column that points to the duration unit column for each treatment.
familyHistory, an object representing the column that points to the family medical history relevant to the patient's condition.
fields, can be either a string or an array consisting of the name of the REDCap variables that map to that Beacon v2 term.
mapping, is an object in the form of key: value that we use to map our Beacon v2 objects to REDCap variables.
procedureCodeLabel , a nested object with specific mappings for interventionsOrProcedures.
ontology, it's an string to define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that from project.ontology.
routeOfAdministration, a nested object with specific mappings for treatments.
selector, a nested object value with specific mappings.
terminology, a nested object value with user-defined ontology terms.

Terminology example

terminology:
  My fav term:
    id: FOO:12345678
label: Label for my fav term

unit, an object representing the column that points to the unit of measurement for a given value or treatment.
visitId, the column with visit occurrence id.

Defining the values in the property dictionary

Before assigning values to dictionary it's important that you think about which ontologies/terminologies you want to use. The field project.ontology defines the ontology for the whole project, but you can also specify a another antology at the Beacon v2 term level. Once you know which ontologies to use, then try searching for such term to get an accorate label for it. For example, if you have chosen ncit, you can search for the values within NCIt at EBI Search. Convert-Pheno will use these values to retrieve the actual ontology term from its internal databases.

About text similarity in database searches

Convert-Pheno comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:

1. `exact` (default)¶

Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.

2. `mixed` (use `--search mixed`)¶

Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.

3. ✨ `fuzzy` (use `--search fuzzy`)¶

Hybrid search with fuzzy ranking:
Like mixed, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where: - 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient), - 10% comes from the normalized Levenshtein similarity.

The concept with the highest composite score is returned.

Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.

🔍 Example Search Behavior¶

Query: Exercise pain management
- With --search exact: ✅ Match found — Exercise Pain Management

Query: Brain Hemorrhage
- With --search mixed:
- ❌ No exact match
- ✅ Closest match by similarity: Intraventricular Brain Hemorrhage

💡 Similarity Threshold¶

The --min-text-similarity-score option sets the minimum threshold for mixed and fuzzy searches. - Default: 0.8 (conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.

⚠️ Performance Note¶

Both mixed and fuzzy modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.

🧪 Example Results Table¶

Below is an example showing how the query Sudden Death Syndrome performs using different search modes against the NCIt ontology:

Query	Search	NCIt match (label)	NCIt code	Cosine	Dice	Levenshtein (Normalized)	Composite
Sudden Death Syndrome	exact	NA	NA	NA	NA	NA	NA
	mixed	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.65	0.60	NA	NA
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.65	0.60	NA	NA
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.65	0.60	NA	NA
		Sudden Infant Death Syndrome	NCIT:C85173	0.86	0.86	NA	NA
	✨ fuzzy	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.65	0.60	0.43	0.63
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.65	0.60	0.43	0.63
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.65	0.60	0.46	0.63
		Sudden Infant Death Syndrome	NCIT:C85173	0.86	0.86	0.75	0.85

Interpretation:

With exact, there are no matches.
With mixed, the best match will be Sudden Infant Death Syndrome.
With fuzzy, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
The highest match is Sudden Infant Death Syndrome, with a composite score of 0.85.

✨ Now we introduce a typo on the query Sudden Infant Deth Syndrome:

Query	Mode	Candidate Label	Code	Cosine	Dice	Levenshtein (Normalized)	Composite
Sudden Infant Deth Syndrome	fuzzy	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.38	0.36	0.33	0.37
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.38	0.36	0.43	0.38
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.57	0.55	0.59	0.57
		Sudden Infant Death Syndrome	NCIT:C85173	0.75	0.75	0.96	0.77

To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75

It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>.

Composite Similarity Score

The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.

1. Token-Based Similarity¶

This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.

2. Normalized Levenshtein Similarity¶

The normalized Levenshtein similarity is defined as:

\[ \text{NormalizedLevenshtein}(s_1, s_2) = 1 - \frac{\text{lev}(s_1, s_2)}{\max(|s_1|, |s_2|)} \]

Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distance—the minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.

This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.

3. Composite Score Formula¶

The final composite similarity score \(C\) is a weighted combination of the two metrics:

\[ C(s_1, s_2) = \alpha \cdot \text{TokenSimilarity}(s_1, s_2) + \beta \cdot \text{NormalizedLevenshtein}(s_1, s_2) \]

Where: - \(\alpha\) (or token_weight) is the weight assigned to the token-based similarity. - \(\beta\) (or lev_weight) is the weight assigned to the normalized Levenshtein similarity.

A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4–5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.

Running `Convert-Pheno`¶

Now you can proceed to run convert-pheno with the command-line interface. Please see how here.

This section provides a summary of the steps to convert an OMOP CDM export to Beacon v2 Models. The starting point is either a PostgreSQL export in the form of .sql or .csv files. The process is the same for both.

Two possibilities may arise:

Full export of records.
Partial export of records.

Full export¶

In a full export, all standardized terms are included in the CONCEPT table, thus Convert-Pheno does not need to search any additional databases for terminology (with a few exceptions).

Partial export¶

In a partial export, many standardized terms may be missing from the CONCEPT table, as a result, Convert-Pheno will perform a search on the included ATHENA-OHDSI database. To enable this search you should use the flag --ohdsi-db.

Running `Convert-Pheno`¶

Now you can proceed to run convert-pheno with the command-line interface. Please see how here.

This section provides a summary of the steps to convert a CSV file with raw clinical data to Phenopackets v2.

Since CSV files are "free-format," a mapping file is necessary to connect variables (i.e. fields) to something meaningful for Convert-Pheno. This mapping file will be used in the conversion process.

What is a Convert-Pheno mapping file?

A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno. This file maps your variables to the required terms of the individuals entity from the Beacon v2 models, which serves a center model.

Creating a mapping file¶

To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:

Type	Required (Optional)	Required properties	Optional properties
Internal	`project`	`id, source, ontology, version`	`description, baselineFieldsToPropagate`
Beacon v2 terms	`id, sex (diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments)`	`fields`	`age,ageOfOnset,assignTermIdFromHeader,bodySite,dateOfProcedure,dictionary,drugDose,drugUnit,duration,durationUnit,familyHistory,fields,mapping,procedureCodeLabel,selector,terminology,unit,visitId`

These are the properties needed to map your data to the entity individuals in the Beacon v2 Models:

baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
age, a string representing the column that points to the age of the patient.
ageOfOnset, an object representing the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition.
assignTermIdFromHeader, an array for columns on which the ontology-term ids have to be assigned from the header.
bodySite, an object representing the column that points to the part of the body affected by a condition or where a procedure was performed.
dateOfProcedure, an object representing the column that points to when a procedure took place.
dictionary, is an object in the form of key: value. The key represents the original variable name in REDCap and the value represents the "phrase" that will be used to query a database to find an ontology candidate. For instance, you may have a variable named cigarettes_days, but you know that in NCIt the label is Average Number Cigarettes Smoked a Day. In this case, you will use cigarettes_days: Average Number Cigarettes Smoked a Day.
drugDose, an object representing the column that points to the dose column for each treatment.
drugUnit, an object representing the column that points to the unit column for each treatment.
duration, an object representing the column that points to the duration column for each treatment.
durationUnit, an object representing the column that points to the duration unit column for each treatment.
familyHistory, an object representing the column that points to the family medical history relevant to the patient's condition.
fields, can be either a string or an array consisting of the name of the REDCap variables that map to that Beacon v2 term.
mapping, is an object in the form of key: value that we use to map our Beacon v2 objects to REDCap variables.
procedureCodeLabel , a nested object with specific mappings for interventionsOrProcedures.
ontology, it's an string to define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that from project.ontology.
routeOfAdministration, a nested object with specific mappings for treatments.
selector, a nested object value with specific mappings.
terminology, a nested object value with user-defined ontology terms.

Terminology example

terminology:
  My fav term:
    id: FOO:12345678
label: Label for my fav term

unit, an object representing the column that points to the unit of measurement for a given value or treatment.
visitId, the column with visit occurrence id.

Defining the values in the property dictionary

Before assigning values to dictionary it's important that you think about which ontologies/terminologies you want to use. The field project.ontology defines the ontology for the whole project, but you can also specify a another antology at the Beacon v2 term level. Once you know which ontologies to use, then try searching for such term to get an accorate label for it. For example, if you have chosen ncit, you can search for the values within NCIt at EBI Search. Convert-Pheno will use these values to retrieve the actual ontology term from its internal databases.

Running `Convert-Pheno`¶

Now you can proceed to run convert-pheno with the command-line interface. Please see how here.

🎓 Tutorial

How to convert:¶

Creating a mapping file¶

1. exact (default)¶

2. mixed (use --search mixed)¶

3. ✨ fuzzy (use --search fuzzy)¶

🔍 Example Search Behavior¶

💡 Similarity Threshold¶

⚠️ Performance Note¶

🧪 Example Results Table¶

1. Token-Based Similarity¶

2. Normalized Levenshtein Similarity¶

3. Composite Score Formula¶

Running Convert-Pheno¶

Full export¶

Partial export¶

Running Convert-Pheno¶

Creating a mapping file¶

Running Convert-Pheno¶

1. `exact` (default)¶

2. `mixed` (use `--search mixed`)¶

3. ✨ `fuzzy` (use `--search fuzzy`)¶

Running `Convert-Pheno`¶

Running `Convert-Pheno`¶

Running `Convert-Pheno`¶