๐ Tutorial
Google Colab version
We created a Google Colab version of the tutorial. Users can view notebooks shared publicly without sign-in, but you need a google account to execute code.
We also have a local copy of the notebook that can be downloaded from the repo.
This page provides brief tutorials on how to perform data conversion by using Convert-Pheno
command-line interface.
Note on installation
Before proceeding, ensure that the software is properly installed. In the following instructions, it will be assumed that you have downloaded and installed Convert-Pheno.
How to convert:¶
This section provides a summary of the steps to convert a REDCap project to Phenopackets v2.
- The starting point is to log in to your REDCap system and export the data to CSV / Microsoft Excel (raw data) format. If you need more information on REDCap, we recommend consulting the comprehensive documentation provided by the Cincinnati Children's Hospital Medical Center.
Can I export CSV / Microsoft Excel (labels) file?
Yes, you can export a CSV or Microsoft Excel file with labels. However, you need to use the --icsv
flag instead of the --iredcap
flag as the input format. While we recommend exporting raw data along with the dictionary for better accuracy, we understand that this might not always be possible.
For more detailed information and other common questions, please refer to the FAQ.
-
After exporting the data, you must also download the REDCap dictionary in CSV format. This can be done within REDCap by navigating to
Project Setup/Data Dictionary/Download the current
. -
Since REDCap projects are "free-format," a mapping file is necessary to connect REDCap project variables (i.e. fields) to something meaningful for
Convert-Pheno
. This mapping file will be used in the conversion process.
What is a Convert-Pheno
mapping file?
A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno
. This file maps your variables to the required terms of the individuals entity from the Beacon v2 models, which serves a center model.
Creating a mapping file¶
To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:
Type | Required (Optional) | Required properties | Optional properties |
---|---|---|---|
Internal | project |
id, source, ontology, version |
description, baselineFieldsToPropagate |
Beacon v2 terms | id, sex (diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments) |
fields |
age,ageOfOnset,assignTermIdFromHeader,bodySite,dateOfProcedure,dictionary,drugDose,drugUnit,duration,durationUnit,familyHistory,fields,mapping,procedureCodeLabel,selector,terminology,unit,visitId |
These are the properties needed to map your data to the entity individuals
in the Beacon v2 Models:
- baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
- age, a
string
representing the column that points to the age of the patient. - ageOfOnset, an
object
representing the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition. - assignTermIdFromHeader, an
array
for columns on which the ontology-term ids have to be assigned from the header. - bodySite, an
object
representing the column that points to the part of the body affected by a condition or where a procedure was performed. - dateOfProcedure, an
object
representing the column that points to when a procedure took place. - dictionary, is an
object
in the form ofkey: value
. Thekey
represents the original variable name in REDCap and thevalue
represents the "phrase" that will be used to query a database to find an ontology candidate. For instance, you may have a variable namedcigarettes_days
, but you know that in NCIt the label isAverage Number Cigarettes Smoked a Day
. In this case, you will usecigarettes_days: Average Number Cigarettes Smoked a Day
. - drugDose, an
object
representing the column that points to the dose column for each treatment. - drugUnit, an
object
representing the column that points to the unit column for each treatment. - duration, an
object
representing the column that points to the duration column for each treatment. - durationUnit, an
object
representing the column that points to the duration unit column for each treatment. - familyHistory, an
object
representing the column that points to the family medical history relevant to the patient's condition. - fields, can be either a
string
or anarray
consisting of the name of the REDCap variables that map to that Beacon v2 term. - mapping, is an
object
in the form ofkey: value
that we use to map our Beacon v2 objects to REDCap variables. - procedureCodeLabel , a nested
object
with specific mappings forinterventionsOrProcedures
. - ontology, it's an
string
to define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that fromproject.ontology
. - routeOfAdministration, a nested
object
with specific mappings fortreatments
. - selector, a nested
object
value with specific mappings. - terminology, a nested
object
value with user-defined ontology terms.
- unit, an
object
representing the column that points to the unit of measurement for a given value or treatment. - visitId, the column with visit occurrence id.
Defining the values in the property dictionary
Before assigning values to dictionary
it's important that you think about which ontologies/terminologies you want to use. The field project.ontology
defines the ontology for the whole project, but you can also specify a another antology at the Beacon v2 term level. Once you know which ontologies to use, then try searching for such term to get an accorate label for it. For example, if you have chosen ncit
, you can search for the values within NCIt at EBI Search. Convert-Pheno
will use these values to retrieve the actual ontology term from its internal databases.
About text similarity in database searches
Convert-Pheno
comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:
1. exact
(default)¶
Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.
2. mixed
(use --search mixed
)¶
Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.
3. โจ fuzzy
(use --search fuzzy
)¶
Hybrid search with fuzzy ranking:
Like mixed
, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where:
- 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient),
- 10% comes from the normalized Levenshtein similarity.
The concept with the highest composite score is returned.
Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.
๐ Example Search Behavior¶
Query: Exercise pain management
- With --search exact
: โ
Match found โ Exercise Pain Management
Query: Brain Hemorrhage
- With --search mixed
:
- โ No exact match
- โ
Closest match by similarity: Intraventricular Brain Hemorrhage
๐ก Similarity Threshold¶
The --min-text-similarity-score
option sets the minimum threshold for mixed
and fuzzy
searches.
- Default: 0.8
(conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.
โ ๏ธ Performance Note¶
Both mixed
and fuzzy
modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.
๐งช Example Results Table¶
Below is an example showing how the query Sudden Death Syndrome
performs using different search modes against the NCIt ontology:
Query | Search | NCIt match (label) | NCIt code | Cosine | Dice | Levenshtein (Normalized) | Composite |
---|---|---|---|---|---|---|---|
Sudden Death Syndrome | exact | NA | NA | NA | NA | NA | NA |
mixed | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | NA | NA | |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | NA | NA | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | NA | NA | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | NA | NA | ||
โจ fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | 0.43 | 0.63 | |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | 0.43 | 0.63 | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | 0.46 | 0.63 | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | 0.75 | 0.85 |
Interpretation:
-
With
exact
, there are no matches. -
With
mixed
, the best match will beSudden Infant Death Syndrome
. -
With
fuzzy
, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
The highest match isSudden Infant Death Syndrome
, with a composite score of 0.85.
โจ Now we introduce a typo on the query Sudden Infant Deth Syndrome
:
Query | Mode | Candidate Label | Code | Cosine | Dice | Levenshtein (Normalized) | Composite |
---|---|---|---|---|---|---|---|
Sudden Infant Deth Syndrome | fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.38 | 0.36 | 0.33 | 0.37 |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.38 | 0.36 | 0.43 | 0.38 | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.57 | 0.55 | 0.59 | 0.57 | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.75 | 0.75 | 0.96 | 0.77 |
To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75
It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>
.
Composite Similarity Score
The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.
1. Token-Based Similarity¶
This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.
2. Normalized Levenshtein Similarity¶
The normalized Levenshtein similarity is defined as:
Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distanceโthe minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.
This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.
3. Composite Score Formula¶
The final composite similarity score \(C\) is a weighted combination of the two metrics:
Where:
- \(\alpha\) (or token_weight
) is the weight assigned to the token-based similarity.
- \(\beta\) (or lev_weight
) is the weight assigned to the normalized Levenshtein similarity.
A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4โ5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.
Running Convert-Pheno
¶
Now you can proceed to run convert-pheno
with the command-line interface. Please see how here.
This section provides a summary of the steps to convert an OMOP CDM export to Beacon v2 Models. The starting point is either a PostgreSQL export in the form of .sql
or .csv
files. The process is the same for both.
Two possibilities may arise:
- Full export of records.
- Partial export of records.
Full export¶
In a full export, all standardized terms are included in the CONCEPT
table, thus Convert-Pheno does not need to search any additional databases for terminology (with a few exceptions).
Partial export¶
In a partial export, many standardized terms may be missing from the CONCEPT
table, as a result, Convert-Pheno
will perform a search on the included ATHENA-OHDSI database. To enable this search you should use the flag --ohdsi-db
.
Running Convert-Pheno
¶
Now you can proceed to run convert-pheno
with the command-line interface. Please see how here.
This section provides a summary of the steps to convert a CSV file with raw clinical data to Phenopackets v2.
- Since CSV files are "free-format," a mapping file is necessary to connect variables (i.e. fields) to something meaningful for
Convert-Pheno
. This mapping file will be used in the conversion process.
What is a Convert-Pheno
mapping file?
A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno
. This file maps your variables to the required terms of the individuals entity from the Beacon v2 models, which serves a center model.
Creating a mapping file¶
To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:
Type | Required (Optional) | Required properties | Optional properties |
---|---|---|---|
Internal | project |
id, source, ontology, version |
description, baselineFieldsToPropagate |
Beacon v2 terms | id, sex (diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments) |
fields |
age,ageOfOnset,assignTermIdFromHeader,bodySite,dateOfProcedure,dictionary,drugDose,drugUnit,duration,durationUnit,familyHistory,fields,mapping,procedureCodeLabel,selector,terminology,unit,visitId |
These are the properties needed to map your data to the entity individuals
in the Beacon v2 Models:
- baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
- age, a
string
representing the column that points to the age of the patient. - ageOfOnset, an
object
representing the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition. - assignTermIdFromHeader, an
array
for columns on which the ontology-term ids have to be assigned from the header. - bodySite, an
object
representing the column that points to the part of the body affected by a condition or where a procedure was performed. - dateOfProcedure, an
object
representing the column that points to when a procedure took place. - dictionary, is an
object
in the form ofkey: value
. Thekey
represents the original variable name in REDCap and thevalue
represents the "phrase" that will be used to query a database to find an ontology candidate. For instance, you may have a variable namedcigarettes_days
, but you know that in NCIt the label isAverage Number Cigarettes Smoked a Day
. In this case, you will usecigarettes_days: Average Number Cigarettes Smoked a Day
. - drugDose, an
object
representing the column that points to the dose column for each treatment. - drugUnit, an
object
representing the column that points to the unit column for each treatment. - duration, an
object
representing the column that points to the duration column for each treatment. - durationUnit, an
object
representing the column that points to the duration unit column for each treatment. - familyHistory, an
object
representing the column that points to the family medical history relevant to the patient's condition. - fields, can be either a
string
or anarray
consisting of the name of the REDCap variables that map to that Beacon v2 term. - mapping, is an
object
in the form ofkey: value
that we use to map our Beacon v2 objects to REDCap variables. - procedureCodeLabel , a nested
object
with specific mappings forinterventionsOrProcedures
. - ontology, it's an
string
to define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that fromproject.ontology
. - routeOfAdministration, a nested
object
with specific mappings fortreatments
. - selector, a nested
object
value with specific mappings. - terminology, a nested
object
value with user-defined ontology terms.
- unit, an
object
representing the column that points to the unit of measurement for a given value or treatment. - visitId, the column with visit occurrence id.
Defining the values in the property dictionary
Before assigning values to dictionary
it's important that you think about which ontologies/terminologies you want to use. The field project.ontology
defines the ontology for the whole project, but you can also specify a another antology at the Beacon v2 term level. Once you know which ontologies to use, then try searching for such term to get an accorate label for it. For example, if you have chosen ncit
, you can search for the values within NCIt at EBI Search. Convert-Pheno
will use these values to retrieve the actual ontology term from its internal databases.
Running Convert-Pheno
¶
Now you can proceed to run convert-pheno
with the command-line interface. Please see how here.
More questions?
Please take a look to our Frequently Asked Questions.