Tutorial
This page gives short, practical walkthroughs for three common convert-pheno workflows.
Google Colab version
A runnable notebook version is available in Google Colab. A local copy is also available in the repo.
Before you start
These examples assume that Convert-Pheno is already installed. If not, start with Download & Installation.
REDCap to PXF¶
This is a good route when you have a REDCap export and want to produce Phenopackets.
You will usually need three files:
- REDCap data export in CSV format
- REDCap data dictionary in CSV format
- Mapping file in YAML or JSON format
Because REDCap projects are free-form, the mapping file is what tells Convert-Pheno how your project variables should be interpreted.
What is a Convert-Pheno mapping file?
A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno.
In v0.30, the layout is entity-aware:
projectholds project-level metadata.beacongroups Beacon entities at the same level.beacon.individualsholds the semantic mapping rules to the individuals entity from the Beacon v2 models, which remains the central normalized model for mapping-file based conversions.beacon.datasets,beacon.cohorts, andbeacon.biosampleshold optional metadata/defaults for emitted Beacon entities.- These metadata overrides are currently consumed only by the conversion routes that use a mapping file:
csv2bff,redcap2bff, andcdisc2bff.
The beacon.individuals wrapper is mandatory in v0.30.
Mental model¶
A mapping section inside beacon.individuals, such as diseases, exposures, or treatments, usually answers four different questions:
- Which source columns participate?
Use
fields. - If the source field name is not the ontology label, what field-level term should be searched?
Use
fieldTermLabels. - If the recorded value is not the ontology label, what value-level term should be searched?
Use
valueTermLabels. - If extra target-side attributes are needed, where do they come from?
Use
targetFieldsfor simple target attributes andfieldRulesfor per-field nested rules.
In practice:
fieldTermLabelsdescribes the meaning of the column/header itself.valueTermLabelsdescribes the meaning of the recorded cell value.targetFieldspoints to source columns used to populate target-side attributes such asprimaryKey,age, ordate.fieldRulesholds field-specific nested configuration, for example value-to-term rules or auxiliary pointers such asageAtExposure.
Creating a mapping file¶
To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:
Minimal mapping skeleton
project:
id: my_project
source: redcap
ontology: ncit
version: 0.1
beacon:
datasets:
id: my-project-dataset
name: My Project Dataset
cohorts:
id: my-project-cohort
name: My Project Cohort
cohortType: study-defined
individuals:
id:
fields: [record_id, visit_name]
targetFields:
primaryKey: record_id
sex:
fields: sex
valueTermLabels:
Male: Male
Female: Female
diseases:
fields: [diagnosis]
valueTermLabels:
UC: Ulcerative Colitis
CD: Crohn Disease
exposures:
fields: [smoking]
fieldTermLabels:
smoking: Smoking
valueTermLabels:
Current smoker: Current Smoker
Ex-smoker: Former Smoker
Never smoked: Never Smoker
info:
fields: [record_id, age, visit_name]
targetFields:
age: age
This skeleton shows the minimum structure most users need first:
projectdefines the source and default ontology.beacongroups all Beacon entity sections in one place.beacon.individualscontains the semantic mapping for the Beaconindividualsentity.beacon.individuals.id.targetFields.primaryKeypoints to the source column used as the main individual identifier.fieldTermLabelsmaps the meaning of a column/header.valueTermLabelsmaps the meaning of a recorded cell value.targetFieldsmaps extra target-side attributes such asage.
| Type | Required (Optional) | Required properties | Optional properties |
|---|---|---|---|
| Internal | project |
id, source, ontology, version |
description, baselineFieldsToPropagate |
| Beacon entities | beacon |
individuals |
datasets, cohorts, biosamples |
| Entity mapping | beacon.individuals |
id, sex |
diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments, ethnicity, geographicOrigin, karyotypicSex, pedigrees |
These are the properties needed to map your data to the entity individuals in the Beacon v2 Models:
- beacon.individuals, an
objectcontaining the semantic mapping rules for the Beaconindividualsentity. -
beacon, a top-level
objectwith the entity sections. Usebeacon.datasetsandbeacon.cohortsto override synthesized metadata such asid,name,description,externalUrl,cohortType, orcohortDataTypes. These values are merged with the tool-generated defaults. This augmentation currently applies only tocsv2bff,redcap2bff, andcdisc2bff. -
baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
- age, a
stringrepresenting the column that points to the age of the patient. - ageAtProcedure, an
objectrepresenting the column that points to the age when a procedure took place. - ageOfOnset, an
objectrepresenting the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition. - bodySite, an
objectrepresenting the column that points to the part of the body affected by a condition or where a procedure was performed. - dateOfProcedure, an
objectrepresenting the column that points to when a procedure took place. - drugDose, an
objectrepresenting the column that points to the dose column for each treatment. - drugUnit, an
objectrepresenting the column that points to the unit column for each treatment. - duration, an
objectrepresenting the column that points to the duration column for each treatment. - durationUnit, an
objectrepresenting the column that points to the duration unit column for each treatment. - familyHistory, an
objectrepresenting the column that points to the family medical history relevant to the patient's condition. - fieldRules, a nested
objectwith per-field rules such as value-to-term mappings or auxiliary field configuration likeageAtExposure. Use this when a single Beacon term needs field-specific behavior rather than one global rule. - fieldTermLabels, is an
objectin the form ofkey: value. Thekeyrepresents the original variable or header name and thevaluerepresents the ontology query phrase used for the field itself. Use this when the column name carries the term meaning. For instance, you may have a variable namedcigarettes_days, but you know that in NCIt the label isAverage Number Cigarettes Smoked a Day. In this case, you will usecigarettes_days: Average Number Cigarettes Smoked a Day. - fields, can be either a
stringor anarrayconsisting of the name of the source variables that map to that Beacon v2 term. - procedureCodeLabel , a nested
objectwith specific mappings forinterventionsOrProcedures. - ontology, it's an
stringto define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that fromproject.ontology. - routeOfAdministration, a nested
objectwith specific mappings fortreatments. - targetFields, is an
objectin the form ofkey: valuethat maps target-side attributes such asprimaryKey,age,date, ordurationto source columns. Use this when the target model expects a named attribute that is not itself an ontology lookup. - terminology, a nested
objectvalue with user-defined ontology terms. Use this when you already know the exact ontology object and want to bypass database lookup for that term. - useHeaderAsTermLabel, an
arrayfor columns on which the ontology-term labels have to be assigned from the header instead of the recorded value. This is common for checkbox-like columns where the header says the term and the cell only says whether it is present.
- unit, an
objectrepresenting the column that points to the unit of measurement for a given value or treatment. - valueTermLabels, is an
objectin the form ofkey: valuewhere thekeyis the original recorded value and thevalueis the ontology query phrase used to map that value. Use this when the cell value carries the term meaning, for exampleCurrent smoker -> Current Smoker. - visitId, the column with visit occurrence id.
Field vs value mapping
exposures:
fields: [smoking, cigarettes_days]
fieldTermLabels:
smoking: Smoking
cigarettes_days: Average Number Cigarettes Smoked a Day
valueTermLabels:
Current smoker: Current Smoker
Ex-smoker: Former Smoker
Never smoked: Never Smoker
In this example:
smoking -> Smokingcomes from the field/header, so it belongs infieldTermLabels.Current smoker -> Current Smokercomes from the recorded value, so it belongs invalueTermLabels.
Defining the values in fieldTermLabels and valueTermLabels
Before assigning values to fieldTermLabels or valueTermLabels it's important that you think about which ontologies or terminologies you want to use. The field project.ontology defines the ontology for the whole project, but you can also specify another ontology at the Beacon v2 term level. Once you know which ontologies to use, search for accurate labels first. For example, if you have chosen ncit, you can search for the values within NCIt at EBI Search. Convert-Pheno will use these values to retrieve the actual ontology term from its internal databases.
For mapping-file-based conversions, Convert-Pheno can also use similarity-based lookup to help connect source fields to target terms:
About text similarity in database searches
Convert-Pheno comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:
1. exact (default)¶
Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.
2. mixed (use --search mixed)¶
Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.
3. ✨ fuzzy (use --search fuzzy)¶
Hybrid search with fuzzy ranking:
Like mixed, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where:
- 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient),
- 10% comes from the normalized Levenshtein similarity.
The concept with the highest composite score is returned.
Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.
🔍 Example Search Behavior¶
Query: Exercise pain management
- With --search exact: ✅ Match found — Exercise Pain Management
Query: Brain Hemorrhage
- With --search mixed:
- ❌ No exact match
- ✅ Closest match by similarity: Intraventricular Brain Hemorrhage
💡 Similarity Threshold¶
The --min-text-similarity-score option sets the minimum threshold for mixed and fuzzy searches.
- Default: 0.8 (conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.
⚠️ Performance Note¶
Both mixed and fuzzy modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.
🧪 Example Results Table¶
Below is an example showing how the query Sudden Death Syndrome performs using different search modes against the NCIt ontology:
| Query | Search | NCIt match (label) | NCIt code | Cosine | Dice | Levenshtein (Normalized) | Composite |
|---|---|---|---|---|---|---|---|
| Sudden Death Syndrome | exact | NA | NA | NA | NA | NA | NA |
| mixed | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | NA | NA | |
| Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | NA | NA | ||
| Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | NA | NA | ||
| Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | NA | NA | ||
| ✨ fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | 0.43 | 0.63 | |
| Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | 0.43 | 0.63 | ||
| Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | 0.46 | 0.63 | ||
| Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | 0.75 | 0.85 |
Interpretation:
-
With
exact, there are no matches. -
With
mixed, the best match will beSudden Infant Death Syndrome. -
With
fuzzy, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
The highest match isSudden Infant Death Syndrome, with a composite score of 0.85.
✨ Now we introduce a typo on the query Sudden Infant Deth Syndrome:
| Query | Mode | Candidate Label | Code | Cosine | Dice | Levenshtein (Normalized) | Composite |
|---|---|---|---|---|---|---|---|
| Sudden Infant Deth Syndrome | fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.38 | 0.36 | 0.33 | 0.37 |
| Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.38 | 0.36 | 0.43 | 0.38 | ||
| Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.57 | 0.55 | 0.59 | 0.57 | ||
| Sudden Infant Death Syndrome | NCIT:C85173 | 0.75 | 0.75 | 0.96 | 0.77 |
To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75
It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>.
Composite Similarity Score
The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.
1. Token-Based Similarity¶
This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.
2. Normalized Levenshtein Similarity¶
The normalized Levenshtein similarity is defined as:
Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distance—the minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.
This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.
3. Composite Score Formula¶
The final composite similarity score \(C\) is a weighted combination of the two metrics:
Where:
- \(\alpha\) (or token_weight) is the weight assigned to the token-based similarity.
- \(\beta\) (or lev_weight) is the weight assigned to the normalized Levenshtein similarity.
A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4–5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.
Run the conversion:
convert-pheno -iredcap redcap.csv \
--redcap-dictionary dictionary.csv \
--mapping-file mapping.yaml \
-opxf phenopackets.json
If you need more detail about REDCap-specific behavior, see REDCap.
OMOP CDM to BFF¶
This route is meant for OMOP exports in SQL or CSV form.
Two situations are common:
- Full export: the
CONCEPTtable already contains the standardized terms needed for conversion - Partial export: some terms are missing, so
Convert-Phenoneeds the bundled ATHENA-OHDSI lookup database and the--ohdsi-dbflag
For smaller inputs:
For larger inputs:
If you are working with OMOP regularly, see OMOP-CDM for the fuller explanation of SQL, CSV, CONCEPT, and streaming behavior.
If you want entity-aware BFF output instead of the individuals-only individuals.json path, request the entities explicitly:
convert-pheno -iomop PERSON.csv CONCEPT.csv DRUG_EXPOSURE.csv \
-obff \
--entities individuals datasets cohorts \
--out-dir out/
In mapping-file workflows, the top-level beacon section can override synthesized datasets and cohorts metadata. This currently applies to csv2bff, redcap2bff, and cdisc2bff, which are the routes that use a mapping file.
CSV to BFF¶
This route is intended for raw clinical CSV data that does not already follow one of the supported data models.
As with REDCap, the key requirement is a mapping file that connects your CSV fields to terms understood by Convert-Pheno.
What is a Convert-Pheno mapping file?
A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno.
In v0.30, the layout is entity-aware:
projectholds project-level metadata.beacongroups Beacon entities at the same level.beacon.individualsholds the semantic mapping rules to the individuals entity from the Beacon v2 models, which remains the central normalized model for mapping-file based conversions.beacon.datasets,beacon.cohorts, andbeacon.biosampleshold optional metadata/defaults for emitted Beacon entities.- These metadata overrides are currently consumed only by the conversion routes that use a mapping file:
csv2bff,redcap2bff, andcdisc2bff.
The beacon.individuals wrapper is mandatory in v0.30.
Mental model¶
A mapping section inside beacon.individuals, such as diseases, exposures, or treatments, usually answers four different questions:
- Which source columns participate?
Use
fields. - If the source field name is not the ontology label, what field-level term should be searched?
Use
fieldTermLabels. - If the recorded value is not the ontology label, what value-level term should be searched?
Use
valueTermLabels. - If extra target-side attributes are needed, where do they come from?
Use
targetFieldsfor simple target attributes andfieldRulesfor per-field nested rules.
In practice:
fieldTermLabelsdescribes the meaning of the column/header itself.valueTermLabelsdescribes the meaning of the recorded cell value.targetFieldspoints to source columns used to populate target-side attributes such asprimaryKey,age, ordate.fieldRulesholds field-specific nested configuration, for example value-to-term rules or auxiliary pointers such asageAtExposure.
Creating a mapping file¶
To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:
Minimal mapping skeleton
project:
id: my_project
source: redcap
ontology: ncit
version: 0.1
beacon:
datasets:
id: my-project-dataset
name: My Project Dataset
cohorts:
id: my-project-cohort
name: My Project Cohort
cohortType: study-defined
individuals:
id:
fields: [record_id, visit_name]
targetFields:
primaryKey: record_id
sex:
fields: sex
valueTermLabels:
Male: Male
Female: Female
diseases:
fields: [diagnosis]
valueTermLabels:
UC: Ulcerative Colitis
CD: Crohn Disease
exposures:
fields: [smoking]
fieldTermLabels:
smoking: Smoking
valueTermLabels:
Current smoker: Current Smoker
Ex-smoker: Former Smoker
Never smoked: Never Smoker
info:
fields: [record_id, age, visit_name]
targetFields:
age: age
This skeleton shows the minimum structure most users need first:
projectdefines the source and default ontology.beacongroups all Beacon entity sections in one place.beacon.individualscontains the semantic mapping for the Beaconindividualsentity.beacon.individuals.id.targetFields.primaryKeypoints to the source column used as the main individual identifier.fieldTermLabelsmaps the meaning of a column/header.valueTermLabelsmaps the meaning of a recorded cell value.targetFieldsmaps extra target-side attributes such asage.
| Type | Required (Optional) | Required properties | Optional properties |
|---|---|---|---|
| Internal | project |
id, source, ontology, version |
description, baselineFieldsToPropagate |
| Beacon entities | beacon |
individuals |
datasets, cohorts, biosamples |
| Entity mapping | beacon.individuals |
id, sex |
diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments, ethnicity, geographicOrigin, karyotypicSex, pedigrees |
These are the properties needed to map your data to the entity individuals in the Beacon v2 Models:
- beacon.individuals, an
objectcontaining the semantic mapping rules for the Beaconindividualsentity. -
beacon, a top-level
objectwith the entity sections. Usebeacon.datasetsandbeacon.cohortsto override synthesized metadata such asid,name,description,externalUrl,cohortType, orcohortDataTypes. These values are merged with the tool-generated defaults. This augmentation currently applies only tocsv2bff,redcap2bff, andcdisc2bff. -
baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
- age, a
stringrepresenting the column that points to the age of the patient. - ageAtProcedure, an
objectrepresenting the column that points to the age when a procedure took place. - ageOfOnset, an
objectrepresenting the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition. - bodySite, an
objectrepresenting the column that points to the part of the body affected by a condition or where a procedure was performed. - dateOfProcedure, an
objectrepresenting the column that points to when a procedure took place. - drugDose, an
objectrepresenting the column that points to the dose column for each treatment. - drugUnit, an
objectrepresenting the column that points to the unit column for each treatment. - duration, an
objectrepresenting the column that points to the duration column for each treatment. - durationUnit, an
objectrepresenting the column that points to the duration unit column for each treatment. - familyHistory, an
objectrepresenting the column that points to the family medical history relevant to the patient's condition. - fieldRules, a nested
objectwith per-field rules such as value-to-term mappings or auxiliary field configuration likeageAtExposure. Use this when a single Beacon term needs field-specific behavior rather than one global rule. - fieldTermLabels, is an
objectin the form ofkey: value. Thekeyrepresents the original variable or header name and thevaluerepresents the ontology query phrase used for the field itself. Use this when the column name carries the term meaning. For instance, you may have a variable namedcigarettes_days, but you know that in NCIt the label isAverage Number Cigarettes Smoked a Day. In this case, you will usecigarettes_days: Average Number Cigarettes Smoked a Day. - fields, can be either a
stringor anarrayconsisting of the name of the source variables that map to that Beacon v2 term. - procedureCodeLabel , a nested
objectwith specific mappings forinterventionsOrProcedures. - ontology, it's an
stringto define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that fromproject.ontology. - routeOfAdministration, a nested
objectwith specific mappings fortreatments. - targetFields, is an
objectin the form ofkey: valuethat maps target-side attributes such asprimaryKey,age,date, ordurationto source columns. Use this when the target model expects a named attribute that is not itself an ontology lookup. - terminology, a nested
objectvalue with user-defined ontology terms. Use this when you already know the exact ontology object and want to bypass database lookup for that term. - useHeaderAsTermLabel, an
arrayfor columns on which the ontology-term labels have to be assigned from the header instead of the recorded value. This is common for checkbox-like columns where the header says the term and the cell only says whether it is present.
- unit, an
objectrepresenting the column that points to the unit of measurement for a given value or treatment. - valueTermLabels, is an
objectin the form ofkey: valuewhere thekeyis the original recorded value and thevalueis the ontology query phrase used to map that value. Use this when the cell value carries the term meaning, for exampleCurrent smoker -> Current Smoker. - visitId, the column with visit occurrence id.
Field vs value mapping
exposures:
fields: [smoking, cigarettes_days]
fieldTermLabels:
smoking: Smoking
cigarettes_days: Average Number Cigarettes Smoked a Day
valueTermLabels:
Current smoker: Current Smoker
Ex-smoker: Former Smoker
Never smoked: Never Smoker
In this example:
smoking -> Smokingcomes from the field/header, so it belongs infieldTermLabels.Current smoker -> Current Smokercomes from the recorded value, so it belongs invalueTermLabels.
Defining the values in fieldTermLabels and valueTermLabels
Before assigning values to fieldTermLabels or valueTermLabels it's important that you think about which ontologies or terminologies you want to use. The field project.ontology defines the ontology for the whole project, but you can also specify another ontology at the Beacon v2 term level. Once you know which ontologies to use, search for accurate labels first. For example, if you have chosen ncit, you can search for the values within NCIt at EBI Search. Convert-Pheno will use these values to retrieve the actual ontology term from its internal databases.
The same similarity-based lookup used with REDCap mappings can also help when building CSV mappings:
About text similarity in database searches
Convert-Pheno comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:
1. exact (default)¶
Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.
2. mixed (use --search mixed)¶
Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.
3. ✨ fuzzy (use --search fuzzy)¶
Hybrid search with fuzzy ranking:
Like mixed, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where:
- 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient),
- 10% comes from the normalized Levenshtein similarity.
The concept with the highest composite score is returned.
Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.
🔍 Example Search Behavior¶
Query: Exercise pain management
- With --search exact: ✅ Match found — Exercise Pain Management
Query: Brain Hemorrhage
- With --search mixed:
- ❌ No exact match
- ✅ Closest match by similarity: Intraventricular Brain Hemorrhage
💡 Similarity Threshold¶
The --min-text-similarity-score option sets the minimum threshold for mixed and fuzzy searches.
- Default: 0.8 (conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.
⚠️ Performance Note¶
Both mixed and fuzzy modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.
🧪 Example Results Table¶
Below is an example showing how the query Sudden Death Syndrome performs using different search modes against the NCIt ontology:
| Query | Search | NCIt match (label) | NCIt code | Cosine | Dice | Levenshtein (Normalized) | Composite |
|---|---|---|---|---|---|---|---|
| Sudden Death Syndrome | exact | NA | NA | NA | NA | NA | NA |
| mixed | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | NA | NA | |
| Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | NA | NA | ||
| Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | NA | NA | ||
| Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | NA | NA | ||
| ✨ fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | 0.43 | 0.63 | |
| Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | 0.43 | 0.63 | ||
| Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | 0.46 | 0.63 | ||
| Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | 0.75 | 0.85 |
Interpretation:
-
With
exact, there are no matches. -
With
mixed, the best match will beSudden Infant Death Syndrome. -
With
fuzzy, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
The highest match isSudden Infant Death Syndrome, with a composite score of 0.85.
✨ Now we introduce a typo on the query Sudden Infant Deth Syndrome:
| Query | Mode | Candidate Label | Code | Cosine | Dice | Levenshtein (Normalized) | Composite |
|---|---|---|---|---|---|---|---|
| Sudden Infant Deth Syndrome | fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.38 | 0.36 | 0.33 | 0.37 |
| Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.38 | 0.36 | 0.43 | 0.38 | ||
| Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.57 | 0.55 | 0.59 | 0.57 | ||
| Sudden Infant Death Syndrome | NCIT:C85173 | 0.75 | 0.75 | 0.96 | 0.77 |
To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75
It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>.
Composite Similarity Score
The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.
1. Token-Based Similarity¶
This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.
2. Normalized Levenshtein Similarity¶
The normalized Levenshtein similarity is defined as:
Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distance—the minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.
This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.
3. Composite Score Formula¶
The final composite similarity score \(C\) is a weighted combination of the two metrics:
Where:
- \(\alpha\) (or token_weight) is the weight assigned to the token-based similarity.
- \(\beta\) (or lev_weight) is the weight assigned to the normalized Levenshtein similarity.
A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4–5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.
Run the conversion:
convert-pheno -icsv clinical_data.csv \
--mapping-file clinical_data_mapping.yaml \
-obff individuals.json
If your separator is not the default one expected by the tool, add --sep.
If you want datasets and cohorts as well, switch to entity mode:
convert-pheno -icsv clinical_data.csv \
--mapping-file clinical_data_mapping.yaml \
-obff \
--entities individuals datasets cohorts \
--out-dir out/