๐ Mapping Steps
Step 1: Conversion to the target modelยถ
Internally, all models are mapped to the Beacon v2 Models.
Why use Beacon v2 as the target model?
- JSON Schema Utilization: Beacon v2 employs JSON Schema for model content definition, facilitating transparency and accessibility in a collaborative environment compared to Phenopackets' Protobuf usage.
- Accommodation of Additional Properties: The Beacon v2 Models schema permits additional properties, enhancing adaptability and enabling near-lossless conversion, especially when using JSON in non-relational databases.
- Beacon v2 API Compatibility: The BFF is directly compatible with the Beacon v2 API ecosystem, a feature not available in Phenopackets without additional mapping.
- Expansion Possibility: Being based at CNAG, a genomics institution, the potential to extend Convert-Pheno's mapping to encompass other Beacon v2 entities was a significant consideration.
- Overlap with Phenopackets v2: Despite minor differences in nomenclature or hierarchy, many essential terms remain identical, encouraging interoperability.
Schema mappingยถ
When starting a new conversion between two data models, the first step is to map variables between the two data schemas. At the time of writting this (Sep-2023) the mapping of variables is still performed manually by human brains .
Mapping strategy: External or hardcoded?
In the early stages of development, we explored the possibility of employing configuration files to guide the mapping process as an alternative to hardcoded solutions. However, JSON data structures' complexity, mainly due to nesting, made this approach impractical for most scenarios, except for REDCap and CDISC-ODM data, which are mapped to Beacon v2 Models via configuration files.
In the Mapping tables section (accessible via the 'Technical Details' tab on the left navigation bar), we outline the equivalencies between different schemas. These tables fulfill several purposes:
- It's a quick way to help out the Health Data community.
- Experts can check it out and suggest changes without digging into all the code.
- If you want to chip in and create a new conversion, you can start by making a mapping table.
Notice
Please note that accurately mapping, even between two standards, is a substantial undertaking. While we possess expertise in certain areas, we certainly don't claim mastery in all . We sincerely welcome any suggestions or feedback.
From table mappings to codeยถ
The tables function as a reference for implementing the source code of Convert-Pheno. For each format conversion, there is a dedicated Perl submodule.
Contributing
While creating the code for a new format can be challenging, modifying properties in an existing one is much easier. Feel free to reach us should you plan to contribute.
Lossless or lossy conversion?ยถ
When converting data from one data standard to another, it is important to consider the possibility of losing information due to differences in schema and field mapping. To mitigate this, we aimed for a lossless conversion by incorporating non-mappable variables as additionalProperties
within the Beacon v2 Models schema. This allows users to access the original variables and their values through database queries, especially when using non-relational databases like MongoDB.
During the conversion process, handling variables that cannot be directly mapped can result in one of two scenarios:
Often, the input data model has variables that do not directly map to the target but are still useful to retain in the output format. If the target format allows for extra properties in a given term (as BFF does), these original variables are stored under the _info
property (or _
+ โproperty nameโ). This commonly happens in conversions from OMOP CDM to BFF.
Example extracted from omop2bff
conversion:
See example
"interventionsOrProcedures" : [
{
"_info" : {
"PROCEDURE_OCCURRENCE" : {
"OMOP_columns" : {
"modifier_concept_id" : 0,
"modifier_source_value" : null,
"person_id" : 2,
"procedure_concept_id" : 4163872,
"procedure_date" : "1955-10-22",
"procedure_datetime" : "1955-10-22 00:00:00",
"procedure_occurrence_id" : 6,
"procedure_source_concept_id" : 4163872,
"procedure_source_value" : 399208008,
"procedure_type_concept_id" : 38000275,
"provider_id" : "\\N",
"quantity" : "\\N",
"visit_detail_id" : 0,
"visit_occurrence_id" : 103
}
}
},
"ageAtProcedure" : {
"age" : {
"iso8601duration" : "35Y"
}
},
"dateOfProcedure" : "1955-10-22",
"procedureCode" : {
"id" : "SNOMED:399208008",
"label" : "Plain chest X-ray"
}
}
]
Example extracted from redcap2bff
conversion:
See example
"treatments" : [
{
"_info" : {
"dose" : null,
"drug" : "budesonide",
"drug_name" : "budesonide",
"duration" : null,
"field" : "budesonide_oral_status",
"route" : "oral",
"start" : null,
"status" : "never treated",
"value" : 1
},
"doseIntervals" : [],
"routeOfAdministration" : {
"id" : "NCIT:C38288",
"label" : "Oral Route of Administration"
},
"treatmentCode" : {
"id" : "NCIT:C1027",
"label" : "Budesonide"
}
}
]
Example of longitudinal data stored under _visit
in a omop2bff
conversion:
See example
"_visit" : {
"_info" : {
"VISIT_OCCURRENCE" : {
"OMOP_columns" : {
"admitting_source_concept_id" : 0,
"admitting_source_value" : null,
"care_site_id" : "\\N",
"discharge_to_concept_id" : 0,
"discharge_to_source_value" : null,
"person_id" : 3,
"preceding_visit_occurrence_id" : 347,
"provider_id" : "\\N",
"visit_concept_id" : 9201,
"visit_end_date" : "1972-12-21",
"visit_end_datetime" : "1972-12-21 00:00:00",
"visit_occurrence_id" : 312,
"visit_source_concept_id" : 0,
"visit_source_value" : "5d035dd1-30d9-4389-b94c-64947bf1f18c",
"visit_start_date" : "1972-12-20",
"visit_start_datetime" : "1972-12-20 00:00:00",
"visit_type_concept_id" : 44818517
}
}
},
"concept" : {
"id" : "Visit:IP",
"label" : "Inpatient Visit"
},
"end_date" : "1972-12-21T00:00:00Z",
"id" : "312",
"occurrence_id" : 312,
"start_date" : "1972-12-20T00:00:00Z",
"type" : {
"id" : "Visit Type:OMOP4822465",
"label" : "Visit derived from encounter on claim"
}
},
"featureType" : {
"id" : "SNOMED:428251008",
"label" : "History of appendectomy"
},
"onset" : {
"iso8601duration" : "56Y"
}
}
When a variable corresponds to other entities in Beacon v2 Models, it is stored within the info
term of the individuals entity. For instance, a PXF
file may contain the biosamples property, which doesn't find a direct match in the individuals entity as it corresponds to the biosamples entity in Beacon v2 Models. To ensure the retention of this information, we place it under info.phenopacket.biosamples
.
Example extracted from pxf2bff
conversion:
See example
"info" : {
"phenopacket" : {
"biosamples" : [
{
"id" : "biosample.1",
"phenotypicFeatures" : [
{
"excluded" : false,
"type" : {
"id" : "HP:0003798",
"label" : "Nemaline bodies"
}
}
],
"procedure" : {
"bodySite" : {
"id" : "UBERON:0002378",
"label" : "muscle of abdomen"
},
"code" : {
"id" : "NCIT:C51895",
"label" : "Muscle Biopsy"
},
"performed" : {
"age" : {
"iso8601duration" : "P1D"
}
}
},
"sampledTissue" : {
"id" : "UBERON:0002378",
"label" : "muscle of abdomen"
}
}
]
}
}
Preservation and augmentation of ontologiesยถ
One of the advantages of Beacon/Phenopackets v2 is that they do not prescribe the use of specific ontologies, thus allowing us to retain the original ontologies, except to fill in missing terms in required fields.
Which ontologies/terminologies are supported?
If the input files contain ontology tems, the ontologies will be preserved and remain intact after the conversion process, except for:
- Beacon v2 Models and Phenopackets v2: the property
sex
is converted to NCI Thesaurus via database search. - OMOP CDM: the properties
sex
,ethnicity
, andgeographicOrigin
are converted to NCI Thesaurus via database search.
CSV | REDCap | CDISC-ODM | OMOP-CDM | Phenopackets v2 | Beacon v2 Models | |
---|---|---|---|---|---|---|
Data mapping | โ | โ | โ | โ | โ | โ |
Add ontologies | โ | โ | โ | --ohdsi-db |
Database Search Feature
For input types that do not contain ontologies, such as CSV
, REDCap, and CDISC-ODM, we perform a database search to fetch ontologies from a variety of trusted databases. Supported databases include:
- Athena-OHDSI standardized vocabulary, which includes multiple terminologies, such as SNOMED, RxNorm or LOINC
- NCI Thesaurus
- ICD-10 terminology
- CDISC (Study Data Tabulation Model Terminology)
- OMIM Online Mendelian Inheritance in Man
- HPO Human Phenotype Ontology (Note that prefixes are
HP:
, without theO
)
About text similarity in database searches
Convert-Pheno
comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:
1. exact
(default)ยถ
Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.
2. mixed
(use --search mixed
)ยถ
Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.
3. โจ fuzzy
(use --search fuzzy
)ยถ
Hybrid search with fuzzy ranking:
Like mixed
, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where:
- 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient),
- 10% comes from the normalized Levenshtein similarity.
The concept with the highest composite score is returned.
Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.
๐ Example Search Behaviorยถ
Query: Exercise pain management
- With --search exact
: โ
Match found โ Exercise Pain Management
Query: Brain Hemorrhage
- With --search mixed
:
- โ No exact match
- โ
Closest match by similarity: Intraventricular Brain Hemorrhage
๐ก Similarity Thresholdยถ
The --min-text-similarity-score
option sets the minimum threshold for mixed
and fuzzy
searches.
- Default: 0.8
(conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.
โ ๏ธ Performance Noteยถ
Both mixed
and fuzzy
modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.
๐งช Example Results Tableยถ
Below is an example showing how the query Sudden Death Syndrome
performs using different search modes against the NCIt ontology:
Query | Search | NCIt match (label) | NCIt code | Cosine | Dice | Levenshtein (Normalized) | Composite |
---|---|---|---|---|---|---|---|
Sudden Death Syndrome | exact | NA | NA | NA | NA | NA | NA |
mixed | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | NA | NA | |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | NA | NA | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | NA | NA | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | NA | NA | ||
โจ fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.60 | 0.43 | 0.63 | |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.60 | 0.43 | 0.63 | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.60 | 0.46 | 0.63 | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 | 0.75 | 0.85 |
Interpretation:
-
With
exact
, there are no matches. -
With
mixed
, the best match will beSudden Infant Death Syndrome
. -
With
fuzzy
, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
The highest match isSudden Infant Death Syndrome
, with a composite score of 0.85.
โจ Now we introduce a typo on the query Sudden Infant Deth Syndrome
:
Query | Mode | Candidate Label | Code | Cosine | Dice | Levenshtein (Normalized) | Composite |
---|---|---|---|---|---|---|---|
Sudden Infant Deth Syndrome | fuzzy | CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.38 | 0.36 | 0.33 | 0.37 |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.38 | 0.36 | 0.43 | 0.38 | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.57 | 0.55 | 0.59 | 0.57 | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.75 | 0.75 | 0.96 | 0.77 |
To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75
It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>
.
Composite Similarity Score
The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.
1. Token-Based Similarityยถ
This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.
2. Normalized Levenshtein Similarityยถ
The normalized Levenshtein similarity is defined as:
Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distanceโthe minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.
This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.
3. Composite Score Formulaยถ
The final composite similarity score \(C\) is a weighted combination of the two metrics:
Where:
- \(\alpha\) (or token_weight
) is the weight assigned to the token-based similarity.
- \(\beta\) (or lev_weight
) is the weight assigned to the normalized Levenshtein similarity.
A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4โ5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.
Step 2: Conversion to the final modelยถ
To Phenopacketsยถ
If the output is set to Phenopackets v2 then a second step (bff2pxf
) is performed (see diagram above).
BFF and PXF community alignment
At present, we have prioritized mapping the terms that we deem most critical in facilitating basic semantic interoperability. We anticipate that Beacon v2 Models will become more aligned with Phenopackets v2, which will simplify the conversion process in future updates. We aim to refine the mappings in future iterations, with the community providing a wider range of case studies.
To OMOP CDMยถ
If the output is set to OMOP CDM then a second step (bff2omop
) is performed (see diagram above).