🔍 Mapping Steps

Step 1: Conversion to the target model¶

Internally, all models are mapped to the Beacon v2 Models.

%%{init: {'theme':'neutral'}}%% graph LR subgraph "Step 1:Conversion to BFF" B[Phenopackets v2] -->|pxf2bff| A C[REDCap] -->|redcap2bff| A[Beacon v2 Models] D[OMOP-CDM] -->|omop2bff| A E[CDISC-ODM] -->|cdisc2bff| A G[CSV] -->|csv2bff| A end subgraph "Step 2:BFF to Final" A --> |bff2pxf | F[Phenopackets v2] A --> |bff2omop| H[OMOP-CDM] end style A fill: #6495ED, stroke: #6495ED style B fill: #FF7F50, stroke: #FF7F50 style C fill: #FF6965, stroke: #FF6965 style D fill: #3CB371, stroke: #3CB371 style E fill: #DDA0DD, stroke: #DDA0DD style F fill: #FF7F50, stroke: #FF7F50 style G fill: #FFFF00, stroke: #FFFF00 style H fill: #3CB371, stroke: #3CB371

Convert-Pheno internal mapping steps

Why use Beacon v2 as the target model?

JSON Schema Utilization: Beacon v2 employs JSON Schema for model content definition, facilitating transparency and accessibility in a collaborative environment compared to Phenopackets' Protobuf usage.
Accommodation of Additional Properties: The Beacon v2 Models schema permits additional properties, enhancing adaptability and enabling near-lossless conversion, especially when using JSON in non-relational databases.
Beacon v2 API Compatibility: The BFF is directly compatible with the Beacon v2 API ecosystem, a feature not available in Phenopackets without additional mapping.
Expansion Possibility: Being based at CNAG, a genomics institution, the potential to extend Convert-Pheno's mapping to encompass other Beacon v2 entities was a significant consideration.
Overlap with Phenopackets v2: Despite minor differences in nomenclature or hierarchy, many essential terms remain identical, encouraging interoperability.

Schema mapping¶

When starting a new conversion between two data models, the first step is to map variables between the two data schemas. At the time of writting this (Sep-2023) the mapping of variables is still performed manually by human brains .

Mapping strategy: External or hardcoded?

In the early stages of development, we explored the possibility of employing configuration files to guide the mapping process as an alternative to hardcoded solutions. However, JSON data structures' complexity, mainly due to nesting, made this approach impractical for most scenarios, except for REDCap and CDISC-ODM data, which are mapped to Beacon v2 Models via configuration files.

In the Mapping tables section (accessible via the 'Technical Details' tab on the left navigation bar), we outline the equivalencies between different schemas. These tables fulfill several purposes:

It's a quick way to help out the Health Data community.
Experts can check it out and suggest changes without digging into all the code.
If you want to chip in and create a new conversion, you can start by making a mapping table.

Notice

Please note that accurately mapping, even between two standards, is a substantial undertaking. While we possess expertise in certain areas, we certainly don't claim mastery in all . We sincerely welcome any suggestions or feedback.

From table mappings to code¶

The tables function as a reference for implementing the source code of Convert-Pheno. For each format conversion, there is a dedicated Perl submodule.

Contributing

While creating the code for a new format can be challenging, modifying properties in an existing one is much easier. Feel free to reach us should you plan to contribute.

Lossless or lossy conversion?¶

When converting data from one data standard to another, it is important to consider the possibility of losing information due to differences in schema and field mapping. To mitigate this, we aimed for a lossless conversion by incorporating non-mappable variables as additionalProperties within the Beacon v2 Models schema. This allows users to access the original variables and their values through database queries, especially when using non-relational databases like MongoDB.

During the conversion process, handling variables that cannot be directly mapped can result in one of two scenarios:

Unmappable variablesMatch to a different entity

Often, the input data model has variables that do not directly map to the target but are still useful to retain in the output format. If the target format allows for extra properties in a given term (as BFF does), these original variables are stored under the _info property (or _ + ‘property name’). This commonly happens in conversions from OMOP CDM to BFF.

Example extracted from omop2bff conversion:

See example

"interventionsOrProcedures" : [
       {
          "_info" : {
             "PROCEDURE_OCCURRENCE" : {
                "OMOP_columns" : {
                   "modifier_concept_id" : 0,
                   "modifier_source_value" : null,
                   "person_id" : 2,
                   "procedure_concept_id" : 4163872,
                   "procedure_date" : "1955-10-22",
                   "procedure_datetime" : "1955-10-22 00:00:00",
                   "procedure_occurrence_id" : 6,
                   "procedure_source_concept_id" : 4163872,
                   "procedure_source_value" : 399208008,
                   "procedure_type_concept_id" : 38000275,
                   "provider_id" : "\\N",
                   "quantity" : "\\N", 
                   "visit_detail_id" : 0,
                   "visit_occurrence_id" : 103
                }
             }
          },
          "ageAtProcedure" : {
             "age" : {
                "iso8601duration" : "35Y"
             }
          },
          "dateOfProcedure" : "1955-10-22",
          "procedureCode" : {
             "id" : "SNOMED:399208008",
             "label" : "Plain chest X-ray"
          }
       }
 ]

Example extracted from redcap2bff conversion:

See example

"treatments" : [
       {
          "_info" : {
             "dose" : null,
             "drug" : "budesonide",
             "drug_name" : "budesonide",
             "duration" : null,
             "field" : "budesonide_oral_status",
             "route" : "oral",
             "start" : null,
             "status" : "never treated",
             "value" : 1
          },
          "doseIntervals" : [],
          "routeOfAdministration" : {
             "id" : "NCIT:C38288",
             "label" : "Oral Route of Administration"
          },
          "treatmentCode" : {
             "id" : "NCIT:C1027",
             "label" : "Budesonide"
          }
       }
]

Example of longitudinal data stored under _visit in a omop2bff conversion:

See example

"_visit" : {
        "_info" : {
           "VISIT_OCCURRENCE" : {
              "OMOP_columns" : {
                 "admitting_source_concept_id" : 0,
                 "admitting_source_value" : null,
                 "care_site_id" : "\\N",
                 "discharge_to_concept_id" : 0,
                 "discharge_to_source_value" : null,
                 "person_id" : 3,
                 "preceding_visit_occurrence_id" : 347,
                 "provider_id" : "\\N",
                 "visit_concept_id" : 9201,
                 "visit_end_date" : "1972-12-21",
                 "visit_end_datetime" : "1972-12-21 00:00:00",
                 "visit_occurrence_id" : 312,
                 "visit_source_concept_id" : 0,
                 "visit_source_value" : "5d035dd1-30d9-4389-b94c-64947bf1f18c",
                 "visit_start_date" : "1972-12-20",
                 "visit_start_datetime" : "1972-12-20 00:00:00",
                 "visit_type_concept_id" : 44818517
              }
           }
        },
        "concept" : {
           "id" : "Visit:IP",
           "label" : "Inpatient Visit"
        },
        "end_date" : "1972-12-21T00:00:00Z",
        "id" : "312",
        "occurrence_id" : 312,
        "start_date" : "1972-12-20T00:00:00Z",
        "type" : {
           "id" : "Visit Type:OMOP4822465",
           "label" : "Visit derived from encounter on claim"
        }
     },
     "featureType" : {
        "id" : "SNOMED:428251008",
        "label" : "History of appendectomy"
     },
     "onset" : {
        "iso8601duration" : "56Y"
     }
}

When a variable corresponds to other entities in Beacon v2 Models, it is stored within the info term of the individuals entity. For instance, a PXF file may contain the biosamples property, which doesn't find a direct match in the individuals entity as it corresponds to the biosamples entity in Beacon v2 Models. To ensure the retention of this information, we place it under info.phenopacket.biosamples.

Example extracted from pxf2bff conversion:

See example

"info" : {
          "phenopacket" : {
             "biosamples" : [
                {
                   "id" : "biosample.1",
                   "phenotypicFeatures" : [
                      {
                         "excluded" : false,
                         "type" : {
                            "id" : "HP:0003798",
                            "label" : "Nemaline bodies"
                         }
                      }
                   ],
                   "procedure" : {
                      "bodySite" : {
                         "id" : "UBERON:0002378",
                         "label" : "muscle of abdomen"
                      },
                      "code" : {
                         "id" : "NCIT:C51895",
                         "label" : "Muscle Biopsy"
                      },
                      "performed" : {
                         "age" : {
                            "iso8601duration" : "P1D"
                         }
                      }
                   },
                   "sampledTissue" : {
                      "id" : "UBERON:0002378",
                      "label" : "muscle of abdomen"
                   }
                }
             ]
       }
}

Preservation and augmentation of ontologies¶

One of the advantages of Beacon/Phenopackets v2 is that they do not prescribe the use of specific ontologies, thus allowing us to retain the original ontologies, except to fill in missing terms in required fields.

Which ontologies/terminologies are supported?

If the input files contain ontology tems, the ontologies will be preserved and remain intact after the conversion process, except for:

Beacon v2 Models and Phenopackets v2: the property sex is converted to NCI Thesaurus via database search.
OMOP CDM: the properties sex, ethnicity, and geographicOrigin are converted to NCI Thesaurus via database search.

	CSV	REDCap	CDISC-ODM	OMOP-CDM	Phenopackets v2	Beacon v2 Models
Data mapping	✓	✓	✓	✓	✓	✓
Add ontologies	✓	✓	✓	`--ohdsi-db`

Database Search Feature

For input types that do not contain ontologies, such as CSV, REDCap, and CDISC-ODM, we perform a database search to fetch ontologies from a variety of trusted databases. Supported databases include:

Athena-OHDSI standardized vocabulary, which includes multiple terminologies, such as SNOMED, RxNorm or LOINC
NCI Thesaurus
ICD-10 terminology
CDISC (Study Data Tabulation Model Terminology)
OMIM Online Mendelian Inheritance in Man
HPO Human Phenotype Ontology (Note that prefixes are HP:, without the O)

About text similarity in database searches

Convert-Pheno comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:

1. `exact` (default)¶

Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.

2. `mixed` (use `--search mixed`)¶

Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.

3. ✨ `fuzzy` (use `--search fuzzy`)¶

Hybrid search with fuzzy ranking:
Like mixed, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where: - 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient), - 10% comes from the normalized Levenshtein similarity.

The concept with the highest composite score is returned.

Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.

🔍 Example Search Behavior¶

Query: Exercise pain management
- With --search exact: ✅ Match found — Exercise Pain Management

Query: Brain Hemorrhage
- With --search mixed:
- ❌ No exact match
- ✅ Closest match by similarity: Intraventricular Brain Hemorrhage

💡 Similarity Threshold¶

The --min-text-similarity-score option sets the minimum threshold for mixed and fuzzy searches. - Default: 0.8 (conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.

⚠️ Performance Note¶

Both mixed and fuzzy modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.

🧪 Example Results Table¶

Below is an example showing how the query Sudden Death Syndrome performs using different search modes against the NCIt ontology:

Query	Search	NCIt match (label)	NCIt code	Cosine	Dice	Levenshtein (Normalized)	Composite
Sudden Death Syndrome	exact	NA	NA	NA	NA	NA	NA
	mixed	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.65	0.60	NA	NA
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.65	0.60	NA	NA
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.65	0.60	NA	NA
		Sudden Infant Death Syndrome	NCIT:C85173	0.86	0.86	NA	NA
	✨ fuzzy	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.65	0.60	0.43	0.63
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.65	0.60	0.43	0.63
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.65	0.60	0.46	0.63
		Sudden Infant Death Syndrome	NCIT:C85173	0.86	0.86	0.75	0.85

Interpretation:

With exact, there are no matches.
With mixed, the best match will be Sudden Infant Death Syndrome.
With fuzzy, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
The highest match is Sudden Infant Death Syndrome, with a composite score of 0.85.

✨ Now we introduce a typo on the query Sudden Infant Deth Syndrome:

Query	Mode	Candidate Label	Code	Cosine	Dice	Levenshtein (Normalized)	Composite
Sudden Infant Deth Syndrome	fuzzy	CDISC SDTM Sudden Death Syndrome Type Terminology	NCIT:C101852	0.38	0.36	0.33	0.37
		Family History of Sudden Arrythmia Death Syndrome	NCIT:C168019	0.38	0.36	0.43	0.38
		Family History of Sudden Infant Death Syndrome	NCIT:C168209	0.57	0.55	0.59	0.57
		Sudden Infant Death Syndrome	NCIT:C85173	0.75	0.75	0.96	0.77

To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75

It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>.

Composite Similarity Score

The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.

1. Token-Based Similarity¶

This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.

2. Normalized Levenshtein Similarity¶

The normalized Levenshtein similarity is defined as:

\[ \text{NormalizedLevenshtein}(s_1, s_2) = 1 - \frac{\text{lev}(s_1, s_2)}{\max(|s_1|, |s_2|)} \]

Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distance—the minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.

This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.

3. Composite Score Formula¶

The final composite similarity score \(C\) is a weighted combination of the two metrics:

\[ C(s_1, s_2) = \alpha \cdot \text{TokenSimilarity}(s_1, s_2) + \beta \cdot \text{NormalizedLevenshtein}(s_1, s_2) \]

Where: - \(\alpha\) (or token_weight) is the weight assigned to the token-based similarity. - \(\beta\) (or lev_weight) is the weight assigned to the normalized Levenshtein similarity.

A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4–5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.

Step 2: Conversion to the final model¶

To Phenopackets¶

If the output is set to Phenopackets v2 then a second step (bff2pxf) is performed (see diagram above).

BFF and PXF community alignment

At present, we have prioritized mapping the terms that we deem most critical in facilitating basic semantic interoperability. We anticipate that Beacon v2 Models will become more aligned with Phenopackets v2, which will simplify the conversion process in future updates. We aim to refine the mappings in future iterations, with the community providing a wider range of case studies.

To OMOP CDM¶

If the output is set to OMOP CDM then a second step (bff2omop) is performed (see diagram above).

🔍 Mapping Steps

Step 1: Conversion to the target model¶

Schema mapping¶

From table mappings to code¶

Lossless or lossy conversion?¶

Preservation and augmentation of ontologies¶

1. exact (default)¶

2. mixed (use --search mixed)¶

3. ✨ fuzzy (use --search fuzzy)¶

🔍 Example Search Behavior¶

💡 Similarity Threshold¶

⚠️ Performance Note¶

🧪 Example Results Table¶

1. Token-Based Similarity¶

2. Normalized Levenshtein Similarity¶

3. Composite Score Formula¶

Step 2: Conversion to the final model¶

To Phenopackets¶

To OMOP CDM¶

1. `exact` (default)¶

2. `mixed` (use `--search mixed`)¶

3. ✨ `fuzzy` (use `--search fuzzy`)¶