Skip to content

๐Ÿ” Mapping Steps

Step 1: Conversion to the target modelยถ

Internally, all models are mapped to the Beacon v2 Models.

%%{init: {'theme':'neutral'}}%% graph LR subgraph "Step 1:Conversion to BFF" B[Phenopackets v2] -->|pxf2bff| A C[REDCap] -->|redcap2bff| A[Beacon v2 Models] D[OMOP-CDM] -->|omop2bff| A E[CDISC-ODM] -->|cdisc2bff| A G[CSV] -->|csv2bff| A end subgraph "Step 2:BFF to Final" A --> |bff2pxf | F[Phenopackets v2] A --> |bff2omop| H[OMOP-CDM] end style A fill: #6495ED, stroke: #6495ED style B fill: #FF7F50, stroke: #FF7F50 style C fill: #FF6965, stroke: #FF6965 style D fill: #3CB371, stroke: #3CB371 style E fill: #DDA0DD, stroke: #DDA0DD style F fill: #FF7F50, stroke: #FF7F50 style G fill: #FFFF00, stroke: #FFFF00 style H fill: #3CB371, stroke: #3CB371
Convert-Pheno internal mapping steps
Why use Beacon v2 as the target model?
  • JSON Schema Utilization: Beacon v2 employs JSON Schema for model content definition, facilitating transparency and accessibility in a collaborative environment compared to Phenopackets' Protobuf usage.
  • Accommodation of Additional Properties: The Beacon v2 Models schema permits additional properties, enhancing adaptability and enabling near-lossless conversion, especially when using JSON in non-relational databases.
  • Beacon v2 API Compatibility: The BFF is directly compatible with the Beacon v2 API ecosystem, a feature not available in Phenopackets without additional mapping.
  • Expansion Possibility: Being based at CNAG, a genomics institution, the potential to extend Convert-Pheno's mapping to encompass other Beacon v2 entities was a significant consideration.
  • Overlap with Phenopackets v2: Despite minor differences in nomenclature or hierarchy, many essential terms remain identical, encouraging interoperability.

Schema mappingยถ

When starting a new conversion between two data models, the first step is to map variables between the two data schemas. At the time of writting this (Sep-2023) the mapping of variables is still performed manually by human brains ๐Ÿ˜ฐ.

Mapping strategy: External or hardcoded?

In the early stages of development, we explored the possibility of employing configuration files to guide the mapping process as an alternative to hardcoded solutions. However, JSON data structures' complexity, mainly due to nesting, made this approach impractical for most scenarios, except for REDCap and CDISC-ODM data, which are mapped to Beacon v2 Models via configuration files.

In the Mapping tables section (accessible via the 'Technical Details' tab on the left navigation bar), we outline the equivalencies between different schemas. These tables fulfill several purposes:

  1. It's a quick way to help out the Health Data community.
  2. Experts can check it out and suggest changes without digging into all the code.
  3. If you want to chip in and create a new conversion, you can start by making a mapping table.

Notice

Please note that accurately mapping, even between two standards, is a substantial undertaking. While we possess expertise in certain areas, we certainly don't claim mastery in all ๐Ÿ™. We sincerely welcome any suggestions or feedback.

From table mappings to codeยถ

The tables function as a reference for implementing the source code of Convert-Pheno. For each format conversion, there is a dedicated Perl submodule.

Contributing

While creating the code for a new format can be challenging, modifying properties in an existing one is much easier. Feel free to reach us should you plan to contribute.

Lossless or lossy conversion?ยถ

When converting data from one data standard to another, it is important to consider the possibility of losing information due to differences in schema and field mapping. To mitigate this, we aimed for a lossless conversion by incorporating non-mappable variables as additionalProperties within the Beacon v2 Models schema. This allows users to access the original variables and their values through database queries, especially when using non-relational databases like MongoDB.

During the conversion process, handling variables that cannot be directly mapped can result in one of two scenarios:

Often, the input data model has variables that do not directly map to the target but are still useful to retain in the output format. If the target format allows for extra properties in a given term (as BFF does), these original variables are stored under the _info property (or _ + โ€˜property nameโ€™). This commonly happens in conversions from OMOP CDM to BFF.

Example extracted from omop2bff conversion:

See example
"interventionsOrProcedures" : [
       {
          "_info" : {
             "PROCEDURE_OCCURRENCE" : {
                "OMOP_columns" : {
                   "modifier_concept_id" : 0,
                   "modifier_source_value" : null,
                   "person_id" : 2,
                   "procedure_concept_id" : 4163872,
                   "procedure_date" : "1955-10-22",
                   "procedure_datetime" : "1955-10-22 00:00:00",
                   "procedure_occurrence_id" : 6,
                   "procedure_source_concept_id" : 4163872,
                   "procedure_source_value" : 399208008,
                   "procedure_type_concept_id" : 38000275,
                   "provider_id" : "\\N",
                   "quantity" : "\\N", 
                   "visit_detail_id" : 0,
                   "visit_occurrence_id" : 103
                }
             }
          },
          "ageAtProcedure" : {
             "age" : {
                "iso8601duration" : "35Y"
             }
          },
          "dateOfProcedure" : "1955-10-22",
          "procedureCode" : {
             "id" : "SNOMED:399208008",
             "label" : "Plain chest X-ray"
          }
       }
 ]

Example extracted from redcap2bff conversion:

See example
"treatments" : [
       {
          "_info" : {
             "dose" : null,
             "drug" : "budesonide",
             "drug_name" : "budesonide",
             "duration" : null,
             "field" : "budesonide_oral_status",
             "route" : "oral",
             "start" : null,
             "status" : "never treated",
             "value" : 1
          },
          "doseIntervals" : [],
          "routeOfAdministration" : {
             "id" : "NCIT:C38288",
             "label" : "Oral Route of Administration"
          },
          "treatmentCode" : {
             "id" : "NCIT:C1027",
             "label" : "Budesonide"
          }
       }
]

Example of longitudinal data stored under _visit in a omop2bff conversion:

See example
"_visit" : {
        "_info" : {
           "VISIT_OCCURRENCE" : {
              "OMOP_columns" : {
                 "admitting_source_concept_id" : 0,
                 "admitting_source_value" : null,
                 "care_site_id" : "\\N",
                 "discharge_to_concept_id" : 0,
                 "discharge_to_source_value" : null,
                 "person_id" : 3,
                 "preceding_visit_occurrence_id" : 347,
                 "provider_id" : "\\N",
                 "visit_concept_id" : 9201,
                 "visit_end_date" : "1972-12-21",
                 "visit_end_datetime" : "1972-12-21 00:00:00",
                 "visit_occurrence_id" : 312,
                 "visit_source_concept_id" : 0,
                 "visit_source_value" : "5d035dd1-30d9-4389-b94c-64947bf1f18c",
                 "visit_start_date" : "1972-12-20",
                 "visit_start_datetime" : "1972-12-20 00:00:00",
                 "visit_type_concept_id" : 44818517
              }
           }
        },
        "concept" : {
           "id" : "Visit:IP",
           "label" : "Inpatient Visit"
        },
        "end_date" : "1972-12-21T00:00:00Z",
        "id" : "312",
        "occurrence_id" : 312,
        "start_date" : "1972-12-20T00:00:00Z",
        "type" : {
           "id" : "Visit Type:OMOP4822465",
           "label" : "Visit derived from encounter on claim"
        }
     },
     "featureType" : {
        "id" : "SNOMED:428251008",
        "label" : "History of appendectomy"
     },
     "onset" : {
        "iso8601duration" : "56Y"
     }
}

When a variable corresponds to other entities in Beacon v2 Models, it is stored within the info term of the individuals entity. For instance, a PXF file may contain the biosamples property, which doesn't find a direct match in the individuals entity as it corresponds to the biosamples entity in Beacon v2 Models. To ensure the retention of this information, we place it under info.phenopacket.biosamples.

Example extracted from pxf2bff conversion:

See example
"info" : {
          "phenopacket" : {
             "biosamples" : [
                {
                   "id" : "biosample.1",
                   "phenotypicFeatures" : [
                      {
                         "excluded" : false,
                         "type" : {
                            "id" : "HP:0003798",
                            "label" : "Nemaline bodies"
                         }
                      }
                   ],
                   "procedure" : {
                      "bodySite" : {
                         "id" : "UBERON:0002378",
                         "label" : "muscle of abdomen"
                      },
                      "code" : {
                         "id" : "NCIT:C51895",
                         "label" : "Muscle Biopsy"
                      },
                      "performed" : {
                         "age" : {
                            "iso8601duration" : "P1D"
                         }
                      }
                   },
                   "sampledTissue" : {
                      "id" : "UBERON:0002378",
                      "label" : "muscle of abdomen"
                   }
                }
             ]
       }
}

Preservation and augmentation of ontologiesยถ

One of the advantages of Beacon/Phenopackets v2 is that they do not prescribe the use of specific ontologies, thus allowing us to retain the original ontologies, except to fill in missing terms in required fields.

Which ontologies/terminologies are supported?

If the input files contain ontology tems, the ontologies will be preserved and remain intact after the conversion process, except for:

  • Beacon v2 Models and Phenopackets v2: the property sex is converted to NCI Thesaurus via database search.
  • OMOP CDM: the properties sex, ethnicity, and geographicOrigin are converted to NCI Thesaurus via database search.
CSV REDCap CDISC-ODM OMOP-CDM Phenopackets v2 Beacon v2 Models
Data mapping โœ“ โœ“ โœ“ โœ“ โœ“ โœ“
Add ontologies โœ“ โœ“ โœ“ --ohdsi-db

Database Search Feature

For input types that do not contain ontologies, such as CSV, REDCap, and CDISC-ODM, we perform a database search to fetch ontologies from a variety of trusted databases. Supported databases include:

  • Athena-OHDSI standardized vocabulary, which includes multiple terminologies, such as SNOMED, RxNorm or LOINC
  • NCI Thesaurus
  • ICD-10 terminology
  • CDISC (Study Data Tabulation Model Terminology)
  • OMIM Online Mendelian Inheritance in Man
  • HPO Human Phenotype Ontology (Note that prefixes are HP:, without the O)
About text similarity in database searches

Convert-Pheno comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:


1. exact (default)ยถ

Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.


2. mixed (use --search mixed)ยถ

Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.


3. โœจ fuzzy (use --search fuzzy)ยถ

Hybrid search with fuzzy ranking:
Like mixed, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where: - 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient), - 10% comes from the normalized Levenshtein similarity.

The concept with the highest composite score is returned.

Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.


๐Ÿ” Example Search Behaviorยถ

Query: Exercise pain management
- With --search exact: โœ… Match found โ€” Exercise Pain Management

Query: Brain Hemorrhage
- With --search mixed:
- โŒ No exact match
- โœ… Closest match by similarity: Intraventricular Brain Hemorrhage


๐Ÿ’ก Similarity Thresholdยถ

The --min-text-similarity-score option sets the minimum threshold for mixed and fuzzy searches. - Default: 0.8 (conservative)
- Lowering the threshold may increase recall but may introduce irrelevant matches.


โš ๏ธ Performance Noteยถ

Both mixed and fuzzy modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.


๐Ÿงช Example Results Tableยถ

Below is an example showing how the query Sudden Death Syndrome performs using different search modes against the NCIt ontology:

Query Search NCIt match (label) NCIt code Cosine Dice Levenshtein (Normalized) Composite
Sudden Death Syndrome exact NA NA NA NA NA NA
mixed CDISC SDTM Sudden Death Syndrome Type Terminology NCIT:C101852 0.65 0.60 NA NA
Family History of Sudden Arrythmia Death Syndrome NCIT:C168019 0.65 0.60 NA NA
Family History of Sudden Infant Death Syndrome NCIT:C168209 0.65 0.60 NA NA
Sudden Infant Death Syndrome NCIT:C85173 0.86 0.86 NA NA
โœจ fuzzy CDISC SDTM Sudden Death Syndrome Type Terminology NCIT:C101852 0.65 0.60 0.43 0.63
Family History of Sudden Arrythmia Death Syndrome NCIT:C168019 0.65 0.60 0.43 0.63
Family History of Sudden Infant Death Syndrome NCIT:C168209 0.65 0.60 0.46 0.63
Sudden Infant Death Syndrome NCIT:C85173 0.86 0.86 0.75 0.85

Interpretation:

  • With exact, there are no matches.

  • With mixed, the best match will be Sudden Infant Death Syndrome.

  • With fuzzy, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
    The highest match is Sudden Infant Death Syndrome, with a composite score of 0.85.


โœจ Now we introduce a typo on the query Sudden Infant Deth Syndrome:

Query Mode Candidate Label Code Cosine Dice Levenshtein (Normalized) Composite
Sudden Infant Deth Syndrome fuzzy CDISC SDTM Sudden Death Syndrome Type Terminology NCIT:C101852 0.38 0.36 0.33 0.37
Family History of Sudden Arrythmia Death Syndrome NCIT:C168019 0.38 0.36 0.43 0.38
Family History of Sudden Infant Death Syndrome NCIT:C168209 0.57 0.55 0.59 0.57
Sudden Infant Death Syndrome NCIT:C85173 0.75 0.75 0.96 0.77

To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75

It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>.

Composite Similarity Score

The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.

1. Token-Based Similarityยถ

This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.

2. Normalized Levenshtein Similarityยถ

The normalized Levenshtein similarity is defined as:

\[ \text{NormalizedLevenshtein}(s_1, s_2) = 1 - \frac{\text{lev}(s_1, s_2)}{\max(|s_1|, |s_2|)} \]

Where: - \(\text{lev}(s_1, s_2)\) is the Levenshtein edit distanceโ€”the minimum number of insertions, deletions, or substitutions required to change \(s_1\) into \(s_2\). - \(|s_1|\) and \(|s_2|\) are the lengths of the strings \(s_1\) and \(s_2\), respectively.

This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.

3. Composite Score Formulaยถ

The final composite similarity score \(C\) is a weighted combination of the two metrics:

\[ C(s_1, s_2) = \alpha \cdot \text{TokenSimilarity}(s_1, s_2) + \beta \cdot \text{NormalizedLevenshtein}(s_1, s_2) \]

Where: - \(\alpha\) (or token_weight) is the weight assigned to the token-based similarity. - \(\beta\) (or lev_weight) is the weight assigned to the normalized Levenshtein similarity.

A common default is to set \(\alpha = 0.9\) and \(\beta = 0.1\), emphasizing the token-based similarity. However, for short strings (4โ€“5 words), you might consider adjusting the balance (for example, \(\alpha = 0.95\) and \(\beta = 0.05\)) if small typographical differences are less critical.

Step 2: Conversion to the final modelยถ

To Phenopacketsยถ

If the output is set to Phenopackets v2 then a second step (bff2pxf) is performed (see diagram above).

BFF and PXF community alignment

At present, we have prioritized mapping the terms that we deem most critical in facilitating basic semantic interoperability. We anticipate that Beacon v2 Models will become more aligned with Phenopackets v2, which will simplify the conversion process in future updates. We aim to refine the mappings in future iterations, with the community providing a wider range of case studies.

To OMOP CDMยถ

If the output is set to OMOP CDM then a second step (bff2omop) is performed (see diagram above).