Skip to main content

Tutorial

This page gives short walkthroughs for three common convert-pheno conversions.

Google Colab version

A runnable notebook version is available in Google Colab. A local copy is also available in the repo.

Before you start

These examples assume that Convert-Pheno is already installed. If not, start with Download & Installation.

Choose a Walkthrough​

GoalStart Here
I have a REDCap export and want PhenopacketsREDCap to PXF
I have OMOP-CDM data and want BFFOMOP CDM to BFF
I have a plain CSV and a mapping fileCSV to BFF

For short copy-paste commands without tutorial context, use Conversion Recipes.

REDCap to PXF​

This is a good route when you have a REDCap export and want to produce Phenopackets.

You will usually need three files:

  1. REDCap data export in CSV format
  2. REDCap data dictionary in CSV format
  3. Mapping file in YAML or JSON format

Because REDCap projects are free-form, the mapping file is what tells Convert-Pheno how your project variables should be interpreted.

What is a Convert-Pheno mapping file?

A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno.

In v0.30, the layout is entity-aware:

  • project holds project-level metadata.
  • beacon groups Beacon entities at the same level.
  • beacon.individuals holds the semantic mapping rules to the individuals entity from the Beacon v2 models, which remains the central normalized model for mapping-file based conversions.
  • beacon.datasets, beacon.cohorts, and beacon.biosamples hold optional metadata/defaults for emitted Beacon entities.
  • These metadata overrides are currently consumed only by the conversion routes that use a mapping file: csv2bff, redcap2bff, and cdisc2bff.

The beacon.individuals wrapper is mandatory in v0.30.

Mental model

A mapping section inside beacon.individuals, such as diseases, exposures, or treatments, usually answers four different questions:

  1. Which source columns participate? Use fields.
  2. If the source field name is not the ontology label, what field-level term should be searched? Use fieldTermLabels.
  3. If the recorded value is not the ontology label, what value-level term should be searched? Use valueTermLabels.
  4. If extra target-side attributes are needed, where do they come from? Use targetFields for simple target attributes and fieldRules for per-field nested rules.

In practice:

  • fieldTermLabels describes the meaning of the column/header itself.
  • valueTermLabels describes the meaning of the recorded cell value.
  • targetFields points to source columns used to populate target-side attributes such as primaryKey, age, or date.
  • fieldRules holds field-specific nested configuration, for example value-to-term rules or auxiliary pointers such as ageAtExposure.

Creating a mapping file

To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:

Minimal mapping skeleton
project:
id: my_project
source: redcap
ontology: ncit
version: 0.1

beacon:
datasets:
id: my-project-dataset
name: My Project Dataset
cohorts:
id: my-project-cohort
name: My Project Cohort
cohortType: study-defined
individuals:
id:
fields: [record_id, visit_name]
targetFields:
primaryKey: record_id

sex:
fields: sex
valueTermLabels:
Male: Male
Female: Female

diseases:
fields: [diagnosis]
valueTermLabels:
UC: Ulcerative Colitis
CD: Crohn Disease

exposures:
fields: [smoking]
fieldTermLabels:
smoking: Smoking
valueTermLabels:
Current smoker: Current Smoker
Ex-smoker: Former Smoker
Never smoked: Never Smoker

info:
fields: [record_id, age, visit_name]
targetFields:
age: age

This skeleton shows the minimum structure most users need first:

  • project defines the source and default ontology.
  • beacon groups all Beacon entity sections in one place.
  • beacon.individuals contains the semantic mapping for the Beacon individuals entity.
  • beacon.individuals.id.targetFields.primaryKey points to the source column used as the main individual identifier.
  • fieldTermLabels maps the meaning of a column/header.
  • valueTermLabels maps the meaning of a recorded cell value.
  • targetFields maps extra target-side attributes such as age.
TypeRequired (Optional)Required propertiesOptional properties
Internalprojectid, source, ontology, versiondescription, baselineFieldsToPropagate
Beacon entitiesbeaconindividualsdatasets, cohorts, biosamples
Entity mappingbeacon.individualsid, sexdiseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments, ethnicity, geographicOrigin, karyotypicSex, pedigrees

These are the properties needed to map your data to the entity individuals in the Beacon v2 Models:

  • beacon.individuals, an object containing the semantic mapping rules for the Beacon individuals entity.

  • beacon, a top-level object with the entity sections. Use beacon.datasets and beacon.cohorts to override synthesized metadata such as id, name, description, externalUrl, cohortType, or cohortDataTypes. These values are merged with the tool-generated defaults. This augmentation currently applies only to csv2bff, redcap2bff, and cdisc2bff.

  • baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.

  • Mapping-file conversions preserve a raw source-row snapshot in the generated BFF info object by default, such as CSV_columns or REDCap_columns. This helps users audit mapped values against the input file. Use --no-source-info to omit these copied source payloads.

  • age, a string representing the column that points to the age of the patient.

  • ageAtProcedure, an object representing the column that points to the age when a procedure took place.

  • ageOfOnset, an object representing the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition.

  • bodySite, an object representing the column that points to the part of the body affected by a condition or where a procedure was performed.

  • dateOfProcedure, an object representing the column that points to when a procedure took place.

  • drugDose, an object representing the column that points to the dose column for each treatment.

  • drugUnit, an object representing the column that points to the unit column for each treatment.

  • duration, an object representing the column that points to the duration column for each treatment.

  • durationUnit, an object representing the column that points to the duration unit column for each treatment.

  • familyHistory, an object representing the column that points to the family medical history relevant to the patient's condition.

  • fieldRules, a nested object with per-field rules such as value-to-term mappings or auxiliary field configuration like ageAtExposure. Use this when a single Beacon term needs field-specific behavior rather than one global rule.

  • fieldTermLabels, is an object in the form of key: value. The key represents the original variable or header name and the value represents the ontology query phrase used for the field itself. Use this when the column name carries the term meaning. For instance, you may have a variable named cigarettes_days, but you know that in NCIt the label is Average Number Cigarettes Smoked a Day. In this case, you will use cigarettes_days: Average Number Cigarettes Smoked a Day.

  • fields, can be either a string or an array consisting of the name of the source variables that map to that Beacon v2 term.

  • procedureCodeLabel , a nested object with specific mappings for interventionsOrProcedures.

  • ontology, it's an string to define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that from project.ontology.

  • routeOfAdministration, a nested object with specific mappings for treatments.

  • targetFields, is an object in the form of key: value that maps target-side attributes such as primaryKey, age, date, or duration to source columns. Use this when the target model expects a named attribute that is not itself an ontology lookup.

  • terminology, a nested object value with user-defined ontology terms. Use this when you already know the exact ontology object and want to bypass database lookup for that term.

  • useHeaderAsTermLabel, an array for columns on which the ontology-term labels have to be assigned from the header instead of the recorded value. This is common for checkbox-like columns where the header says the term and the cell only says whether it is present.

    Terminology example
    terminology:
    My fav term:
    id: FOO:12345678
    label: Label for my fav term
  • unit, an object representing the column that points to the unit of measurement for a given value or treatment.

  • valueTermLabels, is an object in the form of key: value where the key is the original recorded value and the value is the ontology query phrase used to map that value. Use this when the cell value carries the term meaning, for example Current smoker -> Current Smoker.

  • visitId, the column with visit occurrence id.

    Field vs value mapping
    exposures:
    fields: [smoking, cigarettes_days]
    fieldTermLabels:
    smoking: Smoking
    cigarettes_days: Average Number Cigarettes Smoked a Day
    valueTermLabels:
    Current smoker: Current Smoker
    Ex-smoker: Former Smoker
    Never smoked: Never Smoker

    In this example:

    • smoking -> Smoking comes from the field/header, so it belongs in fieldTermLabels.
    • Current smoker -> Current Smoker comes from the recorded value, so it belongs in valueTermLabels.
    Defining the values in fieldTermLabels and valueTermLabels

    Before assigning values to fieldTermLabels or valueTermLabels it's important that you think about which ontologies or terminologies you want to use. The field project.ontology defines the ontology for the whole project, but you can also specify another ontology at the Beacon v2 term level. Once you know which ontologies to use, search for accurate labels first. For example, if you have chosen ncit, you can search for the values within NCIt at EBI Search. Convert-Pheno will use these values to retrieve the actual ontology term from its internal databases.

For mapping-file-based conversions, Convert-Pheno can also use similarity-based lookup to help connect source fields to target terms:

About text similarity in database searches

Convert-Pheno comes with several pre-configured ontology/terminology databases. It supports three types of label-based search strategies:


1. exact (default)​

Returns only exact matches for the given label string. If the label is not found exactly, no results are returned.


2. mixed (use --search mixed)​

Hybrid search: First tries to find an exact label match. If none is found, it performs a token-based similarity search and returns the closest matching concept based on the highest similarity score.


3. ✨ fuzzy (use --search fuzzy)​

Hybrid search with fuzzy ranking:
Like mixed, it starts with an exact match attempt. If that fails, it performs a weighted similarity search, where:

  • 90% of the score comes from token-based similarity (e.g., cosine or Dice coefficient),
  • 10% comes from the normalized Levenshtein similarity.

The concept with the highest composite score is returned.

Note: The normalized Levenshtein similarity is computed on top of the candidate results produced by the full text search. In this approach, an initial full text search (using token-based methods) returns a set of potential matches. The fuzzy search then refines these results by applying the normalized Levenshtein distance to better handle minor typographical differences, ensuring that the final composite score reflects both overall token similarity and fine-grained character-level differences.


πŸ” Example Search Behavior​

Query: Exercise pain management

  • With --search exact: βœ… Match found β€” Exercise Pain Management

Query: Brain Hemorrhage

  • With --search mixed:
    • ❌ No exact match
    • βœ… Closest match by similarity: Intraventricular Brain Hemorrhage

Similarity threshold

The --min-text-similarity-score option sets the minimum threshold for mixed and fuzzy searches.

  • Default: 0.8 (conservative)
  • Lowering the threshold may increase recall but may introduce irrelevant matches.

Performance note

Both mixed and fuzzy modes are more computationally intensive and can produce unexpected or less interpretable matches. Use them with care, especially on large datasets.


Example results table

Below is an example showing how the query Sudden Death Syndrome performs using different search modes against the NCIt ontology:

QuerySearchNCIt match (label)NCIt codeCosineDiceLevenshtein (Normalized)Composite
Sudden Death SyndromeexactNANANANANANA
mixedCDISC SDTM Sudden Death Syndrome Type TerminologyNCIT:C1018520.650.60NANA
Family History of Sudden Arrythmia Death SyndromeNCIT:C1680190.650.60NANA
Family History of Sudden Infant Death SyndromeNCIT:C1682090.650.60NANA
Sudden Infant Death SyndromeNCIT:C851730.860.86NANA
✨ fuzzyCDISC SDTM Sudden Death Syndrome Type TerminologyNCIT:C1018520.650.600.430.63
Family History of Sudden Arrythmia Death SyndromeNCIT:C1680190.650.600.430.63
Family History of Sudden Infant Death SyndromeNCIT:C1682090.650.600.460.63
Sudden Infant Death SyndromeNCIT:C851730.860.860.750.85

Interpretation:

  • With exact, there are no matches.

  • With mixed, the best match will be Sudden Infant Death Syndrome.

  • With fuzzy, the composite score (90% token-based + 10% Levenshtein similarity) is used to rank results.
    The highest match is Sudden Infant Death Syndrome, with a composite score of 0.85.


✨ Now we introduce a typo on the query Sudden Infant Deth Syndrome:

QueryModeCandidate LabelCodeCosineDiceLevenshtein (Normalized)Composite
Sudden Infant Deth SyndromefuzzyCDISC SDTM Sudden Death Syndrome Type TerminologyNCIT:C1018520.380.360.330.37
Family History of Sudden Arrythmia Death SyndromeNCIT:C1680190.380.360.430.38
Family History of Sudden Infant Death SyndromeNCIT:C1682090.570.550.590.57
Sudden Infant Death SyndromeNCIT:C851730.750.750.960.77

To capture the best match we would need to lower the threshold to --min-text-similarity-score 0.75

It is possible to change the weight of Levenshtein similarity via --levenshtein-weight <floating 0.0 - 1.0>.

Composite Similarity Score

The composite similarity score is computed as a weighted sum of two measures: the token-based similarity and the normalized Levenshtein similarity.

1. Token-Based Similarity​

This is calculated using methods like cosine or Dice similarity to measure how similar the tokens (words) of two strings are.

2. Normalized Levenshtein Similarity​

The normalized Levenshtein similarity is defined as:

\text{NormalizedLevenshtein}(s_1, s_2) = 1 - \frac{\text{lev}(s_1, s_2)}{\max(|s_1|, |s_2|)}

Where:

  • \text{lev}(s_1, s_2) is the Levenshtein edit distanceβ€”the minimum number of insertions, deletions, or substitutions required to change s_1 into s_2.
  • |s_1| and |s_2| are the lengths of the strings s_1 and s_2, respectively.

This formula produces a score between 0 and 1, with 1.0 meaning identical strings and 0.0 meaning completely different strings.

3. Composite Score Formula​

The final composite similarity score C is a weighted combination of the two metrics:

C(s_1, s_2) = \alpha \cdot \text{TokenSimilarity}(s_1, s_2) + \beta \cdot \text{NormalizedLevenshtein}(s_1, s_2)

Where:

  • \alpha (or token_weight) is the weight assigned to the token-based similarity.
  • \beta (or lev_weight) is the weight assigned to the normalized Levenshtein similarity.

A common default is to set \alpha = 0.9 and \beta = 0.1, emphasizing the token-based similarity. However, for short strings (4–5 words), you might consider adjusting the balance (for example, \alpha = 0.95 and \beta = 0.05) if small typographical differences are less critical.

Run the conversion:

convert-pheno -iredcap redcap.csv \
--redcap-dictionary dictionary.csv \
--mapping-file mapping.yaml \
-opxf phenopackets.json

If you need more detail about REDCap-specific behavior, see REDCap.

OMOP CDM to BFF​

This route is meant for OMOP exports in SQL or CSV form.

Two situations are common:

  1. Full export: the CONCEPT table already contains the standardized terms needed for conversion
  2. Partial export: some terms are missing, so Convert-Pheno needs the bundled ATHENA-OHDSI lookup database and the --ohdsi-db flag

For smaller inputs:

convert-pheno -iomop omop.sql -obff individuals.json

For larger inputs:

convert-pheno -iomop omop.sql.gz -obff individuals.json.gz --stream --ohdsi-db

If you are working with OMOP regularly, see OMOP-CDM for the fuller explanation of SQL, CSV, CONCEPT, and streaming behavior.

If you want entity-aware BFF output instead of the individuals-only individuals.json path, request the entities explicitly:

convert-pheno -iomop PERSON.csv CONCEPT.csv DRUG_EXPOSURE.csv \
-obff \
--entities individuals datasets cohorts \
--out-dir out/

In mapping-file conversions, the top-level beacon section can override synthesized datasets and cohorts metadata. This currently applies to csv2bff, redcap2bff, and cdisc2bff, which are the routes that use a mapping file.

CSV to BFF​

This route is intended for raw clinical CSV data that does not already follow one of the supported data models.

As with REDCap, the key requirement is a mapping file. The mapping-file structure is the same as the one shown in REDCap to PXF: source columns are described under beacon.individuals, while optional beacon.datasets, beacon.cohorts, and beacon.biosamples sections can provide entity metadata for routes that support it.

For the complete mapping-file model, see Mapping Steps. For ontology lookup modes and thresholds, see DB Search.

Run the conversion:

convert-pheno -icsv clinical_data.csv \
--mapping-file clinical_data_mapping.yaml \
-obff individuals.json

If your separator is not the default one expected by the tool, add --sep.

If you want datasets and cohorts as well, switch to entity mode:

convert-pheno -icsv clinical_data.csv \
--mapping-file clinical_data_mapping.yaml \
-obff \
--entities individuals datasets cohorts \
--out-dir out/

Need more detail?​

  • Usage for more command examples
  • CSV for raw CSV input
  • REDCap for REDCap exports
  • OMOP-CDM for OMOP-specific options and caveats
  • FAQ for common questions