Tutorial
Google Colab version
We created a Google Colab version of the tutorial. Users can view notebooks shared publicly without sign-in, but you need a google account to execute code.
We also have a local copy of the notebook that can be downloaded from the repo.
This page provides brief tutorials on how to perform data conversion by using Convert-Pheno
command-line interface.
Note on installation
Before proceeding, ensure that the software is properly installed. In the following instructions, it will be assumed that you have downloaded and installed Convert-Pheno.
How to convert:¶
This section provides a summary of the steps to convert a REDCap project to Phenopackets v2.
- The starting point is to log in to your REDCap system and export the data to CSV / Microsoft Excel (raw data) format. If you need more information on REDCap, we recommend consulting the comprehensive documentation provided by the Cincinnati Children's Hospital Medical Center.
Can I export CSV / Microsoft Excel (labels) file?
Yes, you can export a CSV or Microsoft Excel file with labels. However, you need to use the --icsv
flag instead of the --iredcap
flag as the input format. While we recommend exporting raw data along with the dictionary for better accuracy, we understand that this might not always be possible.
For more detailed information and other common questions, please refer to the FAQ.
-
After exporting the data, you must also download the REDCap dictionary in CSV format. This can be done within REDCap by navigating to
Project Setup/Data Dictionary/Download the current
. -
Since REDCap projects are "free-format," a mapping file is necessary to connect REDCap project variables (i.e. fields) to something meaningful for
Convert-Pheno
. This mapping file will be used in the conversion process.
What is a Convert-Pheno
mapping file?
A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno
. This file maps your variables to the required terms of the individuals entity from the Beacon v2 models, which serves a center model.
Creating a mapping file¶
To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:
Type | Required (Optional) | Required properties | Optional properties |
---|---|---|---|
Internal | project |
id, source, ontology, version |
description, baselineFieldsToPropagate |
Beacon v2 terms | id, sex (diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments) |
fields |
age,ageOfOnset,assignTermIdFromHeader,bodySite,dateOfProcedure,dictionary,drugDose,drugUnit,duration,durationUnit,familyHistory,fields,mapping,procedureCodeLabel,selector,terminology,unit |
These are the properties needed to map your data to the entity individuals
in the Beacon v2 Models:
- baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
- age, a
string
representing the column that points to the age of the patient. - ageOfOnset, an
object
representing the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition. - assignTermIdFromHeader, an
array
for columns on which the ontology-term ids have to be assigned from the header. - bodySite, an
object
representing the column that points to the part of the body affected by a condition or where a procedure was performed. - dateOfProcedure, an
object
representing the column that points to when a procedure took place. - dictionary, is an
object
in the form ofkey: value
. Thekey
represents the original variable name in REDCap and thevalue
represents the "phrase" that will be used to query a database to find an ontology candidate. For instance, you may have a variable namedcigarettes_days
, but you know that in NCIt the label isAverage Number Cigarettes Smoked a Day
. In this case, you will usecigarettes_days: Average Number Cigarettes Smoked a Day
. - drugDose, an
object
representing the column that points to the dose column for each treatment. - drugUnit, an
object
representing the column that points to the unit column for each treatment. - duration, an
object
representing the column that points to the duration column for each treatment. - durationUnit, an
object
representing the column that points to the duration unit column for each treatment. - familyHistory, an
object
representing the column that points to the family medical history relevant to the patient's condition. - fields, can be either a
string
or anarray
consisting of the name of the REDCap variables that map to that Beacon v2 term. - mapping, is an
object
in the form ofkey: value
that we use to map our Beacon v2 objects to REDCap variables. - procedureCodeLabel , a nested
object
with specific mappings forinterventionsOrProcedures
. - ontology, it's an
string
to define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that fromproject.ontology
. - routeOfAdministration, a nested
object
with specific mappings fortreatments
. - selector, a nested
object
value with specific mappings. - terminology, a nested
object
value with user-defined ontology terms.
- unit, an
object
representing the column that points to the unit of measurement for a given value or treatment.
Defining the values in the property dictionary
Before assigning values to dictionary
it's important that you think about which ontologies/terminologies you want to use. The field project.ontology
defines the ontology for the whole project, but you can also specify a another antology at the Beacon v2 term level. Once you know which ontologies to use, then try searching for such term to get an accorate label for it. For example, if you have chosen ncit
, you can search for the values within NCIt at EBI Search. Convert-Pheno
will use these values to retrieve the actual ontology term from its internal databases.
About text similarity in database searches
Convert-Pheno
comes with a few pre-configured databases and it will search for ontologies/terminologies there. Two two types of searches can be performed:
-
exact
(default)Retrieves only exact matches for a specified 'label'.
-
mixed
(needs--search mixed
)The script will begin by attempting an exact match for 'label', and if it is unsuccessful, it will then conduct a search based on string (phrase) similarity and select the ontology term with the highest score.
Example (NCIt ontology):
Search phrase: Exercise pain management with exact
search.
- exact match: Exercise Pain Management
Search phrase: Brain Hemorrhage with mixed
search.
-
exact match: NA
-
similarity match: Intraventricular Brain Hemorrhage
--min-text-similarity-score
sets the minimum value for the Cosine / Sorensen-Dice coefficient. The default value (0.8) is very conservative.
Note that mixed
search requires more computational time and its results can be unpredictable. Please use it with caution.
Example:
Find below an example of the resulfs for the query Sudden Death Syndrome
on the local NCIt database.
Query | Search method | NCIt match (label) | NCIt code (id) | Cosine | Dice |
---|---|---|---|---|---|
Sudden Death Syndrome | exact |
NA |
NA |
NA |
NA |
mixed |
CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.6 | |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.6 | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.6 | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 |
Here, utilizing the default --search
method (exact
) will yield no matches. However, by employing --search mixed
, we would identify Sudden Infant Death Syndrome
as it registers the highest cosine
score. If we had configured the --min-text-similarity-score
to 0.9, we would not have found any matches.
Running Convert-Pheno
¶
Now you can proceed to run convert-pheno
with the command-line interface. Please see how here.
This section provides a summary of the steps to convert an OMOP CDM export to Beacon v2 Models. The starting point is either a PostgreSQL export in the form of .sql
or .csv
files. The process is the same for both.
Two possibilities may arise:
- Full export of records.
- Partial export of records.
Full export¶
In a full export, all standardized terms are included in the CONCEPT
table, thus Convert-Pheno does not need to search any additional databases for terminology (with a few exceptions).
Partial export¶
In a partial export, many standardized terms may be missing from the CONCEPT
table, as a result, Convert-Pheno
will perform a search on the included ATHENA-OHDSI database. To enable this search you should use the flag --ohdsi-db
.
Running Convert-Pheno
¶
Now you can proceed to run convert-pheno
with the command-line interface. Please see how here.
This section provides a summary of the steps to convert a CSV file with raw clinical data to Phenopackets v2.
- Since CSV files are "free-format," a mapping file is necessary to connect variables (i.e. fields) to something meaningful for
Convert-Pheno
. This mapping file will be used in the conversion process.
What is a Convert-Pheno
mapping file?
A mapping file is a text file in YAML format (JSON is also accepted) that connects a set of variables to a format that is understood by Convert-Pheno
. This file maps your variables to the required terms of the individuals entity from the Beacon v2 models, which serves a center model.
Creating a mapping file¶
To create a mapping file, start by reviewing the example mapping file provided with the installation. The goal is to replace the contents of such file with those from your REDCap project. The mapping file contains the following types of data:
Type | Required (Optional) | Required properties | Optional properties |
---|---|---|---|
Internal | project |
id, source, ontology, version |
description, baselineFieldsToPropagate |
Beacon v2 terms | id, sex (diseases, exposures, info, interventionsOrProcedures, measures, phenotypicFeatures, treatments) |
fields |
age,ageOfOnset,assignTermIdFromHeader,bodySite,dateOfProcedure,dictionary,drugDose,drugUnit,duration,durationUnit,familyHistory,fields,mapping,procedureCodeLabel,selector,terminology,unit |
These are the properties needed to map your data to the entity individuals
in the Beacon v2 Models:
- baselineFieldsToPropagate, an array of columns containing measurements that were taken only at the initial time point (time = 0). Use this if you wish to duplicate these columns across subsequent rows for the same patient ID. It is important to ensure that the row containing baseline information appears first in the CSV.
- age, a
string
representing the column that points to the age of the patient. - ageOfOnset, an
object
representing the column that points to the age at which the patient first experienced symptoms or was diagnosed with a condition. - assignTermIdFromHeader, an
array
for columns on which the ontology-term ids have to be assigned from the header. - bodySite, an
object
representing the column that points to the part of the body affected by a condition or where a procedure was performed. - dateOfProcedure, an
object
representing the column that points to when a procedure took place. - dictionary, is an
object
in the form ofkey: value
. Thekey
represents the original variable name in REDCap and thevalue
represents the "phrase" that will be used to query a database to find an ontology candidate. For instance, you may have a variable namedcigarettes_days
, but you know that in NCIt the label isAverage Number Cigarettes Smoked a Day
. In this case, you will usecigarettes_days: Average Number Cigarettes Smoked a Day
. - drugDose, an
object
representing the column that points to the dose column for each treatment. - drugUnit, an
object
representing the column that points to the unit column for each treatment. - duration, an
object
representing the column that points to the duration column for each treatment. - durationUnit, an
object
representing the column that points to the duration unit column for each treatment. - familyHistory, an
object
representing the column that points to the family medical history relevant to the patient's condition. - fields, can be either a
string
or anarray
consisting of the name of the REDCap variables that map to that Beacon v2 term. - mapping, is an
object
in the form ofkey: value
that we use to map our Beacon v2 objects to REDCap variables. - procedureCodeLabel , a nested
object
with specific mappings forinterventionsOrProcedures
. - ontology, it's an
string
to define more granularly the ontology for this particular Beacon v2 term. If not present, the script will use that fromproject.ontology
. - routeOfAdministration, a nested
object
with specific mappings fortreatments
. - selector, a nested
object
value with specific mappings. - terminology, a nested
object
value with user-defined ontology terms.
- unit, an
object
representing the column that points to the unit of measurement for a given value or treatment.
Defining the values in the property dictionary
Before assigning values to dictionary
it's important that you think about which ontologies/terminologies you want to use. The field project.ontology
defines the ontology for the whole project, but you can also specify a another antology at the Beacon v2 term level. Once you know which ontologies to use, then try searching for such term to get an accorate label for it. For example, if you have chosen ncit
, you can search for the values within NCIt at EBI Search. Convert-Pheno
will use these values to retrieve the actual ontology term from its internal databases.
About text similarity in database searches
Convert-Pheno
comes with a few pre-configured databases and it will search for ontologies/terminologies there. Two two types of searches can be performed:
-
exact
(default)Retrieves only exact matches for a specified 'label'.
-
mixed
(needs--search mixed
)The script will begin by attempting an exact match for 'label', and if it is unsuccessful, it will then conduct a search based on string (phrase) similarity and select the ontology term with the highest score.
Example (NCIt ontology):
Search phrase: Exercise pain management with exact
search.
- exact match: Exercise Pain Management
Search phrase: Brain Hemorrhage with mixed
search.
-
exact match: NA
-
similarity match: Intraventricular Brain Hemorrhage
--min-text-similarity-score
sets the minimum value for the Cosine / Sorensen-Dice coefficient. The default value (0.8) is very conservative.
Note that mixed
search requires more computational time and its results can be unpredictable. Please use it with caution.
Example:
Find below an example of the resulfs for the query Sudden Death Syndrome
on the local NCIt database.
Query | Search method | NCIt match (label) | NCIt code (id) | Cosine | Dice |
---|---|---|---|---|---|
Sudden Death Syndrome | exact |
NA |
NA |
NA |
NA |
mixed |
CDISC SDTM Sudden Death Syndrome Type Terminology | NCIT:C101852 | 0.65 | 0.6 | |
Family History of Sudden Arrythmia Death Syndrome | NCIT:C168019 | 0.65 | 0.6 | ||
Family History of Sudden Infant Death Syndrome | NCIT:C168209 | 0.65 | 0.6 | ||
Sudden Infant Death Syndrome | NCIT:C85173 | 0.86 | 0.86 |
Here, utilizing the default --search
method (exact
) will yield no matches. However, by employing --search mixed
, we would identify Sudden Infant Death Syndrome
as it registers the highest cosine
score. If we had configured the --min-text-similarity-score
to 0.9, we would not have found any matches.
Running Convert-Pheno
¶
Now you can proceed to run convert-pheno
with the command-line interface. Please see how here.
More questions?
Please take a look to our Frequently Asked Questions.