OMOP-CDM
OMOP CDM stands for Observational Medical Outcomes Partnership Common Data Model. OMOP CDM documentation.
--stream for large OMOP-to-BFF
The OMOP CDM is designed to be database-agnostic, which means it can be implemented using different relational database management systems, with PostgreSQL being a popular choice.
Convert-Pheno is capable of performing both file-based conversions (from PostgreSQL exports in .sql or from any other SQL database via .csv files) and real-time conversions (e.g., from SQL queries) as long as the data has been converted to the accepted JSON format.
Convert-Pheno supports OMOP-CDM in both directions:
- as input, from
.sql,.csv,.sql.gz, or.csv.gzexports - as output, as OMOP CSV tables generated from
BFFinput
Quick OMOP Commandsâ
| Goal | Command |
|---|---|
| OMOP CSV tables to BFF individuals | convert-pheno -iomop PERSON.csv CONCEPT.csv CONDITION_OCCURRENCE.csv -obff individuals.json |
OMOP SPECIMEN to BFF biosamples | convert-pheno -iomop PERSON.csv CONCEPT.csv SPECIMEN.csv -obff --entities biosamples --out-dir bff_out/ |
| Large OMOP SQL dump to streamed BFF | convert-pheno -iomop dump.sql.gz -obff individuals.json.gz --stream --ohdsi-db |
| BFF individuals to OMOP CSV tables | convert-pheno -ibff individuals.json -oomop --out-dir omop_export/ |
For more examples, see Conversion Recipes.
About OMOP CDM longitudinal data
OMOP CDM stores visit_occurrence_id for each person_id in the VISIT_OCCURRENCE table. However, Beacon v2 Models currently lack a way to store longitudinal data. To address this, we added a property named _visit to each record, which stores visit information. This property will be serialized only if the VISIT_OCCURRENCE table is provided.
Currently, Convert-Pheno supports versions 5.3 and 5.4 of OMOP CDM, and its prepared to support v6 once we can test the code with v6 projects.
OMOP As Outputâ
OMOP-CDM can also be emitted as output from BFF input. In that direction, Convert-Pheno writes OMOP CSV tables rather than Beacon-style JSON.
Example:
convert-pheno -ibff individuals.json -oomop --out-dir omop_export/
This writes table files such as omop_export/PERSON.csv, omop_export/CONDITION_OCCURRENCE.csv, and related OMOP outputs.
To rename one of the emitted tables:
convert-pheno -ibff individuals.json -oomop --out-dir omop_export/ --out-name PERSON=patients.csv
OMOP As Inputâ
- Command-line
- Module
- API
When using the convert-pheno command-line interface, simply ensure the correct syntax is provided.
Most examples below use the individuals-only -obff FILE path, which still emits Beacon individuals as one file. In non-stream BFF conversions, you can also request synthesized datasets and cohorts with --entities ... --out-dir out/. biosamples can now be emitted from OMOP SPECIMEN, but this OMOP-to-BFF biosample path is still experimental and pending validation with external collaborators.
Does Convert-Pheno accept gz files?
Yes, both input and output files can be gzipped to save space. However, it's important to note that the gzip layer introduces an overhead.
This overhead can be substantial, potentially doubling the processing time in --stream mode when handling PostgreSQL dumps as input.
About --max-lines-sql default value
Please note that for PostgreSQL dumps, we have configured --max-lines-sql=500 which is suitable for testing purposes. However, for real data, it is recommended to increase this limit to match the size of your largest table. This flag does not apply when your input files are CSV.
Small to medium-sized files (<1M rows)â
All tables at onceâ
Usage:
convert-pheno -iomop omop_dump.sql -obff individuals.json
or when gzipped...
convert-pheno -iomop omop_dump.sql.gz -obff individuals.json.gz
with multiple CSVs (one CSV per table)...
convert-pheno -iomop *csv -obff individuals.json.gz
Independent table filesâ
You can also provide independent table files explicitly, one file per OMOP table. This is useful when your export is already split by table, or when you only want to work with a reduced set of tables.
For example:
convert-pheno -iomop PERSON.csv CONCEPT.csv DRUG_EXPOSURE.csv -obff individuals.json
To emit entity-aware BFF output instead:
convert-pheno -iomop PERSON.csv CONCEPT.csv DRUG_EXPOSURE.csv -obff --entities individuals datasets cohorts --out-dir out/
To emit Beacon biosamples from OMOP SPECIMEN without synthesized datasets or cohorts:
convert-pheno -iomop PERSON.csv CONCEPT.csv SPECIMEN.csv -obff --entities biosamples --out-dir out/
When SPECIMEN.quantity is present, Convert-Pheno also emits it as a
sample-level biosamples.measurements entry. The value comes from
SPECIMEN.quantity, the unit is resolved from unit_concept_id when
available, and unit_source_value is used as a fallback label. Because
OMOP SPECIMEN has no measurement_concept_id equivalent for this
field, the Beacon assayCode uses the valid local CURIE
OMOP:SPECIMEN.quantity with label Specimen quantity.
When Convert-Pheno builds ontology ids from OMOP
CONCEPT.vocabulary_id and CONCEPT.concept_code, whitespace in
the vocabulary prefix is replaced with underscores. For example,
Type Concept becomes Type_Concept, producing ids such as
Type_Concept:OMOP4976929.
Beacon schema validation is permissive for CURIE-like values, but whitespace in identifier prefixes can make downstream APIs, indexes, and query layers harder to handle consistently.
By default, the original OMOP rows are also preserved under fields such
as info.PERSON.OMOP_columns or
biosamples.info.SPECIMEN.OMOP_columns. This is intentional: it helps
users audit the mapping and, when desired, query original OMOP values
through Beacon-oriented APIs. Use --no-source-info to omit these raw
OMOP payloads from the generated BFF.
In this mode, Convert-Pheno infers the OMOP table name from each filename. At minimum, practical conversions usually require:
PERSONCONCEPTor--ohdsi-db- one or more clinical tables such as
DRUG_EXPOSURE,MEASUREMENT,OBSERVATION, orCONDITION_OCCURRENCE
The same approach also works with gzipped table files:
convert-pheno -iomop PERSON.csv.gz CONCEPT.csv.gz DRUG_EXPOSURE.csv.gz -obff individuals.json.gz
Selected table(s)â
It is possible to convert selected tables. For instance, in case you only want to convert DRUG_EXPOSURE table use the option --omop-tables. The option accepts a list of tables (case insensitive) separated by spaces:
CONCEPT and PERSONTables CONCEPT and PERSON are always loaded as they're needed for the conversion. You don't need to specify them.
convert-pheno -iomop omop_dump.sql -obff individuals.json --omop-tables DRUG_EXPOSURE
Using this approach you will be able to submit multiple jobs in parallel.
What if my CONCEPT table does not contain all standard concept_id(s)
In this case, you can use the flag --ohdsi-db that will enable checking an internal database whenever the concept_id can not be found inside your CONCEPT table.
convert-pheno -iomop omop_dump.sql -obff individuals_measurement.json --omop-tables DRUG_EXPOSURE --ohdsi-db
RAM memory usage in --no-stream mode (default)
When working with -iomop and --no-stream, Convert-Pheno will consolidate all the values corresponding to a given attribute person_id under the same object. In order to do this, we need to store all data in the RAM memory. The reason for storing the data in RAM is because the rows are not adjacent (they are not pre-sorted by person_id) and can originate from distinct tables.
| Number of rows | Estimated RAM memory | Estimated time |
|---|---|---|
| 100K | <1GB | 5s |
| 500K | 1GB | 15s |
| 1M | 2GB | 30s |
| 2M | 4GB | 1m |
1 x Intel(R) Xeon(R) W-1350P @ 4.00GHz - 32GB RAM - SSD
If your computer only has 4GB-8GB of RAM and you plan to convert large files we recommend you to use the flag --stream which will process your tables incrementally (i.e.,line-by-line), instead of loading them into memory.
Large files (>1M rows)â
For large files, Convert-Pheno allows for a different approach. The files can be parsed incrementally (i.e., line-by-line).
To choose incremental data processing we'll be using the flag --stream:
--stream mode supported output
We currently support only the individuals-only BFF path (-obff FILE) in --stream mode.
All tables at onceâ
convert-pheno -iomop omop_dump.sql.gz -obff individuals.json.gz --stream
You can also stream independent OMOP table files directly:
convert-pheno -iomop PERSON.csv.gz CONCEPT.csv.gz DRUG_EXPOSURE.csv.gz -obff individuals.json.gz --stream --ohdsi-db
Tables CONCEPT and PERSON are always loaded in RAM.
VISIT_OCCURRENCE will also be loaded if present, and this can consume a lot of RAM depending on its size. You might simply skip this table when exporting OMOP CDM data, as its information is only used as additional property _visit, but it is not part of the Beacon v2 or Phenopackets schema.
Selected table(s)â
You can narrow down the selection to a set of table(s).
About tables CONCEPT and PERSON
Tables CONCEPT and PERSON are always loaded as they're needed for the conversion. You don't need to specify them.
convert-pheno -iomop omop_dump.sql.gz -obff individuals_measurement.json.gz --omop-tables DRUG_EXPOSURE --stream
Running multiple jobs in --stream mode will create a bunch of JSON files instead of one. It's OK, as the files we're creating are intermediate files.
Pros and Cons of incremental data load (--stream mode)
Incremental data load facilitates the processing of huge files. The only substantive difference compared to the --no-stream mode is that the data will not be consolidated at the patient or individual level, which is merely a cosmetic concern. Ultimately, the data will be loaded into a database, such as MongoDB, where the linking of data through keys can be managed. In most cases, the implementation of a pre-built API, such as the one described in the B2RI documentation, will be added to further enhance the functionality.
| Number of rows | Estimated RAM memory | Estimated time |
|---|---|---|
| 100K | 500MB | 7s |
| 500K | 500MB | 18s |
| 1M | 500MB | 35s |
| 2M | 500MB | 1m5s |
1 x Intel(R) Xeon(R) W-1350P @ 4.00GHz - 32GB RAM - SSD
Note that the output JSON files generated in --stream mode will always include information from the PERSON and CONCEPT tables. Therefore, both tables must be loaded into RAM (along with VISIT_OCCURRENCE if present). The size of these tables will obviously impact RAM usage. Although having this information is not a mandatory requirement for MongoDB, it helps in validating the data against Beacon v2 JSON schemas. According to JSON Schema terminology, these files contain required properties for BFF and PXF. For more details on validation, refer to the BFF Validator.
About parallelization and speed
Runtime depends on input size, CPU, disk speed, compression, and ontology lookup requirements. Small synthetic test files usually complete in seconds, while large OMOP exports and conversions that require database-backed ontology resolution take longer.
The calculation is I/O limited and using internal threads did not speed up the calculation. Another valid option is to run simultaneous jobs with external tools such as GNU Parallel, but keep in mind that SQLite database may complain.
As a final consideration, pheno-clinical data conversions are often run to produce intermediate files that are later loaded into a database. For large datasets, the database load can be a substantial part of the total runtime.
For developers who wish to retrieve data in real time, the module can also receive OMOP tables directly as in-memory data structures. The module interface uses one flat payload. Unlike the API, the arguments are not split into input, output, and options sections.
Definitions are stored in table CONCEPT. If you do not pass the relevant CONCEPT rows yourself, set ohdsi_db => 1 (or True in Python) so the converter can resolve terms from the Athena-OHDSI SQLite database.
Perl module payload example
Perl
use Convert::Pheno;
my $payload = {
method => 'omop2bff',
ohdsi_db => 0,
test => 1,
data => {
PERSON => [
{
person_id => 974,
gender_concept_id => 8532,
gender_source_value => 'F',
year_of_birth => 1963,
ethnicity_source_value => 'west_indian',
}
],
CONCEPT => [
{
concept_id => 8532,
concept_name => 'FEMALE',
vocabulary_id => 'Gender',
}
],
},
};
my $convert = Convert::Pheno->new($payload);
my $bff = $convert->omop2bff;
Python bridge payload example
Python
from convertpheno import PythonBinding
payload = {
"method": "omop2bff",
"ohdsi_db": False,
"test": 1,
"data": {
"PERSON": [
{
"person_id": 974,
"gender_concept_id": 8532,
"gender_source_value": "F",
"year_of_birth": 1963,
"ethnicity_source_value": "west_indian",
}
],
"CONCEPT": [
{
"concept_id": 8532,
"concept_name": "FEMALE",
"vocabulary_id": "Gender",
}
],
},
}
bff = PythonBinding(payload).convert_pheno()
All said for the Module also works for the API.
The API request payload now uses explicit conversion, input, output, and options sections.
Small OMOP API payload example
{
"conversion": "omop2bff",
"input": {
"data": {
"PERSON": [
{
"person_id": 974,
"gender_concept_id": 8532,
"year_of_birth": 1963
}
],
"CONCEPT": [
{
"concept_id": 8532,
"concept_name": "FEMALE",
"vocabulary_id": "Gender"
}
]
}
},
"output": {
"entities": ["individuals"]
},
"options": {
"ohdsi_db": true
}
}
See a larger example payload here.