Using Pheno-Ranker
with OMIM Data¶
This guide explains how to use Pheno-Ranker
with data from the OMIM Database.
Data Download¶
Starting from version 1.02, Pheno-Ranker
includes disease-based cohorts: one for the OMIM Database and another for the ORPHAnet Database.
Which data exchange formats
We provide two data exchange formats, PXF
(Phenotype Exchange Format) and BFF
(Beacon Friendly Format).
For more details on how these cohorts were created, refer to this repository. If you use this data in your research, please check the citation section.
If you installed the code via GitHub or Docker, the disease-based cohorts should already be available. Otherwise, follow these steps to download them. Ensure that wget
and jq
are installed:
We assume that you have R
installed and can install its required packages.
wget https://raw.githubusercontent.com/CNAG-Biomedical-Informatics/pheno-ranker/refs/heads/main/share/diseases/hpo/omim.pxf.json.gz
Here, we will use the PXF
version of the disease-based cohort.
Analytics¶
The cohort consists of 6,471 diseases. Each disease is represented as a PXF
containing a list of phenotypicTerms
.
From this point onward, we assume that the pheno-ranker
directory is located at ../
. Please adjust the path accordingly.
To analyze the distribution of terms across patients, we will use a Perl
script.
See Perl code
#!/usr/bin/perl
use strict;
use warnings;
use JSON::XS;
# Read from STDIN only
if (-t STDIN) { # Check if STDIN is empty (no piped input)
print STDERR "Usage: zcat input.json.gz | $0\n";
print STDERR " cat input.json | $0\n";
exit 1;
}
# Read the entire JSON content from STDIN
my $json_text = do {
local $/;
<STDIN>;
};
# Decode JSON
my $json = JSON::XS->new->utf8->decode($json_text);
# Ensure the decoded JSON is an array reference
die "Input JSON is not an array.\n" unless ref($json) eq 'ARRAY';
# Print CSV header
print "key,count\n";
# Iterate through each object in the array
foreach my $obj (@$json) {
# Ensure the current element is an object with an 'id' field
next unless ref($obj) eq 'HASH' && exists $obj->{id};
my $id = $obj->{id};
my $count = 0;
# Check if 'phenotypicFeatures' exists and is an array
if (exists $obj->{phenotypicFeatures} && ref($obj->{phenotypicFeatures}) eq 'ARRAY') {
$count = scalar @{ $obj->{phenotypicFeatures} };
}
# Print the CSV line
print "\"$id\",$count\n";
}
Now, we will use R
to plot a histogram.
See R code
# Load required packages
library(ggplot2)
# Read the CSV file into a data frame
df <- read.csv("counts.csv")
# Calculate the average and median counts
average_count <- mean(df$count)
median_count <- median(df$count)
# Generate the histogram and add vertical lines for the average and median
plot <- ggplot(df, aes(x = count)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "black") +
geom_vline(aes(xintercept = average_count), color = "black", linetype = "dashed", linewidth = 1) +
geom_vline(aes(xintercept = median_count), color = "red", linetype = "dashed", linewidth = 1) +
labs(title = "Distribution of Element Counts per Object",
x = "Number of Elements",
y = "Frequency") +
theme_minimal() +
annotate("text", x = average_count + 0.5, y = Inf, label = paste("Mean:", round(average_count, 2)), vjust = 2) +
annotate("text", x = median_count - 0.5, y = Inf, label = paste("Median:", round(median_count, 2)), vjust = 2, color = "red")
# Save the plot as a PNG file
ggsave("histogram_with_mean_median.png", plot = plot, width = 8, height = 6, dpi = 300)
# Print confirmation
cat("Histogram saved as histogram_with_mean_median.png\n")
Display plot
Based on the median, the Completeness of the Phenopackets Corpus is approximately 71% (12/17).
Cohort Mode¶
We begin with a simple calculation in cohort mode. In the run, we will also export intermediate files using --e omim
for later use. From now on, we will focus on phenotypicFeatures
terms, which we aim to use for patient classification.
Ensure that the pheno-ranker
directory is set correctly.
time ../pheno-ranker/bin/pheno-ranker -r omim.pxf.json.gz --include-terms phenotypicFeatures --max-matrix-items-in-ram 10000 -e omim
This calculation takes approximately 1 min 40 sec (1 core @ Apple M2 Pro). The --max-matrix-items-in-ram 10000
flag increases efficiency by using RAM, making the process 2x faster.
Since a 6,471 ร 6,471 matrix is too large for a heatmap, we will use multidimensional scaling (mds.R
script).
This computation takes about 5 minutes (1 core @ Apple M2 Pro).
See R code
library(ggplot2)
library(ggrepel)
# Read in the input file as a matrix
data <- as.matrix(read.table("matrix.txt", header = TRUE, row.names = 1, check.names = FALSE))
# Calculate distance matrix
#d <- dist(data)
#d <- 1 - data # J-similarity to J-distance
# Perform multidimensional scaling
#fit <- cmdscale(d, eig=TRUE, k=2)
fit <- cmdscale(data, eig=TRUE, k=2)
# Extract (x, y) coordinates of multidimensional scaling
x <- fit$points[,1]
y <- fit$points[,2]
# Create data frame
df <- data.frame(x, y, label=row.names(data))
# Save image
png(filename = "mds.png", width = 1000, height = 1000,
units = "px", pointsize = 12, bg = "white", res = NA)
# Create scatter plot
ggplot(df, aes(x, y, label = label)) +
geom_point() +
geom_text_repel(size = 5, # Adjust the size of the text
box.padding = 0.2, # Adjust the padding around the text
max.overlaps = 10) + # Change the maximum number of overlaps
labs(title = "Multidimensional Scaling Results",
x = "Hamming Distance MDS Coordinate 1",
y = "Hamming Distance MDS Coordinate 2") + # Add title and axis labels
theme(
plot.title = element_text(size = 30, face = "bold", hjust = 0.5),
axis.title = element_text(size = 25),
axis.text = element_text(size = 15))
#dev.off()
Display plot
Graph Representation¶
To generate a graph visualization, we use a subset of 50 diseases for faster execution.
zcat omim.pxf.json.gz | jq -c '.[]' | shuf -n 50 | jq -s '.' > omim_small.json
../pheno-ranker/bin/pheno-ranker -r omim_small.json -include-terms phenotypicFeatures --cytoscape-json omim_cytoscape.json
Display plot
Patient Mode¶
We will attempt to match patient PMID_35344616_A2
from the Phenopackets Corpus to its disease (OMIM:268310) in OMIM. See the Phenopackets Corpus tutorial for instructions on obtaining a PXF
file for the patient.
Sorting by Jaccard index is recommended, as data completeness is below 30%. From the top three, selecting by INTERSECT-RATE(%) ranks the match at 1.
../pheno-ranker/bin/pheno-ranker -r omim.pxf.json.gz -t PMID_35344616_A2.json -include-terms phenotypicFeatures -sort-by jaccard -max-out 5
See Results
RANK | REFERENCE(ID) | TARGET(ID) | FORMAT | LENGTH | WEIGHTED | HAMMING-DISTANCE | DISTANCE-Z-SCORE | DISTANCE-P-VALUE | DISTANCE-Z-SCORE(RAND) | JACCARD-INDEX | JACCARD-Z-SCORE | JACCARD-P-VALUE | REFERENCE-VARS | TARGET-VARS | INTERSECT | INTERSECT-RATE(%) | COMPLETENESS(%) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | OMIM:180700 | PMID_35344616_A2 | PXF | 77 | False | 52 | 0.111 | 0.5440913 | 3.0769 | 0.325 | 10.784 | 0.0000000 | 71 | 31 | 25 | 80.65 | 35.21 |
2 | OMIM:616894 | PMID_35344616_A2 | PXF | 64 | False | 44 | -0.397 | 0.3457285 | 3.0000 | 0.312 | 10.359 | 0.0000000 | 53 | 31 | 20 | 64.52 | 37.74 |
3 | OMIM:268310 | PMID_35344616_A2 | PXF | 107 | False | 76 | 1.634 | 0.9488308 | 4.3503 | 0.290 | 9.563 | 0.0000000 | 107 | 31 | 31 | 100.00 | 28.97 |
4 | OMIM:618529 | PMID_35344616_A2 | PXF | 53 | False | 39 | -0.714 | 0.2375690 | 3.4340 | 0.264 | 8.670 | 0.0000000 | 36 | 31 | 14 | 45.16 | 38.89 |
5 | OMIM:616331 | PMID_35344616_A2 | PXF | 84 | False | 64 | 0.872 | 0.8084460 | 4.8008 | 0.238 | 7.759 | 0.0000000 | 73 | 31 | 20 | 64.52 | 27.40 |
This process takes ~10 seconds. For a much faster approach using precomputed data, run:
../pheno-ranker/bin/pheno-ranker -prp omim -t PMID_35344616_A2.json -include-terms phenotypicFeatures -sort-by jaccard -max-out 5
With precomputed data, the calculation takes only 0.5 seconds while yielding identical results.
Citation¶
If you use this information in your research, please cite the following: