๐Ÿ“Š Cohort mode

Cohort mode performs a cross-comparison of all individuals in a cohort(s) using as a metric the Hamming distance or the Jaccard index. The resulting matrix can be further analyzed (e.g., with R) using unsupervised learning techniques such as cluster characterization, dimensionality reduction, or graph-based analytics.

Generic JSON tutorial

We created a tutorial that deliberately uses generic JSON data (i.e., movies) to illustrate the capabilities of Pheno-Ranker, as starting with familiar examples can help you better grasp its usage.

Once you are comfortable with the concepts using movie data, you will find it easier to apply Pheno-Ranker to real GA4GH standards. For specific examples, please refer to the cohort and patient pages in this documentation.


When using the Pheno-ranker command-line interface, simply ensure the correct syntax is provided.

For this example, we'll use individuals.json, which contains a JSON array of 36 patients. We will conduct a comprehensive cross-comparison among all individuals within this file.

First, we will download the file:

And now we run Pheno-Ranker:

pheno-ranker -r individuals.json 
More input examples

You can find more input examples here.

This process generates a matrix.txt file, containing the results of 36 x 36 pairwise comparisons, calculated using the Hamming distance metric.

See matrix.txt
107:week_0_arm_1 107:week_2_arm_1 107:week_14_arm_1 125:week_0_arm_1 125:week_2_arm_1 125:week_14_arm_1 125:week_26_arm_1 125:week_52_arm_1 125:week_78_arm_1 215:week_0_arm_1 215:week_2_arm_1 215:week_14_arm_1 215:week_26_arm_1 215:week_52_arm_1 215:week_78_arm_1 257:week_0_arm_1 257:week_2_arm_1 257:week_14_arm_1 257:week_26_arm_1 275:week_0_arm_1 275:week_2_arm_1 275:week_14_arm_1 275:week_52_arm_1 305:week_0_arm_1 305:week_26_arm_1 305:week_52_arm_1 365:week_0_arm_1 365:week_2_arm_1 365:week_14_arm_1 365:week_26_arm_1 365:week_52_arm_1 527:week_0_arm_1 527:week_2_arm_1 527:week_14_arm_1 527:week_26_arm_1 527:week_52_arm_1
Defining the similarity metric

Use the flag --similarity-metric-cohort. The default value is hamming. The alternative value is jaccard.

Exporting intermediate files

It is possible to export all intermediate files, as well as a file indicating coverage with the flag --e. Examples:

pheno-ranker -r individuals.json --e 
pheno-ranker -r individuals.json --e my_fav_id # for chosing a prefix

The intermediate files can be used for further processing (e.g., import to a database; see FAQs) or to make informed decisions. For instance, the file export.coverage_stats.json has stats on the coverage of each term (1D-key) in the cohort. It is possible to go more granular with a tool like jq that parses JSON. For instance:

jq -r 'to_entries | map(.key + ": " + (.value | length | tostring))[]' < export.ref_hash.json

This command will print how many variables per individual were actually used to perform the comparison. You can post-process the output to check for unbalanced data.

Included R scripts

You can find in the link below a few examples to perform clustering and multimensional scaling with your data:

R scripts at GitHub.


The matrix can be processed to obtain a heatmap:

R code
# Load library
#library("heatmaply") # could not install

# Read in the input file as a matrix
data <- as.matrix(read.table("matrix.txt", header = TRUE, row.names = 1, check.names = FALSE))

# Save image
png(filename = "heatmap.png", width = 1000, height = 1000,
    units = "px", pointsize = 12, bg = "white", res = NA)

# Create the heatmap with row and column labels
#heatmap(data, Rowv = FALSE, Colv = FALSE, labRow = rownames(data), labCol = colnames(data))

Heatmap of a intra-cohort pairwise comparison

Dimensionality reduction

The same matrix can be processed with multidimensional scaling to reduce the dimensionality.

R code

# Read in the input file as a matrix 
data <- as.matrix(read.table("matrix.txt", header = TRUE, row.names = 1, check.names = FALSE))

# Calculate distance matrix
#d <- dist(data)
#d <- 1 - data  # J-similarity to J-distance

# Perform multidimensional scaling
#fit <- cmdscale(d, eig=TRUE, k=2)
fit <- cmdscale(data, eig=TRUE, k=2)

# Extract (x, y) coordinates of multidimensional scaling
x <- fit$points[,1]
y <- fit$points[,2]

# Create data frame
df <- data.frame(x, y, label=row.names(data))

# Save image
png(filename = "mds.png", width = 1000, height = 1000,
    units = "px", pointsize = 12, bg = "white", res = NA)

# Create scatter plot
ggplot(df, aes(x, y, label = label)) +
  geom_point() +
  geom_text_repel(size = 5, # Adjust the size of the text
                  box.padding = 0.2, # Adjust the padding around the text
                  max.overlaps = 10) + # Change the maximum number of overlaps
  labs(title = "Multidimensional Scaling Results",
       x = "Hamming Distance MDS Coordinate 1",
       y = "Hamming Distance MDS Coordinate 2") + # Add title and axis labels
        plot.title = element_text(size = 30, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 25),
        axis.text = element_text(size = 15))

Multidimensional scaling of a intra-cohort pairwise comparison

Graph analytics

Pheno-Rankerhas an option for creating a graph in JSONformat, compatible with Cytoscape ecoystem.

Bash code for Cytoscape-compatible graph/network

pheno-ranker -r individuals.json --cytoscape-json
This command generates a graph.json file, as well as a matrix.txt file.

To produce summary statistics, use:

pheno-ranker -r individuals.json --cytoscape-json --graph-stats
This command will produce a file called graph_stats.txt. For additional information, see the generic JSON tutorial.

We'll be using individuals.json again, which includes data for 36 patients. This time, however, we'll use it twice to simulate having two cohorts. The software will add a CX_ prefix to the primary_key values to help us keep track of which patient comes from which usage of the file.

pheno-ranker -r individuals.json individuals.json

Is it possible to have a cohort with just one individual?

Absolutely, a cohort can indeed be composed of a single individual. This allows for an analysis involving both a cohort and specific patient(s) simultaneously.

Heatmap of a inter-cohort pairwise comparison

The prefixes can be changed with the flag --append-prefixes:

pheno-ranker -r individuals.json individuals.json --append-prefixes REF TAR
This will create a matrix.txt file of (36+36) x (36+36) cells. Again, this matrix can be processed with R:

Heatmap of a inter-cohort pairwise comparison