Skip to content

Cohort mode

Cohort mode performs a cross-comparison of all individuals in a cohort(s) using as a metric the Hamming distance or the Jaccard index. The resulting matrix can be further analyzed (e.g., with R) using unsupervised learning techniques such as cluster characterization, dimensionality reduction, or graph-based analytics.

Generic JSON tutorial

We created a tutorial that deliberately uses generic JSON data (i.e., movies) to illustrate the capabilities of Pheno-Ranker, as starting with familiar examples can help you better grasp its usage.

Once you are comfortable with the concepts using movie data, you will find it easier to apply Pheno-Ranker to real GA4GH standards. For specific examples, please refer to the cohort and patient pages in this documentation.

Usage

When using the Pheno-ranker command-line interface, simply ensure the correct syntax is provided.

For this example, we'll use individuals.json, which contains a JSON array of 36 patients. We will conduct a comprehensive cross-comparison among all individuals within this file.

First, we will download the file:

wget https://raw.githubusercontent.com/CNAG-Biomedical-Informatics/pheno-ranker/refs/heads/main/t/individuals.json
And now we run Pheno-Ranker:

pheno-ranker -r individuals.json 
More input examples

You can find more input examples here.

This process generates a matrix.txt file, containing the results of 36 x 36 pairwise comparisons, calculated using the Hamming distance metric.

See matrix.txt
107:week_0_arm_1 107:week_2_arm_1 107:week_14_arm_1 125:week_0_arm_1 125:week_2_arm_1 125:week_14_arm_1 125:week_26_arm_1 125:week_52_arm_1 125:week_78_arm_1 215:week_0_arm_1 215:week_2_arm_1 215:week_14_arm_1 215:week_26_arm_1 215:week_52_arm_1 215:week_78_arm_1 257:week_0_arm_1 257:week_2_arm_1 257:week_14_arm_1 257:week_26_arm_1 275:week_0_arm_1 275:week_2_arm_1 275:week_14_arm_1 275:week_52_arm_1 305:week_0_arm_1 305:week_26_arm_1 305:week_52_arm_1 365:week_0_arm_1 365:week_2_arm_1 365:week_14_arm_1 365:week_26_arm_1 365:week_52_arm_1 527:week_0_arm_1 527:week_2_arm_1 527:week_14_arm_1 527:week_26_arm_1 527:week_52_arm_1
107:week_0_arm_1 0 24 23 6 23 23 24 43 40 16 27 29 27 49 32 29 45 45 50 14 25 30 51 18 26 45 20 25 26 30 45 24 24 23 32 43
107:week_2_arm_1 24 0 3 22 3 3 2 23 18 30 7 9 7 29 10 47 25 25 28 30 5 10 31 32 4 25 34 5 6 8 25 42 2 3 10 23
107:week_14_arm_1 23 3 0 21 2 2 3 22 19 29 6 8 6 28 11 46 24 24 29 29 4 9 30 31 5 24 33 4 5 9 24 41 3 2 11 22
125:week_0_arm_1 6 22 21 0 21 21 22 41 38 14 25 27 25 47 30 29 43 43 48 12 23 28 49 16 24 43 18 23 24 28 43 24 22 21 30 41
125:week_2_arm_1 23 3 2 21 0 2 3 22 19 29 6 8 6 28 11 46 24 24 29 29 4 9 30 31 5 24 33 4 5 9 24 41 3 2 11 22
125:week_14_arm_1 23 3 2 21 2 0 3 22 19 29 6 8 6 28 11 46 24 24 29 29 4 9 30 31 5 24 33 4 5 9 24 41 3 2 11 22
125:week_26_arm_1 24 2 3 22 3 3 0 23 18 30 7 9 7 29 10 47 25 25 28 30 5 10 31 32 4 25 34 5 6 8 25 42 2 3 10 23
125:week_52_arm_1 43 23 22 41 22 22 23 0 7 49 26 28 26 8 15 26 4 4 9 49 24 29 10 51 25 4 53 24 25 29 4 21 23 22 15 2
125:week_78_arm_1 40 18 19 38 19 19 18 7 0 46 23 25 23 13 10 31 9 9 12 46 21 26 15 48 20 9 50 21 22 24 9 26 18 19 10 7
215:week_0_arm_1 16 30 29 14 29 29 30 49 46 0 33 27 33 43 38 37 51 51 56 12 31 34 45 22 32 51 18 31 30 36 51 34 30 29 38 49
215:week_2_arm_1 27 7 6 25 6 6 7 26 23 33 0 12 2 32 15 50 28 28 33 33 8 13 34 35 9 28 29 8 9 13 28 45 7 6 15 26
215:week_14_arm_1 29 9 8 27 8 8 9 28 25 27 12 0 12 26 17 50 30 30 35 23 10 13 28 37 11 30 31 10 9 15 30 47 9 8 17 28
215:week_26_arm_1 27 7 6 25 6 6 7 26 23 33 2 12 0 32 15 50 28 28 33 33 8 13 34 35 9 28 29 8 9 13 28 45 7 6 15 26
215:week_52_arm_1 49 29 28 47 28 28 29 8 13 43 32 26 32 0 21 30 10 10 15 47 30 33 4 57 31 10 55 30 29 35 10 27 29 28 21 8
215:week_78_arm_1 32 10 11 30 11 11 10 15 10 38 15 17 15 21 0 39 17 17 20 38 13 18 23 40 12 17 42 13 14 16 17 34 10 11 2 15
257:week_0_arm_1 29 47 46 29 46 46 47 26 31 37 50 50 50 30 39 0 24 24 29 31 44 47 28 37 45 24 37 44 43 49 24 7 47 46 39 26
257:week_2_arm_1 45 25 24 43 24 24 25 4 9 51 28 30 28 10 17 24 0 2 7 47 22 27 8 49 23 2 51 22 23 27 2 23 25 24 17 4
257:week_14_arm_1 45 25 24 43 24 24 25 4 9 51 28 30 28 10 17 24 2 0 7 47 22 27 8 49 23 2 51 22 23 27 2 23 25 24 17 4
257:week_26_arm_1 50 28 29 48 29 29 28 9 12 56 33 35 33 15 20 29 7 7 0 52 27 32 13 46 26 7 56 27 28 22 7 28 28 29 20 9
275:week_0_arm_1 14 30 29 12 29 29 30 49 46 12 33 23 33 47 38 31 47 47 52 0 27 30 45 18 28 47 12 27 26 32 47 32 30 29 38 49
275:week_2_arm_1 25 5 4 23 4 4 5 24 21 31 8 10 8 30 13 44 22 22 27 27 0 7 28 29 3 22 31 2 3 7 22 43 5 4 13 24
275:week_14_arm_1 30 10 9 28 9 9 10 29 26 34 13 13 13 33 18 47 27 27 32 30 7 0 31 34 8 27 34 7 6 12 27 48 10 9 18 29
275:week_52_arm_1 51 31 30 49 30 30 31 10 15 45 34 28 34 4 23 28 8 8 13 45 28 31 0 55 29 8 53 28 27 33 8 29 31 30 23 10
305:week_0_arm_1 18 32 31 16 31 31 32 51 48 22 35 37 35 57 40 37 49 49 46 18 29 34 55 0 30 49 22 29 30 26 49 36 32 31 40 51
305:week_26_arm_1 26 4 5 24 5 5 4 25 20 32 9 11 9 31 12 45 23 23 26 28 3 8 29 30 0 23 32 3 4 6 23 44 4 5 12 25
305:week_52_arm_1 45 25 24 43 24 24 25 4 9 51 28 30 28 10 17 24 2 2 7 47 22 27 8 49 23 0 51 22 23 27 2 23 25 24 17 4
365:week_0_arm_1 20 34 33 18 33 33 34 53 50 18 29 31 29 55 42 37 51 51 56 12 31 34 53 22 32 51 0 31 30 36 51 38 34 33 42 53
365:week_2_arm_1 25 5 4 23 4 4 5 24 21 31 8 10 8 30 13 44 22 22 27 27 2 7 28 29 3 22 31 0 3 7 22 43 5 4 13 24
365:week_14_arm_1 26 6 5 24 5 5 6 25 22 30 9 9 9 29 14 43 23 23 28 26 3 6 27 30 4 23 30 3 0 8 23 44 6 5 14 25
365:week_26_arm_1 30 8 9 28 9 9 8 29 24 36 13 15 13 35 16 49 27 27 22 32 7 12 33 26 6 27 36 7 8 0 27 48 8 9 16 29
365:week_52_arm_1 45 25 24 43 24 24 25 4 9 51 28 30 28 10 17 24 2 2 7 47 22 27 8 49 23 2 51 22 23 27 0 23 25 24 17 4
527:week_0_arm_1 24 42 41 24 41 41 42 21 26 34 45 47 45 27 34 7 23 23 28 32 43 48 29 36 44 23 38 43 44 48 23 0 42 41 34 21
527:week_2_arm_1 24 2 3 22 3 3 2 23 18 30 7 9 7 29 10 47 25 25 28 30 5 10 31 32 4 25 34 5 6 8 25 42 0 3 10 23
527:week_14_arm_1 23 3 2 21 2 2 3 22 19 29 6 8 6 28 11 46 24 24 29 29 4 9 30 31 5 24 33 4 5 9 24 41 3 0 11 22
527:week_26_arm_1 32 10 11 30 11 11 10 15 10 38 15 17 15 21 2 39 17 17 20 38 13 18 23 40 12 17 42 13 14 16 17 34 10 11 0 15
527:week_52_arm_1 43 23 22 41 22 22 23 2 7 49 26 28 26 8 15 26 4 4 9 49 24 29 10 51 25 4 53 24 25 29 4 21 23 22 15 0
Defining the similarity metric

Use the flag --similarity-metric-cohort. The default value is hamming. The alternative value is jaccard.

Exporting intermediate files

It is possible to export all intermediate files, as well as a file indicating coverage with the flag --e. Examples:

pheno-ranker -r individuals.json --e 
pheno-ranker -r individuals.json --e my_fav_id # for chosing a prefix

The intermediate files can be used for further processing (e.g., import to a database; see FAQs) or to make informed decisions. For instance, the file export.coverage_stats.json has stats on the coverage of each term (1D-key) in the cohort. It is possible to go more granular with a tool like jq that parses JSON. For instance:

jq -r 'to_entries | map(.key + ": " + (.value | length | tostring))[]' < export.ref_hash.json

This command will print how many variables per individual were actually used to perform the comparison. You can post-process the output to check for unbalanced data.

Included R scripts

You can find in the link below a few examples to perform clustering and multimensional scaling with your data:

R scripts at GitHub.

Clustering

The matrix can be processed to obtain a heatmap:

R code
# Load library
library("pheatmap")

# Read in the input file as a matrix
data <- as.matrix(read.table("matrix.txt", header = TRUE, row.names = 1, check.names = FALSE))

# Save image
png(filename = "heatmap.png", width = 1000, height = 1000,
    units = "px", pointsize = 12, bg = "white", res = NA)

# Create the heatmap with row and column labels
pheatmap(data)

Heatmap
Heatmap of a intra-cohort pairwise comparison

Dimensionality reduction

The same matrix can be processed with multidimensional scaling to reduce the dimensionality.

R code
library(ggplot2)
library(ggrepel)

# Read in the input file as a matrix 
data <- as.matrix(read.table("matrix.txt", header = TRUE, row.names = 1, check.names = FALSE))

#perform multidimensional scaling
fit <- cmdscale(data, eig=TRUE, k=2)

#extract (x, y) coordinates of multidimensional scaling
x <- fit$points[,1]
y <- fit$points[,2]

# Create example data frame
df <- data.frame(x, y, label=row.names(data))

# Save image
png(filename = "mds.png", width = 1000, height = 1000,
    units = "px", pointsize = 12, bg = "white", res = NA)

# Create scatter plot
ggplot(df, aes(x, y, label = label)) +
  geom_point() +
  geom_text_repel(size = 5, # Adjust the size of the text
                  box.padding = 0.2, # Adjust the padding around the text
                  max.overlaps = 10) + # Change the maximum number of overlaps
  labs(title = "Multidimensional Scaling Results",
       x = "Hamming Distance MDS Coordinate 1",
       y = "Hamming Distance MDS Coordinate 2") + # Add title and axis labels
  theme(
        plot.title = element_text(size = 30, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 25),
        axis.text = element_text(size = 15))

MDS
Multidimensional scaling of a intra-cohort pairwise comparison

Graph analytics

Pheno-Rankerhas an option for creating a graph in JSONformat, compatible with Cytoscape ecoystem.

Bash code for Cytoscape-compatible graph/network

pheno-ranker -r individuals.json --cytoscape-json
This command generates a graph.json file, as well as a matrix.txt file.

To produce summary statistics, use:

pheno-ranker -r individuals.json --cytoscape-json --graph-stats
This command will produce a file called graph_stats.txt. For additional information, see the generic JSON tutorial.

We'll be using individuals.json again, which includes data for 36 patients. This time, however, we'll use it twice to simulate having two cohorts. The software will add a CX_ prefix to the primary_key values to help us keep track of which patient comes from which usage of the file.

pheno-ranker -r individuals.json individuals.json

Is it possible to have a cohort with just one individual?

Absolutely, a cohort can indeed be composed of a single individual. This allows for an analysis involving both a cohort and specific patient(s) simultaneously.

Heatmap
Heatmap of a inter-cohort pairwise comparison

The prefixes can be changed with the flag --append-prefixes:

pheno-ranker -r individuals.json individuals.json --append-prefixes REF TAR
This will create a matrix.txt file of (36+36) x (36+36) cells. Again, this matrix can be processed with R:

Heatmap
Heatmap of a inter-cohort pairwise comparison