Skip to main content

Cohort Mode

Cohort mode performs an all-vs-all comparison of records in one or more cohorts. Each record is flattened, encoded as a binary vector, and compared with either Hamming distance or the Jaccard index.

Use cohort mode when you want to explore the structure of a cohort, compare multiple cohorts, identify clusters, run dimensionality reduction, or export a graph for network analysis.

ComparesReference records against each other
Basic commandpheno-ranker -r cohort.json
Main outputmatrix.txt
Best forClustering, MDS, UMAP, graph export

When to Use It

Explore

Cohort structure

Generate pairwise distances or similarities between all records in a cohort.

Compare

Multiple cohorts

Pass several reference files and keep source cohorts traceable with prefixes.

Scale

Large outputs

Use sparse Matrix Market output or graph edge filters when dense outputs are too large.

Inspect

Intermediate vectors

Export hashes, binary vectors, and coverage statistics for debugging or downstream use.

What You Get

  • matrix.txt: the default dense pairwise comparison matrix.
  • graph.json: an optional Cytoscape-compatible graph when --cytoscape-json is used.
  • graph_stats.txt: optional graph summary statistics when --graph-stats is used.
  • export.*.json: optional intermediate hashes, vectors, and coverage statistics when --export is used.
  • matrix.mtx: optional sparse Matrix Market output for large matrix workflows.
Cohort mode vs patient mode

Use cohort mode when every record should be compared with every other record. Use patient mode when you have a target patient or object and want a ranked list of closest matches.

See common usage Read generic JSON tutorial Check installation
Generic JSON tutorial

We created a tutorial that deliberately uses generic JSON data (i.e., movies) to illustrate the capabilities of Pheno-Ranker, as starting with familiar examples can help you better grasp its usage.

Once you are comfortable with the concepts using movie data, you will find it easier to apply Pheno-Ranker to real GA4GH standards. For specific examples, please refer to the cohort and patient pages in this documentation.

Usage

The examples below show common cohort-mode command-line patterns. For the complete CLI reference, see Usage.

For this example, we use individuals.json, a JSON array with 36 patients. The goal is to compare every patient against every other patient in the file.

First, we will download the file:

wget https://raw.githubusercontent.com/CNAG-Biomedical-Informatics/pheno-ranker/refs/heads/main/t/data/individuals.json

Now run Pheno-Ranker:

pheno-ranker -r individuals.json
More input examples

You can find more input examples here.

This process generates a matrix.txt file, containing the results of 36 x 36 pairwise comparisons, calculated using the Hamming distance metric.

See matrix.txt
107:week_0_arm_1107:week_2_arm_1107:week_14_arm_1125:week_0_arm_1125:week_2_arm_1125:week_14_arm_1125:week_26_arm_1125:week_52_arm_1125:week_78_arm_1215:week_0_arm_1215:week_2_arm_1215:week_14_arm_1215:week_26_arm_1215:week_52_arm_1215:week_78_arm_1257:week_0_arm_1257:week_2_arm_1257:week_14_arm_1257:week_26_arm_1275:week_0_arm_1275:week_2_arm_1275:week_14_arm_1275:week_52_arm_1305:week_0_arm_1305:week_26_arm_1305:week_52_arm_1365:week_0_arm_1365:week_2_arm_1365:week_14_arm_1365:week_26_arm_1365:week_52_arm_1527:week_0_arm_1527:week_2_arm_1527:week_14_arm_1527:week_26_arm_1527:week_52_arm_1
107:week_0_arm_10242362323244340162729274932294545501425305118264520252630452424233243
107:week_2_arm_1240322332231830797291047252528305103132425345682542231023
107:week_14_arm_123302122322192968628114624242929493031524334592441321122
125:week_0_arm_16222102121224138142527254730294343481223284916244318232428432422213041
125:week_2_arm_123322102322192968628114624242929493031524334592441321122
125:week_14_arm_123322120322192968628114624242929493031524334592441321122
125:week_26_arm_1242322330231830797291047252528305103132425345682542231023
125:week_52_arm_1432322412222230749262826815264494924291051254532425294212322152
125:week_78_arm_140181938191918704623252313103199124621261548209502122249261819107
215:week_0_arm_116302914292930494603327334338375151561231344522325118313036513430293849
215:week_2_arm_127762566726233301223215502828333381334359282989132845761526
215:week_14_arm_1299827889282527120122617503030352310132837113031109153047981728
215:week_26_arm_127762566726233321203215502828333381334359282989132845761526
215:week_52_arm_149292847282829813433226320213010101547303345731105530293510272928218
215:week_78_arm_13210113011111015103815171521039171720381318234012174213141617341011215
257:week_0_arm_12947462946464726313750505030390242429314447283745243744434924747463926
257:week_2_arm_1452524432424254951283028101724027472227849232512223272232524174
257:week_14_arm_1452524432424254951283028101724207472227849232512223272232524174
257:week_26_arm_150282948292928912563335331520297705227321346267562728227282829209
275:week_0_arm_114302912292930494612332333473831474752027304518284712272632473230293849
275:week_2_arm_1255423445242131810830134422222727072829322312372243541324
275:week_14_arm_1301092899102926341313133318472727323070313482734761227481091829
275:week_52_arm_1513130493030311015453428344232888134528310552985328273382931302310
305:week_0_arm_118323116313132514822353735574037494946182934550304922293026493632314051
305:week_26_arm_1264524554252032911931124523232628382930023323462344451225
305:week_52_arm_1452524432424254951283028101724227472227849230512223272232524174
365:week_0_arm_120343318333334535018293129554237515156123134532232510313036513834334253
365:week_2_arm_1255423445242131810830134422222727272829322310372243541324
365:week_14_arm_126652455625223099929144323232826362730423303082344651425
365:week_26_arm_1308928998292436131513351649272722327123326627367802748891629
365:week_52_arm_1452524432424254951283028101724227472227849232512223270232524174
527:week_0_arm_12442412441414221263445474527347232328324348293644233843444823042413421
527:week_2_arm_1242322332231830797291047252528305103132425345682542031023
527:week_14_arm_123322122322192968628114624242929493031524334592441301122
527:week_26_arm_13210113011111015103815171521239171720381318234012174213141617341011015
527:week_52_arm_1432322412222232749262826815264494924291051254532425294212322150
Defining the similarity metric

Use --similarity-metric-cohort to choose the cohort metric. The default value is hamming; the alternative is jaccard.

pheno-ranker -r individuals.json --similarity-metric-cohort jaccard
Sparse Matrix Market output

By default, cohort mode writes a dense tab-separated matrix (matrix.txt). For large cohorts, you can instead write a sparse Matrix Market coordinate file:

pheno-ranker -r individuals.json --matrix-format mtx -o matrix.mtx

The mtx format stores one triangle of the symmetric matrix and writes only non-zero values. It is always RAM-light and does not use the dense in-memory matrix cache controlled by --max-matrix-records-in-ram.

The Matrix Market file includes comment lines mapping 1-based matrix indexes back to individual IDs:

% id 1 107:week_0_arm_1
% id 2 107:week_2_arm_1

Matrix output and Cytoscape graph output are generated independently. This means --matrix-format mtx can be combined with --cytoscape-json.

Exporting intermediate files

It is possible to export all intermediate files, as well as a file indicating coverage, with --export (--e). Examples:

pheno-ranker -r individuals.json --export
pheno-ranker -r individuals.json --export my_fav_id # choose a prefix

The intermediate files can be used for further processing (e.g., import to a database; see FAQs) or to make informed decisions. For instance, the file export.coverage_stats.json has stats on the coverage of each term (1D-key) in the cohort. It is possible to go more granular with a tool like jq that parses JSON. For instance:

jq -r 'to_entries | map(.key + ": " + (.value | length | tostring))[]' < export.ref_hash.json

This command will print how many variables per individual were actually used to perform the comparison. You can post-process the output to check for unbalanced data.

Included R scripts

You can find in the link below a few examples to perform clustering and multidimensional scaling with your data:

R scripts at GitHub.

Clustering

The matrix can be processed to obtain a heatmap:

R code
# Load library
library("pheatmap")
#library("heatmaply") # could not install

# Read in the input file as a matrix
data <- as.matrix(read.table("matrix.txt", header = TRUE, row.names = 1, check.names = FALSE))

# Save image
png(filename = "heatmap.png", width = 1000, height = 1000,
units = "px", pointsize = 12, bg = "white", res = NA)

# Create the heatmap with row and column labels
#heatmap(data, Rowv = FALSE, Colv = FALSE, labRow = rownames(data), labCol = colnames(data))
pheatmap(data)
#heatmaply(data)
#dev.off()
Heatmap
Heatmap of a intra-cohort pairwise comparison

Dimensionality reduction

The same matrix can be processed with multidimensional scaling to reduce the dimensionality.

R code
library(ggplot2)
library(ggrepel)

# Read in the input file as a matrix
data <- as.matrix(read.table("matrix.txt", header = TRUE, row.names = 1, check.names = FALSE))

# Calculate distance matrix
#d <- dist(data)
#d <- 1 - data # J-similarity to J-distance

# Perform multidimensional scaling
#fit <- cmdscale(d, eig=TRUE, k=2)
fit <- cmdscale(data, eig=TRUE, k=2)

# Extract (x, y) coordinates of multidimensional scaling
x <- fit$points[,1]
y <- fit$points[,2]

# Create data frame
df <- data.frame(x, y, label=row.names(data))

# Save image
png(filename = "mds.png", width = 1000, height = 1000,
units = "px", pointsize = 12, bg = "white", res = NA)

# Create scatter plot
ggplot(df, aes(x, y, label = label)) +
geom_point() +
geom_text_repel(size = 5, # Adjust the size of the text
box.padding = 0.2, # Adjust the padding around the text
max.overlaps = 10) + # Change the maximum number of overlaps
labs(title = "Multidimensional Scaling Results",
x = "Hamming Distance MDS Coordinate 1",
y = "Hamming Distance MDS Coordinate 2") + # Add title and axis labels
theme(
plot.title = element_text(size = 30, face = "bold", hjust = 0.5),
axis.title = element_text(size = 25),
axis.text = element_text(size = 15))

#dev.off()
MDS
Multidimensional scaling of a intra-cohort pairwise comparison

Or the dimensionality can be reduced with UMAP:

R code
# -- Install uwot on the fly if needed
if (!requireNamespace("uwot", quietly = TRUE)) {
install.packages("uwot", repos="https://cloud.r-project.org")
}

# -- Load libraries
library(uwot)
library(ggplot2)
library(ggrepel)

# -- Read in the input file as a full distance matrix
data <- as.matrix(
read.table("matrix.txt",
header=TRUE,
row.names=1,
check.names=FALSE)
)

# -- Convert to a 'dist' object so uwot knows these are distances
d <- as.dist(data)

# -- Set seed for reproducibility
set.seed(42)

# -- Run UMAP directly on the distances
# Passing a 'dist' object lets uwot build the k-NN graph from your distances
umap_res <- umap(
d,
n_neighbors=30,
min_dist=0.3,
n_components=2
)

# -- Extract UMAP coordinates
x <- umap_res[,1]
y <- umap_res[,2]

# -- Build a data frame for plotting
df <- data.frame(
x=x,
y=y,
label=rownames(data)
)

# -- Open PNG device
png(filename="umap.png",
width=1000,
height=1000,
units="px",
pointsize=12,
bg="white",
res=NA)

# -- Create scatter plot with labels
ggplot(df, aes(x=x, y=y, label=label)) +
geom_point() +
geom_text_repel(
size=5,
box.padding=0.2,
max.overlaps=10
) +
labs(
title="UMAP Embedding of Hamming Distance Matrix",
x="UMAP Coordinate 1",
y="UMAP Coordinate 2"
) +
theme(
plot.title=element_text(size=30, face="bold", hjust=0.5),
axis.title=element_text(size=25),
axis.text=element_text(size=15)
)

# -- Close the device
dev.off()
MDS
UMAP of a intra-cohort pairwise comparison

Graph analytics

Pheno-Ranker has an option for creating a graph in JSON format, compatible with the Cytoscape ecosystem.

Bash code for Cytoscape-compatible graph/network
pheno-ranker -r individuals.json --cytoscape-json

This command generates a graph.json file, as well as a matrix.txt file. The graph is generated directly from the binary comparison hashes, not by parsing the matrix file, so it can also be combined with Matrix Market output:

pheno-ranker -r individuals.json --matrix-format mtx -o matrix.mtx --cytoscape-json graph.json

Large graphs can be filtered by edge weight:

# Hamming distance: keep close pairs
pheno-ranker -r individuals.json --cytoscape-json --graph-max-weight 10

# Jaccard similarity: keep highly similar pairs
pheno-ranker -r individuals.json --similarity-metric-cohort jaccard --cytoscape-json --graph-min-weight 0.7

To produce summary statistics, use:

pheno-ranker -r individuals.json --cytoscape-json --graph-stats

This command will produce a file called graph_stats.txt. For additional information, see the generic JSON tutorial.