Patient Mode

Patient mode ranks records in a reference cohort against a target patient or object. It uses the same flattened variables and binary-vector representation as cohort mode, but the output is a ranked table instead of an all-vs-all matrix.

Use patient mode when you want to find the closest matches to a patient profile, inspect which variables overlap, or assess match significance with Z-scores and p-values.

ComparesTarget against reference records

Basic commandpheno-ranker -r cohort.json -t patient.json

Main outputrank.txt

Best forClosest matches and alignment review

When to Use It

Match

Find similar records

Rank every reference record against one target patient or object.

Compare

Multiple cohorts

Use several reference files and keep each match traceable to its source cohort.

Interpret

Read match statistics

Use Hamming distance, Jaccard similarity, Z-scores, p-values, and overlap percentages.

Audit

Inspect alignments

Use --align to see which variables match or differ between target and reference.

What You Get

rank.txt: ranked matches between the target and the reference cohort.
alignment*: optional variable-level alignment files when --align is used.
export.*.json: optional intermediate hashes, vectors, and coverage statistics when --export is used.
Hamming distance, Jaccard similarity, Z-scores, p-values, and overlap statistics for each match.

Patient mode vs cohort mode

Use patient mode when one target should be ranked against a reference cohort. Use cohort mode when every record should be compared with every other record.

See common usage Compare cohorts Check installation

Usage

The examples below show the common patient-mode command-line patterns. For the complete CLI reference, see Usage.

Against one cohort
Against multiple cohorts

Example:

pheno-ranker -r individuals.json -t patient.json

How do I extract one or many patients from a cohort file?

pheno-ranker -r t/data/individuals.json --patients-of-interest 107:week_0_arm_1 125:week_0_arm_1

This command will carry out a dry-run, creating 107:week_0_arm_1.json and 125:week_0_arm_1.json files. On Windows, characters that are invalid in filenames are percent-encoded, so 107:week_0_arm_1 is written as 107%3Aweek_0_arm_1.json. In the example above, I renamed 107:week_0_arm_1.json to patient.json by typing this:

mv 107:week_0_arm_1.json patient.json

This will create the output text file rank.txt.

The first rows in rank.txt are the best matches according to the selected sorting metric. By default, patient mode sorts by Hamming distance; use --sort-by jaccard to sort by Jaccard similarity instead.

How to read rank.txt

For most analyses, start with these columns:

RANK: Match order; 1 is the best match under the selected sorting metric.
REFERENCE(ID): The matched individual in the reference cohort.
HAMMING-DISTANCE: Lower values indicate more similar binary profiles.
JACCARD-INDEX: Higher values indicate more similar binary profiles.
DISTANCE-P-VALUE / JACCARD-P-VALUE: Significance of the match within the distribution of comparisons in the run.
INTERSECT-RATE(%): How much of the target profile is covered by the reference match.
COMPLETENESS(%): How much of the reference profile is covered by the target.

Use Hamming distance when you want a distance-like ranking. Use Jaccard similarity when sparse overlap or missingness is important.

Full rank.txt column reference

Identifiers and run metadata

RANK: Match order. A rank of 1 is the best match.
REFERENCE(ID): The unique identifier (primary_key) for the reference individual.
TARGET(ID): The unique identifier (primary_key) for the target individual passed with --target.
FORMAT: Input format used by the configuration, such as BFF, PXF, or CSV.
WEIGHTED: Whether the calculation used variable weights with --weights.

Alignment size

LENGTH: Count of variables that have a 1 in either the reference or the target. In other words, this is the size of the comparison space for that pair.

LENGTH example

REF: 0001001
TAR: 1000001

In this case, LENGTH is 3 because three positions have a 1 in at least one vector.

Similarity and distance metrics

HAMMING-DISTANCE: Count of positions where the reference and target binary vectors differ. Lower values indicate more similar profiles.
JACCARD-INDEX: Similarity between the reference and target vectors, calculated as the intersection divided by the union. Higher values indicate more similar profiles.

Metric definitions

Hamming distance counts mismatches between two binary strings of equal length.

Jaccard similarity focuses on shared 1 values:

\text{Jaccard} = \frac{\text{Intersection}}{\text{Union}}

Significance statistics

DISTANCE-Z-SCORE: Empirical Z-score for the observed Hamming distance compared with all target-reference comparisons in the run.
DISTANCE-P-VALUE: Statistical significance associated with DISTANCE-Z-SCORE.
DISTANCE-Z-SCORE(RAND): Estimated Z-score for two random binary vectors, assuming the alignment size is equal to LENGTH.
JACCARD-Z-SCORE: Empirical Z-score for the observed Jaccard index compared with all target-reference comparisons in the run.
JACCARD-P-VALUE: Statistical significance associated with JACCARD-Z-SCORE.

DISTANCE-Z-SCORE(RAND) calculation

This value comes from the estimated mean and standard deviation of the Hamming distance for binary strings. It assumes that each position has a 50% chance of being a mismatch, independently of other positions.

The expected mean is:

\text{Estimated Average} = \text{Length} \times \text{Probability of Mismatch}

where the probability of mismatch is set to 0.5.

The standard deviation is:

\text{Estimated Standard Deviation} = \sqrt{\text{Length} \times \text{Probability of Mismatch} \times (1 - \text{Probability of Mismatch})}

Finally, the formula for the Z-score is:

$Z = \frac{(X - \mu)}{\sigma}$

where ( X ) is the observed value, ( \mu ) is the estimated mean, and ( \sigma ) is the estimated standard deviation.

Variable overlap

REFERENCE-VARS: Total number of variables present in the reference.
TARGET-VARS: Total number of variables present in the target.
INTERSECT: Number of variables shared by the reference and target.
INTERSECT-RATE(%): Percentage of target variables also present in the reference.
COMPLETENESS(%): Percentage of reference variables also present in the target.

INTERSECT-RATE(%) calculation

INTERSECT-RATE(%) measures how much of the target profile is covered by the reference:

\text{INTERSECT-RATE(\%)} = \frac{\text{Intersection Count}}{\text{Number of Variables in Target}} \times 100

COMPLETENESS(%) calculation

COMPLETENESS(%) measures how much of the reference profile is covered by the target:

\text{COMPLETENESS(\%)} = \frac{\text{Intersection Count}}{\text{Number of Variables in Reference}} \times 100

See results from rank.txt

RANK	REFERENCE(ID)	TARGET(ID)	FORMAT	LENGTH	WEIGHTED	HAMMING-DISTANCE	DISTANCE-Z-SCORE	DISTANCE-P-VALUE	DISTANCE-Z-SCORE(RAND)	JACCARD-INDEX	JACCARD-Z-SCORE	JACCARD-P-VALUE	REFERENCE-VARS	TARGET-VARS	INTERSECT	INTERSECT-RATE(%)	COMPLETENESS(%)
1	107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	0	-2.419	0.0077787	-8.7750	1.000	2.949	0.0256500	77	77	77	100.00	100.00
2	125:week_0_arm_1	107:week_0_arm_1	BFF	79	False	6	-1.924	0.0271576	-7.5381	0.924	2.269	0.1022693	75	77	73	94.81	97.33
3	275:week_0_arm_1	107:week_0_arm_1	BFF	86	False	14	-1.265	0.1030165	-6.2543	0.837	1.491	0.3117348	81	77	72	93.51	88.89
4	215:week_0_arm_1	107:week_0_arm_1	BFF	88	False	16	-1.100	0.1357515	-5.9696	0.818	1.321	0.3742868	83	77	72	93.51	86.75
5	305:week_0_arm_1	107:week_0_arm_1	BFF	89	False	18	-0.935	0.1749800	-5.6180	0.798	1.138	0.4452980	83	77	71	92.21	85.54
6	365:week_0_arm_1	107:week_0_arm_1	BFF	87	False	20	-0.770	0.2207314	-5.0389	0.770	0.890	0.5437899	77	77	67	87.01	87.01
7	125:week_14_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
8	527:week_14_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
9	107:week_14_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
10	125:week_2_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
11	107:week_2_arm_1	107:week_0_arm_1	BFF	78	False	24	-0.440	0.3300253	-3.3968	0.692	0.193	0.7901267	55	77	54	70.13	98.18
12	125:week_26_arm_1	107:week_0_arm_1	BFF	78	False	24	-0.440	0.3300253	-3.3968	0.692	0.193	0.7901267	55	77	54	70.13	98.18
13	527:week_2_arm_1	107:week_0_arm_1	BFF	78	False	24	-0.440	0.3300253	-3.3968	0.692	0.193	0.7901267	55	77	54	70.13	98.18
14	527:week_0_arm_1	107:week_0_arm_1	BFF	98	False	24	-0.440	0.3300253	-5.0508	0.755	0.756	0.5965581	95	77	74	96.10	77.89
15	365:week_2_arm_1	107:week_0_arm_1	BFF	79	False	25	-0.357	0.3604065	-3.2628	0.684	0.115	0.8120159	56	77	54	70.13	96.43
16	275:week_2_arm_1	107:week_0_arm_1	BFF	79	False	25	-0.357	0.3604065	-3.2628	0.684	0.115	0.8120159	56	77	54	70.13	96.43
17	305:week_26_arm_1	107:week_0_arm_1	BFF	79	False	26	-0.275	0.3916958	-3.0377	0.671	0.001	0.8410353	55	77	53	68.83	96.36
18	365:week_14_arm_1	107:week_0_arm_1	BFF	80	False	26	-0.275	0.3916958	-3.1305	0.675	0.038	0.8319440	57	77	54	70.13	94.74
19	215:week_2_arm_1	107:week_0_arm_1	BFF	78	False	27	-0.192	0.4237022	-2.7175	0.654	-0.151	0.8752035	52	77	51	66.23	98.08
20	215:week_26_arm_1	107:week_0_arm_1	BFF	78	False	27	-0.192	0.4237022	-2.7175	0.654	-0.151	0.8752035	52	77	51	66.23	98.08
21	257:week_0_arm_1	107:week_0_arm_1	BFF	102	False	29	-0.027	0.4890344	-4.3566	0.716	0.403	0.7249040	98	77	73	94.81	74.49
22	215:week_14_arm_1	107:week_0_arm_1	BFF	84	False	29	-0.027	0.4890344	-2.8368	0.655	-0.143	0.8735091	62	77	55	71.43	88.71
23	275:week_14_arm_1	107:week_0_arm_1	BFF	80	False	30	0.055	0.5219230	-2.2361	0.625	-0.410	0.9206854	53	77	50	64.94	94.34
24	365:week_26_arm_1	107:week_0_arm_1	BFF	83	False	30	0.055	0.5219230	-2.5246	0.639	-0.288	0.9011791	59	77	53	68.83	89.83
25	215:week_78_arm_1	107:week_0_arm_1	BFF	86	False	32	0.220	0.5870339	-2.3723	0.628	-0.384	0.9167688	63	77	54	70.13	85.71
26	527:week_26_arm_1	107:week_0_arm_1	BFF	86	False	32	0.220	0.5870339	-2.3723	0.628	-0.384	0.9167688	63	77	54	70.13	85.71
27	125:week_78_arm_1	107:week_0_arm_1	BFF	94	False	40	0.880	0.8104854	-1.4440	0.574	-0.862	0.9687183	71	77	54	70.13	76.06
28	527:week_52_arm_1	107:week_0_arm_1	BFF	98	False	43	1.127	0.8701495	-1.2122	0.561	-0.981	0.9761986	76	77	55	71.43	72.37
29	125:week_52_arm_1	107:week_0_arm_1	BFF	98	False	43	1.127	0.8701495	-1.2122	0.561	-0.981	0.9761986	76	77	55	71.43	72.37
30	365:week_52_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
31	257:week_14_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
32	305:week_52_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
33	257:week_2_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
34	215:week_52_arm_1	107:week_0_arm_1	BFF	104	False	49	1.622	0.9475899	-0.5883	0.529	-1.271	0.9884232	82	77	55	71.43	67.07
35	257:week_26_arm_1	107:week_0_arm_1	BFF	103	False	50	1.704	0.9558461	-0.2956	0.515	-1.399	0.9917759	79	77	53	68.83	67.09
36	275:week_52_arm_1	107:week_0_arm_1	BFF	105	False	51	1.787	0.9630202	-0.2928	0.514	-1.401	0.9918315	82	77	54	70.13	65.85

The process mirrors handling a single cohort; the main difference is that each reference cohort gets a prefix in its primary_key, making it possible to trace the origin of every individual.

We reuse individuals.json to simulate more than one cohort.

Example:

pheno-ranker -r individuals.json individuals.json individuals.json -t patient.json --max-out 10 -o rank_multiple.txt

This will create the text file rank_multiple.txt.

See results from rank_multiple.txt

RANK	REFERENCE(ID)	TARGET(ID)	FORMAT	LENGTH	WEIGHTED	HAMMING-DISTANCE	DISTANCE-Z-SCORE	DISTANCE-P-VALUE	DISTANCE-Z-SCORE(RAND)	JACCARD-INDEX	JACCARD-Z-SCORE	JACCARD-P-VALUE	REFERENCE-VARS	TARGET-VARS	INTERSECT	INTERSECT-RATE(%)	COMPLETENESS(%)
1	C2_107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	1	-2.306	0.0105624	-8.5470	0.987	2.804	0.0356370	77	77	76	98.70	98.70
2	C3_107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	1	-2.306	0.0105624	-8.5470	0.987	2.804	0.0356370	77	77	76	98.70	98.70
3	C1_107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	1	-2.306	0.0105624	-8.5470	0.987	2.804	0.0356370	77	77	76	98.70	98.70
4	C1_125:week_0_arm_1	107:week_0_arm_1	BFF	78	False	5	-1.969	0.0244763	-7.6995	0.936	2.340	0.0901212	75	77	73	94.81	97.33
5	C3_125:week_0_arm_1	107:week_0_arm_1	BFF	78	False	5	-1.969	0.0244763	-7.6995	0.936	2.340	0.0901212	75	77	73	94.81	97.33
6	C2_125:week_0_arm_1	107:week_0_arm_1	BFF	78	False	5	-1.969	0.0244763	-7.6995	0.936	2.340	0.0901212	75	77	73	94.81	97.33
7	C2_275:week_0_arm_1	107:week_0_arm_1	BFF	85	False	13	-1.296	0.0975704	-6.3994	0.847	1.534	0.2966472	81	77	72	93.51	88.89
8	C1_275:week_0_arm_1	107:week_0_arm_1	BFF	85	False	13	-1.296	0.0975704	-6.3994	0.847	1.534	0.2966472	81	77	72	93.51	88.89
9	C3_275:week_0_arm_1	107:week_0_arm_1	BFF	85	False	13	-1.296	0.0975704	-6.3994	0.847	1.534	0.2966472	81	77	72	93.51	88.89
10	C1_215:week_0_arm_1	107:week_0_arm_1	BFF	87	False	15	-1.127	0.1298396	-6.1110	0.828	1.357	0.3603912	83	77	72	93.51	86.75

Why the distance for 107:week_0_arm_1 is not 0 if the three cohorts are identical?

In Patient mode, the global vector is formed using variables solely from the reference cohort(s), not the patient's. The primary_key (id in this context) is automatically included, leading to a distance of 1 due to the mismatch in the individual's id field.

If you want to visualize the differences in all variables (i.e., the union of reference(s) and target), simply add the target as another cohort in --r. This way, the variables from the patient will be included in the reference vector.

Note that you can exclude id by adding --exclude-terms id.

Obtaining additional information on the alignments

You can create several files related to the reference-target alignment by adding --align. By default, this creates alignment* files in the current directory, but you can specify a </path/basename>. Example:

pheno-ranker -r individuals.json individuals.json -t patient.json --align

Or using a path + basename:

pheno-ranker -r individuals.json individuals.json -t patient.json --align /my/fav/dir/jobid-001-align

Find below an extract of the alignment (C1_107:week_0_arm_1 --- 107:week_0_arm_1) extracted from alignment.txt:

REF -- TAR
----- 1 | (w:  1|d:  0|cd:  0|) diseases.NCIT:C3138.diseaseCode.id.NCIT:C3138 (Inflammatory Bowel Disease)
----- 1 | (w:  1|d:  0|cd:  0|) ethnicity.id.NCIT:C41261 (Caucasian)
----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.exposureCode.id.NCIT:C154329 (Smoking)
----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C65108 (Never Smoker)
     0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67147 (Current Smoker)
     0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67148 (Former Smoker)
----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.exposureCode.id.NCIT:C2190 (Alcohol)
     0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C126379 (Non-Drinker)
     0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C156821 (Alcohol Consumption More than 2 Drinks per Day for Men and More than 1 Drink per Day for Women)
----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C17998 (Unknown)
     0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.exposureCode.id.NCIT:C73993 (Pack Year)
     0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.unit.id.NCIT:C73993 (Pack Year)
xxx-- 0 | (w:  1|d:  1|cd:  1|) id.C1_107:week_0_arm_1 (id.C1_107:week_0_arm_1)
     0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_14_arm_1 (id.C1_107:week_14_arm_1)
     0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_2_arm_1 (id.C1_107:week_2_arm_1)
     0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_0_arm_1 (id.C1_125:week_0_arm_1)
     0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_14_arm_1 (id.C1_125:week_14_arm_1)
...

When to Use It​

Find similar records

Multiple cohorts

Read match statistics

Inspect alignments

What You Get​

Usage​

Identifiers and run metadata​

Alignment size​

Similarity and distance metrics​

Significance statistics​

Variable overlap​