👤 Patient mode

Patient mode aims to determine which individuals in the cohort are the closest to our patient by ranking them using (dis)similarity metrics.

Usage¶

When using the Pheno-ranker command-line interface, simply ensure the correct syntax is provided.

Against one cohortAgainst multiple cohorts

Example:

pheno-ranker -r individuals.json -t patient.json

How do I extract one or many patients from a cohort file?

pheno-ranker -r t/individuals.json --patients-of-interest 107:week_0_arm_1 125:week_0_arm_1

This command will carry out a dry-run, creating 107:week_0_arm_1.json and 125:week_0_arm_1.json files. In the example above, I renamed 107:week_0_arm_1.json to patient.json by typing this:

mv 107:week_0_arm_1.json patient.json

This will create the output text file rank.txt.

rank.txt column names and meaning

RANK: This indicates the similarity match's order. A rank of 1 signifies the best match.
REFERENCE(ID): The unique identifier (primary key) for the reference individual.
TARGET(ID): The unique identifier (primary key) for the target individual. This is set using the --t parameter.
FORMAT: Specifies the format of the input data, which can be one of the following: BFF, PXF, or CSV. This is configured in the settings file.
LENGTH: This refers to the length of the "alignment", meaning the count of variables that have a 1 in either the reference or the target. For example:

LENGTH example

REF: 0001001
TAR: 1000001

In this case, the LENGTH is 3.

WEIGHTED: Indicates if the calculation used weights (specified with --w). Possible values are True or False.
HAMMING-DISTANCE: The Hamming distance between the reference and target individuals' vectors. The Hamming distance between two strings of equal length is the count of positions at which the corresponding symbols are different. In the context of binary strings, it's the number of bit positions where the two strings differ.
DISTANCE-Z-SCORE: The empirical Z-score from all comparisons between the patient and the reference cohort.
DISTANCE-P-VALUE: The statistical significance of the observed DISTANCE-Z-SCORE.
DISTANCE-Z-SCORE(RAND): The estimated Z-score for two random vectors, assuming the alignment size is equal to LENGTH.

DISTANCE-Z-SCORE(RAND) calculation

The value comes from the estimated mean and standard deviation of the Hamming distance for binary strings. It assumes that each position in the strings has a 50% chance of being a mismatch (independent of other positions). The method is grounded in the principles of binomial distribution.

The mean is calculated under the assumption of a 50% probability of mismatch at each position.

\[ \text{Estimated Average} = \text{Length} \times \text{Probability of Mismatch} \]

where Probability of Mismatch is set at 0.5.

The standard deviation, which provides a measure of the variability or spread of the Hamming distance from the mean. This calculation assumes a binomial distribution of mismatches, given the binary nature of the data (match or mismatch).

\[ \text{Estimated Standard Deviation} = \sqrt{\text{Length} \times \text{Probability of Mismatch} \times (1 - \text{Probability of Mismatch})} \]

Finally, the formula for the Z-score is:

$$ Z = \frac{(X - \mu)}{\sigma} $$ Where: $ X $ is the value of interest. $ \mu $ is the estimated average. $ \sigma $ is the estimated estandard deviation

This method is applicable for estimating the Hamming distance in randomly generated binary strings where each position is independently set.

JACCARD-INDEX: The Jaccard similarity coefficient between the reference and target individuals' vectors. The Jaccard Index for binary digit strings is a measure that calculates the similarity between two strings by dividing the number of positions where both have a 1 by the number of positions where at least one has a 1.
JACCARD-Z-SCORE: The Z-score calculated from all comparisons between patients and the reference cohort.
JACCARD-P-VALUE: The statistical significance of the observed JACCARD-Z-SCORE.
REFERENCE-VARS: The total number of variables for the reference.
TARGET-VARS: The total number of variables for the target.
INTERSECT: The intersection of variables between reference and target.
INTERSECT-RATE(%): The percentage of intersected variables with respect the total number of variables in the target.

INTERSECT-RATE(%) calculation

The INTERSECT-RATE measures the overlap of variables between reference and target by calculating the proportion of shared variables relative to the total number of variables in the target.

Intersection Count: The number of variables that exist in both the reference and target sets.

\[ \text{INTERSECT-RATE(\%)} = \frac{\text{Intersection Count}}{\text{Number of Variables in Target}} \times 100 \]

This metric expresses the overlap as a percentage, where 0% means no overlap and 100% means complete overlap with the target variables.

COMPLETENESS(%): The percentage of intersected variables with respect the total number of variables in the reference.

COMPLETENESS(%) calculation

The COMPLETENESS measures the overlap of variables between reference and target by calculating the proportion of shared variables relative to the total number of variables in the reference.

Intersection Count: The number of variables that exist in both the reference and target sets.

\[ \text{COMPLETENESS(\%)} = \frac{\text{Intersection Count}}{\text{Number of Variables in Reference}} \times 100 \]

This metric expresses the overlap as a percentage, where 0% means no overlap and 100% means complete overlap with the reference variables.

See results from rank.txt

RANK	REFERENCE(ID)	TARGET(ID)	FORMAT	LENGTH	WEIGHTED	HAMMING-DISTANCE	DISTANCE-Z-SCORE	DISTANCE-P-VALUE	DISTANCE-Z-SCORE(RAND)	JACCARD-INDEX	JACCARD-Z-SCORE	JACCARD-P-VALUE	REFERENCE-VARS	TARGET-VARS	INTERSECT	INTERSECT-RATE(%)	COMPLETENESS(%)
1	107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	0	-2.419	0.0077787	-8.7750	1.000	2.949	0.0256500	77	77	77	100.00	100.00
2	125:week_0_arm_1	107:week_0_arm_1	BFF	79	False	6	-1.924	0.0271576	-7.5381	0.924	2.269	0.1022693	75	77	73	94.81	97.33
3	275:week_0_arm_1	107:week_0_arm_1	BFF	86	False	14	-1.265	0.1030165	-6.2543	0.837	1.491	0.3117348	81	77	72	93.51	88.89
4	215:week_0_arm_1	107:week_0_arm_1	BFF	88	False	16	-1.100	0.1357515	-5.9696	0.818	1.321	0.3742868	83	77	72	93.51	86.75
5	305:week_0_arm_1	107:week_0_arm_1	BFF	89	False	18	-0.935	0.1749800	-5.6180	0.798	1.138	0.4452980	83	77	71	92.21	85.54
6	365:week_0_arm_1	107:week_0_arm_1	BFF	87	False	20	-0.770	0.2207314	-5.0389	0.770	0.890	0.5437899	77	77	67	87.01	87.01
7	125:week_14_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
8	527:week_14_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
9	107:week_14_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
10	125:week_2_arm_1	107:week_0_arm_1	BFF	78	False	23	-0.522	0.3007259	-3.6233	0.705	0.308	0.7555423	56	77	55	71.43	98.21
11	107:week_2_arm_1	107:week_0_arm_1	BFF	78	False	24	-0.440	0.3300253	-3.3968	0.692	0.193	0.7901267	55	77	54	70.13	98.18
12	125:week_26_arm_1	107:week_0_arm_1	BFF	78	False	24	-0.440	0.3300253	-3.3968	0.692	0.193	0.7901267	55	77	54	70.13	98.18
13	527:week_2_arm_1	107:week_0_arm_1	BFF	78	False	24	-0.440	0.3300253	-3.3968	0.692	0.193	0.7901267	55	77	54	70.13	98.18
14	527:week_0_arm_1	107:week_0_arm_1	BFF	98	False	24	-0.440	0.3300253	-5.0508	0.755	0.756	0.5965581	95	77	74	96.10	77.89
15	365:week_2_arm_1	107:week_0_arm_1	BFF	79	False	25	-0.357	0.3604065	-3.2628	0.684	0.115	0.8120159	56	77	54	70.13	96.43
16	275:week_2_arm_1	107:week_0_arm_1	BFF	79	False	25	-0.357	0.3604065	-3.2628	0.684	0.115	0.8120159	56	77	54	70.13	96.43
17	305:week_26_arm_1	107:week_0_arm_1	BFF	79	False	26	-0.275	0.3916958	-3.0377	0.671	0.001	0.8410353	55	77	53	68.83	96.36
18	365:week_14_arm_1	107:week_0_arm_1	BFF	80	False	26	-0.275	0.3916958	-3.1305	0.675	0.038	0.8319440	57	77	54	70.13	94.74
19	215:week_2_arm_1	107:week_0_arm_1	BFF	78	False	27	-0.192	0.4237022	-2.7175	0.654	-0.151	0.8752035	52	77	51	66.23	98.08
20	215:week_26_arm_1	107:week_0_arm_1	BFF	78	False	27	-0.192	0.4237022	-2.7175	0.654	-0.151	0.8752035	52	77	51	66.23	98.08
21	257:week_0_arm_1	107:week_0_arm_1	BFF	102	False	29	-0.027	0.4890344	-4.3566	0.716	0.403	0.7249040	98	77	73	94.81	74.49
22	215:week_14_arm_1	107:week_0_arm_1	BFF	84	False	29	-0.027	0.4890344	-2.8368	0.655	-0.143	0.8735091	62	77	55	71.43	88.71
23	275:week_14_arm_1	107:week_0_arm_1	BFF	80	False	30	0.055	0.5219230	-2.2361	0.625	-0.410	0.9206854	53	77	50	64.94	94.34
24	365:week_26_arm_1	107:week_0_arm_1	BFF	83	False	30	0.055	0.5219230	-2.5246	0.639	-0.288	0.9011791	59	77	53	68.83	89.83
25	215:week_78_arm_1	107:week_0_arm_1	BFF	86	False	32	0.220	0.5870339	-2.3723	0.628	-0.384	0.9167688	63	77	54	70.13	85.71
26	527:week_26_arm_1	107:week_0_arm_1	BFF	86	False	32	0.220	0.5870339	-2.3723	0.628	-0.384	0.9167688	63	77	54	70.13	85.71
27	125:week_78_arm_1	107:week_0_arm_1	BFF	94	False	40	0.880	0.8104854	-1.4440	0.574	-0.862	0.9687183	71	77	54	70.13	76.06
28	527:week_52_arm_1	107:week_0_arm_1	BFF	98	False	43	1.127	0.8701495	-1.2122	0.561	-0.981	0.9761986	76	77	55	71.43	72.37
29	125:week_52_arm_1	107:week_0_arm_1	BFF	98	False	43	1.127	0.8701495	-1.2122	0.561	-0.981	0.9761986	76	77	55	71.43	72.37
30	365:week_52_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
31	257:week_14_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
32	305:week_52_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
33	257:week_2_arm_1	107:week_0_arm_1	BFF	99	False	45	1.292	0.9018282	-0.9045	0.545	-1.122	0.9830870	76	77	54	70.13	71.05
34	215:week_52_arm_1	107:week_0_arm_1	BFF	104	False	49	1.622	0.9475899	-0.5883	0.529	-1.271	0.9884232	82	77	55	71.43	67.07
35	257:week_26_arm_1	107:week_0_arm_1	BFF	103	False	50	1.704	0.9558461	-0.2956	0.515	-1.399	0.9917759	79	77	53	68.83	67.09
36	275:week_52_arm_1	107:week_0_arm_1	BFF	105	False	51	1.787	0.9630202	-0.2928	0.514	-1.401	0.9918315	82	77	54	70.13	65.85

The process mirrors handling a single cohort; the sole distinction is the addition of a prefix to each primary_key, enabling us to trace the origin of every individual.

Let's reuse individuals.json to have the impression of having more than one cohort.

Example:

pheno-ranker -r individuals.json individuals.json individuals.json -t patient.json --max-out 10 -o rank_multiple.txt

This will create the text file rank_multiple.txt.

See results from rank_multiple.txt

RANK	REFERENCE(ID)	TARGET(ID)	FORMAT	LENGTH	WEIGHTED	HAMMING-DISTANCE	DISTANCE-Z-SCORE	DISTANCE-P-VALUE	DISTANCE-Z-SCORE(RAND)	JACCARD-INDEX	JACCARD-Z-SCORE	JACCARD-P-VALUE	REFERENCE-VARS	TARGET-VARS	INTERSECT	INTERSECT-RATE(%)	COMPLETENESS(%)
1	C2_107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	1	-2.306	0.0105624	-8.5470	0.987	2.804	0.0356370	77	77	76	98.70	98.70
2	C3_107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	1	-2.306	0.0105624	-8.5470	0.987	2.804	0.0356370	77	77	76	98.70	98.70
3	C1_107:week_0_arm_1	107:week_0_arm_1	BFF	77	False	1	-2.306	0.0105624	-8.5470	0.987	2.804	0.0356370	77	77	76	98.70	98.70
4	C1_125:week_0_arm_1	107:week_0_arm_1	BFF	78	False	5	-1.969	0.0244763	-7.6995	0.936	2.340	0.0901212	75	77	73	94.81	97.33
5	C3_125:week_0_arm_1	107:week_0_arm_1	BFF	78	False	5	-1.969	0.0244763	-7.6995	0.936	2.340	0.0901212	75	77	73	94.81	97.33
6	C2_125:week_0_arm_1	107:week_0_arm_1	BFF	78	False	5	-1.969	0.0244763	-7.6995	0.936	2.340	0.0901212	75	77	73	94.81	97.33
7	C2_275:week_0_arm_1	107:week_0_arm_1	BFF	85	False	13	-1.296	0.0975704	-6.3994	0.847	1.534	0.2966472	81	77	72	93.51	88.89
8	C1_275:week_0_arm_1	107:week_0_arm_1	BFF	85	False	13	-1.296	0.0975704	-6.3994	0.847	1.534	0.2966472	81	77	72	93.51	88.89
9	C3_275:week_0_arm_1	107:week_0_arm_1	BFF	85	False	13	-1.296	0.0975704	-6.3994	0.847	1.534	0.2966472	81	77	72	93.51	88.89
10	C1_215:week_0_arm_1	107:week_0_arm_1	BFF	87	False	15	-1.127	0.1298396	-6.1110	0.828	1.357	0.3603912	83	77	72	93.51	86.75

Why the distance for 107:week_0_arm_1 is not 0 if the three cohorts are identical?

In Patient mode, the global vector is formed using variables solely from the reference cohort(s), not the patient's. The primary_key (id in this context) is automatically included, leading to a distance of 1 due to the mismatch in the individual's id field.

If you want to visualize the differences in all variables (i.e., the union of reference(s) and target), simply add the target as another cohort in --r. This way, the variables from the patient will be included in the reference vector.

Note that you can exclude id by adding --exclude-terms id.

Obtaining additional information on the alignments

You can create several files related to the reference --- target alignment by adding --align. By default it will create files (alignment*) in the current directory but you can specify a </path/basename>. Example:

pheno-ranker -r individuals.json individuals.json -t patient.json --align

Or using a path + basename:

pheno-ranker -r individuals.json individuals.json -t patient.json --align /my/fav/dir/jobid-001-align

Find below an extract of the alignment (C1_107:week_0_arm_1 --- 107:week_0_arm_1) extracted from alignment.txt:

REF -- TAR
1 ----- 1 | (w:  1|d:  0|cd:  0|) diseases.NCIT:C3138.diseaseCode.id.NCIT:C3138 (Inflammatory Bowel Disease)
1 ----- 1 | (w:  1|d:  0|cd:  0|) ethnicity.id.NCIT:C41261 (Caucasian)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.exposureCode.id.NCIT:C154329 (Smoking)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C65108 (Never Smoker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67147 (Current Smoker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67148 (Former Smoker)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.exposureCode.id.NCIT:C2190 (Alcohol)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C126379 (Non-Drinker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C156821 (Alcohol Consumption More than 2 Drinks per Day for Men and More than 1 Drink per Day for Women)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C17998 (Unknown)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.exposureCode.id.NCIT:C73993 (Pack Year)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.unit.id.NCIT:C73993 (Pack Year)
1 xxx-- 0 | (w:  1|d:  1|cd:  1|) id.C1_107:week_0_arm_1 (id.C1_107:week_0_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_14_arm_1 (id.C1_107:week_14_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_2_arm_1 (id.C1_107:week_2_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_0_arm_1 (id.C1_125:week_0_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_14_arm_1 (id.C1_125:week_14_arm_1)
...