Skip to content

Patient mode

Patient mode aims to determine which individuals in the cohort are the closest to our patient by ranking them using (dis)similarity metrics.

Usage

When using the Pheno-ranker command-line interface, simply ensure the correct syntax is provided.

Example:

pheno-ranker -r individuals.json -t patient.json
How do I extract one or many patients from a cohort file?
pheno-ranker -r t/individuals.json --patients-of-interest 107:week_0_arm_1 125:week_0_arm_1

This command will carry out a dry-run, creating 107:week_0_arm_1.json and 125:week_0_arm_1.json files. In the example above, I renamed 107:week_0_arm_1.json to patient.json by typing this:

mv 107:week_0_arm_1.json patient.json

This will create the output text file rank.txt.

rank.txt column names and meaning
  • RANK: This indicates the similarity match's order. A rank of 1 signifies the best match.
  • REFERENCE(ID): The unique identifier (primary key) for the reference individual.
  • TARGET(ID): The unique identifier (primary key) for the target individual. This is set using the --t parameter.
  • FORMAT: Specifies the format of the input data, which can be one of the following: BFF, PXF, or CSV. This is configured in the settings file.
  • LENGTH: This refers to the length of the "alignment", meaning the count of variables that have a 1 in either the reference or the target. For example:
LENGTH example

REF: 0001001
TAR: 1000001
In this case, the LENGTH is 3.

  • WEIGHTED: Indicates if the calculation used weights (specified with --w). Possible values are True or False.
  • HAMMING-DISTANCE: The Hamming distance between the reference and target individuals' vectors. The Hamming distance between two strings of equal length is the count of positions at which the corresponding symbols are different. In the context of binary strings, it's the number of bit positions where the two strings differ.
  • DISTANCE-Z-SCORE: The empirical Z-score from all comparisons between the patient and the reference cohort.
  • DISTANCE-P-VALUE: The statistical significance of the observed DISTANCE-Z-SCORE.
  • DISTANCE-Z-SCORE(RAND): The estimated Z-score for two random vectors, assuming the alignment size is equal to LENGTH.
DISTANCE-Z-SCORE(RAND) calculation

The value comes from the estimated mean and standard deviation of the Hamming distance for binary strings. It assumes that each position in the strings has a 50% chance of being a mismatch (independent of other positions). The method is grounded in the principles of binomial distribution.

  • The mean is calculated under the assumption of a 50% probability of mismatch at each position.
\[ \text{Estimated Average} = \text{Length} \times \text{Probability of Mismatch} \]

where Probability of Mismatch is set at 0.5.

  • The standard deviation, which provides a measure of the variability or spread of the Hamming distance from the mean. This calculation assumes a binomial distribution of mismatches, given the binary nature of the data (match or mismatch).
\[ \text{Estimated Standard Deviation} = \sqrt{\text{Length} \times \text{Probability of Mismatch} \times (1 - \text{Probability of Mismatch})} \]

Finally, the formula for the Z-score is:

$$ Z = \frac{(X - \mu)}{\sigma} $$ Where: \( X \) is the value of interest. \( \mu \) is the estimated average. \( \sigma \) is the estimated estandard deviation

This method is applicable for estimating the Hamming distance in randomly generated binary strings where each position is independently set.

  • JACCARD-INDEX: The Jaccard similarity coefficient between the reference and target individuals' vectors. The Jaccard Index for binary digit strings is a measure that calculates the similarity between two strings by dividing the number of positions where both have a 1 by the number of positions where at least one has a 1.
  • JACCARD-Z-SCORE: The Z-score calculated from all comparisons between patients and the reference cohort.
  • JACCARD-P-VALUE: The statistical significance of the observed JACCARD-Z-SCORE.
See results from rank.txt
RANK REFERENCE(ID) TARGET(ID) FORMAT LENGTH WEIGHTED HAMMING-DISTANCE DISTANCE-Z-SCORE DISTANCE-P-VALUE DISTANCE-Z-SCORE(RAND) JACCARD-INDEX JACCARD-Z-SCORE JACCARD-P-VALUE
1 107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 0 -2.419 0.0077787 -8.7750 1.000 2.949 0.0256500
2 125:week_0_arm_1 107:week_0_arm_1 BFF 79 False 6 -1.924 0.0271576 -7.5381 0.924 2.269 0.1022693
3 275:week_0_arm_1 107:week_0_arm_1 BFF 86 False 14 -1.265 0.1030165 -6.2543 0.837 1.491 0.3117348
4 215:week_0_arm_1 107:week_0_arm_1 BFF 88 False 16 -1.100 0.1357515 -5.9696 0.818 1.321 0.3742868
5 305:week_0_arm_1 107:week_0_arm_1 BFF 89 False 18 -0.935 0.1749800 -5.6180 0.798 1.138 0.4452980
6 365:week_0_arm_1 107:week_0_arm_1 BFF 87 False 20 -0.770 0.2207314 -5.0389 0.770 0.890 0.5437899
7 125:week_2_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423
8 125:week_14_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423
9 107:week_14_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423
10 527:week_14_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423
11 107:week_2_arm_1 107:week_0_arm_1 BFF 78 False 24 -0.440 0.3300253 -3.3968 0.692 0.193 0.7901267
12 527:week_2_arm_1 107:week_0_arm_1 BFF 78 False 24 -0.440 0.3300253 -3.3968 0.692 0.193 0.7901267
13 527:week_0_arm_1 107:week_0_arm_1 BFF 98 False 24 -0.440 0.3300253 -5.0508 0.755 0.756 0.5965581
14 125:week_26_arm_1 107:week_0_arm_1 BFF 78 False 24 -0.440 0.3300253 -3.3968 0.692 0.193 0.7901267
15 275:week_2_arm_1 107:week_0_arm_1 BFF 79 False 25 -0.357 0.3604065 -3.2628 0.684 0.115 0.8120159
16 365:week_2_arm_1 107:week_0_arm_1 BFF 79 False 25 -0.357 0.3604065 -3.2628 0.684 0.115 0.8120159
17 305:week_26_arm_1 107:week_0_arm_1 BFF 79 False 26 -0.275 0.3916958 -3.0377 0.671 0.001 0.8410353
18 365:week_14_arm_1 107:week_0_arm_1 BFF 80 False 26 -0.275 0.3916958 -3.1305 0.675 0.038 0.8319440
19 215:week_26_arm_1 107:week_0_arm_1 BFF 78 False 27 -0.192 0.4237022 -2.7175 0.654 -0.151 0.8752035
20 215:week_2_arm_1 107:week_0_arm_1 BFF 78 False 27 -0.192 0.4237022 -2.7175 0.654 -0.151 0.8752035
21 215:week_14_arm_1 107:week_0_arm_1 BFF 84 False 29 -0.027 0.4890344 -2.8368 0.655 -0.143 0.8735091
22 257:week_0_arm_1 107:week_0_arm_1 BFF 102 False 29 -0.027 0.4890344 -4.3566 0.716 0.403 0.7249040
23 365:week_26_arm_1 107:week_0_arm_1 BFF 83 False 30 0.055 0.5219230 -2.5246 0.639 -0.288 0.9011791
24 275:week_14_arm_1 107:week_0_arm_1 BFF 80 False 30 0.055 0.5219230 -2.2361 0.625 -0.410 0.9206854
25 215:week_78_arm_1 107:week_0_arm_1 BFF 86 False 32 0.220 0.5870339 -2.3723 0.628 -0.384 0.9167688
26 527:week_26_arm_1 107:week_0_arm_1 BFF 86 False 32 0.220 0.5870339 -2.3723 0.628 -0.384 0.9167688
27 125:week_78_arm_1 107:week_0_arm_1 BFF 94 False 40 0.880 0.8104854 -1.4440 0.574 -0.862 0.9687183
28 527:week_52_arm_1 107:week_0_arm_1 BFF 98 False 43 1.127 0.8701495 -1.2122 0.561 -0.981 0.9761986
29 125:week_52_arm_1 107:week_0_arm_1 BFF 98 False 43 1.127 0.8701495 -1.2122 0.561 -0.981 0.9761986
30 365:week_52_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870
31 257:week_2_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870
32 257:week_14_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870
33 305:week_52_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870
34 215:week_52_arm_1 107:week_0_arm_1 BFF 104 False 49 1.622 0.9475899 -0.5883 0.529 -1.271 0.9884232
35 257:week_26_arm_1 107:week_0_arm_1 BFF 103 False 50 1.704 0.9558461 -0.2956 0.515 -1.399 0.9917759
36 275:week_52_arm_1 107:week_0_arm_1 BFF 105 False 51 1.787 0.9630202 -0.2928 0.514 -1.401 0.9918315

The process mirrors handling a single cohort; the sole distinction is the addition of a prefix to each primary_key, enabling us to trace the origin of every individual.

Let's reuse individuals.json to have the impression of having more than one cohort.

Example:

pheno-ranker -r individuals.json individuals.json individuals.json -t patient.json --max-out 10 -o rank_multiple.txt

This will create the text file rank_multiple.txt.

See results from rank_multiple.txt
RANK REFERENCE(ID) TARGET(ID) FORMAT LENGTH WEIGHTED HAMMING-DISTANCE DISTANCE-Z-SCORE DISTANCE-P-VALUE DISTANCE-Z-SCORE(RAND) JACCARD-INDEX JACCARD-Z-SCORE JACCARD-P-VALUE
1 C1_107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 1 -2.306 0.0105624 -8.5470 0.987 2.804 0.0356370
2 C2_107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 1 -2.306 0.0105624 -8.5470 0.987 2.804 0.0356370
3 C3_107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 1 -2.306 0.0105624 -8.5470 0.987 2.804 0.0356370
4 C2_125:week_0_arm_1 107:week_0_arm_1 BFF 78 False 5 -1.969 0.0244763 -7.6995 0.936 2.340 0.0901212
5 C1_125:week_0_arm_1 107:week_0_arm_1 BFF 78 False 5 -1.969 0.0244763 -7.6995 0.936 2.340 0.0901212
6 C3_125:week_0_arm_1 107:week_0_arm_1 BFF 78 False 5 -1.969 0.0244763 -7.6995 0.936 2.340 0.0901212
7 C2_275:week_0_arm_1 107:week_0_arm_1 BFF 85 False 13 -1.296 0.0975704 -6.3994 0.847 1.534 0.2966472
8 C1_275:week_0_arm_1 107:week_0_arm_1 BFF 85 False 13 -1.296 0.0975704 -6.3994 0.847 1.534 0.2966472
9 C3_275:week_0_arm_1 107:week_0_arm_1 BFF 85 False 13 -1.296 0.0975704 -6.3994 0.847 1.534 0.2966472
10 C3_215:week_0_arm_1 107:week_0_arm_1 BFF 87 False 15 -1.127 0.1298396 -6.1110 0.828 1.357 0.3603912
Why the distance for 107:week_0_arm_1 is not 0 if the three cohorts are identical?

In Patient mode, the global vector is formed using variables solely from the reference cohort(s), not the patient's. The primary_key (id in this context) is automatically included, leading to a distance of 1 due to the mismatch in the individual's id field.

Note that you can exclude id by adding --exclude-terms id.

Obtaining additional information on the alignments

You can create several files related to the reference --- target alignment by adding --align. By default it will create files (alignment*) in the current directory but you can specify a </path/basename>. Example:

pheno-ranker -r individuals.json individuals.json -t patient.json --align

Or using a path + basename:

pheno-ranker -r individuals.json individuals.json -t patient.json --align /my/fav/dir/jobid-001-align

Find below an extract of the alignment (C1_107:week_0_arm_1 --- 107:week_0_arm_1) extracted from alignment.txt:

REF -- TAR
1 ----- 1 | (w:  1|d:  0|cd:  0|) diseases.NCIT:C3138.diseaseCode.id.NCIT:C3138 (Inflammatory Bowel Disease)
1 ----- 1 | (w:  1|d:  0|cd:  0|) ethnicity.id.NCIT:C41261 (Caucasian)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.exposureCode.id.NCIT:C154329 (Smoking)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C65108 (Never Smoker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67147 (Current Smoker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67148 (Former Smoker)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.exposureCode.id.NCIT:C2190 (Alcohol)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C126379 (Non-Drinker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C156821 (Alcohol Consumption More than 2 Drinks per Day for Men and More than 1 Drink per Day for Women)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C17998 (Unknown)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.exposureCode.id.NCIT:C73993 (Pack Year)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.unit.id.NCIT:C73993 (Pack Year)
1 xxx-- 0 | (w:  1|d:  1|cd:  1|) id.C1_107:week_0_arm_1 (id.C1_107:week_0_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_14_arm_1 (id.C1_107:week_14_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_2_arm_1 (id.C1_107:week_2_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_0_arm_1 (id.C1_125:week_0_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_14_arm_1 (id.C1_125:week_14_arm_1)
...