Skip to content

Patient mode

Patient mode aims to determine which individuals in the cohort are the closest to our patient by ranking them using (dis)similarity metrics.

Usage

When using the Pheno-ranker command-line interface, simply ensure the correct syntax is provided.

Example:

pheno-ranker -r individuals.json -t patient.json
How do I extract one or many patients from a cohort file?
pheno-ranker -r t/individuals.json --patients-of-interest 107:week_0_arm_1 125:week_0_arm_1

This command will carry out a dry-run, creating 107:week_0_arm_1.json and 125:week_0_arm_1.json files. In the example above, I renamed 107:week_0_arm_1.json to patient.json by typing this:

mv 107:week_0_arm_1.json patient.json

This will create the output text file rank.txt.

rank.txt column names and meaning
  • RANK: This indicates the similarity match's order. A rank of 1 signifies the best match.
  • REFERENCE(ID): The unique identifier (primary key) for the reference individual.
  • TARGET(ID): The unique identifier (primary key) for the target individual. This is set using the --t parameter.
  • FORMAT: Specifies the format of the input data, which can be one of the following: BFF, PXF, or CSV. This is configured in the settings file.
  • LENGTH: This refers to the length of the "alignment", meaning the count of variables that have a 1 in either the reference or the target. For example:
LENGTH example

REF: 0001001
TAR: 1000001
In this case, the LENGTH is 3.

  • WEIGHTED: Indicates if the calculation used weights (specified with --w). Possible values are True or False.
  • HAMMING-DISTANCE: The Hamming distance between the reference and target individuals' vectors. The Hamming distance between two strings of equal length is the count of positions at which the corresponding symbols are different. In the context of binary strings, it's the number of bit positions where the two strings differ.
  • DISTANCE-Z-SCORE: The empirical Z-score from all comparisons between the patient and the reference cohort.
  • DISTANCE-P-VALUE: The statistical significance of the observed DISTANCE-Z-SCORE.
  • DISTANCE-Z-SCORE(RAND): The estimated Z-score for two random vectors, assuming the alignment size is equal to LENGTH.
DISTANCE-Z-SCORE(RAND) calculation

The value comes from the estimated mean and standard deviation of the Hamming distance for binary strings. It assumes that each position in the strings has a 50% chance of being a mismatch (independent of other positions). The method is grounded in the principles of binomial distribution.

  • The mean is calculated under the assumption of a 50% probability of mismatch at each position.
\[ \text{Estimated Average} = \text{Length} \times \text{Probability of Mismatch} \]

where Probability of Mismatch is set at 0.5.

  • The standard deviation, which provides a measure of the variability or spread of the Hamming distance from the mean. This calculation assumes a binomial distribution of mismatches, given the binary nature of the data (match or mismatch).
\[ \text{Estimated Standard Deviation} = \sqrt{\text{Length} \times \text{Probability of Mismatch} \times (1 - \text{Probability of Mismatch})} \]

Finally, the formula for the Z-score is:

$$ Z = \frac{(X - \mu)}{\sigma} $$ Where: \( X \) is the value of interest. \( \mu \) is the estimated average. \( \sigma \) is the estimated estandard deviation

This method is applicable for estimating the Hamming distance in randomly generated binary strings where each position is independently set.

  • JACCARD-INDEX: The Jaccard similarity coefficient between the reference and target individuals' vectors. The Jaccard Index for binary digit strings is a measure that calculates the similarity between two strings by dividing the number of positions where both have a 1 by the number of positions where at least one has a 1.
  • JACCARD-Z-SCORE: The Z-score calculated from all comparisons between patients and the reference cohort.
  • JACCARD-P-VALUE: The statistical significance of the observed JACCARD-Z-SCORE.
  • REFERENCE-VARS: The total number of variables for the reference.
  • TARGET-VARS: The total number of variables for the target.
  • INTERSECT: The intersection of variables between reference and target.
  • INTERSECT-RATE(%): The percentage of intersected variables with respect the total number of variables in the target.
INTERSECT-RATE(%) calculation

The INTERSECT-RATE measures the overlap of variables between reference and target by calculating the proportion of shared variables relative to the total number of variables in the target.

  • Intersection Count: The number of variables that exist in both the reference and target sets.
\[ \text{INTERSECT-RATE(\%)} = \frac{\text{Intersection Count}}{\text{Number of Variables in Target}} \times 100 \]

This metric expresses the overlap as a percentage, where 0% means no overlap and 100% means complete overlap with the target variables.

  • COMPLETENESS(%): The percentage of intersected variables with respect the total number of variables in the reference.
COMPLETENESS(%) calculation

The COMPLETENESS measures the overlap of variables between reference and target by calculating the proportion of shared variables relative to the total number of variables in the reference.

  • Intersection Count: The number of variables that exist in both the reference and target sets.
\[ \text{COMPLETENESS(\%)} = \frac{\text{Intersection Count}}{\text{Number of Variables in Reference}} \times 100 \]

This metric expresses the overlap as a percentage, where 0% means no overlap and 100% means complete overlap with the reference variables.

See results from rank.txt
RANK REFERENCE(ID) TARGET(ID) FORMAT LENGTH WEIGHTED HAMMING-DISTANCE DISTANCE-Z-SCORE DISTANCE-P-VALUE DISTANCE-Z-SCORE(RAND) JACCARD-INDEX JACCARD-Z-SCORE JACCARD-P-VALUE REFERENCE-VARS TARGET-VARS INTERSECT INTERSECT-RATE(%) COMPLETENESS(%)
1 107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 0 -2.419 0.0077787 -8.7750 1.000 2.949 0.0256500 77 77 77 100.00 100.00
2 125:week_0_arm_1 107:week_0_arm_1 BFF 79 False 6 -1.924 0.0271576 -7.5381 0.924 2.269 0.1022693 75 77 73 94.81 97.33
3 275:week_0_arm_1 107:week_0_arm_1 BFF 86 False 14 -1.265 0.1030165 -6.2543 0.837 1.491 0.3117348 81 77 72 93.51 88.89
4 215:week_0_arm_1 107:week_0_arm_1 BFF 88 False 16 -1.100 0.1357515 -5.9696 0.818 1.321 0.3742868 83 77 72 93.51 86.75
5 305:week_0_arm_1 107:week_0_arm_1 BFF 89 False 18 -0.935 0.1749800 -5.6180 0.798 1.138 0.4452980 83 77 71 92.21 85.54
6 365:week_0_arm_1 107:week_0_arm_1 BFF 87 False 20 -0.770 0.2207314 -5.0389 0.770 0.890 0.5437899 77 77 67 87.01 87.01
7 125:week_14_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423 56 77 55 71.43 98.21
8 527:week_14_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423 56 77 55 71.43 98.21
9 107:week_14_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423 56 77 55 71.43 98.21
10 125:week_2_arm_1 107:week_0_arm_1 BFF 78 False 23 -0.522 0.3007259 -3.6233 0.705 0.308 0.7555423 56 77 55 71.43 98.21
11 107:week_2_arm_1 107:week_0_arm_1 BFF 78 False 24 -0.440 0.3300253 -3.3968 0.692 0.193 0.7901267 55 77 54 70.13 98.18
12 125:week_26_arm_1 107:week_0_arm_1 BFF 78 False 24 -0.440 0.3300253 -3.3968 0.692 0.193 0.7901267 55 77 54 70.13 98.18
13 527:week_2_arm_1 107:week_0_arm_1 BFF 78 False 24 -0.440 0.3300253 -3.3968 0.692 0.193 0.7901267 55 77 54 70.13 98.18
14 527:week_0_arm_1 107:week_0_arm_1 BFF 98 False 24 -0.440 0.3300253 -5.0508 0.755 0.756 0.5965581 95 77 74 96.10 77.89
15 365:week_2_arm_1 107:week_0_arm_1 BFF 79 False 25 -0.357 0.3604065 -3.2628 0.684 0.115 0.8120159 56 77 54 70.13 96.43
16 275:week_2_arm_1 107:week_0_arm_1 BFF 79 False 25 -0.357 0.3604065 -3.2628 0.684 0.115 0.8120159 56 77 54 70.13 96.43
17 305:week_26_arm_1 107:week_0_arm_1 BFF 79 False 26 -0.275 0.3916958 -3.0377 0.671 0.001 0.8410353 55 77 53 68.83 96.36
18 365:week_14_arm_1 107:week_0_arm_1 BFF 80 False 26 -0.275 0.3916958 -3.1305 0.675 0.038 0.8319440 57 77 54 70.13 94.74
19 215:week_2_arm_1 107:week_0_arm_1 BFF 78 False 27 -0.192 0.4237022 -2.7175 0.654 -0.151 0.8752035 52 77 51 66.23 98.08
20 215:week_26_arm_1 107:week_0_arm_1 BFF 78 False 27 -0.192 0.4237022 -2.7175 0.654 -0.151 0.8752035 52 77 51 66.23 98.08
21 257:week_0_arm_1 107:week_0_arm_1 BFF 102 False 29 -0.027 0.4890344 -4.3566 0.716 0.403 0.7249040 98 77 73 94.81 74.49
22 215:week_14_arm_1 107:week_0_arm_1 BFF 84 False 29 -0.027 0.4890344 -2.8368 0.655 -0.143 0.8735091 62 77 55 71.43 88.71
23 275:week_14_arm_1 107:week_0_arm_1 BFF 80 False 30 0.055 0.5219230 -2.2361 0.625 -0.410 0.9206854 53 77 50 64.94 94.34
24 365:week_26_arm_1 107:week_0_arm_1 BFF 83 False 30 0.055 0.5219230 -2.5246 0.639 -0.288 0.9011791 59 77 53 68.83 89.83
25 215:week_78_arm_1 107:week_0_arm_1 BFF 86 False 32 0.220 0.5870339 -2.3723 0.628 -0.384 0.9167688 63 77 54 70.13 85.71
26 527:week_26_arm_1 107:week_0_arm_1 BFF 86 False 32 0.220 0.5870339 -2.3723 0.628 -0.384 0.9167688 63 77 54 70.13 85.71
27 125:week_78_arm_1 107:week_0_arm_1 BFF 94 False 40 0.880 0.8104854 -1.4440 0.574 -0.862 0.9687183 71 77 54 70.13 76.06
28 527:week_52_arm_1 107:week_0_arm_1 BFF 98 False 43 1.127 0.8701495 -1.2122 0.561 -0.981 0.9761986 76 77 55 71.43 72.37
29 125:week_52_arm_1 107:week_0_arm_1 BFF 98 False 43 1.127 0.8701495 -1.2122 0.561 -0.981 0.9761986 76 77 55 71.43 72.37
30 365:week_52_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870 76 77 54 70.13 71.05
31 257:week_14_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870 76 77 54 70.13 71.05
32 305:week_52_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870 76 77 54 70.13 71.05
33 257:week_2_arm_1 107:week_0_arm_1 BFF 99 False 45 1.292 0.9018282 -0.9045 0.545 -1.122 0.9830870 76 77 54 70.13 71.05
34 215:week_52_arm_1 107:week_0_arm_1 BFF 104 False 49 1.622 0.9475899 -0.5883 0.529 -1.271 0.9884232 82 77 55 71.43 67.07
35 257:week_26_arm_1 107:week_0_arm_1 BFF 103 False 50 1.704 0.9558461 -0.2956 0.515 -1.399 0.9917759 79 77 53 68.83 67.09
36 275:week_52_arm_1 107:week_0_arm_1 BFF 105 False 51 1.787 0.9630202 -0.2928 0.514 -1.401 0.9918315 82 77 54 70.13 65.85

The process mirrors handling a single cohort; the sole distinction is the addition of a prefix to each primary_key, enabling us to trace the origin of every individual.

Let's reuse individuals.json to have the impression of having more than one cohort.

Example:

pheno-ranker -r individuals.json individuals.json individuals.json -t patient.json --max-out 10 -o rank_multiple.txt

This will create the text file rank_multiple.txt.

See results from rank_multiple.txt
RANK REFERENCE(ID) TARGET(ID) FORMAT LENGTH WEIGHTED HAMMING-DISTANCE DISTANCE-Z-SCORE DISTANCE-P-VALUE DISTANCE-Z-SCORE(RAND) JACCARD-INDEX JACCARD-Z-SCORE JACCARD-P-VALUE REFERENCE-VARS TARGET-VARS INTERSECT INTERSECT-RATE(%) COMPLETENESS(%)
1 C2_107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 1 -2.306 0.0105624 -8.5470 0.987 2.804 0.0356370 77 77 76 98.70 98.70
2 C3_107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 1 -2.306 0.0105624 -8.5470 0.987 2.804 0.0356370 77 77 76 98.70 98.70
3 C1_107:week_0_arm_1 107:week_0_arm_1 BFF 77 False 1 -2.306 0.0105624 -8.5470 0.987 2.804 0.0356370 77 77 76 98.70 98.70
4 C1_125:week_0_arm_1 107:week_0_arm_1 BFF 78 False 5 -1.969 0.0244763 -7.6995 0.936 2.340 0.0901212 75 77 73 94.81 97.33
5 C3_125:week_0_arm_1 107:week_0_arm_1 BFF 78 False 5 -1.969 0.0244763 -7.6995 0.936 2.340 0.0901212 75 77 73 94.81 97.33
6 C2_125:week_0_arm_1 107:week_0_arm_1 BFF 78 False 5 -1.969 0.0244763 -7.6995 0.936 2.340 0.0901212 75 77 73 94.81 97.33
7 C2_275:week_0_arm_1 107:week_0_arm_1 BFF 85 False 13 -1.296 0.0975704 -6.3994 0.847 1.534 0.2966472 81 77 72 93.51 88.89
8 C1_275:week_0_arm_1 107:week_0_arm_1 BFF 85 False 13 -1.296 0.0975704 -6.3994 0.847 1.534 0.2966472 81 77 72 93.51 88.89
9 C3_275:week_0_arm_1 107:week_0_arm_1 BFF 85 False 13 -1.296 0.0975704 -6.3994 0.847 1.534 0.2966472 81 77 72 93.51 88.89
10 C1_215:week_0_arm_1 107:week_0_arm_1 BFF 87 False 15 -1.127 0.1298396 -6.1110 0.828 1.357 0.3603912 83 77 72 93.51 86.75
Why the distance for 107:week_0_arm_1 is not 0 if the three cohorts are identical?

In Patient mode, the global vector is formed using variables solely from the reference cohort(s), not the patient's. The primary_key (id in this context) is automatically included, leading to a distance of 1 due to the mismatch in the individual's id field.

If you want to visualize the differences in all variables (i.e., the union of reference(s) and target), simply add the target as another cohort in --r. This way, the variables from the patient will be included in the reference vector.

Note that you can exclude id by adding --exclude-terms id.

Obtaining additional information on the alignments

You can create several files related to the reference --- target alignment by adding --align. By default it will create files (alignment*) in the current directory but you can specify a </path/basename>. Example:

pheno-ranker -r individuals.json individuals.json -t patient.json --align

Or using a path + basename:

pheno-ranker -r individuals.json individuals.json -t patient.json --align /my/fav/dir/jobid-001-align

Find below an extract of the alignment (C1_107:week_0_arm_1 --- 107:week_0_arm_1) extracted from alignment.txt:

REF -- TAR
1 ----- 1 | (w:  1|d:  0|cd:  0|) diseases.NCIT:C3138.diseaseCode.id.NCIT:C3138 (Inflammatory Bowel Disease)
1 ----- 1 | (w:  1|d:  0|cd:  0|) ethnicity.id.NCIT:C41261 (Caucasian)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.exposureCode.id.NCIT:C154329 (Smoking)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C65108 (Never Smoker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67147 (Current Smoker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C154329.unit.id.NCIT:C67148 (Former Smoker)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.exposureCode.id.NCIT:C2190 (Alcohol)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C126379 (Non-Drinker)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C156821 (Alcohol Consumption More than 2 Drinks per Day for Men and More than 1 Drink per Day for Women)
1 ----- 1 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C2190.unit.id.NCIT:C17998 (Unknown)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.exposureCode.id.NCIT:C73993 (Pack Year)
0       0 | (w:  1|d:  0|cd:  0|) exposures.NCIT:C73993.unit.id.NCIT:C73993 (Pack Year)
1 xxx-- 0 | (w:  1|d:  1|cd:  1|) id.C1_107:week_0_arm_1 (id.C1_107:week_0_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_14_arm_1 (id.C1_107:week_14_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_107:week_2_arm_1 (id.C1_107:week_2_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_0_arm_1 (id.C1_125:week_0_arm_1)
0       0 | (w:  1|d:  0|cd:  1|) id.C1_125:week_14_arm_1 (id.C1_125:week_14_arm_1)
...