Patient mode
Patient mode aims to determine which individuals in the cohort are the closest to our patient by ranking them using (dis)similarity metrics.
Usage¶
When using the Pheno-ranker
command-line interface, simply ensure the correct syntax is provided.
Example:
How do I extract one or many patients from a cohort file?
This command will carry out a dry-run, creating 107:week_0_arm_1.json
and 125:week_0_arm_1.json
files.
In the example above, I renamed 107:week_0_arm_1.json
to patient.json
by typing this:
This will create the output text file rank.txt
.
rank.txt
column names and meaning
RANK
: This indicates the similarity match's order. A rank of 1 signifies the best match.REFERENCE(ID)
: The unique identifier (primary key) for the reference individual.TARGET(ID)
: The unique identifier (primary key) for the target individual. This is set using the--t
parameter.FORMAT
: Specifies the format of the input data, which can be one of the following:BFF
,PXF
, orCSV
. This is configured in the settings file.LENGTH
: This refers to the length of the "alignment", meaning the count of variables that have a1
in either the reference or the target. For example:
WEIGHTED
: Indicates if the calculation used weights (specified with--w
). Possible values areTrue
orFalse
.HAMMING-DISTANCE
: The Hamming distance between the reference and target individuals' vectors. The Hamming distance between two strings of equal length is the count of positions at which the corresponding symbols are different. In the context of binary strings, it's the number of bit positions where the two strings differ.DISTANCE-Z-SCORE
: The empiricalZ-score
from all comparisons between the patient and the reference cohort.DISTANCE-P-VALUE
: The statistical significance of the observedDISTANCE-Z-SCORE
.DISTANCE-Z-SCORE(RAND)
: The estimatedZ-score
for two random vectors, assuming the alignment size is equal toLENGTH
.
DISTANCE-Z-SCORE(RAND)
calculation
The value comes from the estimated mean and standard deviation of the Hamming distance for binary strings. It assumes that each position in the strings has a 50% chance of being a mismatch (independent of other positions). The method is grounded in the principles of binomial distribution.
- The mean is calculated under the assumption of a 50% probability of mismatch at each position.
where Probability of Mismatch is set at 0.5.
- The standard deviation, which provides a measure of the variability or spread of the Hamming distance from the mean. This calculation assumes a binomial distribution of mismatches, given the binary nature of the data (match or mismatch).
Finally, the formula for the Z-score
is:
$$ Z = \frac{(X - \mu)}{\sigma} $$ Where: \( X \) is the value of interest. \( \mu \) is the estimated average. \( \sigma \) is the estimated estandard deviation
This method is applicable for estimating the Hamming distance in randomly generated binary strings where each position is independently set.
JACCARD-INDEX
: The Jaccard similarity coefficient between the reference and target individuals' vectors. The Jaccard Index for binary digit strings is a measure that calculates the similarity between two strings by dividing the number of positions where both have a1
by the number of positions where at least one has a1
.JACCARD-Z-SCORE
: TheZ-score
calculated from all comparisons between patients and the reference cohort.JACCARD-P-VALUE
: The statistical significance of the observedJACCARD-Z-SCORE
.REFERENCE-VARS
: The total number of variables for the reference.TARGET-VARS
: The total number of variables for the target.INTERSECT
: The intersection of variables between reference and target.INTERSECT-RATE(%)
: The percentage of intersected variables with respect the total number of variables in the target.
INTERSECT-RATE(%)
calculation
The INTERSECT-RATE measures the overlap of variables between reference and target by calculating the proportion of shared variables relative to the total number of variables in the target.
- Intersection Count: The number of variables that exist in both the reference and target sets.
This metric expresses the overlap as a percentage, where 0% means no overlap and 100% means complete overlap with the target variables.
COMPLETENESS(%)
: The percentage of intersected variables with respect the total number of variables in the reference.
COMPLETENESS(%)
calculation
The COMPLETENESS measures the overlap of variables between reference and target by calculating the proportion of shared variables relative to the total number of variables in the reference.
- Intersection Count: The number of variables that exist in both the reference and target sets.
This metric expresses the overlap as a percentage, where 0% means no overlap and 100% means complete overlap with the reference variables.
See results from rank.txt
RANK | REFERENCE(ID) | TARGET(ID) | FORMAT | LENGTH | WEIGHTED | HAMMING-DISTANCE | DISTANCE-Z-SCORE | DISTANCE-P-VALUE | DISTANCE-Z-SCORE(RAND) | JACCARD-INDEX | JACCARD-Z-SCORE | JACCARD-P-VALUE | REFERENCE-VARS | TARGET-VARS | INTERSECT | INTERSECT-RATE(%) | COMPLETENESS(%) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 0 | -2.419 | 0.0077787 | -8.7750 | 1.000 | 2.949 | 0.0256500 | 77 | 77 | 77 | 100.00 | 100.00 |
2 | 125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 6 | -1.924 | 0.0271576 | -7.5381 | 0.924 | 2.269 | 0.1022693 | 75 | 77 | 73 | 94.81 | 97.33 |
3 | 275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 86 | False | 14 | -1.265 | 0.1030165 | -6.2543 | 0.837 | 1.491 | 0.3117348 | 81 | 77 | 72 | 93.51 | 88.89 |
4 | 215:week_0_arm_1 | 107:week_0_arm_1 | BFF | 88 | False | 16 | -1.100 | 0.1357515 | -5.9696 | 0.818 | 1.321 | 0.3742868 | 83 | 77 | 72 | 93.51 | 86.75 |
5 | 305:week_0_arm_1 | 107:week_0_arm_1 | BFF | 89 | False | 18 | -0.935 | 0.1749800 | -5.6180 | 0.798 | 1.138 | 0.4452980 | 83 | 77 | 71 | 92.21 | 85.54 |
6 | 365:week_0_arm_1 | 107:week_0_arm_1 | BFF | 87 | False | 20 | -0.770 | 0.2207314 | -5.0389 | 0.770 | 0.890 | 0.5437899 | 77 | 77 | 67 | 87.01 | 87.01 |
7 | 125:week_14_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
8 | 527:week_14_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
9 | 107:week_14_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
10 | 125:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
11 | 107:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 24 | -0.440 | 0.3300253 | -3.3968 | 0.692 | 0.193 | 0.7901267 | 55 | 77 | 54 | 70.13 | 98.18 |
12 | 125:week_26_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 24 | -0.440 | 0.3300253 | -3.3968 | 0.692 | 0.193 | 0.7901267 | 55 | 77 | 54 | 70.13 | 98.18 |
13 | 527:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 24 | -0.440 | 0.3300253 | -3.3968 | 0.692 | 0.193 | 0.7901267 | 55 | 77 | 54 | 70.13 | 98.18 |
14 | 527:week_0_arm_1 | 107:week_0_arm_1 | BFF | 98 | False | 24 | -0.440 | 0.3300253 | -5.0508 | 0.755 | 0.756 | 0.5965581 | 95 | 77 | 74 | 96.10 | 77.89 |
15 | 365:week_2_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 25 | -0.357 | 0.3604065 | -3.2628 | 0.684 | 0.115 | 0.8120159 | 56 | 77 | 54 | 70.13 | 96.43 |
16 | 275:week_2_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 25 | -0.357 | 0.3604065 | -3.2628 | 0.684 | 0.115 | 0.8120159 | 56 | 77 | 54 | 70.13 | 96.43 |
17 | 305:week_26_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 26 | -0.275 | 0.3916958 | -3.0377 | 0.671 | 0.001 | 0.8410353 | 55 | 77 | 53 | 68.83 | 96.36 |
18 | 365:week_14_arm_1 | 107:week_0_arm_1 | BFF | 80 | False | 26 | -0.275 | 0.3916958 | -3.1305 | 0.675 | 0.038 | 0.8319440 | 57 | 77 | 54 | 70.13 | 94.74 |
19 | 215:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 27 | -0.192 | 0.4237022 | -2.7175 | 0.654 | -0.151 | 0.8752035 | 52 | 77 | 51 | 66.23 | 98.08 |
20 | 215:week_26_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 27 | -0.192 | 0.4237022 | -2.7175 | 0.654 | -0.151 | 0.8752035 | 52 | 77 | 51 | 66.23 | 98.08 |
21 | 257:week_0_arm_1 | 107:week_0_arm_1 | BFF | 102 | False | 29 | -0.027 | 0.4890344 | -4.3566 | 0.716 | 0.403 | 0.7249040 | 98 | 77 | 73 | 94.81 | 74.49 |
22 | 215:week_14_arm_1 | 107:week_0_arm_1 | BFF | 84 | False | 29 | -0.027 | 0.4890344 | -2.8368 | 0.655 | -0.143 | 0.8735091 | 62 | 77 | 55 | 71.43 | 88.71 |
23 | 275:week_14_arm_1 | 107:week_0_arm_1 | BFF | 80 | False | 30 | 0.055 | 0.5219230 | -2.2361 | 0.625 | -0.410 | 0.9206854 | 53 | 77 | 50 | 64.94 | 94.34 |
24 | 365:week_26_arm_1 | 107:week_0_arm_1 | BFF | 83 | False | 30 | 0.055 | 0.5219230 | -2.5246 | 0.639 | -0.288 | 0.9011791 | 59 | 77 | 53 | 68.83 | 89.83 |
25 | 215:week_78_arm_1 | 107:week_0_arm_1 | BFF | 86 | False | 32 | 0.220 | 0.5870339 | -2.3723 | 0.628 | -0.384 | 0.9167688 | 63 | 77 | 54 | 70.13 | 85.71 |
26 | 527:week_26_arm_1 | 107:week_0_arm_1 | BFF | 86 | False | 32 | 0.220 | 0.5870339 | -2.3723 | 0.628 | -0.384 | 0.9167688 | 63 | 77 | 54 | 70.13 | 85.71 |
27 | 125:week_78_arm_1 | 107:week_0_arm_1 | BFF | 94 | False | 40 | 0.880 | 0.8104854 | -1.4440 | 0.574 | -0.862 | 0.9687183 | 71 | 77 | 54 | 70.13 | 76.06 |
28 | 527:week_52_arm_1 | 107:week_0_arm_1 | BFF | 98 | False | 43 | 1.127 | 0.8701495 | -1.2122 | 0.561 | -0.981 | 0.9761986 | 76 | 77 | 55 | 71.43 | 72.37 |
29 | 125:week_52_arm_1 | 107:week_0_arm_1 | BFF | 98 | False | 43 | 1.127 | 0.8701495 | -1.2122 | 0.561 | -0.981 | 0.9761986 | 76 | 77 | 55 | 71.43 | 72.37 |
30 | 365:week_52_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
31 | 257:week_14_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
32 | 305:week_52_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
33 | 257:week_2_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
34 | 215:week_52_arm_1 | 107:week_0_arm_1 | BFF | 104 | False | 49 | 1.622 | 0.9475899 | -0.5883 | 0.529 | -1.271 | 0.9884232 | 82 | 77 | 55 | 71.43 | 67.07 |
35 | 257:week_26_arm_1 | 107:week_0_arm_1 | BFF | 103 | False | 50 | 1.704 | 0.9558461 | -0.2956 | 0.515 | -1.399 | 0.9917759 | 79 | 77 | 53 | 68.83 | 67.09 |
36 | 275:week_52_arm_1 | 107:week_0_arm_1 | BFF | 105 | False | 51 | 1.787 | 0.9630202 | -0.2928 | 0.514 | -1.401 | 0.9918315 | 82 | 77 | 54 | 70.13 | 65.85 |
The process mirrors handling a single cohort; the sole distinction is the addition of a prefix to each primary_key
, enabling us to trace the origin of every individual.
Let's reuse individuals.json
to have the impression of having more than one cohort.
Example:
pheno-ranker -r individuals.json individuals.json individuals.json -t patient.json --max-out 10 -o rank_multiple.txt
This will create the text file rank_multiple.txt
.
See results from rank_multiple.txt
RANK | REFERENCE(ID) | TARGET(ID) | FORMAT | LENGTH | WEIGHTED | HAMMING-DISTANCE | DISTANCE-Z-SCORE | DISTANCE-P-VALUE | DISTANCE-Z-SCORE(RAND) | JACCARD-INDEX | JACCARD-Z-SCORE | JACCARD-P-VALUE | REFERENCE-VARS | TARGET-VARS | INTERSECT | INTERSECT-RATE(%) | COMPLETENESS(%) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | C2_107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 1 | -2.306 | 0.0105624 | -8.5470 | 0.987 | 2.804 | 0.0356370 | 77 | 77 | 76 | 98.70 | 98.70 |
2 | C3_107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 1 | -2.306 | 0.0105624 | -8.5470 | 0.987 | 2.804 | 0.0356370 | 77 | 77 | 76 | 98.70 | 98.70 |
3 | C1_107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 1 | -2.306 | 0.0105624 | -8.5470 | 0.987 | 2.804 | 0.0356370 | 77 | 77 | 76 | 98.70 | 98.70 |
4 | C1_125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 5 | -1.969 | 0.0244763 | -7.6995 | 0.936 | 2.340 | 0.0901212 | 75 | 77 | 73 | 94.81 | 97.33 |
5 | C3_125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 5 | -1.969 | 0.0244763 | -7.6995 | 0.936 | 2.340 | 0.0901212 | 75 | 77 | 73 | 94.81 | 97.33 |
6 | C2_125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 5 | -1.969 | 0.0244763 | -7.6995 | 0.936 | 2.340 | 0.0901212 | 75 | 77 | 73 | 94.81 | 97.33 |
7 | C2_275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 85 | False | 13 | -1.296 | 0.0975704 | -6.3994 | 0.847 | 1.534 | 0.2966472 | 81 | 77 | 72 | 93.51 | 88.89 |
8 | C1_275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 85 | False | 13 | -1.296 | 0.0975704 | -6.3994 | 0.847 | 1.534 | 0.2966472 | 81 | 77 | 72 | 93.51 | 88.89 |
9 | C3_275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 85 | False | 13 | -1.296 | 0.0975704 | -6.3994 | 0.847 | 1.534 | 0.2966472 | 81 | 77 | 72 | 93.51 | 88.89 |
10 | C1_215:week_0_arm_1 | 107:week_0_arm_1 | BFF | 87 | False | 15 | -1.127 | 0.1298396 | -6.1110 | 0.828 | 1.357 | 0.3603912 | 83 | 77 | 72 | 93.51 | 86.75 |
Why the distance for 107:week_0_arm_1
is not 0
if the three cohorts are identical?
In Patient mode, the global vector is formed using variables solely from the reference cohort(s), not the patient's. The primary_key
(id
in this context) is automatically included, leading to a distance of 1 due to the mismatch in the individual's id
field.
If you want to visualize the differences in all variables (i.e., the union of reference(s) and target), simply add the target as another cohort in --r
. This way, the variables from the patient will be included in the reference vector.
Note that you can exclude id
by adding --exclude-terms id
.
Obtaining additional information on the alignments
You can create several files related to the reference --- target alignment by adding --align
. By default it will create files (alignment*
) in the current directory but you can specify a </path/basename>
. Example:
Or using a path + basename:
pheno-ranker -r individuals.json individuals.json -t patient.json --align /my/fav/dir/jobid-001-align
Find below an extract of the alignment (C1_107:week_0_arm_1 --- 107:week_0_arm_1
) extracted from alignment.txt
:
REF -- TAR
1 ----- 1 | (w: 1|d: 0|cd: 0|) diseases.NCIT:C3138.diseaseCode.id.NCIT:C3138 (Inflammatory Bowel Disease)
1 ----- 1 | (w: 1|d: 0|cd: 0|) ethnicity.id.NCIT:C41261 (Caucasian)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.exposureCode.id.NCIT:C154329 (Smoking)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.unit.id.NCIT:C65108 (Never Smoker)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.unit.id.NCIT:C67147 (Current Smoker)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.unit.id.NCIT:C67148 (Former Smoker)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.exposureCode.id.NCIT:C2190 (Alcohol)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.unit.id.NCIT:C126379 (Non-Drinker)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.unit.id.NCIT:C156821 (Alcohol Consumption More than 2 Drinks per Day for Men and More than 1 Drink per Day for Women)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.unit.id.NCIT:C17998 (Unknown)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C73993.exposureCode.id.NCIT:C73993 (Pack Year)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C73993.unit.id.NCIT:C73993 (Pack Year)
1 xxx-- 0 | (w: 1|d: 1|cd: 1|) id.C1_107:week_0_arm_1 (id.C1_107:week_0_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_107:week_14_arm_1 (id.C1_107:week_14_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_107:week_2_arm_1 (id.C1_107:week_2_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_125:week_0_arm_1 (id.C1_125:week_0_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_125:week_14_arm_1 (id.C1_125:week_14_arm_1)
...