Patient Mode
Patient mode ranks records in a reference cohort against a target patient or object. It uses the same flattened variables and binary-vector representation as cohort mode, but the output is a ranked table instead of an all-vs-all matrix.
Use patient mode when you want to find the closest matches to a patient profile, inspect which variables overlap, or assess match significance with Z-scores and p-values.
pheno-ranker -r cohort.json -t patient.jsonrank.txtWhen to Use It
Match
Find similar records
Rank every reference record against one target patient or object.
Compare
Multiple cohorts
Use several reference files and keep each match traceable to its source cohort.
Interpret
Read match statistics
Use Hamming distance, Jaccard similarity, Z-scores, p-values, and overlap percentages.
Audit
Inspect alignments
Use --align to see which variables match or differ between target and reference.
What You Get
rank.txt: ranked matches between the target and the reference cohort.alignment*: optional variable-level alignment files when--alignis used.export.*.json: optional intermediate hashes, vectors, and coverage statistics when--exportis used.- Hamming distance, Jaccard similarity, Z-scores, p-values, and overlap statistics for each match.
Use patient mode when one target should be ranked against a reference cohort. Use cohort mode when every record should be compared with every other record.
Usage
The examples below show the common patient-mode command-line patterns. For the complete CLI reference, see Usage.
- Against one cohort
- Against multiple cohorts
Example:
pheno-ranker -r individuals.json -t patient.json
How do I extract one or many patients from a cohort file?
pheno-ranker -r t/data/individuals.json --patients-of-interest 107:week_0_arm_1 125:week_0_arm_1
This command will carry out a dry-run, creating 107:week_0_arm_1.json and 125:week_0_arm_1.json files.
On Windows, characters that are invalid in filenames are percent-encoded, so 107:week_0_arm_1 is written as 107%3Aweek_0_arm_1.json.
In the example above, I renamed 107:week_0_arm_1.json to patient.json by typing this:
mv 107:week_0_arm_1.json patient.json
This will create the output text file rank.txt.
The first rows in rank.txt are the best matches according to the selected sorting metric. By default, patient mode sorts by Hamming distance; use --sort-by jaccard to sort by Jaccard similarity instead.
rank.txtFor most analyses, start with these columns:
RANK: Match order;1is the best match under the selected sorting metric.REFERENCE(ID): The matched individual in the reference cohort.HAMMING-DISTANCE: Lower values indicate more similar binary profiles.JACCARD-INDEX: Higher values indicate more similar binary profiles.DISTANCE-P-VALUE/JACCARD-P-VALUE: Significance of the match within the distribution of comparisons in the run.INTERSECT-RATE(%): How much of the target profile is covered by the reference match.COMPLETENESS(%): How much of the reference profile is covered by the target.
Use Hamming distance when you want a distance-like ranking. Use Jaccard similarity when sparse overlap or missingness is important.
Full rank.txt column reference
Identifiers and run metadata
RANK: Match order. A rank of1is the best match.REFERENCE(ID): The unique identifier (primary_key) for the reference individual.TARGET(ID): The unique identifier (primary_key) for the target individual passed with--target.FORMAT: Input format used by the configuration, such asBFF,PXF, orCSV.WEIGHTED: Whether the calculation used variable weights with--weights.
Alignment size
LENGTH: Count of variables that have a1in either the reference or the target. In other words, this is the size of the comparison space for that pair.
LENGTH example
REF: 0001001
TAR: 1000001
In this case, LENGTH is 3 because three positions have a 1 in at least one vector.
Similarity and distance metrics
HAMMING-DISTANCE: Count of positions where the reference and target binary vectors differ. Lower values indicate more similar profiles.JACCARD-INDEX: Similarity between the reference and target vectors, calculated as the intersection divided by the union. Higher values indicate more similar profiles.
Metric definitions
Hamming distance counts mismatches between two binary strings of equal length.
Jaccard similarity focuses on shared 1 values:
Significance statistics
DISTANCE-Z-SCORE: Empirical Z-score for the observed Hamming distance compared with all target-reference comparisons in the run.DISTANCE-P-VALUE: Statistical significance associated withDISTANCE-Z-SCORE.DISTANCE-Z-SCORE(RAND): Estimated Z-score for two random binary vectors, assuming the alignment size is equal toLENGTH.JACCARD-Z-SCORE: Empirical Z-score for the observed Jaccard index compared with all target-reference comparisons in the run.JACCARD-P-VALUE: Statistical significance associated withJACCARD-Z-SCORE.
DISTANCE-Z-SCORE(RAND) calculation
This value comes from the estimated mean and standard deviation of the Hamming distance for binary strings. It assumes that each position has a 50% chance of being a mismatch, independently of other positions.
The expected mean is:
where the probability of mismatch is set to 0.5.
The standard deviation is:
Finally, the formula for the Z-score is:
where ( X ) is the observed value, ( \mu ) is the estimated mean, and ( \sigma ) is the estimated standard deviation.
Variable overlap
REFERENCE-VARS: Total number of variables present in the reference.TARGET-VARS: Total number of variables present in the target.INTERSECT: Number of variables shared by the reference and target.INTERSECT-RATE(%): Percentage of target variables also present in the reference.COMPLETENESS(%): Percentage of reference variables also present in the target.
INTERSECT-RATE(%) calculation
INTERSECT-RATE(%) measures how much of the target profile is covered by the reference:
COMPLETENESS(%) calculation
COMPLETENESS(%) measures how much of the reference profile is covered by the target:
See results from rank.txt
| RANK | REFERENCE(ID) | TARGET(ID) | FORMAT | LENGTH | WEIGHTED | HAMMING-DISTANCE | DISTANCE-Z-SCORE | DISTANCE-P-VALUE | DISTANCE-Z-SCORE(RAND) | JACCARD-INDEX | JACCARD-Z-SCORE | JACCARD-P-VALUE | REFERENCE-VARS | TARGET-VARS | INTERSECT | INTERSECT-RATE(%) | COMPLETENESS(%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 0 | -2.419 | 0.0077787 | -8.7750 | 1.000 | 2.949 | 0.0256500 | 77 | 77 | 77 | 100.00 | 100.00 |
| 2 | 125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 6 | -1.924 | 0.0271576 | -7.5381 | 0.924 | 2.269 | 0.1022693 | 75 | 77 | 73 | 94.81 | 97.33 |
| 3 | 275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 86 | False | 14 | -1.265 | 0.1030165 | -6.2543 | 0.837 | 1.491 | 0.3117348 | 81 | 77 | 72 | 93.51 | 88.89 |
| 4 | 215:week_0_arm_1 | 107:week_0_arm_1 | BFF | 88 | False | 16 | -1.100 | 0.1357515 | -5.9696 | 0.818 | 1.321 | 0.3742868 | 83 | 77 | 72 | 93.51 | 86.75 |
| 5 | 305:week_0_arm_1 | 107:week_0_arm_1 | BFF | 89 | False | 18 | -0.935 | 0.1749800 | -5.6180 | 0.798 | 1.138 | 0.4452980 | 83 | 77 | 71 | 92.21 | 85.54 |
| 6 | 365:week_0_arm_1 | 107:week_0_arm_1 | BFF | 87 | False | 20 | -0.770 | 0.2207314 | -5.0389 | 0.770 | 0.890 | 0.5437899 | 77 | 77 | 67 | 87.01 | 87.01 |
| 7 | 125:week_14_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
| 8 | 527:week_14_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
| 9 | 107:week_14_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
| 10 | 125:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 23 | -0.522 | 0.3007259 | -3.6233 | 0.705 | 0.308 | 0.7555423 | 56 | 77 | 55 | 71.43 | 98.21 |
| 11 | 107:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 24 | -0.440 | 0.3300253 | -3.3968 | 0.692 | 0.193 | 0.7901267 | 55 | 77 | 54 | 70.13 | 98.18 |
| 12 | 125:week_26_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 24 | -0.440 | 0.3300253 | -3.3968 | 0.692 | 0.193 | 0.7901267 | 55 | 77 | 54 | 70.13 | 98.18 |
| 13 | 527:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 24 | -0.440 | 0.3300253 | -3.3968 | 0.692 | 0.193 | 0.7901267 | 55 | 77 | 54 | 70.13 | 98.18 |
| 14 | 527:week_0_arm_1 | 107:week_0_arm_1 | BFF | 98 | False | 24 | -0.440 | 0.3300253 | -5.0508 | 0.755 | 0.756 | 0.5965581 | 95 | 77 | 74 | 96.10 | 77.89 |
| 15 | 365:week_2_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 25 | -0.357 | 0.3604065 | -3.2628 | 0.684 | 0.115 | 0.8120159 | 56 | 77 | 54 | 70.13 | 96.43 |
| 16 | 275:week_2_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 25 | -0.357 | 0.3604065 | -3.2628 | 0.684 | 0.115 | 0.8120159 | 56 | 77 | 54 | 70.13 | 96.43 |
| 17 | 305:week_26_arm_1 | 107:week_0_arm_1 | BFF | 79 | False | 26 | -0.275 | 0.3916958 | -3.0377 | 0.671 | 0.001 | 0.8410353 | 55 | 77 | 53 | 68.83 | 96.36 |
| 18 | 365:week_14_arm_1 | 107:week_0_arm_1 | BFF | 80 | False | 26 | -0.275 | 0.3916958 | -3.1305 | 0.675 | 0.038 | 0.8319440 | 57 | 77 | 54 | 70.13 | 94.74 |
| 19 | 215:week_2_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 27 | -0.192 | 0.4237022 | -2.7175 | 0.654 | -0.151 | 0.8752035 | 52 | 77 | 51 | 66.23 | 98.08 |
| 20 | 215:week_26_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 27 | -0.192 | 0.4237022 | -2.7175 | 0.654 | -0.151 | 0.8752035 | 52 | 77 | 51 | 66.23 | 98.08 |
| 21 | 257:week_0_arm_1 | 107:week_0_arm_1 | BFF | 102 | False | 29 | -0.027 | 0.4890344 | -4.3566 | 0.716 | 0.403 | 0.7249040 | 98 | 77 | 73 | 94.81 | 74.49 |
| 22 | 215:week_14_arm_1 | 107:week_0_arm_1 | BFF | 84 | False | 29 | -0.027 | 0.4890344 | -2.8368 | 0.655 | -0.143 | 0.8735091 | 62 | 77 | 55 | 71.43 | 88.71 |
| 23 | 275:week_14_arm_1 | 107:week_0_arm_1 | BFF | 80 | False | 30 | 0.055 | 0.5219230 | -2.2361 | 0.625 | -0.410 | 0.9206854 | 53 | 77 | 50 | 64.94 | 94.34 |
| 24 | 365:week_26_arm_1 | 107:week_0_arm_1 | BFF | 83 | False | 30 | 0.055 | 0.5219230 | -2.5246 | 0.639 | -0.288 | 0.9011791 | 59 | 77 | 53 | 68.83 | 89.83 |
| 25 | 215:week_78_arm_1 | 107:week_0_arm_1 | BFF | 86 | False | 32 | 0.220 | 0.5870339 | -2.3723 | 0.628 | -0.384 | 0.9167688 | 63 | 77 | 54 | 70.13 | 85.71 |
| 26 | 527:week_26_arm_1 | 107:week_0_arm_1 | BFF | 86 | False | 32 | 0.220 | 0.5870339 | -2.3723 | 0.628 | -0.384 | 0.9167688 | 63 | 77 | 54 | 70.13 | 85.71 |
| 27 | 125:week_78_arm_1 | 107:week_0_arm_1 | BFF | 94 | False | 40 | 0.880 | 0.8104854 | -1.4440 | 0.574 | -0.862 | 0.9687183 | 71 | 77 | 54 | 70.13 | 76.06 |
| 28 | 527:week_52_arm_1 | 107:week_0_arm_1 | BFF | 98 | False | 43 | 1.127 | 0.8701495 | -1.2122 | 0.561 | -0.981 | 0.9761986 | 76 | 77 | 55 | 71.43 | 72.37 |
| 29 | 125:week_52_arm_1 | 107:week_0_arm_1 | BFF | 98 | False | 43 | 1.127 | 0.8701495 | -1.2122 | 0.561 | -0.981 | 0.9761986 | 76 | 77 | 55 | 71.43 | 72.37 |
| 30 | 365:week_52_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
| 31 | 257:week_14_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
| 32 | 305:week_52_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
| 33 | 257:week_2_arm_1 | 107:week_0_arm_1 | BFF | 99 | False | 45 | 1.292 | 0.9018282 | -0.9045 | 0.545 | -1.122 | 0.9830870 | 76 | 77 | 54 | 70.13 | 71.05 |
| 34 | 215:week_52_arm_1 | 107:week_0_arm_1 | BFF | 104 | False | 49 | 1.622 | 0.9475899 | -0.5883 | 0.529 | -1.271 | 0.9884232 | 82 | 77 | 55 | 71.43 | 67.07 |
| 35 | 257:week_26_arm_1 | 107:week_0_arm_1 | BFF | 103 | False | 50 | 1.704 | 0.9558461 | -0.2956 | 0.515 | -1.399 | 0.9917759 | 79 | 77 | 53 | 68.83 | 67.09 |
| 36 | 275:week_52_arm_1 | 107:week_0_arm_1 | BFF | 105 | False | 51 | 1.787 | 0.9630202 | -0.2928 | 0.514 | -1.401 | 0.9918315 | 82 | 77 | 54 | 70.13 | 65.85 |
The process mirrors handling a single cohort; the main difference is that each reference cohort gets a prefix in its primary_key, making it possible to trace the origin of every individual.
We reuse individuals.json to simulate more than one cohort.
Example:
pheno-ranker -r individuals.json individuals.json individuals.json -t patient.json --max-out 10 -o rank_multiple.txt
This will create the text file rank_multiple.txt.
See results from rank_multiple.txt
| RANK | REFERENCE(ID) | TARGET(ID) | FORMAT | LENGTH | WEIGHTED | HAMMING-DISTANCE | DISTANCE-Z-SCORE | DISTANCE-P-VALUE | DISTANCE-Z-SCORE(RAND) | JACCARD-INDEX | JACCARD-Z-SCORE | JACCARD-P-VALUE | REFERENCE-VARS | TARGET-VARS | INTERSECT | INTERSECT-RATE(%) | COMPLETENESS(%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C2_107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 1 | -2.306 | 0.0105624 | -8.5470 | 0.987 | 2.804 | 0.0356370 | 77 | 77 | 76 | 98.70 | 98.70 |
| 2 | C3_107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 1 | -2.306 | 0.0105624 | -8.5470 | 0.987 | 2.804 | 0.0356370 | 77 | 77 | 76 | 98.70 | 98.70 |
| 3 | C1_107:week_0_arm_1 | 107:week_0_arm_1 | BFF | 77 | False | 1 | -2.306 | 0.0105624 | -8.5470 | 0.987 | 2.804 | 0.0356370 | 77 | 77 | 76 | 98.70 | 98.70 |
| 4 | C1_125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 5 | -1.969 | 0.0244763 | -7.6995 | 0.936 | 2.340 | 0.0901212 | 75 | 77 | 73 | 94.81 | 97.33 |
| 5 | C3_125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 5 | -1.969 | 0.0244763 | -7.6995 | 0.936 | 2.340 | 0.0901212 | 75 | 77 | 73 | 94.81 | 97.33 |
| 6 | C2_125:week_0_arm_1 | 107:week_0_arm_1 | BFF | 78 | False | 5 | -1.969 | 0.0244763 | -7.6995 | 0.936 | 2.340 | 0.0901212 | 75 | 77 | 73 | 94.81 | 97.33 |
| 7 | C2_275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 85 | False | 13 | -1.296 | 0.0975704 | -6.3994 | 0.847 | 1.534 | 0.2966472 | 81 | 77 | 72 | 93.51 | 88.89 |
| 8 | C1_275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 85 | False | 13 | -1.296 | 0.0975704 | -6.3994 | 0.847 | 1.534 | 0.2966472 | 81 | 77 | 72 | 93.51 | 88.89 |
| 9 | C3_275:week_0_arm_1 | 107:week_0_arm_1 | BFF | 85 | False | 13 | -1.296 | 0.0975704 | -6.3994 | 0.847 | 1.534 | 0.2966472 | 81 | 77 | 72 | 93.51 | 88.89 |
| 10 | C1_215:week_0_arm_1 | 107:week_0_arm_1 | BFF | 87 | False | 15 | -1.127 | 0.1298396 | -6.1110 | 0.828 | 1.357 | 0.3603912 | 83 | 77 | 72 | 93.51 | 86.75 |
Why the distance for 107:week_0_arm_1 is not 0 if the three cohorts are identical?
In Patient mode, the global vector is formed using variables solely from the reference cohort(s), not the patient's. The primary_key (id in this context) is automatically included, leading to a distance of 1 due to the mismatch in the individual's id field.
If you want to visualize the differences in all variables (i.e., the union of reference(s) and target), simply add the target as another cohort in --r. This way, the variables from the patient will be included in the reference vector.
Note that you can exclude id by adding --exclude-terms id.
You can create several files related to the reference-target alignment by adding --align. By default, this creates alignment* files in the current directory, but you can specify a </path/basename>. Example:
pheno-ranker -r individuals.json individuals.json -t patient.json --align
Or using a path + basename:
pheno-ranker -r individuals.json individuals.json -t patient.json --align /my/fav/dir/jobid-001-align
Find below an extract of the alignment (C1_107:week_0_arm_1 --- 107:week_0_arm_1) extracted from alignment.txt:
REF -- TAR
1 ----- 1 | (w: 1|d: 0|cd: 0|) diseases.NCIT:C3138.diseaseCode.id.NCIT:C3138 (Inflammatory Bowel Disease)
1 ----- 1 | (w: 1|d: 0|cd: 0|) ethnicity.id.NCIT:C41261 (Caucasian)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.exposureCode.id.NCIT:C154329 (Smoking)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.unit.id.NCIT:C65108 (Never Smoker)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.unit.id.NCIT:C67147 (Current Smoker)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C154329.unit.id.NCIT:C67148 (Former Smoker)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.exposureCode.id.NCIT:C2190 (Alcohol)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.unit.id.NCIT:C126379 (Non-Drinker)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.unit.id.NCIT:C156821 (Alcohol Consumption More than 2 Drinks per Day for Men and More than 1 Drink per Day for Women)
1 ----- 1 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C2190.unit.id.NCIT:C17998 (Unknown)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C73993.exposureCode.id.NCIT:C73993 (Pack Year)
0 0 | (w: 1|d: 0|cd: 0|) exposures.NCIT:C73993.unit.id.NCIT:C73993 (Pack Year)
1 xxx-- 0 | (w: 1|d: 1|cd: 1|) id.C1_107:week_0_arm_1 (id.C1_107:week_0_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_107:week_14_arm_1 (id.C1_107:week_14_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_107:week_2_arm_1 (id.C1_107:week_2_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_125:week_0_arm_1 (id.C1_125:week_0_arm_1)
0 0 | (w: 1|d: 0|cd: 1|) id.C1_125:week_14_arm_1 (id.C1_125:week_14_arm_1)
...