Skip to main content

Phenopackets v2

PXF stands for Phenotype eXchange Format. Phenopackets v2 documentation.

RoleNative input
Accepted inputPhenopacket JSON/YAML
ConfigurationBuilt in
Best forPhenopackets v2 records
Phenopackets v2
Figure extracted from www.ga4gh.org

Phenopackets organize information using top-level elements. Our software, Pheno-Ranker, specifically processes data from the Phenopacket element, serialized in PXF format.

Browsing PXF JSON data

You can browse a public Phenopackets v2 file with one of the following JSON viewers:

PXF As Input PXF

The examples below show the minimal command-line patterns. For the complete CLI reference, see Usage.

What happens with deeply nested arrays such as interpretations.diagnosis.genomicInterpretations?

The property genomicInterpretation presents some peculiarities for several reasons. It can have multiple nested levels or arrays, the key "id" may refer to a given patient, plus the key subjectOrBiosampleId refers to the same patient too!. This implies that users might be interested in the variants, but since patient ids will be in the flattened key, it will never match another patient.

Pheno-Ranker will handle this for you for the term interpretations. This is a dedicated PXF-specific transformation because genomic interpretation records can otherwise include patient-specific identifiers in the flattened keys. The approach taken is to transition from array data structures to objects.

Imagine you have a PXF data that looks like this:

{
"id": "Sample_1",
"interpretations": [
{
"id": "Interpretation_1",
"progressStatus": "SOLVED",
"diagnosis": {
"disease": {
"id": "OMIM:148600",
"label": "Disease 1"
},
"genomicInterpretations": [
{
"subjectOrBiosampleId": "Subject_1",
"interpretationStatus": "CAUSATIVE",
"variantInterpretation": {
"variationDescriptor": {
"geneContext": {
"valueId": "HGNC:25662",
"symbol": "AAGAB"
}
}
}
}
]
}
}
],
"subject": {
"id": "Subject_1"
}
}

The processed JSON will look like this:

{
"id": "Sample_1",
"interpretations": {
"OMIM:148600": {
"genomicInterpretations": {
"HGNC:25662": {
"interpretationStatus": "CAUSATIVE",
"variantInterpretation": {
"variationDescriptor": {
"geneContext": {
"symbol": "AAGAB",
"valueId": "HGNC:25662"
}
}
}
}
},
"progressStatus": "SOLVED"
}
},
"subject": {
"id": "Subject_1"
}
}

Now you can run Pheno-Ranker as usual. The flattened keys will look like this:

"interpretations.OMIM:148600.genomicInterpretations.HGNC:25662.interpretationStatus.CAUSATIVE" : 1,
"interpretations.OMIM:148600.genomicInterpretations.HGNC:25662.variantInterpretation.variationDescriptor.geneContext.symbol.AAGAB" : 1,
"interpretations.OMIM:148600.progressStatus.SOLVED" : 1,
Other examples of PXF nested array properties

From v1.08 onward, users do not need to transpose or manually rewrite nested arrays for comparison. Pheno-Ranker canonicalizes other nested arrays automatically from their meaningful content. This avoids differences caused only by array order in complex PXF properties such as:

"biosamples.diagnosticMarkers",
"biosamples.pathologicalTnmFinding",
"biosamples.phenotypicFeatures",
"diseases.clinicalTnmFinding",
"diseases.diseaseStage",
"measurements.complexValue.typedQuantities",
"medicalActions.treatment.doseIntervals"

If a nested object has no usable content after filtering, Pheno-Ranker keeps its numeric position instead of guessing an identity. You can still filter out noisy variables with the configuration file when they are not useful for similarity.

Basic run:

pheno-ranker -r pxf.json

The default output is named matrix.txt. It is an N x N matrix with pairwise comparisons for all individuals.

Test dataset

We are going to use data from the phenopacket-store repository:

wget https://github.com/monarch-initiative/phenopacket-store/releases/latest/download/all_phenopackets.zip
unzip all_phenopackets.zip

Instead of using the > 5K examples, we will work with a subset of 50, consolidated in an array:

# sudo apt install jq
jq -s '.' $(ls -1 */*json | shuf -n 50) > combined.json

And now we perform the calculation:

pheno-ranker -r combined.json -include-terms interpretations

For more information visit the cohort mode page.