Analysis

This page collects a few practical ways to work with individuals.json output once the conversion is done.

Get quick counts from `individuals.json`

One simple option is to use jq to count the main arrays for each individual and write the result as tabular text:

jq -r '["id", "diseases", "exposures", "interventionsOrProcedures", "measures", "phenotypicFeatures", "treatments"], (.[] | [.id, (.diseases | length), (.exposures | length), (.interventionsOrProcedures | length), (.measures | length), (.phenotypicFeatures | length), (.treatments | length)]) | @tsv' < individuals.json > results.tsv

This produces one row per individual and one column per array count.

Do the same in Python

import json
import pandas as pd

with open('individuals.json', 'r') as json_file:
    data = json.load(json_file)

keys = [
    "diseases",
    "exposures",
    "interventionsOrProcedures",
    "measures",
    "phenotypicFeatures",
    "treatments",
]

result_data = [
    {
        "id": item["id"],
        **{key: len(item.get(key, [])) for key in keys},
    }
    for item in data
]

df = pd.DataFrame(result_data)
df.to_csv('results.tsv', sep='\t', index=False)

Do the same in Perl

use strict;
use warnings;
use autodie;
use JSON::XS;
use Text::CSV_XS qw(csv);

open my $json_file, '<', 'individuals.json';
my $json_text = do { local $/; <$json_file> };
my $data = decode_json($json_text);
close $json_file;

my @keys = (
    "diseases",
    "exposures",
    "interventionsOrProcedures",
    "measures",
    "phenotypicFeatures",
    "treatments"
);

my $aoa = [["id", @keys]];

foreach my $item (@$data) {
    my @row = ($item->{"id"});
    foreach my $key (@keys) {
        push @row, scalar @{$item->{$key} // []};
    }
    push @$aoa, \@row;
}

csv(in => $aoa, out => "results.tsv", sep_char => "\t", eol => "\n");

Example downstream statistics

Once you have results.tsv, you can calculate summary statistics with your preferred tooling.

Python example

import pandas as pd

df = pd.read_csv('results.tsv', sep='\t')
df = df.iloc[:, 1:]

stats = {
    'Statistic': [
        'Mean',
        'Median',
        'Max',
        'Min',
        '25th Percentile',
        '75th Percentile',
        'IQR',
        'Standard Deviation',
    ]
}

for column in df.columns:
    percentile_25 = df[column].quantile(0.25)
    percentile_75 = df[column].quantile(0.75)

    stats[column] = [
        df[column].mean(),
        df[column].median(),
        df[column].max(),
        df[column].min(),
        percentile_25,
        percentile_75,
        percentile_75 - percentile_25,
        df[column].std()
    ]

stats_df = pd.DataFrame(stats)
stats_df.to_csv('column_statistics.csv', index=False)

R example

df <- read.csv("results.tsv", sep = "\t")
df <- df[-1]
summary_stats <- summary(df)
write.csv(summary_stats, file = 'column_statistics.csv')

For comparison, patient matching, synthetic data, plotting, or feature extraction, see Pheno-Ranker.

Useful entry points:

Cohort comparison: Pheno-Ranker cohort mode
Patient matching: Pheno-Ranker patient mode
Synthetic BFF/PXF data: Pheno-Ranker simulator
Feature extraction / one-hot encoding: Pheno-Ranker

Get quick counts from individuals.json​

Do the same in Python​

Do the same in Perl​

Example downstream statistics​

Python example​

R example​

Related tools​