Skip to main content

Analysis

This page collects a few practical ways to work with individuals.json output once the conversion is done.

Get quick counts from individuals.json​

One simple option is to use jq to count the main arrays for each individual and write the result as tabular text:

jq -r '["id", "diseases", "exposures", "interventionsOrProcedures", "measures", "phenotypicFeatures", "treatments"], (.[] | [.id, (.diseases | length), (.exposures | length), (.interventionsOrProcedures | length), (.measures | length), (.phenotypicFeatures | length), (.treatments | length)]) | @tsv' < individuals.json > results.tsv

This produces one row per individual and one column per array count.

Do the same in Python​

import json
import pandas as pd

with open('individuals.json', 'r') as json_file:
data = json.load(json_file)

keys = [
"diseases",
"exposures",
"interventionsOrProcedures",
"measures",
"phenotypicFeatures",
"treatments",
]

result_data = [
{
"id": item["id"],
**{key: len(item.get(key, [])) for key in keys},
}
for item in data
]

df = pd.DataFrame(result_data)
df.to_csv('results.tsv', sep='\t', index=False)

Do the same in Perl​

use strict;
use warnings;
use autodie;
use JSON::XS;
use Text::CSV_XS qw(csv);

open my $json_file, '<', 'individuals.json';
my $json_text = do { local $/; <$json_file> };
my $data = decode_json($json_text);
close $json_file;

my @keys = (
"diseases",
"exposures",
"interventionsOrProcedures",
"measures",
"phenotypicFeatures",
"treatments"
);

my $aoa = [["id", @keys]];

foreach my $item (@$data) {
my @row = ($item->{"id"});
foreach my $key (@keys) {
push @row, scalar @{$item->{$key} // []};
}
push @$aoa, \@row;
}

csv(in => $aoa, out => "results.tsv", sep_char => "\t", eol => "\n");

Example downstream statistics​

Once you have results.tsv, you can calculate summary statistics with your preferred tooling.

Python example​

import pandas as pd

df = pd.read_csv('results.tsv', sep='\t')
df = df.iloc[:, 1:]

stats = {
'Statistic': [
'Mean',
'Median',
'Max',
'Min',
'25th Percentile',
'75th Percentile',
'IQR',
'Standard Deviation',
]
}

for column in df.columns:
percentile_25 = df[column].quantile(0.25)
percentile_75 = df[column].quantile(0.75)

stats[column] = [
df[column].mean(),
df[column].median(),
df[column].max(),
df[column].min(),
percentile_25,
percentile_75,
percentile_75 - percentile_25,
df[column].std()
]

stats_df = pd.DataFrame(stats)
stats_df.to_csv('column_statistics.csv', index=False)

R example​

df <- read.csv("results.tsv", sep = "\t")
df <- df[-1]
summary_stats <- summary(df)
write.csv(summary_stats, file = 'column_statistics.csv')

For comparison, patient matching, synthetic data, plotting, or feature extraction, see Pheno-Ranker.

Useful entry points: