Use Case I: Subject-Level Encoding of GDC Case Data

Data Download

On June 1, 2025, we downloaded clinical metadata from the Genomic Data Commons (GDC) portal as part of the archive:

clinical.cohort.2025-06-02.tar.gz

All files were extracted to the nb/data/subject directory:

cd nb/data/subject
tar -xvf clinical.cohort.2025-06-02.tar.gz

The archive included the following four TSV files:

clinical.tsv
family_history.tsv
follow_up.tsv
pathology_detail.tsv

We focused our analysis on clinical.tsv, which contains:

210 columns
113,760 rows
Data from 86 studies, including 33 from TCGA

Note that multiple cases can share the same UUID.

We will keep the files compressed to minimize space:

gzip *.tsv

Data Pre-processing

We pre-processed the data using the script ../../../utils/csv/csv2_clarid_in.py, along with a column mapping file:

View Mapping File

See utils/csv/gdc_subject_mapping.yaml.

Run the pre-processing with:

TSV UUID order

The script assumes that your data is sorted by UUID (cases.case_id) in this case. If not, you need to sort it manually using a command like

sort -t$'\t' -k1,1 raw.tsv > raw.sorted.tsv

../../../utils/csv/csv2_clarid_in.py \
    --entity subject \
    -i clinical.tsv.gz \
    -o clinical_in.csv.gz \
    --mapping ../../../utils/csv/gdc_subject_mapping.yaml

ClarID encoding

Human-Readable Format

../../../bin/clarid-tools code \
    --entity subject \
    --format human \
    --action encode \
    --infile clinical_in.csv.gz \
    --sep "," \
    --outfile clarid_human.csv.gz

Stub Format

../../../bin/clarid-tools code \
    --entity subject \
    --format stub \
    --action encode \
    --infile clinical_in.csv.gz \
    --sep "," \
    --outfile clarid_stub.csv.gz

ClarID decoding

Human-Readable Format

../../../bin/clarid-tools code \
    --entity subject \
    --format human \
    --action decode \
    --infile clarid_human.csv.gz \
    --sep "," \
    --outfile clarid_decoded_human.csv.gz

Duplicated columns?

Note: The resulting columns will be appended to the right of the existing ones, which may result in some columns appearing duplicated.

Stub Format

../../../bin/clarid-tools code \
    --entity subject \
    --format stub \
    --action decode \
    --infile clarid_stub.csv.gz \
    --sep "," \
    --outfile clarid_decoded_stub.csv.gz

Results Table

Below is a browsable table of the first 10,000 human format encodings. The full file is available at nb/data/subject/clarid_human.csv.gz.

Loading table...

Page 1 of 1

Data Download​

Data Pre-processing​

ClarID encoding​

Human-Readable Format​

Stub Format​

ClarID decoding​

Human-Readable Format​

Stub Format​

Results Table​

Data Download

Data Pre-processing

ClarID encoding

Human-Readable Format

Stub Format

ClarID decoding

Human-Readable Format

Stub Format

Results Table