Use Case I: Subject-Level Encoding of GDC Case Data
Data Download
On June 1, 2025, we downloaded clinical metadata from the Genomic Data Commons (GDC) portal as part of the archive:
clinical.cohort.2025-06-02.tar.gz
All files were extracted to the nb/data/subject directory:
cd nb/data/subject
tar -xvf clinical.cohort.2025-06-02.tar.gz
The archive included the following four TSV files:
clinical.tsvfamily_history.tsvfollow_up.tsvpathology_detail.tsv
We focused our analysis on clinical.tsv, which contains:
- 210 columns
- 113,760 rows
- Data from 86 studies, including 33 from TCGA
Note that multiple cases can share the same UUID.
We will keep the files compressed to minimize space:
gzip *.tsv
Data Pre-processing
We pre-processed the data using the script ../../../utils/csv/csv2_clarid_in.py, along with a column mapping file:
View Mapping File
Run the pre-processing with:
TSV UUID order
The script assumes that your data is sorted by UUID (cases.case_id) in this case. If not, you need to sort it manually using a command like
sort -t$'\t' -k1,1 raw.tsv > raw.sorted.tsv
../../../utils/csv/csv2_clarid_in.py \
--entity subject \
-i clinical.tsv.gz \
-o clinical_in.csv.gz \
--mapping ../../../utils/csv/gdc_subject_mapping.yaml
ClarID encoding
Human-Readable Format
../../../bin/clarid-tools code \
--entity subject \
--format human \
--action encode \
--infile clinical_in.csv.gz \
--sep "," \
--outfile clarid_human.csv.gz
Stub Format
../../../bin/clarid-tools code \
--entity subject \
--format stub \
--action encode \
--infile clinical_in.csv.gz \
--sep "," \
--outfile clarid_stub.csv.gz
ClarID decoding
Human-Readable Format
../../../bin/clarid-tools code \
--entity subject \
--format human \
--action decode \
--infile clarid_human.csv.gz \
--sep "," \
--outfile clarid_decoded_human.csv.gz
Note: The resulting columns will be appended to the right of the existing ones, which may result in some columns appearing duplicated.
Stub Format
../../../bin/clarid-tools code \
--entity subject \
--format stub \
--action decode \
--infile clarid_stub.csv.gz \
--sep "," \
--outfile clarid_decoded_stub.csv.gz
Results Table
Below is a browsable table of the first 10,000 human format encodings. The full file is available at
nb/data/subject/clarid_human.csv.gz.