Use Case II: Biosample-Level Encoding of GDC Data
Data Download
On June 6, 2025, we downloaded biospecimen metadata for the project TARGET-AML from the Genomic Data Commons (GDC) portal as part of the archive:
biospecimen.project-target-aml.2025-06-29.tar.gz
All files were extracted to the nb/data/biosample directory:
cd nb/data/biosample
tar -xvf biospecimen.project-target-aml.2025-06-29.tar.gz
The archive included the following five TSV files:
aliquot.tsvanalyte.tsvportion.tsvsample.tsvslide.tsv
We focused our analysis on sample.tsv, which contains:
- 39 columns
- 4255 rows
We will keep the files compressed to minimize space:
gzip *.tsv
Data Pre-processing
We pre-processed the data using the script ../../../utils/csv/csv2_clarid_in.py, along with a column mapping file:
View Mapping File
Run the pre-processing with:
TSV UUID order
The script assumes that your data is sorted by UUID (cases.case_id) in this case. If not, you need to sort it manually using a command like
sort -t$'\t' -k1,1 raw.tsv > raw.sorted.tsv
../../../utils/csv/csv2_clarid_in.py \
--entity biosample \
-i sample.tsv.gz \
-o sample_in.csv.gz \
--mapping ../../../utils/csv/gdc_biosample_mapping.yaml
ClarID encoding
Human-Readable Format
../../../bin/clarid-tools code \
--entity biosample \
--format human \
--action encode \
--infile sample_in.csv.gz \
--sep "," \
--outfile clarid_human.csv.gz
Stub Format
../../../bin/clarid-tools code \
--entity biosample \
--format stub \
--action encode \
--infile sample_in.csv.gz \
--sep "," \
--outfile clarid_stub.csv.gz
ClarID decoding
Human-Readable Format
../../../bin/clarid-tools code \
--entity biosample \
--format human \
--action decode \
--infile clarid_human.csv.gz \
--sep "," \
--outfile clarid_decoded_human.csv.gz
Note: The resulting columns will be appended to the right of the existing ones, which may result in some columns appearing duplicated.
Stub Format
../../../bin/clarid-tools code \
--entity biosample \
--format stub \
--action decode \
--infile clarid_stub.csv.gz \
--sep "," \
--outfile clarid_decoded_stub.csv.gz
Results Table
Below is a browsable table of the human format encodings. The full file is available at
nb/data/biosample/clarid_human.csv.gz.