Skip to main content

Use Case II: Biosample-Level Encoding of GDC Data

Data Download

On June 6, 2025, we downloaded biospecimen metadata for the project TARGET-AML from the Genomic Data Commons (GDC) portal as part of the archive:

biospecimen.project-target-aml.2025-06-29.tar.gz

All files were extracted to the nb/data/biosample directory:

cd nb/data/biosample
tar -xvf biospecimen.project-target-aml.2025-06-29.tar.gz

The archive included the following five TSV files:

  • aliquot.tsv
  • analyte.tsv
  • portion.tsv
  • sample.tsv
  • slide.tsv

We focused our analysis on sample.tsv, which contains:

  • 39 columns
  • 4255 rows

We will keep the files compressed to minimize space:

gzip *.tsv

Data Pre-processing

We pre-processed the data using the script ../../../utils/csv/csv2_clarid_in.py, along with a column mapping file:

View Mapping File

Run the pre-processing with:

TSV UUID order

The script assumes that your data is sorted by UUID (cases.case_id) in this case. If not, you need to sort it manually using a command like

sort -t$'\t' -k1,1 raw.tsv > raw.sorted.tsv
../../../utils/csv/csv2_clarid_in.py \
--entity biosample \
-i sample.tsv.gz \
-o sample_in.csv.gz \
--mapping ../../../utils/csv/gdc_biosample_mapping.yaml

ClarID encoding

Human-Readable Format

../../../bin/clarid-tools code \
--entity biosample \
--format human \
--action encode \
--infile sample_in.csv.gz \
--sep "," \
--outfile clarid_human.csv.gz

Stub Format

../../../bin/clarid-tools code \
--entity biosample \
--format stub \
--action encode \
--infile sample_in.csv.gz \
--sep "," \
--outfile clarid_stub.csv.gz

ClarID decoding

Human-Readable Format

../../../bin/clarid-tools code \
--entity biosample \
--format human \
--action decode \
--infile clarid_human.csv.gz \
--sep "," \
--outfile clarid_decoded_human.csv.gz
Duplicated columns?

Note: The resulting columns will be appended to the right of the existing ones, which may result in some columns appearing duplicated.

Stub Format

../../../bin/clarid-tools code \
--entity biosample \
--format stub \
--action decode \
--infile clarid_stub.csv.gz \
--sep "," \
--outfile clarid_decoded_stub.csv.gz

Results Table

Below is a browsable table of the human format encodings. The full file is available at nb/data/biosample/clarid_human.csv.gz.

Loading table...
Page 1 of 1