Skip to main content

csv2_clarid_in.py πŸ§ͺπŸ“„

Overview​

csv2_clarid_in.py is a script to convert raw TSV or CSV files into a standard CSV format for biosample or subject entities. It uses a YAML file to define how each field is handled, with support for field transformations and automatic subject ID generation.

🧬 Included in the containerized version of ClarID-Tools β€” no installation needed if using the Docker image.

Note: These helper scripts are intended to work with the ClarID-Tools release they are shipped with.


From raw table to ClarID input​

csv2_clarid_in.py is a preparation step. It does not create ClarID identifiers by itself; it converts a local tabular file into the CSV shape expected by clarid-tools code --infile.

The workflow has four steps:

  1. Start with a raw TSV or CSV file from a local source.
  2. Describe how source columns map to ClarID fields in a YAML file.
  3. Run csv2_clarid_in.py to create a ClarID-ready CSV.
  4. Pass that CSV to clarid-tools code --infile to generate identifiers.

For example, a project may start with a table whose column names and values come from a local LIMS, spreadsheet, or public repository:

sample_barcodepatientorganismtissue_sitediagnosisdays_from_baseline
S-001P001Homo sapiensliverLiver cancer0
S-002P002Mus musculusbrainBrain cancer49

The mapping YAML tells the script which source columns to read, how to normalize values, and which ClarID fields to emit. The output should contain the fields needed by the selected ClarID entity.

Typical biosample output columns are:

unique_id,subject_id,project,species,tissue,sample_type,assay,condition,timepoint,duration,batch,replicate

Typical subject output columns are:

unique_id,study,subject_id,type,condition,sex,age_group

Once the CSV is prepared, pass it to ClarID-Tools:

clarid-tools code \
--entity biosample \
--format human \
--action encode \
--infile clarid_ready_biosample.csv \
--sep "," \
--outfile clarid_encoded_biosample.csv

The final output from clarid-tools code will contain the generated clar_id or stub_id column.


πŸš€ Usage​

./csv2_clarid_in.py --entity {biosample|subject} -i input.tsv[.gz] -o output.csv[.gz] -m mapping.yaml [-d delimiter]

Arguments​

  • --entity β€” required: biosample or subject
  • -i / --input β€” input TSV or CSV (gzip supported)
  • -o / --output β€” output CSV (gzip supported)
  • -m / --mapping β€” YAML file with field config
  • -d β€” input delimiter (default: tab, use , for CSV)

πŸ—‚οΈ Mapping YAML (minimal example)​

output_headers:
- subject_id
- age_group
- sex

fields:
subject_id:
source: sample_id
operations:
- trim

age_group:
source: age
operations:
- bucketize_age:
- {name: 'Adult', min: 20, max: 64}

sex:
source: gender
operations:
- normalize_sex

static_fields:
sex: Unknown

Duration binning days to ISO8601 PnU​

Many pipelines need a normalized β€œduration/timepoint” field. This script supports an op that converts a day count into a strict 3-char ISO-like bin:

  • Format: P[0-9][DWMY] (examples: P0D, P7D, P3M, P2Y)
  • Allowed units: Days (D), Weeks (W), Months (M β‰ˆ 30 days), Years (Y β‰ˆ 365 days)
  • Allowed digits: 0–9 (values >9 in a unit are escalated to a larger unit; years are clamped to P9Y)

Note: The 3-character limit (PnU) is intentional, designed to keep timepoint duration compact and consistent, rather than attempting to represent every possible duration in full detail.

Assumptions​

  • The input column is in days (integer or numeric string). This keeps the logic predictable and easy to prepare (you can convert anything to days in Excel/R/Python beforehand).
  • Blank cells and sentinels can be mapped to null (~ in YAML) and then filled by static_fields (e.g., P0D).

Note: P0D is the valid zero-duration value. If you used P0N before, replace it with P0D.

YAML snippet (duration)​

output_headers:
- duration

fields:
duration:
source: samples.days_to_collection
operations:
- strip_quotes
- trim
- map_values:
"--": ~
"NA": ~
"": ~
- days_to_iso8601_bin:
rounding: floor # one of: floor | round | ceil (default: floor)
units: [D, W, M, Y] # allowed output units in this priority
on_error: ~ # if parse/negative/error -> None, so static_fields can apply

static_fields:
duration: P0D

How it bins​

  • 0..9 days β†’ P0D..P9D
  • >=10 days β†’ try weeks (days/7, rounded as configured) β†’ P1W..P9W
  • If weeks would be >=10 β†’ try months (days/30) β†’ P1M..P9M
  • If months would be >=10 β†’ years (days/365), clamped to P1Y..P9Y

Rounding affects W/M/Y only (not D). Default is floor.

Examples​

Input values (days) β†’ Output:

  • 0 β†’ P0D
  • 7 β†’ P7D
  • 10 β†’ P1W
  • 63 β†’ P9W
  • 70 β†’ P2M (β‰ˆ 70/30 β†’ 2 with floor)
  • 300 β†’ P1Y
  • 4000 β†’ P9Y (clamped)

Preparing β€œdays” in spreadsheets​

If your raw data are dates or timestamps:

  • Excel/LibreOffice: if A2 is a collection date and A1 is baseline, =A2 - A1 yields day count (format as General/Number).
  • If strings: convert to proper date types first, then subtract.
  • If weeks/months/years: multiply by 7, 30, or 365 to approximate days.

Skipped empty rows​

The parser skips completely empty rows. To emit a row that falls back to static_fields, provide a sentinel (e.g., NA or --) and map it to ~ via map_values. A truly blank line at file end won’t produce an output row.


πŸ”€ Handling multi-value fields (e.g., condition)​

Some columns (like condition) may contain multiple inline values in a single cell, separated by typical CSV/TSV delimiters (,, ;, |, /). Use the normalize_multivalue operation to split, clean, map, and rejoin the values using ;, which the downstream software expects.

YAML snippet​

fields:
condition:
source: samples.tumor_code
operations:
- strip_quotes
- trim
- normalize_multivalue:
delimiters: [",", ";", "|", "/"] # what to split on
join_with: ";" # how to rejoin after mapping
dedupe: false # set true to collapse repeats
map_values: # per-token mapping
"Acute myeloid leukemia (AML)": C92.0
"Induction Failure AML (AML-IF)": C92.0
"--": Z00.00

What it does​

  1. Split on delimiters.
  2. Trim and unquote each token.
  3. Map via map_values (unmapped tokens pass through).
  4. Drop empties (drop_empty: true by default).
  5. Optionally de-duplicate (dedupe: true).
  6. Rejoin with join_with (default ;).

🧩 Other behaviors & tips​

  • Column validation: the script checks that every source: column exists in the input header; missing ones abort with an error.
  • Output order: strictly follows output_headers.
  • subject_id assignment:
    • If fields.subject_id.source is present: β€œgroup mode”—rows that share the same transformed source value receive the same sequential ID (grouped in read order).
    • If absent: β€œpure counter”—every row receives a new ID.
  • Gzip: .gz extensions are detected automatically for input and output.
  • Empty rows: completely empty rows (including overflow fields) are skipped.

βœ… Defaults for normalize_multivalue​

  • delimiters: [",", ";", "|", "/"]
  • join_with: ";"
  • drop_empty: true
  • dedupe: false

πŸ“œ License​

Artistic License 2.0
(C) 2025-2026 Manuel Rueda - CNAG