csv2_clarid_in.py π§ͺπ¶
Overview¶
csv2_clarid_in.py is a script to convert raw TSV or CSV files into a standard CSV format for biosample or subject entities. It uses a YAML file to define how each field is handled, with support for field transformations and automatic subject ID generation.
𧬠Included in the containerized version of ClarID-Tools β no installation needed if using the Docker image.
π Usage¶
./csv2_clarid_in.py --entity {biosample|subject} -i input.tsv[.gz] -o output.csv[.gz] -m mapping.yaml [-d delimiter]
Arguments¶
--entityβ required:biosampleorsubject-i/--inputβ input TSV or CSV (gzip supported)-o/--outputβ output CSV (gzip supported)-m/--mappingβ YAML file with field config-dβ input delimiter (default: tab, use,for CSV)
ποΈ Mapping YAML (minimal example)¶
output_headers:
  - subject_id
  - age_group
  - sex
fields:
  subject_id:
    source: sample_id
    operations:
      - trim
  age_group:
    source: age
    operations:
      - bucketize_age:
          - {name: 'Adult', min: 20, max: 64}
  sex:
    source: gender
    operations:
      - normalize_sex
static_fields:
  sex: Unknown
β±οΈ Duration binning (days β ISO8601 PnU)¶
Many pipelines need a normalized βduration/timepointβ field. This script supports an op that converts a day count into a strict 3-char ISO-like bin:
- Format: 
P[0-9][DWMY](examples:P0D,P7D,P3M,P2Y) - Allowed units: Days (
D), Weeks (W), Months (Mβ 30 days), Years (Yβ 365 days) - Allowed digits: 
0β9(values >9 in a unit are escalated to a larger unit; years are clamped toP9Y) 
Note: The 3-character limit (PnU) is intentional, designed to keep timepoint duration compact and consistent, rather than attempting to represent every possible duration in full detail.
Assumptions¶
- The input column is in days (integer or numeric string). This keeps the logic predictable and easy to prepare (you can convert anything to days in Excel/R/Python beforehand).
 - Blank cells and sentinels can be mapped to 
null(~in YAML) and then filled bystatic_fields(e.g.,P0D). 
Note:
P0Dis the valid zero-duration value. If you usedP0Nbefore, replace it withP0D.
YAML snippet (duration)¶
output_headers:
  - duration
fields:
  duration:
    source: samples.days_to_collection
    operations:
      - strip_quotes
      - trim
      - map_values:
          "--": ~
          "NA": ~
          "": ~
      - days_to_iso8601_bin:
          rounding: floor      # one of: floor | round | ceil (default: floor)
          units: [D, W, M, Y]  # allowed output units in this priority
          on_error: ~          # if parse/negative/error -> None, so static_fields can apply
static_fields:
  duration: P0D
How it bins¶
0..9days βP0D..P9D>=10days β try weeks (days/7, rounded as configured) βP1W..P9W- If weeks would be 
>=10β try months (days/30) βP1M..P9M - If months would be 
>=10β years (days/365), clamped toP1Y..P9Y 
Rounding affects W/M/Y only (not D). Default is floor.
Examples¶
Input values (days) β Output:
0βP0D7βP7D10βP1W63βP9W70βP2M(β 70/30 β 2 with floor)300βP1Y4000βP9Y(clamped)
Preparing βdaysβ in spreadsheets¶
If your raw data are dates or timestamps:
- Excel/LibreOffice: if A2 is a collection date and A1 is baseline, 
=A2 - A1yields day count (format as General/Number). - If strings: convert to proper date types first, then subtract.
 - If weeks/months/years: multiply by 
7,30, or365to approximate days. 
Skipped empty rows¶
The parser skips completely empty rows. To emit a row that falls back to static_fields, provide a sentinel (e.g., NA or --) and map it to ~ via map_values. A truly blank line at file end wonβt produce an output row.
π Handling multi-value fields (e.g., condition)¶
Some columns (like condition) may contain multiple inline values in a single cell, separated by typical CSV/TSV delimiters (,, ;, |, /). Use the normalize_multivalue operation to split, clean, map, and rejoin the values using ;, which the downstream software expects.
YAML snippet¶
fields:
  condition:
    source: samples.tumor_code
    operations:
      - strip_quotes
      - trim
      - normalize_multivalue:
          delimiters: [",", ";", "|", "/"]  # what to split on
          join_with: ";"                    # how to rejoin after mapping
          dedupe: false                     # set true to collapse repeats
          map_values:                       # per-token mapping
            "Acute myeloid leukemia (AML)": C92.0
            "Induction Failure AML (AML-IF)": C92.0
            "--": Z00.00
What it does¶
- Split on 
delimiters. - Trim and unquote each token.
 - Map via 
map_values(unmapped tokens pass through). - Drop empties (
drop_empty: trueby default). - Optionally de-duplicate (
dedupe: true). - Rejoin with 
join_with(default;). 
π§© Other behaviors & tips¶
- Column validation: the script checks that every 
source:column exists in the input header; missing ones abort with an error. - Output order: strictly follows 
output_headers. subject_idassignment:- If 
fields.subject_id.sourceis present: βgroup modeββrows that share the same transformed source value receive the same sequential ID (grouped in read order). - If absent: βpure counterββevery row receives a new ID.
 - Gzip: 
.gzextensions are detected automatically for input and output. - Empty rows: completely empty rows (including overflow fields) are skipped.
 
β
 Defaults for normalize_multivalue¶
delimiters:[",", ";", "|", "/"]join_with:";"drop_empty:truededupe:false
π License¶
Artistic License 2.0
(C) 2025 Manuel Rueda - CNAG