Skip to content

csv2_clarid_in.py πŸ§ͺπŸ“„

Overview

csv2_clarid_in.py is a script to convert raw TSV or CSV files into a standard CSV format for biosample or subject entities. It uses a YAML file to define how each field is handled, with support for field transformations and automatic subject ID generation.

🧬 Included in the containerized version of ClarID-Tools β€” no installation needed if using the Docker image.


πŸš€ Usage

./csv2_clarid_in.py --entity {biosample|subject} -i input.tsv[.gz] -o output.csv[.gz] -m mapping.yaml [-d delimiter]

Arguments

  • --entity β€” required: biosample or subject
  • -i / --input β€” input TSV or CSV (gzip supported)
  • -o / --output β€” output CSV (gzip supported)
  • -m / --mapping β€” YAML file with field config
  • -d β€” input delimiter (default: tab, use , for CSV)

πŸ—‚οΈ Mapping YAML (minimal example)

output_headers:
  - subject_id
  - age_group
  - sex

fields:
  subject_id:
    source: sample_id
    operations:
      - trim

  age_group:
    source: age
    operations:
      - bucketize_age:
          - {name: 'Adult', min: 20, max: 64}

  sex:
    source: gender
    operations:
      - normalize_sex

static_fields:
  sex: Unknown

⏱️ Duration binning (days β†’ ISO8601 PnU)

Many pipelines need a normalized β€œduration/timepoint” field. This script supports an op that converts a day count into a strict 3-char ISO-like bin:

  • Format: P[0-9][DWMY] (examples: P0D, P7D, P3M, P2Y)
  • Allowed units: Days (D), Weeks (W), Months (M β‰ˆ 30 days), Years (Y β‰ˆ 365 days)
  • Allowed digits: 0–9 (values >9 in a unit are escalated to a larger unit; years are clamped to P9Y)

Assumptions

  • The input column is in days (integer or numeric string). This keeps the logic predictable and easy to prepare (you can convert anything to days in Excel/R/Python beforehand).
  • Blank cells and sentinels can be mapped to null (~ in YAML) and then filled by static_fields (e.g., P0D).

Note: P0D is the valid zero-duration value. If you used P0N before, replace it with P0D.

YAML snippet (duration)

output_headers:
  - duration

fields:
  duration:
    source: samples.days_to_collection
    operations:
      - strip_quotes
      - trim
      - map_values:
          "--": ~
          "NA": ~
          "": ~
      - days_to_iso8601_bin:
          rounding: floor      # one of: floor | round | ceil (default: floor)
          units: [D, W, M, Y]  # allowed output units in this priority
          on_error: ~          # if parse/negative/error -> None, so static_fields can apply

static_fields:
  duration: P0D

How it bins

  • 0..9 days β†’ P0D..P9D
  • >=10 days β†’ try weeks (days/7, rounded as configured) β†’ P1W..P9W
  • If weeks would be >=10 β†’ try months (days/30) β†’ P1M..P9M
  • If months would be >=10 β†’ years (days/365), clamped to P1Y..P9Y

Rounding affects W/M/Y only (not D). Default is floor.

Examples

Input values (days) β†’ Output:

  • 0 β†’ P0D
  • 7 β†’ P7D
  • 10 β†’ P1W
  • 63 β†’ P9W
  • 70 β†’ P2M (β‰ˆ 70/30 β†’ 2 with floor)
  • 300 β†’ P1Y
  • 4000 β†’ P9Y (clamped)

Preparing β€œdays” in spreadsheets

If your raw data are dates or timestamps:

  • Excel/LibreOffice: if A2 is a collection date and A1 is baseline, =A2 - A1 yields day count (format as General/Number).
  • If strings: convert to proper date types first, then subtract.
  • If weeks/months/years: multiply by 7, 30, or 365 to approximate days.

Skipped empty rows

The parser skips completely empty rows. To emit a row that falls back to static_fields, provide a sentinel (e.g., NA or --) and map it to ~ via map_values. A truly blank line at file end won’t produce an output row.


πŸ”€ Handling multi-value fields (e.g., condition)

Some columns (like condition) may contain multiple inline values in a single cell, separated by typical CSV/TSV delimiters (,, ;, |, /). Use the normalize_multivalue operation to split, clean, map, and rejoin the values using ;, which the downstream software expects.

YAML snippet

fields:
  condition:
    source: samples.tumor_code
    operations:
      - strip_quotes
      - trim
      - normalize_multivalue:
          delimiters: [",", ";", "|", "/"]  # what to split on
          join_with: ";"                    # how to rejoin after mapping
          dedupe: false                     # set true to collapse repeats
          map_values:                       # per-token mapping
            "Acute myeloid leukemia (AML)": C92.0
            "Induction Failure AML (AML-IF)": C92.0
            "--": Z00.00

What it does

  1. Split on delimiters.
  2. Trim and unquote each token.
  3. Map via map_values (unmapped tokens pass through).
  4. Drop empties (drop_empty: true by default).
  5. Optionally de-duplicate (dedupe: true).
  6. Rejoin with join_with (default ;).

🧩 Other behaviors & tips

  • Column validation: the script checks that every source: column exists in the input header; missing ones abort with an error.
  • Output order: strictly follows output_headers.
  • subject_id assignment:
  • If fields.subject_id.source is present: β€œgroup mode”—rows that share the same transformed source value receive the same sequential ID (grouped in read order).
  • If absent: β€œpure counter”—every row receives a new ID.
  • Gzip: .gz extensions are detected automatically for input and output.
  • Empty rows: completely empty rows (including overflow fields) are skipped.

βœ… Defaults for normalize_multivalue

  • delimiters: [",", ";", "|", "/"]
  • join_with: ";"
  • drop_empty: true
  • dedupe: false

πŸ“œ License

Artistic License 2.0
(C) 2025 Manuel Rueda - CNAG