csv2_clarid_in.py π§ͺπ¶
Overview¶
csv2_clarid_in.py
is a script to convert raw TSV or CSV files into a standard CSV format for biosample
or subject
entities. It uses a YAML file to define how each field is handled, with support for field transformations and automatic subject ID generation.
𧬠Included in the containerized version of ClarID-Tools β no installation needed if using the Docker image.
π Usage¶
./csv2_clarid_in.py --entity {biosample|subject} -i input.tsv[.gz] -o output.csv[.gz] -m mapping.yaml [-d delimiter]
Arguments¶
--entity
β required:biosample
orsubject
-i
/--input
β input TSV or CSV (gzip supported)-o
/--output
β output CSV (gzip supported)-m
/--mapping
β YAML file with field config-d
β input delimiter (default: tab, use,
for CSV)
ποΈ Mapping YAML (minimal example)¶
output_headers:
- subject_id
- age_group
- sex
fields:
subject_id:
source: sample_id
operations:
- trim
age_group:
source: age
operations:
- bucketize_age:
- {name: 'Adult', min: 20, max: 64}
sex:
source: gender
operations:
- normalize_sex
static_fields:
sex: Unknown
β±οΈ Duration binning (days β ISO8601 PnU)¶
Many pipelines need a normalized βduration/timepointβ field. This script supports an op that converts a day count into a strict 3-char ISO-like bin:
- Format:
P[0-9][DWMY]
(examples:P0D
,P7D
,P3M
,P2Y
) - Allowed units: Days (
D
), Weeks (W
), Months (M
β 30 days), Years (Y
β 365 days) - Allowed digits:
0β9
(values >9 in a unit are escalated to a larger unit; years are clamped toP9Y
)
Assumptions¶
- The input column is in days (integer or numeric string). This keeps the logic predictable and easy to prepare (you can convert anything to days in Excel/R/Python beforehand).
- Blank cells and sentinels can be mapped to
null
(~
in YAML) and then filled bystatic_fields
(e.g.,P0D
).
Note:
P0D
is the valid zero-duration value. If you usedP0N
before, replace it withP0D
.
YAML snippet (duration)¶
output_headers:
- duration
fields:
duration:
source: samples.days_to_collection
operations:
- strip_quotes
- trim
- map_values:
"--": ~
"NA": ~
"": ~
- days_to_iso8601_bin:
rounding: floor # one of: floor | round | ceil (default: floor)
units: [D, W, M, Y] # allowed output units in this priority
on_error: ~ # if parse/negative/error -> None, so static_fields can apply
static_fields:
duration: P0D
How it bins¶
0..9
days βP0D
..P9D
>=10
days β try weeks (days/7
, rounded as configured) βP1W
..P9W
- If weeks would be
>=10
β try months (days/30
) βP1M
..P9M
- If months would be
>=10
β years (days/365
), clamped toP1Y
..P9Y
Rounding affects W/M/Y only (not D). Default is floor
.
Examples¶
Input values (days) β Output:
0
βP0D
7
βP7D
10
βP1W
63
βP9W
70
βP2M
(β 70/30 β 2 with floor)300
βP1Y
4000
βP9Y
(clamped)
Preparing βdaysβ in spreadsheets¶
If your raw data are dates or timestamps:
- Excel/LibreOffice: if A2 is a collection date and A1 is baseline,
=A2 - A1
yields day count (format as General/Number). - If strings: convert to proper date types first, then subtract.
- If weeks/months/years: multiply by
7
,30
, or365
to approximate days.
Skipped empty rows¶
The parser skips completely empty rows. To emit a row that falls back to static_fields
, provide a sentinel (e.g., NA
or --
) and map it to ~
via map_values
. A truly blank line at file end wonβt produce an output row.
π Handling multi-value fields (e.g., condition
)¶
Some columns (like condition
) may contain multiple inline values in a single cell, separated by typical CSV/TSV delimiters (,
, ;
, |
, /
). Use the normalize_multivalue
operation to split, clean, map, and rejoin the values using ;
, which the downstream software expects.
YAML snippet¶
fields:
condition:
source: samples.tumor_code
operations:
- strip_quotes
- trim
- normalize_multivalue:
delimiters: [",", ";", "|", "/"] # what to split on
join_with: ";" # how to rejoin after mapping
dedupe: false # set true to collapse repeats
map_values: # per-token mapping
"Acute myeloid leukemia (AML)": C92.0
"Induction Failure AML (AML-IF)": C92.0
"--": Z00.00
What it does¶
- Split on
delimiters
. - Trim and unquote each token.
- Map via
map_values
(unmapped tokens pass through). - Drop empties (
drop_empty: true
by default). - Optionally de-duplicate (
dedupe: true
). - Rejoin with
join_with
(default;
).
π§© Other behaviors & tips¶
- Column validation: the script checks that every
source:
column exists in the input header; missing ones abort with an error. - Output order: strictly follows
output_headers
. subject_id
assignment:- If
fields.subject_id.source
is present: βgroup modeββrows that share the same transformed source value receive the same sequential ID (grouped in read order). - If absent: βpure counterββevery row receives a new ID.
- Gzip:
.gz
extensions are detected automatically for input and output. - Empty rows: completely empty rows (including overflow fields) are skipped.
β
Defaults for normalize_multivalue
¶
delimiters
:[",", ";", "|", "/"]
join_with
:";"
drop_empty
:true
dedupe
:false
π License¶
Artistic License 2.0
(C) 2025 Manuel Rueda - CNAG