csv2_clarid_in.py π§ͺπ
Overviewβ
csv2_clarid_in.py is a script to convert raw TSV or CSV files into a standard CSV format for biosample or subject entities. It uses a YAML file to define how each field is handled, with support for field transformations and automatic subject ID generation.
𧬠Included in the containerized version of ClarID-Tools β no installation needed if using the Docker image.
Note: These helper scripts are intended to work with the ClarID-Tools release they are shipped with.
From raw table to ClarID inputβ
csv2_clarid_in.py is a preparation step. It does not create ClarID identifiers by itself; it converts a local tabular file into the CSV shape expected by clarid-tools code --infile.
The workflow has four steps:
- Start with a raw TSV or CSV file from a local source.
- Describe how source columns map to ClarID fields in a YAML file.
- Run
csv2_clarid_in.pyto create a ClarID-ready CSV. - Pass that CSV to
clarid-tools code --infileto generate identifiers.
For example, a project may start with a table whose column names and values come from a local LIMS, spreadsheet, or public repository:
| sample_barcode | patient | organism | tissue_site | diagnosis | days_from_baseline |
|---|---|---|---|---|---|
| S-001 | P001 | Homo sapiens | liver | Liver cancer | 0 |
| S-002 | P002 | Mus musculus | brain | Brain cancer | 49 |
The mapping YAML tells the script which source columns to read, how to normalize values, and which ClarID fields to emit. The output should contain the fields needed by the selected ClarID entity.
Typical biosample output columns are:
unique_id,subject_id,project,species,tissue,sample_type,assay,condition,timepoint,duration,batch,replicate
Typical subject output columns are:
unique_id,study,subject_id,type,condition,sex,age_group
Once the CSV is prepared, pass it to ClarID-Tools:
clarid-tools code \
--entity biosample \
--format human \
--action encode \
--infile clarid_ready_biosample.csv \
--sep "," \
--outfile clarid_encoded_biosample.csv
The final output from clarid-tools code will contain the generated clar_id or stub_id column.
π Usageβ
./csv2_clarid_in.py --entity {biosample|subject} -i input.tsv[.gz] -o output.csv[.gz] -m mapping.yaml [-d delimiter]
Argumentsβ
--entityβ required:biosampleorsubject-i/--inputβ input TSV or CSV (gzip supported)-o/--outputβ output CSV (gzip supported)-m/--mappingβ YAML file with field config-dβ input delimiter (default: tab, use,for CSV)
ποΈ Mapping YAML (minimal example)β
output_headers:
- subject_id
- age_group
- sex
fields:
subject_id:
source: sample_id
operations:
- trim
age_group:
source: age
operations:
- bucketize_age:
- {name: 'Adult', min: 20, max: 64}
sex:
source: gender
operations:
- normalize_sex
static_fields:
sex: Unknown
Duration binning days to ISO8601 PnUβ
Many pipelines need a normalized βduration/timepointβ field. This script supports an op that converts a day count into a strict 3-char ISO-like bin:
- Format:
P[0-9][DWMY](examples:P0D,P7D,P3M,P2Y) - Allowed units: Days (
D), Weeks (W), Months (Mβ 30 days), Years (Yβ 365 days) - Allowed digits:
0β9(values >9 in a unit are escalated to a larger unit; years are clamped toP9Y)
Note: The 3-character limit (PnU) is intentional, designed to keep timepoint duration compact and consistent, rather than attempting to represent every possible duration in full detail.
Assumptionsβ
- The input column is in days (integer or numeric string). This keeps the logic predictable and easy to prepare (you can convert anything to days in Excel/R/Python beforehand).
- Blank cells and sentinels can be mapped to
null(~in YAML) and then filled bystatic_fields(e.g.,P0D).
Note:
P0Dis the valid zero-duration value. If you usedP0Nbefore, replace it withP0D.
YAML snippet (duration)β
output_headers:
- duration
fields:
duration:
source: samples.days_to_collection
operations:
- strip_quotes
- trim
- map_values:
"--": ~
"NA": ~
"": ~
- days_to_iso8601_bin:
rounding: floor # one of: floor | round | ceil (default: floor)
units: [D, W, M, Y] # allowed output units in this priority
on_error: ~ # if parse/negative/error -> None, so static_fields can apply
static_fields:
duration: P0D
How it binsβ
0..9days βP0D..P9D>=10days β try weeks (days/7, rounded as configured) βP1W..P9W- If weeks would be
>=10β try months (days/30) βP1M..P9M - If months would be
>=10β years (days/365), clamped toP1Y..P9Y
Rounding affects W/M/Y only (not D). Default is floor.
Examplesβ
Input values (days) β Output:
0βP0D7βP7D10βP1W63βP9W70βP2M(β 70/30 β 2 with floor)300βP1Y4000βP9Y(clamped)
Preparing βdaysβ in spreadsheetsβ
If your raw data are dates or timestamps:
- Excel/LibreOffice: if A2 is a collection date and A1 is baseline,
=A2 - A1yields day count (format as General/Number). - If strings: convert to proper date types first, then subtract.
- If weeks/months/years: multiply by
7,30, or365to approximate days.
Skipped empty rowsβ
The parser skips completely empty rows. To emit a row that falls back to static_fields, provide a sentinel (e.g., NA or --) and map it to ~ via map_values. A truly blank line at file end wonβt produce an output row.
π Handling multi-value fields (e.g., condition)β
Some columns (like condition) may contain multiple inline values in a single cell, separated by typical CSV/TSV delimiters (,, ;, |, /). Use the normalize_multivalue operation to split, clean, map, and rejoin the values using ;, which the downstream software expects.
YAML snippetβ
fields:
condition:
source: samples.tumor_code
operations:
- strip_quotes
- trim
- normalize_multivalue:
delimiters: [",", ";", "|", "/"] # what to split on
join_with: ";" # how to rejoin after mapping
dedupe: false # set true to collapse repeats
map_values: # per-token mapping
"Acute myeloid leukemia (AML)": C92.0
"Induction Failure AML (AML-IF)": C92.0
"--": Z00.00
What it doesβ
- Split on
delimiters. - Trim and unquote each token.
- Map via
map_values(unmapped tokens pass through). - Drop empties (
drop_empty: trueby default). - Optionally de-duplicate (
dedupe: true). - Rejoin with
join_with(default;).
π§© Other behaviors & tipsβ
- Column validation: the script checks that every
source:column exists in the input header; missing ones abort with an error. - Output order: strictly follows
output_headers. subject_idassignment:- If
fields.subject_id.sourceis present: βgroup modeββrows that share the same transformed source value receive the same sequential ID (grouped in read order). - If absent: βpure counterββevery row receives a new ID.
- If
- Gzip:
.gzextensions are detected automatically for input and output. - Empty rows: completely empty rows (including overflow fields) are skipped.
β
Defaults for normalize_multivalueβ
delimiters:[",", ";", "|", "/"]join_with:";"drop_empty:truededupe:false
π Licenseβ
Artistic License 2.0
(C) 2025-2026 Manuel Rueda - CNAG