Skip to content

Implementation Details

Flowchart

flowchart TD subgraph Z["ClarID Formats"] direction TB D1["Human
CNAG_Test-HomSap-00001-LIV-TUM-RNA-C22.0-TRT-P1W-B01-R05
(Human-friendly)"] D2["Stub
CT01001LTR0N401T1WB01R05
(Machine-friendly)"] end R["Raw Data
(CSV)"] -. "Pre-processing" .-> A["Input Metadata File
(CSV)"] A -.-> B["ClarID-Tools
(CLI Reference Implementation)"] B -. "clarid-tools validate" .-> C["Codebook
(YAML)
(Controlled Vocabulary)"] C -.-> B B -. "clarid-tools code" .-> D1 B -. "clarid-tools code" .-> D2 D1 -.-> E["Readable Display"] & F["Filenames / Labeling / Pipelines / Databases"] D1 <-. "clarid-tools qrcode" .-> G["QR-Code Generation"] D2 -.-> F D2 <-. "clarid-tools qrcode" .-> G style R fill:#f5f5f5,stroke:#777,stroke-width:1px style A fill:#fdf6e3,stroke:#333,stroke-width:1px style B fill:#e7f3fe,stroke:#1e90ff,stroke-width:1px style C fill:#ffe7e7,stroke:#cc0000,stroke-width:1px style D1 fill:#e0ffe0,stroke:#228b22,stroke-width:1px style D2 fill:#e0ffe0,stroke:#228b22,stroke-width:1px style E fill:#fff8dc,stroke:#b8860b,stroke-width:1px style F fill:#fff8dc,stroke:#b8860b,stroke-width:1px style G fill:#fff8dc,stroke:#b8860b,stroke-width:1px

Architecture

  • Language & framework: Perl 5, Moo and MooX::Options.
  • Parsing / validation: YAML::XS, Text::CSV_XS, JSON::Validator (codebook validated by JSON Schema).
  • QR codes: qrencode (Linux).
  • Config: YAML codebook (controlled vocabulary + optional aliases).

Design choices (short)

  • Full externalization of the identifier spec into JSON Schema was tried but became complex (nested regexes and transforms).
  • Hybrid approach: core structural rules are implemented in code for clarity; domain vocabularies (species, tissues, assays, aliases) live in the YAML codebook and are schema-validated.
  • This keeps parsing deterministic and easier to maintain while retaining configurability.

Encoding / decoding

project / study

  • Labels like TCGA_AML remain literal unless an alias is declared in the YAML codebook. Add aliases when you need short representations.

subject_id — Base62, fixed width

  • Numeric subject_id → Base62 (0-9A-Za-z) with fixed width (default: 3) to simplify parsing.
  • Options:
  • --subject_id_pad_length — numeric padding for human format (if used).
  • --subject_id_base62_width — width for stub Base62 field.
  • Example: subject_id = 999 → Base62 G7 → padded to 0G7.
  • Capacity: 62^width - 1 unique IDs.

condition (disease)

  • ICD-10 codes → internal numeric index → Base62 (fixed length, default 3).
  • Human form: multiple conditions separated by +.
  • Stub form: condition codes concatenated (no separator); decoding uses reverse mapping.

species

  • stub_code declared in the YAML codebook as a 2-character code (Base62 alphabet).
  • One code (e.g., 00) reserved for unknown.
  • Optional tax_code is kept for traceability (not used in stubs).

tissue, sample_type, assay

  • Use predefined stub_codes from the codebook (recommended 2–5 chars).
  • Decoding strategy: parse fixed-width fields first, then greedy reverse lookup on remaining stub codes sorted by descending length to avoid prefix collisions (e.g., PB, T, HI parse PBTHI correctly).

Extensibility & pragmatic workarounds

  • Targets subject and biosample entities; extensible to cohorts, datasets, experiments with minor code changes.
  • Temporary workaround: repurpose unused codebook fields (e.g., use tissue for geographic location). This keeps identifiers functional if overall structure is preserved.

Implementation notes & tips

  • Keep stub codes short (2–5 chars) and unique to avoid parsing ambiguity. ✅
  • Increase --subject_id_base62_width before cohort size exceeds 62^width - 1.
  • Use YAML codebook aliases for stable short labels.
  • Maintain the JSON Schema when editing the codebook.