Implementation Details¶
Flowchart¶
flowchart TD
subgraph Z["ClarID Formats"]
direction TB
D1["Human
CNAG_Test-HomSap-00001-LIV-TUM-RNA-C22.0-TRT-P1W-B01-R05
(Human-friendly)"] D2["Stub
CT01001LTR0N401T1WB01R05
(Machine-friendly)"] end R["Raw Data
(CSV)"] -. "Pre-processing" .-> A["Input Metadata File
(CSV)"] A -.-> B["ClarID-Tools
(CLI Reference Implementation)"] B -. "clarid-tools validate" .-> C["Codebook
(YAML)
(Controlled Vocabulary)"] C -.-> B B -. "clarid-tools code" .-> D1 B -. "clarid-tools code" .-> D2 D1 -.-> E["Readable Display"] & F["Filenames / Labeling / Pipelines / Databases"] D1 <-. "clarid-tools qrcode" .-> G["QR-Code Generation"] D2 -.-> F D2 <-. "clarid-tools qrcode" .-> G style R fill:#f5f5f5,stroke:#777,stroke-width:1px style A fill:#fdf6e3,stroke:#333,stroke-width:1px style B fill:#e7f3fe,stroke:#1e90ff,stroke-width:1px style C fill:#ffe7e7,stroke:#cc0000,stroke-width:1px style D1 fill:#e0ffe0,stroke:#228b22,stroke-width:1px style D2 fill:#e0ffe0,stroke:#228b22,stroke-width:1px style E fill:#fff8dc,stroke:#b8860b,stroke-width:1px style F fill:#fff8dc,stroke:#b8860b,stroke-width:1px style G fill:#fff8dc,stroke:#b8860b,stroke-width:1px
CNAG_Test-HomSap-00001-LIV-TUM-RNA-C22.0-TRT-P1W-B01-R05
(Human-friendly)"] D2["Stub
CT01001LTR0N401T1WB01R05
(Machine-friendly)"] end R["Raw Data
(CSV)"] -. "Pre-processing" .-> A["Input Metadata File
(CSV)"] A -.-> B["ClarID-Tools
(CLI Reference Implementation)"] B -. "clarid-tools validate" .-> C["Codebook
(YAML)
(Controlled Vocabulary)"] C -.-> B B -. "clarid-tools code" .-> D1 B -. "clarid-tools code" .-> D2 D1 -.-> E["Readable Display"] & F["Filenames / Labeling / Pipelines / Databases"] D1 <-. "clarid-tools qrcode" .-> G["QR-Code Generation"] D2 -.-> F D2 <-. "clarid-tools qrcode" .-> G style R fill:#f5f5f5,stroke:#777,stroke-width:1px style A fill:#fdf6e3,stroke:#333,stroke-width:1px style B fill:#e7f3fe,stroke:#1e90ff,stroke-width:1px style C fill:#ffe7e7,stroke:#cc0000,stroke-width:1px style D1 fill:#e0ffe0,stroke:#228b22,stroke-width:1px style D2 fill:#e0ffe0,stroke:#228b22,stroke-width:1px style E fill:#fff8dc,stroke:#b8860b,stroke-width:1px style F fill:#fff8dc,stroke:#b8860b,stroke-width:1px style G fill:#fff8dc,stroke:#b8860b,stroke-width:1px
Architecture¶
- Language & framework: Perl 5,
Moo
andMooX::Options
. - Parsing / validation:
YAML::XS
,Text::CSV_XS
,JSON::Validator
(codebook validated by JSON Schema). - QR codes:
qrencode
(Linux). - Config: YAML codebook (controlled vocabulary + optional aliases).
Design choices (short)¶
- Full externalization of the identifier spec into JSON Schema was tried but became complex (nested regexes and transforms).
- Hybrid approach: core structural rules are implemented in code for clarity; domain vocabularies (species, tissues, assays, aliases) live in the YAML codebook and are schema-validated.
- This keeps parsing deterministic and easier to maintain while retaining configurability.
Encoding / decoding¶
project
/ study
¶
- Labels like
TCGA_AML
remain literal unless an alias is declared in the YAML codebook. Add aliases when you need short representations.
subject_id
— Base62, fixed width¶
- Numeric
subject_id
→ Base62 (0-9A-Za-z
) with fixed width (default: 3) to simplify parsing. - Options:
--subject_id_pad_length
— numeric padding for human format (if used).--subject_id_base62_width
— width for stub Base62 field.- Example:
subject_id = 999
→ Base62G7
→ padded to0G7
. - Capacity:
62^width - 1
unique IDs.
condition
(disease)¶
- ICD-10 codes → internal numeric index → Base62 (fixed length, default 3).
- Human form: multiple conditions separated by
+
. - Stub form: condition codes concatenated (no separator); decoding uses reverse mapping.
species
¶
stub_code
declared in the YAML codebook as a 2-character code (Base62 alphabet).- One code (e.g.,
00
) reserved for unknown. - Optional
tax_code
is kept for traceability (not used in stubs).
tissue
, sample_type
, assay
¶
- Use predefined stub_codes from the codebook (recommended 2–5 chars).
- Decoding strategy: parse fixed-width fields first, then greedy reverse lookup on remaining stub codes sorted by descending length to avoid prefix collisions (e.g.,
PB
,T
,HI
parsePBTHI
correctly).
Extensibility & pragmatic workarounds¶
- Targets
subject
andbiosample
entities; extensible to cohorts, datasets, experiments with minor code changes. - Temporary workaround: repurpose unused codebook fields (e.g., use
tissue
for geographic location). This keeps identifiers functional if overall structure is preserved.
Implementation notes & tips¶
- Keep stub codes short (2–5 chars) and unique to avoid parsing ambiguity. ✅
- Increase
--subject_id_base62_width
before cohort size exceeds62^width - 1
. - Use YAML codebook aliases for stable short labels.
- Maintain the JSON Schema when editing the codebook.