Skip to main content

๐Ÿ“˜ ClarID Codebook Documentation

Overviewโ€‹

This codebook defines a standardized encoding system for species, biosample metadata, assay types, conditions, and other identifiers in the ClarID project. It provides a mapping between human-readable labels and compact codes or stub codes for use in structured identifiers and databases.

See Codebook:

The bundled reference codebook is available in the repository at share/clarid-codebook.yaml. Version-pinned copies are kept under share/versions/.


๐Ÿ“ Metadataโ€‹

Defines global codebook info:

metadata:
version: "0.03" # ๐Ÿท๏ธ official ClarID specification version
local_version: "CNAG-GDC-v1" # ๐Ÿท๏ธ project-specific codebook revision (optional)
author: "M. Rueda" # ๐Ÿ‘ค author
center: "CNAG" # ๐Ÿข institution
date: "2026-04-01" # ๐Ÿ“… YYYY-MM-DD
description: "ClarID codebook" # ๐Ÿ“ summary
repository: "https://github.com/cnag-biomedical-informatics/clarid-tools" # ๐Ÿ”— repo URL
About version

version identifies the official ClarID specification release targeted by the codebook and schema, for example 0.03. ClarID-Tools compatibility is defined at this release level.

About local_version

local_version is optional and can be used for ad hoc or project-specific codebook variants without changing the official ClarID specification version. This is useful when different projects need their own controlled vocabularies, aliases, or dictionary updates while still conforming to the same ClarID release.


๐ŸŒ Entitiesโ€‹

All under entities:.

๐Ÿ”„ _defaultsโ€‹

Fallback when no match:

entities:
_defaults:
Unknown:
code: UNK
stub_code: U
label: "Unknown"
id: "NCIT:C17998"
"Not Available":
code: NAV
stub_code: n
label: "Not Available"
id: "NCIT:C126101"

๐Ÿ› ๏ธ biosampleโ€‹

๐Ÿ“ projectโ€‹

entities:
biosample:
project: &all_projects
"TCGA-AML":
code: TCGA_AML
stub_code: AML
label: "TCGA Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
About species:

Reference: Based on Schrade et al., Animals 2024, Table 2.

๐Ÿงฌ Component I: Species Information

Each species entry is defined by two key elements:

  • Element 1 (positions 1โ€“3): tax_code A 3-letter taxonomic classification code:

    • 1st letter: Class
    • 2nd letter: Order
    • 3rd letter: Family
    • Example: MPC = Mammalia | Primates | Cercopithecidae
  • Element 2 (positions 5โ€“10): code A 6-letter binomial acronym formed by:

    • 3 letters from the genus name
    • 3 letters from the species name
    • Example: MacMul = Macaca mulatta
  • stub_code: A 2-character Base-62 encoded unique species identifier (e.g. "01" for Homo sapiens, "0E" for Macaca mulatta)

Note: tax_code is provided as metadata and is not used in encode/decode logic.

๐Ÿงฌ speciesโ€‹

species:
Human:
code: HomSap # ๐Ÿ†” binomial acronym
stub_code: "01" # ๐Ÿ”ข index
label: "Homo sapiens" # ๐Ÿ“– name
id: "NCBITaxon:9606" # ๐Ÿ”— taxonomy
tax_code: MPH # ๐Ÿท๏ธ class|order|family

๐Ÿฅ tissueโ€‹

tissue:
Liver:
code: LIV
stub_code: L
label: "Liver"
id: "UBERON:0002107"

๐Ÿงช sample_typeโ€‹

sample_type:
Tumor:
code: TUM
stub_code: T
label: "Tumor"
id: "NCIT:C4872"

๐Ÿ”ฌ assayโ€‹

assay:
RNA_seq:
code: RNA
stub_code: R
label: "RNA-seq"
id: "EFO:0008896"

โฐ timepointโ€‹

timepoint:
Baseline:
code: BSL
stub_code: "B"
label: "Baseline"
id: "NCIT:C25213"

๐Ÿ” Patternsโ€‹

Regex-based formats:

condition_pattern:
regex: '^([A-Z]\d{2}(?:\.\d+)?)$' # โœ… Letter+digits
code_format: '%s'
stub_format: '%s'

๐Ÿ‘ฅ Subjectโ€‹

๐Ÿ”„ studyโ€‹

Reuses biosample.project:

subject:
study: *all_projects

๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ type, sex, age_groupโ€‹

Case:
code: Case
stub_code: C
label: "Case Study"
id: "NCIT:C15362"
sex:
Male:
code: Male
stub_code: M
label: "Male"
id: "PATO:0000384"
age_group:
Age20to29:
code: A20_29
stub_code: A2
label: "Age 20-29"
id: "APOLLO:SV_00000241" # age range category

Naming conventions

Vocabulary keys under biosample and subject (e.g. RhesusMacaque, PeripheralBlood) use CamelCase; attributes (e.g. code, stub_code, tax_code) use snake_case.
Rationale: CamelCase keeps multi-word names compact and avoids confusion with attributes.
Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style for clarity or compatibility.