๐ ClarID Codebook Documentation
Overviewโ
This codebook defines a standardized encoding system for species, biosample metadata, assay types, conditions, and other identifiers in the ClarID project. It provides a mapping between human-readable labels and compact codes or stub codes for use in structured identifiers and databases.
See Codebook:
The bundled reference codebook is available in the repository at
share/clarid-codebook.yaml.
Version-pinned copies are kept under
share/versions/.
๐ Metadataโ
Defines global codebook info:
metadata:
version: "0.03" # ๐ท๏ธ official ClarID specification version
local_version: "CNAG-GDC-v1" # ๐ท๏ธ project-specific codebook revision (optional)
author: "M. Rueda" # ๐ค author
center: "CNAG" # ๐ข institution
date: "2026-04-01" # ๐
YYYY-MM-DD
description: "ClarID codebook" # ๐ summary
repository: "https://github.com/cnag-biomedical-informatics/clarid-tools" # ๐ repo URL
About version
version identifies the official ClarID specification release targeted by the
codebook and schema, for example 0.03. ClarID-Tools compatibility is defined
at this release level.
About local_version
local_version is optional and can be used for ad hoc or project-specific
codebook variants without changing the official ClarID specification version.
This is useful when different projects need their own controlled vocabularies,
aliases, or dictionary updates while still conforming to the same ClarID
release.
๐ Entitiesโ
All under entities:.
๐ _defaultsโ
Fallback when no match:
entities:
_defaults:
Unknown:
code: UNK
stub_code: U
label: "Unknown"
id: "NCIT:C17998"
"Not Available":
code: NAV
stub_code: n
label: "Not Available"
id: "NCIT:C126101"
๐ ๏ธ biosampleโ
๐ projectโ
entities:
biosample:
project: &all_projects
"TCGA-AML":
code: TCGA_AML
stub_code: AML
label: "TCGA Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
About species:
Reference: Based on Schrade et al., Animals 2024, Table 2.
๐งฌ Component I: Species Information
Each species entry is defined by two key elements:
-
Element 1 (positions 1โ3):
tax_codeA 3-letter taxonomic classification code:- 1st letter: Class
- 2nd letter: Order
- 3rd letter: Family
- Example:
MPC= Mammalia | Primates | Cercopithecidae
-
Element 2 (positions 5โ10):
codeA 6-letter binomial acronym formed by:- 3 letters from the genus name
- 3 letters from the species name
- Example:
MacMul= Macaca mulatta
-
stub_code: A 2-character Base-62 encoded unique species identifier (e.g."01"for Homo sapiens,"0E"for Macaca mulatta)
Note: tax_code is provided as metadata and is not used in encode/decode logic.
๐งฌ speciesโ
species:
Human:
code: HomSap # ๐ binomial acronym
stub_code: "01" # ๐ข index
label: "Homo sapiens" # ๐ name
id: "NCBITaxon:9606" # ๐ taxonomy
tax_code: MPH # ๐ท๏ธ class|order|family
๐ฅ tissueโ
tissue:
Liver:
code: LIV
stub_code: L
label: "Liver"
id: "UBERON:0002107"
๐งช sample_typeโ
sample_type:
Tumor:
code: TUM
stub_code: T
label: "Tumor"
id: "NCIT:C4872"
๐ฌ assayโ
assay:
RNA_seq:
code: RNA
stub_code: R
label: "RNA-seq"
id: "EFO:0008896"
โฐ timepointโ
timepoint:
Baseline:
code: BSL
stub_code: "B"
label: "Baseline"
id: "NCIT:C25213"
๐ Patternsโ
Regex-based formats:
condition_pattern:
regex: '^([A-Z]\d{2}(?:\.\d+)?)$' # โ
Letter+digits
code_format: '%s'
stub_format: '%s'
๐ฅ Subjectโ
๐ studyโ
Reuses biosample.project:
subject:
study: *all_projects
๐งโ๐คโ๐ง type, sex, age_groupโ
Case:
code: Case
stub_code: C
label: "Case Study"
id: "NCIT:C15362"
sex:
Male:
code: Male
stub_code: M
label: "Male"
id: "PATO:0000384"
age_group:
Age20to29:
code: A20_29
stub_code: A2
label: "Age 20-29"
id: "APOLLO:SV_00000241" # age range category
Naming conventions
Vocabulary keys under biosample and subject (e.g. RhesusMacaque, PeripheralBlood) use CamelCase; attributes (e.g. code, stub_code, tax_code) use snake_case.
Rationale: CamelCase keeps multi-word names compact and avoids confusion with attributes.
Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style for clarity or compatibility.