π ClarID Codebook Documentation¶
Overview¶
This codebook defines a standardized encoding system for species, biosample metadata, assay types, conditions, and other identifiers in the ClarID project. It provides a mapping between human-readable labels and compact codes or stub codes for use in structured identifiers and databases.
See Codebook:
# -----------------------------------------------------------------------------
# ClarID Codebook
# -----------------------------------------------------------------------------
# Note: Species Information taken from (Schrade et al., Animals 2024, Table 2):
# COMPONENT I
# * Element 1 (positions 1-3): 3-letter taxonomic code
# (Class | Order | Family, e.g. "MPC" for Mammalia | Primates | Cercopithecidae)
# * Element 2 (positions 5-10): 6-letter binomial acronym
# (first 3 letters of Genus + first 3 letters of species, e.g. "MacMul" for Macaca mulatta)
#
# In this file:
# - code = Element 2 (binomial acronym)
# - tax_code = Element 1 (taxonomic classification)
# included here only as metadata, NOT used in current encode/decode routines
# - stub_code = Base-62 species index (unique identifier)
# The reference codebook uses width 2. Other widths are supported as long as
# all species stub_code values in a given codebook use the same length.
#
# Naming conventions:
# - Vocabulary keys under "biosample" and "subject" use CamelCase
# (e.g. RhesusMacaque, PeripheralBlood).
# - Attributes use snake_case (e.g. code, stub_code, tax_code).
# Rationale: CamelCase keeps multi-word names compact and distinct from fields.
# Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style.
# -----------------------------------------------------------------------------
metadata:
version: "0.03" # Publication release
local_version: "CNAG-2025.04.01" # internal/project revision (optional)
author: "Manuel Rueda <manuel.rueda@cnag.eu>"
center: "CNAG"
date: "2025-04-01"
description: "CNAG's ClarID codebook."
repository: "https://github.com/cnag-biomedical-informatics/clarid-tools"
entities:
# Define "global" fall-through entries
# Does not work out-of-the-box with YAML::XS (added during BUILD)
_defaults: &defaults
"Unknown":
code: UNK
stub_code: U
label: "Unknown"
id: "NCIT:C17998"
"Not Available":
code: NAV
stub_code: n
label: "Not Available"
id: "NCIT:C126101"
biosample:
project: &all_projects
"TCGA-AML":
code: TCGA_AML
stub_code: AML
label: "TCGA Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
"TARGET-AML":
code: TARGET_AML
stub_code: TAML
label: "TARGETβs Study of Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
"CNAG-Test":
code: CNAG_Test
stub_code: CT
label: "CNAG test project for ClarID-Tools"
id: "NCIT:C17998" # Unknown
species:
Unknown:
code: UnkNow
stub_code: "00"
label: "Unknown"
id: "NCIT:C17998" # Unknown
tax_code: UNK
Human:
code: HomSap
stub_code: "01"
label: "Homo sapiens"
id: "NCBITaxon:9606"
tax_code: MPH # Mammalia | Primates | Hominidae
Mouse:
code: MusMus
stub_code: "02"
label: "Mus musculus"
id: "NCBITaxon:10090"
tax_code: MRM # Mammalia | Rodentia | Muridae
Rat:
code: RatNor
stub_code: "03"
label: "Rattus norvegicus"
id: "NCBITaxon:10116"
tax_code: MRM # Mammalia | Rodentia | Muridae
Zebrafish:
code: DanRer
stub_code: "04"
label: "Danio rerio"
id: "NCBITaxon:7955"
tax_code: ACC # Actinopterygii | Cypriniformes | Cyprinidae
Fruitfly:
code: DroMel
stub_code: "05"
label: "Drosophila melanogaster"
id: "NCBITaxon:7227"
tax_code: IDD # Insecta | Diptera | Drosophilidae
Worm:
code: CaeEle
stub_code: "06"
label: "Caenorhabditis elegans"
id: "NCBITaxon:6239"
tax_code: CRR # Chromadorea | Rhabditida | Rhabditidae
Yeast:
code: SacCer
stub_code: "07"
label: "Saccharomyces cerevisiae"
id: "NCBITaxon:4932"
tax_code: SSS # Saccharomycetes | Saccharomycetales | Saccharomycetaceae
Ecoli:
code: EscCol
stub_code: "08"
label: "Escherichia coli"
id: "NCBITaxon:562"
tax_code: GEE # Gammaproteobacteria | Enterobacterales | Enterobacteriaceae
Dog:
code: CanLup
stub_code: "09"
label: "Canis lupus familiaris"
id: "NCBITaxon:9615"
tax_code: MCC # Mammalia | Carnivora | Canidae
Pig:
code: SusScr
stub_code: "0A"
label: "Sus scrofa"
id: "NCBITaxon:9823"
tax_code: MAS # Mammalia | Artiodactyla | Suidae
Cow:
code: BosTau
stub_code: "0B"
label: "Bos taurus"
id: "NCBITaxon:9913"
tax_code: MAB # Mammalia | Artiodactyla | Bovidae
Chicken:
code: GalGal
stub_code: "0C"
label: "Gallus gallus"
id: "NCBITaxon:9031"
tax_code: AGP # Aves | Galliformes | Phasianidae
Rabbit:
code: OryCun
stub_code: "0D"
label: "Oryctolagus cuniculus"
id: "NCBITaxon:9986"
tax_code: MLL # Mammalia | Lagomorpha | Leporidae
RhesusMacaque:
code: MacMul
stub_code: "0E"
label: "Macaca mulatta"
id: "NCBITaxon:9544"
tax_code: MPC # Mammalia | Primates | Cercopithecidae
CynomolgusMacaque:
code: MacFas
stub_code: "0F"
label: "Macaca fascicularis"
id: "NCBITaxon:9543"
tax_code: MPC # Mammalia | Primates | Cercopithecidae
CommonMarmoset:
code: CalJac
stub_code: "0G"
label: "Callithrix jacchus"
id: "NCBITaxon:9483"
tax_code: MCP # Mammalia | Primates | Callitrichidae
GuineaPig:
code: CavPor
stub_code: "0H"
label: "Cavia porcellus"
id: "NCBITaxon:10141"
tax_code: MRC # Mammalia | Rodentia | Caviidae
GoldenHamster:
code: MesAur
stub_code: "0I"
label: "Mesocricetus auratus"
id: "NCBITaxon:10036"
tax_code: MRC # Mammalia | Rodentia | Cricetidae
AfricanClawedFrog:
code: XenLae
stub_code: "0J"
label: "Xenopus laevis"
id: "NCBITaxon:8355"
tax_code: AAP # Amphibia | Anura | Pipidae
Ferret:
code: MusPut
stub_code: "0K"
label: "Mustela putorius furo"
id: "NCBITaxon:9612"
tax_code: MCM # Mammalia | Carnivora | Mustelidae
NakedMoleRat:
code: HetGla
stub_code: "0L"
label: "Heterocephalus glaber"
id: "NCBITaxon:314479"
tax_code: MRB # Mammalia | Rodentia | Bathyergidae
Opossum:
code: MonDom
stub_code: "0M"
label: "Monodelphis domestica"
id: "NCBITaxon:13710"
tax_code: MDD # Mammalia | Didelphimorphia | Didelphidae
tissue:
Liver:
code: LIV
stub_code: L
label: "Liver"
id: "UBERON:0002107"
Lung:
code: LUN
stub_code: LU
label: "Lung"
id: "UBERON:0002048"
Kidney:
code: KID
stub_code: K
label: "Kidney"
id: "UBERON:0002113"
Blood:
code: BLO
stub_code: B
label: "Blood"
id: "UBERON:0000178"
PeripheralBlood:
code: PBLO
stub_code: PB
label: "Peripheral Blood"
id: "BTO:0000553"
Tumor:
code: TUM
stub_code: T
label: "Neoplasm"
id: "NCIT:C3262"
Brain:
code: BRN
stub_code: N
label: "Brain"
id: "UBERON:0000955"
Heart:
code: HRT
stub_code: H
label: "Heart"
id: "UBERON:0000948"
Spleen:
code: SPL
stub_code: S
label: "Spleen"
id: "UBERON:0002106"
Skin:
code: SKN
stub_code: I
label: "Skin"
id: "UBERON:0002097"
Pancreas:
code: PNC
stub_code: P
label: "Pancreas"
id: "UBERON:0001264"
Colon:
code: CLN
stub_code: C
label: "Colon"
id: "UBERON:0000059"
Stomach:
code: STM
stub_code: M
label: "Stomach"
id: "UBERON:0000945"
Muscle:
code: MSC
stub_code: V
label: "Muscle"
id: "BTO:0000887"
Intestine:
code: INT
stub_code: E
label: "Intestine"
id: "UBERON:0000160"
Bone:
code: BNE
stub_code: O
label: "Bone"
id: "BTO:0000140"
AdiposeTissue:
code: ADT
stub_code: A
label: "Adipose tissue"
id: "UBERON:0001013"
BoneMarrow:
code: BMR
stub_code: R
label: "Bone marrow"
id: "UBERON:0002371"
DerivedCellLine:
code: DCL
stub_code: DC
label: "Derived Cell Line"
id: "NCIT:C156445"
sample_type:
Tumor:
code: TUM
stub_code: T
label: "Neoplasm"
id: "NCIT:C3262"
Normal:
code: NOR
stub_code: N
label: "Normal"
id: "PATO:0000461"
Primary:
code: PRI
stub_code: P
label: "Primary Tumor Site Indicator"
id: "NCIT:C172602"
Recurrence:
code: REC
stub_code: R
label: "Recurrent Neoplasm"
id: "NCIT:C4798"
assay:
RNA_seq:
code: RNA
stub_code: R
label: "RNA-seq"
id: "EFO:0008896"
WES:
code: WES
stub_code: E
label: "Exome sequencing"
id: "EFO:0005396"
ChIP_seq:
code: CHI
stub_code: C
label: "ChIP-seq"
id: "EFO:0002692"
IHC:
code: IHC
stub_code: I
label: "Immunohistochemistry"
id: "EFO:0022943"
LC_MS:
code: LCMS
stub_code: S
label: "Liquid Chromatography Mass Spectrometry"
id: "NCIT:C18475"
ATAC_seq:
code: ATAC
stub_code: A
label: "ATAC-seq"
id: "EFO:0007045"
scRNA_seq:
code: SCR
stub_code: N
label: "Single-cell RNA sequencing"
id: "EFO:0008913"
scATAC_seq:
code: SCAT
stub_code: T
label: "ScATAC-seq"
id: "EFO:0010891"
HiC:
code: HIC
stub_code: HI
label: "Hi-C"
id: "EFO:0007693"
WGBS:
code: WGB
stub_code: G
label: "WGBS" # Whole-genome bisulfite sequencing
id: "EFO:0008985"
FlowCytometry:
code: FCM
stub_code: F
label: "Flow Cytometry"
id: "NCIT:C16585"
Proteomics:
code: PRO
stub_code: P
label: "Proteomics"
id: "NCIT:C20085"
Metabolomics:
code: METB
stub_code: B
label: "Metabolomics"
id: "NCIT:C49019"
Microarray:
code: ARR
stub_code: X
label: "Microarray"
id: "NCIT:C44282"
ELISA:
code: ELS
stub_code: L
label: "Enzyme Immunoassay"
id: "NCIT:C17455"
qPCR:
code: QPCR
stub_code: Q
label: "quantitative polymerase chain reaction"
id: "AFP:0003769"
WesternBlot:
code: WBL
stub_code: W
label: "western blot assay"
id: "OBI:0000854"
WGS:
code: WGS
stub_code: Z
label: "Whole Genome Sequencing"
id: "NCIT:C101294"
ddPCR:
code: DDP
stub_code: D
label: "Droplet Digital PCR"
id: "NCIT:C166064"
timepoint:
Baseline:
code: BSL
stub_code: "B"
label: "Baseline"
id: "NCIT:C25213"
Treatment:
code: TRT
stub_code: "T"
label: "Treatment Ongoing"
id: "NCIT:C165209"
Surgery:
code: SUR
stub_code: "S"
label: "Surgery"
id: "NCIT:C17998"
Challenge:
code: CHL
stub_code: "C"
label: "Challenge"
id: "NCIT:C78166"
Collection:
code: COL
stub_code: CT
label: "Collection Time"
id: "NCIT:C81287"
condition_pattern: &condition_pattern
regex: '^([A-Z]\d{2}(?:\.\d+)?)$'
code_format: '%s'
stub_format: '%s'
duration_pattern:
# accept P<digits><D|W|M|Y> OR exactly P0N
regex: '^(?:P?(\d+)([DWMY])|P?(0)(N))$'
code_format: 'P%d%s' # uses captures
stub_format: '%d%s'
batch_pattern:
# capture up to two digits
regex: '^(\d{1,2})$'
code_format: 'B%02d' # B01, B02, ...
stub_format: 'B%02d' # idem
replicate_pattern:
# 1β99
regex: '^(\d{1,2})$'
code_format: 'R%02d' # R01, R02, ...
stub_format: 'R%02d' # idem
subject:
study: *all_projects
type:
Case:
code: Case
stub_code: C
label: "Case Study"
id: "NCIT:C15362"
Control:
code: Control
stub_code: N
label: "Study Control"
id: "NCIT:C142703"
condition_pattern: *condition_pattern
sex:
Male:
code: Male
stub_code: M
label: "Male"
id: "PATO:0000384"
Female:
code: Female
stub_code: F
label: "Female"
id: "PATO:0000383"
"Not reported":
code: NotR
stub_code: N
label: "Not Reported"
id: "NCIT:C43234"
Unspecified:
code: "Uns"
stub_code: S
label: "Unspecified"
id: "NCIT:C38046"
age_group:
Age0to9:
code: A0_9
stub_code: A0
label: "Age 0-9"
id: "APOLLO:SV_00000241" # age range category
Age10to19:
code: A10_19
stub_code: A1
label: "Age 10-19"
id: "APOLLO:SV_00000241" # age range category
Age20to29:
code: A20_29
stub_code: A2
label: "Age 20-29"
id: "APOLLO:SV_00000241" # age range category
Age30to39:
code: A30_39
stub_code: A3
label: "Age 30-39"
id: "APOLLO:SV_00000241" # age range category
Age40to49:
code: A40_49
stub_code: A4
label: "Age 40-49"
id: "APOLLO:SV_00000241" # age range category
Age50to59:
code: A50_59
stub_code: A5
label: "Age 50-59"
id: "APOLLO:SV_00000241" # age range category
Age60to69:
code: A60_69
stub_code: A6
label: "Age 60-69"
id: "APOLLO:SV_00000241" # age range category
Age70to79:
code: A70_79
stub_code: A7
label: "Age 70-79"
id: "APOLLO:SV_00000241" # age range category
Age80to89:
code: A80_89
stub_code: A8
label: "Age 80-89"
id: "APOLLO:SV_00000241" # age range category
Age90to99:
code: A90_99
stub_code: A9
label: "Age 90-99"
id: "APOLLO:SV_00000241" # age range category
Unknown:
code: UNK
stub_code: UN # 2 char mandatory - not using global
label: "Unknown"
id: "NCIT:C17998"
'Not Available':
code: NAV
stub_code: NA # 2 char mandatory - not using global
label: "Not Available"
id: "NCIT:C126101"
π Metadata¶
Defines global codebook info:
metadata:
version: "0.03" # π·οΈ official ClarID specification version
local_version: "CNAG-GDC-v1" # π·οΈ project-specific codebook revision (optional)
author: "M. Rueda" # π€ author
center: "CNAG" # π’ institution
date: "2026-04-01" # π
YYYY-MM-DD
description: "ClarID codebook" # π summary
repository: "https://github.com/cnag-biomedical-informatics/clarid-tools" # π repo URL
About version
version identifies the official ClarID specification release targeted by the
codebook and schema, for example 0.03. ClarID-Tools compatibility is defined
at this release level.
About local_version
local_version is optional and can be used for ad hoc or project-specific
codebook variants without changing the official ClarID specification version.
This is useful when different projects need their own controlled vocabularies,
aliases, or dictionary updates while still conforming to the same ClarID
release.
π Entities¶
All under entities:.
π _defaults¶
Fallback when no match:
entities:
_defaults:
Unknown:
code: UNK
stub_code: U
label: "Unknown"
id: "NCIT:C17998"
"Not Available":
code: NAV
stub_code: n
label: "Not Available"
id: "NCIT:C126101"
π οΈ biosample¶
π project¶
entities:
biosample:
project: &all_projects
"TCGA-AML":
code: TCGA_AML
stub_code: AML
label: "TCGA Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
About species:
Reference: Based on Schrade et al., Animals 2024, Table 2.
𧬠Component I: Species Information
Each species entry is defined by two key elements:
- Element 1 (positions 1β3):
tax_codeA 3-letter taxonomic classification code: - 1st letter: Class
- 2nd letter: Order
- 3rd letter: Family
-
Example:
MPC= Mammalia | Primates | Cercopithecidae -
Element 2 (positions 5β10):
codeA 6-letter binomial acronym formed by: - 3 letters from the genus name
- 3 letters from the species name
-
Example:
MacMul= Macaca mulatta -
stub_code: A 2-character Base-62 encoded unique species identifier (e.g."01"for Homo sapiens,"0E"for Macaca mulatta)
Note: tax_code is provided as metadata and is not used in encode/decode logic.
𧬠species¶
species:
Human:
code: HomSap # π binomial acronym
stub_code: "01" # π’ index
label: "Homo sapiens" # π name
id: "NCBITaxon:9606" # π taxonomy
tax_code: MPH # π·οΈ class|order|family
π₯ tissue¶
π§ͺ sample_type¶
π¬ assay¶
β° timepoint¶
π Patterns¶
Regex-based formats:
condition_pattern:
regex: '^([A-Z]\d{2}(?:\.\d+)?)$' # β
Letter+digits
code_format: '%s'
stub_format: '%s'
π₯ Subject¶
π study¶
Reuses biosample.project:
π§βπ€βπ§ type, sex, age_group¶
Case:
code: Case
stub_code: C
label: "Case Study"
id: "NCIT:C15362"
sex:
Male:
code: Male
stub_code: M
label: "Male"
id: "PATO:0000384"
age_group:
Age20to29:
code: A20_29
stub_code: A2
label: "Age 20-29"
id: "APOLLO:SV_00000241" # age range category
Naming conventions
Vocabulary keys under biosample and subject (e.g. RhesusMacaque, PeripheralBlood) use CamelCase; attributes (e.g. code, stub_code, tax_code) use snake_case.
Rationale: CamelCase keeps multi-word names compact and avoids confusion with attributes.
Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style for clarity or compatibility.