π ClarID Codebook Documentation¶
Overview¶
This codebook defines a standardized encoding system for species, biosample metadata, assay types, conditions, and other identifiers in the ClarID project. It provides a mapping between human-readable labels and compact codes or stub codes for use in structured identifiers and databases.
See Codebook:
# -----------------------------------------------------------------------------
# ClarID Codebook
# -----------------------------------------------------------------------------
# Note: Species Information taken from (Schrade et al., Animals 2024, Table 2):
# COMPONENT I
# * Element 1 (positions 1-3): 3-letter taxonomic code
# (Class | Order | Family, e.g. "MPC" for Mammalia | Primates | Cercopithecidae)
# * Element 2 (positions 5-10): 6-letter binomial acronym
# (first 3 letters of Genus + first 3 letters of species, e.g. "MacMul" for Macaca mulatta)
#
# In this file:
# - code = Element 2 (binomial acronym)
# - tax_code = Element 1 (taxonomic classification)
# included here only as metadata, NOT used in current encode/decode routines
# - stub_code = 2-character Base-62 species index (unique identifier)
#
# Naming conventions:
# - Vocabulary keys under "biosample" and "subject" use CamelCase
# (e.g. RhesusMacaque, PeripheralBlood).
# - Attributes use snake_case (e.g. code, stub_code, tax_code).
# Rationale: CamelCase keeps multi-word names compact and distinct from fields.
# Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style.
# -----------------------------------------------------------------------------
metadata:
version: "0.02" # official ClarID release
local_version: "CNAG-2025.09.05" # internal/project revision (optional)
author: "Manuel Rueda <manuel.rueda@cnag.eu>"
center: "CNAG"
date: "2025-09-05"
description: "CNAG's ClarID codebook."
repository: "https://github.com/cnag-biomedical-informatics/clarid-tools"
entities:
# Define "global" fall-through entries
# Does not work out-of-the-box with YAML::XS (added during BUILD)
_defaults: &defaults
"Unknown":
code: UNK
stub_code: U
label: "Unknown"
id: "NCIT:C17998"
"Not Available":
code: NAV
stub_code: n
label: "Not Available"
id: "NCIT:C126101"
biosample:
project: &all_projects
"TCGA-AML":
code: TCGA_AML
stub_code: AML
label: "TCGA Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
"TARGET-AML":
code: TARGET_AML
stub_code: TAML
label: "TARGETβs Study of Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
"CNAG-Test":
code: CNAG_Test
stub_code: CT
label: "CNAG test project for ClarID-Tools"
id: "NCIT:C17998" # Unknown
species:
Unknown:
code: UnkNow
stub_code: "00"
label: "Unknown"
id: "NCIT:C17998" # Unknown
tax_code: UNK
Human:
code: HomSap
stub_code: "01"
label: "Homo sapiens"
id: "NCBITaxon:9606"
tax_code: MPH # Mammalia | Primates | Hominidae
Mouse:
code: MusMus
stub_code: "02"
label: "Mus musculus"
id: "NCBITaxon:10090"
tax_code: MRM # Mammalia | Rodentia | Muridae
Rat:
code: RatNor
stub_code: "03"
label: "Rattus norvegicus"
id: "NCBITaxon:10116"
tax_code: MRM # Mammalia | Rodentia | Muridae
Zebrafish:
code: DanRer
stub_code: "04"
label: "Danio rerio"
id: "NCBITaxon:7955"
tax_code: ACC # Actinopterygii | Cypriniformes | Cyprinidae
Fruitfly:
code: DroMel
stub_code: "05"
label: "Drosophila melanogaster"
id: "NCBITaxon:7227"
tax_code: IDD # Insecta | Diptera | Drosophilidae
Worm:
code: CaeEle
stub_code: "06"
label: "Caenorhabditis elegans"
id: "NCBITaxon:6239"
tax_code: CRR # Chromadorea | Rhabditida | Rhabditidae
Yeast:
code: SacCer
stub_code: "07"
label: "Saccharomyces cerevisiae"
id: "NCBITaxon:4932"
tax_code: SSS # Saccharomycetes | Saccharomycetales | Saccharomycetaceae
Ecoli:
code: EscCol
stub_code: "08"
label: "Escherichia coli"
id: "NCBITaxon:562"
tax_code: GEE # Gammaproteobacteria | Enterobacterales | Enterobacteriaceae
Dog:
code: CanLup
stub_code: "09"
label: "Canis lupus familiaris"
id: "NCBITaxon:9615"
tax_code: MCC # Mammalia | Carnivora | Canidae
Pig:
code: SusScr
stub_code: "0A"
label: "Sus scrofa"
id: "NCBITaxon:9823"
tax_code: MAS # Mammalia | Artiodactyla | Suidae
Cow:
code: BosTau
stub_code: "0B"
label: "Bos taurus"
id: "NCBITaxon:9913"
tax_code: MAB # Mammalia | Artiodactyla | Bovidae
Chicken:
code: GalGal
stub_code: "0C"
label: "Gallus gallus"
id: "NCBITaxon:9031"
tax_code: AGP # Aves | Galliformes | Phasianidae
Rabbit:
code: OryCun
stub_code: "0D"
label: "Oryctolagus cuniculus"
id: "NCBITaxon:9986"
tax_code: MLL # Mammalia | Lagomorpha | Leporidae
RhesusMacaque:
code: MacMul
stub_code: "0E"
label: "Macaca mulatta"
id: "NCBITaxon:9544"
tax_code: MPC # Mammalia | Primates | Cercopithecidae
CynomolgusMacaque:
code: MacFas
stub_code: "0F"
label: "Macaca fascicularis"
id: "NCBITaxon:9543"
tax_code: MPC # Mammalia | Primates | Cercopithecidae
CommonMarmoset:
code: CalJac
stub_code: "0G"
label: "Callithrix jacchus"
id: "NCBITaxon:9483"
tax_code: MCP # Mammalia | Primates | Callitrichidae
GuineaPig:
code: CavPor
stub_code: "0H"
label: "Cavia porcellus"
id: "NCBITaxon:10141"
tax_code: MRC # Mammalia | Rodentia | Caviidae
GoldenHamster:
code: MesAur
stub_code: "0I"
label: "Mesocricetus auratus"
id: "NCBITaxon:10036"
tax_code: MRC # Mammalia | Rodentia | Cricetidae
AfricanClawedFrog:
code: XenLae
stub_code: "0J"
label: "Xenopus laevis"
id: "NCBITaxon:8355"
tax_code: AAP # Amphibia | Anura | Pipidae
Ferret:
code: MusPut
stub_code: "0K"
label: "Mustela putorius furo"
id: "NCBITaxon:9612"
tax_code: MCM # Mammalia | Carnivora | Mustelidae
NakedMoleRat:
code: HetGla
stub_code: "0L"
label: "Heterocephalus glaber"
id: "NCBITaxon:314479"
tax_code: MRB # Mammalia | Rodentia | Bathyergidae
Opossum:
code: MonDom
stub_code: "0M"
label: "Monodelphis domestica"
id: "NCBITaxon:13710"
tax_code: MDD # Mammalia | Didelphimorphia | Didelphidae
tissue:
Liver:
code: LIV
stub_code: L
label: "Liver"
id: "UBERON:0002107"
Lung:
code: LUN
stub_code: LU
label: "Lung"
id: "UBERON:0002048"
Kidney:
code: KID
stub_code: K
label: "Kidney"
id: "UBERON:0002113"
Blood:
code: BLO
stub_code: B
label: "Blood"
id: "UBERON:0000178"
PeripheralBlood:
code: PBLO
stub_code: PB
label: "Peripheral Blood"
id: "BTO:0000553"
Tumor:
code: TUM
stub_code: T
label: "Neoplasm"
id: "NCIT:C3262"
Brain:
code: BRN
stub_code: N
label: "Brain"
id: "UBERON:0000955"
Heart:
code: HRT
stub_code: H
label: "Heart"
id: "UBERON:0000948"
Spleen:
code: SPL
stub_code: S
label: "Spleen"
id: "UBERON:0002106"
Skin:
code: SKN
stub_code: I
label: "Skin"
id: "UBERON:0002097"
Pancreas:
code: PNC
stub_code: P
label: "Pancreas"
id: "UBERON:0001264"
Colon:
code: CLN
stub_code: C
label: "Colon"
id: "UBERON:0000059"
Stomach:
code: STM
stub_code: M
label: "Stomach"
id: "UBERON:0000945"
Muscle:
code: MSC
stub_code: V
label: "Muscle"
id: "BTO:0000887"
Intestine:
code: INT
stub_code: E
label: "Intestine"
id: "UBERON:0000160"
Bone:
code: BNE
stub_code: O
label: "Bone"
id: "BTO:0000140"
AdiposeTissue:
code: ADT
stub_code: A
label: "Adipose tissue"
id: "UBERON:0001013"
BoneMarrow:
code: BMR
stub_code: R
label: "Bone marrow"
id: "UBERON:0002371"
DerivedCellLine:
code: DCL
stub_code: DC
label: "Derived Cell Line"
id: "NCIT:C156445"
sample_type:
Tumor:
code: TUM
stub_code: T
label: "Neoplasm"
id: "NCIT:C3262"
Normal:
code: NOR
stub_code: N
label: "Normal"
id: "PATO:0000461"
Primary:
code: PRI
stub_code: P
label: "Primary Tumor Site Indicator"
id: "NCIT:C172602"
Recurrence:
code: REC
stub_code: R
label: "Recurrent Neoplasm"
id: "NCIT:C4798"
assay:
RNA_seq:
code: RNA
stub_code: R
label: "RNA-seq"
id: "EFO:0008896"
WES:
code: WES
stub_code: E
label: "Exome sequencing"
id: "EFO:0005396"
ChIP_seq:
code: CHI
stub_code: C
label: "ChIP-seq"
id: "EFO:0002692"
IHC:
code: IHC
stub_code: I
label: "Immunohistochemistry"
id: "EFO:0022943"
LC_MS:
code: LCMS
stub_code: S
label: "Liquid Chromatography Mass Spectrometry"
id: "NCIT:C18475"
ATAC_seq:
code: ATAC
stub_code: A
label: "ATAC-seq"
id: "EFO:0007045"
scRNA_seq:
code: SCR
stub_code: N
label: "Single-cell RNA sequencing"
id: "EFO:0008913"
scATAC_seq:
code: SCAT
stub_code: T
label: "ScATAC-seq"
id: "EFO:0010891"
HiC:
code: HIC
stub_code: HI
label: "Hi-C"
id: "EFO:0007693"
WGBS:
code: WGB
stub_code: G
label: "WGBS" # Whole-genome bisulfite sequencing
id: "EFO:0008985"
FlowCytometry:
code: FCM
stub_code: F
label: "Flow Cytometry"
id: "NCIT:C16585"
Proteomics:
code: PRO
stub_code: P
label: "Proteomics"
id: "NCIT:C20085"
Metabolomics:
code: METB
stub_code: B
label: "Metabolomics"
id: "NCIT:C49019"
Microarray:
code: ARR
stub_code: X
label: "Microarray"
id: "NCIT:C44282"
ELISA:
code: ELS
stub_code: L
label: "Enzyme Immunoassay"
id: "NCIT:C17455"
qPCR:
code: QPCR
stub_code: Q
label: "quantitative polymerase chain reaction"
id: "AFP:0003769"
WesternBlot:
code: WBL
stub_code: W
label: "western blot assay"
id: "OBI:0000854"
WGS:
code: WGS
stub_code: Z
label: "Whole Genome Sequencing"
id: "NCIT:C101294"
ddPCR:
code: DDP
stub_code: D
label: "Droplet Digital PCR"
id: "NCIT:C166064"
timepoint:
Baseline:
code: BSL
stub_code: "B"
label: "Baseline"
id: "NCIT:C25213"
Treatment:
code: TRT
stub_code: "T"
label: "Treatment Ongoing"
id: "NCIT:C165209"
Surgery:
code: SUR
stub_code: "S"
label: "Surgery"
id: "NCIT:C17998"
Challenge:
code: CHL
stub_code: "C"
label: "Challenge"
id: "NCIT:C78166"
Collection:
code: COL
stub_code: CT
label: "Collection Time"
id: "NCIT:C81287"
condition_pattern: &condition_pattern
regex: '^([A-Z]\d{2}(?:\.\d+)?)$'
code_format: '%s'
stub_format: '%s'
duration_pattern:
# accept P<digits><D|W|M|Y> OR exactly P0N
regex: '^(?:P?(\d+)([DWMY])|P?(0)(N))$'
code_format: 'P%d%s' # uses captures
stub_format: '%d%s'
batch_pattern:
# capture up to two digits
regex: '^(\d{1,2})$'
code_format: 'B%02d' # B01, B02, ...
stub_format: 'B%02d' # idem
replicate_pattern:
# 1β99
regex: '^(\d{1,2})$'
code_format: 'R%02d' # R01, R02, ...
stub_format: 'R%02d' # idem
subject:
study: *all_projects
type:
Case:
code: Case
stub_code: C
label: "Case Study"
id: "NCIT:C15362"
Control:
code: Control
stub_code: N
label: "Study Control"
id: "NCIT:C142703"
condition_pattern: *condition_pattern
sex:
Male:
code: Male
stub_code: M
label: "Male"
id: "PATO:0000384"
Female:
code: Female
stub_code: F
label: "Female"
id: "PATO:0000383"
"Not reported":
code: NotR
stub_code: N
label: "Not Reported"
id: "NCIT:C43234"
Unspecified:
code: "Uns"
stub_code: S
label: "Unspecified"
id: "NCIT:C38046"
age_group:
Age0to9:
code: A0_9
stub_code: A0
label: "Age 0-9"
id: "APOLLO:SV_00000241" # age range category
Age10to19:
code: A10_19
stub_code: A1
label: "Age 10-19"
id: "APOLLO:SV_00000241" # age range category
Age20to29:
code: A20_29
stub_code: A2
label: "Age 20-29"
id: "APOLLO:SV_00000241" # age range category
Age30to39:
code: A30_39
stub_code: A3
label: "Age 30-39"
id: "APOLLO:SV_00000241" # age range category
Age40to49:
code: A40_49
stub_code: A4
label: "Age 40-49"
id: "APOLLO:SV_00000241" # age range category
Age50to59:
code: A50_59
stub_code: A5
label: "Age 50-59"
id: "APOLLO:SV_00000241" # age range category
Age60to69:
code: A60_69
stub_code: A6
label: "Age 60-69"
id: "APOLLO:SV_00000241" # age range category
Age70to79:
code: A70_79
stub_code: A7
label: "Age 70-79"
id: "APOLLO:SV_00000241" # age range category
Age80to89:
code: A80_89
stub_code: A8
label: "Age 80-89"
id: "APOLLO:SV_00000241" # age range category
Age90to99:
code: A90_99
stub_code: A9
label: "Age 90-99"
id: "APOLLO:SV_00000241" # age range category
Unknown:
code: UNK
stub_code: UN # 2 char mandatory - not using global
label: "Unknown"
id: "NCIT:C17998"
'Not Available':
code: NAV
stub_code: NA # 2 char mandatory - not using global
label: "Not Available"
id: "NCIT:C126101"
π Metadata¶
Defines global codebook info:
metadata:
version: "0.02" # π·οΈ version
local_version: "CNAG-2025.09.05" # π·οΈ internal/project revision (optional)
author: "M. Rueda" # π€ author
center: "CNAG" # π’ institution
date: "2025-09-05" # π
YYYY-MM-DD
description: "ClarID codebook" # π summary
repository: "https://github.com/cnag-biomedical-informatics/clarid-tools" # π repo URL
About Codebook Version
The codebook and the ClarID-Tools software are versioned in sync.
Each software release is supposed to include the corresponding version of the codebook.
π Entities¶
All under entities:
.
π _defaults
¶
Fallback when no match:
entities:
_defaults:
Unknown:
code: UNK
stub_code: U
label: "Unknown"
id: "NCIT:C17998"
"Not Available":
code: NAV
stub_code: n
label: "Not Available"
id: "NCIT:C126101"
π οΈ biosample
¶
π project¶
entities:
biosample:
project: &all_projects
"TCGA-AML":
code: TCGA_AML
stub_code: AML
label: "TCGA Acute Myeloid Leukemia"
id: "NCIT:C17998" # Unknown
About species
:
Reference: Based on Schrade et al., Animals 2024, Table 2.
𧬠Component I: Species Information
Each species entry is defined by two key elements:
- Element 1 (positions 1β3):
tax_code
A 3-letter taxonomic classification code: - 1st letter: Class
- 2nd letter: Order
- 3rd letter: Family
-
Example:
MPC
= Mammalia | Primates | Cercopithecidae -
Element 2 (positions 5β10):
code
A 6-letter binomial acronym formed by: - 3 letters from the genus name
- 3 letters from the species name
-
Example:
MacMul
= Macaca mulatta -
stub_code
: A 2-character Base-62 encoded unique species identifier (e.g."01"
for Homo sapiens,"0E"
for Macaca mulatta)
Note: tax_code
is provided as metadata and is not used in encode/decode logic.
𧬠species¶
species:
Human:
code: HomSap # π binomial acronym
stub_code: "01" # π’ index
label: "Homo sapiens" # π name
id: "NCBITaxon:9606" # π taxonomy
tax_code: MPH # π·οΈ class|order|family
π₯ tissue¶
π§ͺ sample_type¶
π¬ assay¶
β° timepoint¶
π Patterns¶
Regex-based formats:
condition_pattern:
regex: '^([A-Z]\d{2}(?:\.\d+)?)$' # β
Letter+digits
code_format: '%s'
stub_format: '%s'
π₯ Subject¶
π study¶
Reuses biosample.project
:
π§βπ€βπ§ type, sex, age_group¶
Case:
code: Case
stub_code: C
label: "Case Study"
id: "NCIT:C15362"
sex:
Male:
code: Male
stub_code: M
label: "Male"
id: "PATO:0000384"
age_group:
Age20to29:
code: A20_29
stub_code: A2
label: "Age 20-29"
id: "APOLLO:SV_00000241" # age range category
Naming conventions
Vocabulary keys under biosample
and subject
(e.g. RhesusMacaque
, PeripheralBlood
) use CamelCase; attributes (e.g. code
, stub_code
, tax_code
) use snake_case.
Rationale: CamelCase keeps multi-word names compact and avoids confusion with attributes.
Exceptions: some keys (e.g. "Not Available"
, RNA_seq
) keep original style for clarity or compatibility.