Skip to content

πŸ“˜ ClarID Codebook Documentation

Overview

This codebook defines a standardized encoding system for species, biosample metadata, assay types, conditions, and other identifiers in the ClarID project. It provides a mapping between human-readable labels and compact codes or stub codes for use in structured identifiers and databases.

See Codebook:
# -----------------------------------------------------------------------------
# ClarID Codebook
# -----------------------------------------------------------------------------
# Note: Species Information taken from  (Schrade et al., Animals 2024, Table 2):
# COMPONENT I 
#   * Element 1 (positions 1-3): 3-letter taxonomic code
#       (Class | Order | Family, e.g. "MPC" for Mammalia | Primates | Cercopithecidae)
#   * Element 2 (positions 5-10): 6-letter binomial acronym
#       (first 3 letters of Genus + first 3 letters of species, e.g. "MacMul" for Macaca mulatta)
#   
# In this file:
#   - code      = Element 2 (binomial acronym)
#   - tax_code  = Element 1 (taxonomic classification)
#                 included here only as metadata, NOT used in current encode/decode routines
#   - stub_code = Base-62 species index (unique identifier)
#       The reference codebook uses width 2. Other widths are supported as long as
#       all species stub_code values in a given codebook use the same length.
#
# Naming conventions:
# - Vocabulary keys under "biosample" and "subject" use CamelCase
#   (e.g. RhesusMacaque, PeripheralBlood).
# - Attributes use snake_case (e.g. code, stub_code, tax_code).
# Rationale: CamelCase keeps multi-word names compact and distinct from fields.
# Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style.
# -----------------------------------------------------------------------------

metadata:
  version:       "0.03"                # Publication release
  local_version: "CNAG-2025.04.01"     # internal/project revision (optional)
  author:        "Manuel Rueda <manuel.rueda@cnag.eu>"
  center:        "CNAG"
  date:          "2025-04-01"
  description:   "CNAG's ClarID codebook."
  repository:    "https://github.com/cnag-biomedical-informatics/clarid-tools"

entities:

  # Define "global" fall-through entries
  # Does not work out-of-the-box with YAML::XS (added during BUILD)
  _defaults: &defaults
    "Unknown":
      code:       UNK
      stub_code:  U
      label:      "Unknown"
      id:         "NCIT:C17998"
    "Not Available":
      code:       NAV
      stub_code:  n
      label:      "Not Available"
      id:         "NCIT:C126101"

  biosample:
    project: &all_projects
      "TCGA-AML": 
        code:      TCGA_AML
        stub_code: AML
        label:     "TCGA Acute Myeloid Leukemia"
        id:         "NCIT:C17998" # Unknown
      "TARGET-AML":
        code:      TARGET_AML
        stub_code: TAML
        label:     "TARGET’s Study of Acute Myeloid Leukemia"
        id:         "NCIT:C17998" # Unknown
      "CNAG-Test":
        code:      CNAG_Test
        stub_code: CT
        label:     "CNAG test project for ClarID-Tools"
        id:         "NCIT:C17998" # Unknown


    species:
      Unknown:
        code:       UnkNow
        stub_code:  "00"
        label:      "Unknown"
        id:         "NCIT:C17998" # Unknown
        tax_code:   UNK
      Human:
        code:       HomSap
        stub_code:  "01"
        label:      "Homo sapiens"
        id:         "NCBITaxon:9606"
        tax_code:   MPH       # Mammalia | Primates         | Hominidae
      Mouse:
        code:       MusMus
        stub_code:  "02"
        label:      "Mus musculus"
        id:         "NCBITaxon:10090"
        tax_code:   MRM       # Mammalia | Rodentia         | Muridae
      Rat:
        code:       RatNor
        stub_code:  "03"
        label:      "Rattus norvegicus"
        id:         "NCBITaxon:10116"
        tax_code:   MRM       # Mammalia | Rodentia         | Muridae
      Zebrafish:
        code:       DanRer
        stub_code:  "04"
        label:      "Danio rerio"
        id:         "NCBITaxon:7955"
        tax_code:   ACC       # Actinopterygii | Cypriniformes   | Cyprinidae
      Fruitfly:
        code:       DroMel
        stub_code:  "05"
        label:      "Drosophila melanogaster"
        id:         "NCBITaxon:7227"
        tax_code:   IDD       # Insecta      | Diptera          | Drosophilidae
      Worm:
        code:       CaeEle
        stub_code:  "06"
        label:      "Caenorhabditis elegans"
        id:         "NCBITaxon:6239"
        tax_code:   CRR       # Chromadorea | Rhabditida       | Rhabditidae
      Yeast:
        code:       SacCer
        stub_code:  "07"
        label:      "Saccharomyces cerevisiae"
        id:         "NCBITaxon:4932"
        tax_code:   SSS       # Saccharomycetes | Saccharomycetales | Saccharomycetaceae
      Ecoli:
        code:       EscCol
        stub_code:  "08"
        label:      "Escherichia coli"
        id:         "NCBITaxon:562"
        tax_code:   GEE       # Gammaproteobacteria | Enterobacterales  | Enterobacteriaceae
      Dog:
        code:       CanLup
        stub_code:  "09"
        label:      "Canis lupus familiaris"
        id:         "NCBITaxon:9615"
        tax_code:   MCC       # Mammalia | Carnivora        | Canidae
      Pig:
        code:       SusScr
        stub_code:  "0A"
        label:      "Sus scrofa"
        id:         "NCBITaxon:9823"
        tax_code:   MAS       # Mammalia | Artiodactyla     | Suidae
      Cow:
        code:       BosTau
        stub_code:  "0B"
        label:      "Bos taurus"
        id:         "NCBITaxon:9913"
        tax_code:   MAB       # Mammalia | Artiodactyla     | Bovidae
      Chicken:
        code:       GalGal
        stub_code:  "0C"
        label:      "Gallus gallus"
        id:         "NCBITaxon:9031"
        tax_code:   AGP       # Aves      | Galliformes      | Phasianidae
      Rabbit:
        code:       OryCun
        stub_code:  "0D"
        label:      "Oryctolagus cuniculus"
        id:         "NCBITaxon:9986"
        tax_code:   MLL       # Mammalia | Lagomorpha       | Leporidae
      RhesusMacaque:
        code:       MacMul
        stub_code:  "0E"
        label:      "Macaca mulatta"
        id:         "NCBITaxon:9544"
        tax_code:   MPC       # Mammalia | Primates         | Cercopithecidae
      CynomolgusMacaque:
        code:       MacFas
        stub_code:  "0F"
        label:      "Macaca fascicularis"
        id:         "NCBITaxon:9543"
        tax_code:   MPC       # Mammalia | Primates         | Cercopithecidae
      CommonMarmoset:
        code:       CalJac
        stub_code:  "0G"
        label:      "Callithrix jacchus"
        id:         "NCBITaxon:9483"
        tax_code:   MCP       # Mammalia | Primates         | Callitrichidae
      GuineaPig:
        code:       CavPor
        stub_code:  "0H"
        label:      "Cavia porcellus"
        id:         "NCBITaxon:10141"
        tax_code:   MRC       # Mammalia | Rodentia         | Caviidae
      GoldenHamster:
        code:       MesAur
        stub_code:  "0I"
        label:      "Mesocricetus auratus"
        id:         "NCBITaxon:10036"
        tax_code:   MRC       # Mammalia | Rodentia         | Cricetidae
      AfricanClawedFrog:
        code:       XenLae
        stub_code:  "0J"
        label:      "Xenopus laevis"
        id:         "NCBITaxon:8355"
        tax_code:   AAP       # Amphibia   | Anura            | Pipidae
      Ferret:
        code:       MusPut
        stub_code:  "0K"
        label:      "Mustela putorius furo"
        id:         "NCBITaxon:9612"
        tax_code:   MCM       # Mammalia   | Carnivora        | Mustelidae
      NakedMoleRat:
        code:       HetGla
        stub_code:  "0L"
        label:      "Heterocephalus glaber"
        id:         "NCBITaxon:314479"
        tax_code:   MRB       # Mammalia   | Rodentia         | Bathyergidae
      Opossum:
        code:       MonDom
        stub_code:  "0M"
        label:      "Monodelphis domestica"
        id:         "NCBITaxon:13710"
        tax_code:   MDD       # Mammalia   | Didelphimorphia  | Didelphidae

    tissue:
      Liver:
        code:       LIV
        stub_code:  L
        label:      "Liver"
        id:         "UBERON:0002107"
      Lung:
        code:       LUN
        stub_code:  LU
        label:      "Lung"
        id:         "UBERON:0002048"
      Kidney:
        code:       KID
        stub_code:  K
        label:      "Kidney"
        id:         "UBERON:0002113"
      Blood:
        code:       BLO
        stub_code:  B
        label:      "Blood"
        id:         "UBERON:0000178"
      PeripheralBlood:
        code:       PBLO
        stub_code:  PB
        label:      "Peripheral Blood"
        id:         "BTO:0000553"
      Tumor:
        code:       TUM
        stub_code:  T
        label:      "Neoplasm"
        id:         "NCIT:C3262"
      Brain:
        code:       BRN
        stub_code:  N
        label:      "Brain"
        id:         "UBERON:0000955"
      Heart:
        code:       HRT
        stub_code:  H
        label:      "Heart"
        id:         "UBERON:0000948"
      Spleen:
        code:       SPL
        stub_code:  S
        label:      "Spleen"
        id:         "UBERON:0002106"
      Skin:
        code:       SKN
        stub_code:  I
        label:      "Skin"
        id:         "UBERON:0002097"
      Pancreas:
        code:       PNC
        stub_code:  P
        label:      "Pancreas"
        id:         "UBERON:0001264"
      Colon:
        code:       CLN
        stub_code:  C
        label:      "Colon"
        id:         "UBERON:0000059"
      Stomach:
        code:       STM
        stub_code:  M
        label:      "Stomach"
        id:         "UBERON:0000945"
      Muscle:
        code:       MSC
        stub_code:  V
        label:      "Muscle"
        id:         "BTO:0000887"
      Intestine:
        code:       INT
        stub_code:  E
        label:      "Intestine"
        id:         "UBERON:0000160"
      Bone:
        code:       BNE
        stub_code:  O
        label:      "Bone"
        id:         "BTO:0000140"
      AdiposeTissue:
        code:       ADT
        stub_code:  A
        label:      "Adipose tissue"
        id:         "UBERON:0001013"
      BoneMarrow:
        code:       BMR
        stub_code:  R
        label:      "Bone marrow"
        id:         "UBERON:0002371"
      DerivedCellLine:
        code:       DCL
        stub_code:  DC
        label:      "Derived Cell Line"
        id:         "NCIT:C156445" 

    sample_type:
      Tumor:
        code:       TUM
        stub_code:  T
        label:      "Neoplasm"
        id:         "NCIT:C3262"
      Normal:
        code:       NOR
        stub_code:  N
        label:      "Normal"
        id:         "PATO:0000461"
      Primary:
        code:       PRI
        stub_code:  P
        label:      "Primary Tumor Site Indicator"
        id:         "NCIT:C172602"
      Recurrence:
        code:       REC
        stub_code:  R
        label:      "Recurrent Neoplasm"
        id:         "NCIT:C4798"
    assay:
      RNA_seq:
        code:       RNA
        stub_code:  R
        label:      "RNA-seq"
        id:         "EFO:0008896"
      WES:
        code:       WES
        stub_code:  E
        label:      "Exome sequencing"
        id:         "EFO:0005396"
      ChIP_seq:
        code:       CHI
        stub_code:  C
        label:      "ChIP-seq"
        id:         "EFO:0002692"
      IHC:
        code:       IHC
        stub_code:  I
        label:      "Immunohistochemistry"
        id:         "EFO:0022943"
      LC_MS:
        code:       LCMS
        stub_code:  S
        label:      "Liquid Chromatography Mass Spectrometry"
        id:         "NCIT:C18475"
      ATAC_seq:
        code:       ATAC
        stub_code:  A
        label:      "ATAC-seq"
        id:         "EFO:0007045"
      scRNA_seq:
        code:       SCR
        stub_code:  N
        label:      "Single-cell RNA sequencing"
        id:         "EFO:0008913"
      scATAC_seq:
        code:       SCAT
        stub_code:  T
        label:      "ScATAC-seq"
        id:         "EFO:0010891"
      HiC:
        code:       HIC
        stub_code:  HI
        label:      "Hi-C"
        id:         "EFO:0007693"
      WGBS:
        code:       WGB
        stub_code:  G
        label:      "WGBS" # Whole-genome bisulfite sequencing
        id:         "EFO:0008985"
      FlowCytometry:
        code:       FCM
        stub_code:  F
        label:      "Flow Cytometry"
        id:         "NCIT:C16585"
      Proteomics:
        code:       PRO
        stub_code:  P
        label:      "Proteomics"
        id:         "NCIT:C20085"
      Metabolomics:
        code:       METB
        stub_code:  B
        label:      "Metabolomics"
        id:         "NCIT:C49019"
      Microarray:
        code:       ARR
        stub_code:  X
        label:      "Microarray"
        id:         "NCIT:C44282"
      ELISA:
        code:       ELS
        stub_code:  L
        label:      "Enzyme Immunoassay"
        id:         "NCIT:C17455"
      qPCR:
        code:       QPCR
        stub_code:  Q
        label:      "quantitative polymerase chain reaction"
        id:         "AFP:0003769"
      WesternBlot:
        code:       WBL
        stub_code:  W
        label:      "western blot assay"
        id:         "OBI:0000854"
      WGS:
        code:       WGS
        stub_code:  Z
        label:      "Whole Genome Sequencing"
        id:         "NCIT:C101294"
      ddPCR:
        code:       DDP
        stub_code:  D
        label:      "Droplet Digital PCR"
        id:         "NCIT:C166064"

    timepoint:
      Baseline:
        code:       BSL
        stub_code:  "B"
        label:      "Baseline"
        id:         "NCIT:C25213"
      Treatment:
        code:       TRT
        stub_code:  "T"
        label:      "Treatment Ongoing"
        id:         "NCIT:C165209"
      Surgery:
        code:       SUR
        stub_code:  "S"
        label:      "Surgery"
        id:         "NCIT:C17998"
      Challenge:
        code:       CHL
        stub_code:  "C"
        label:      "Challenge"
        id:         "NCIT:C78166"
      Collection:
        code: COL
        stub_code: CT
        label: "Collection Time"
        id: "NCIT:C81287"

    condition_pattern: &condition_pattern
      regex: '^([A-Z]\d{2}(?:\.\d+)?)$'
      code_format: '%s'
      stub_format: '%s'

    duration_pattern:
      # accept P<digits><D|W|M|Y> OR exactly P0N
      regex: '^(?:P?(\d+)([DWMY])|P?(0)(N))$'
      code_format: 'P%d%s'   # uses captures
      stub_format: '%d%s'

    batch_pattern:
      # capture up to two digits
      regex: '^(\d{1,2})$'
      code_format: 'B%02d'   # B01, B02, ...
      stub_format: 'B%02d'   # idem

    replicate_pattern:
      # 1–99
      regex: '^(\d{1,2})$'
      code_format: 'R%02d'   # R01, R02, ...
      stub_format: 'R%02d'   # idem

  subject:
    study: *all_projects

    type:
      Case:
        code:       Case
        stub_code:  C
        label:      "Case Study"
        id:         "NCIT:C15362"
      Control:
        code:       Control
        stub_code:  N
        label:      "Study Control"
        id:         "NCIT:C142703"

    condition_pattern: *condition_pattern

    sex:
      Male:
        code:       Male
        stub_code:  M
        label:      "Male"
        id:         "PATO:0000384"
      Female:
        code:       Female
        stub_code:  F
        label:      "Female"
        id:         "PATO:0000383"
      "Not reported":
        code:       NotR
        stub_code:  N
        label:      "Not Reported"
        id:         "NCIT:C43234"
      Unspecified:
        code:       "Uns"
        stub_code:  S
        label:      "Unspecified"
        id:         "NCIT:C38046"

    age_group:
      Age0to9:
        code:       A0_9
        stub_code:  A0
        label:      "Age 0-9"
        id:         "APOLLO:SV_00000241" # age range category
      Age10to19:
        code:       A10_19
        stub_code:  A1
        label:      "Age 10-19"
        id:         "APOLLO:SV_00000241" # age range category
      Age20to29:
        code:       A20_29
        stub_code:  A2
        label:      "Age 20-29"
        id:         "APOLLO:SV_00000241" # age range category
      Age30to39:
        code:       A30_39
        stub_code:  A3
        label:      "Age 30-39"
        id:         "APOLLO:SV_00000241" # age range category
      Age40to49:
        code:       A40_49
        stub_code:  A4
        label:      "Age 40-49"
        id:         "APOLLO:SV_00000241" # age range category
      Age50to59:
        code:       A50_59
        stub_code:  A5
        label:      "Age 50-59"
        id:         "APOLLO:SV_00000241" # age range category
      Age60to69:
        code:       A60_69
        stub_code:  A6
        label:      "Age 60-69"
        id:         "APOLLO:SV_00000241" # age range category
      Age70to79:
        code:       A70_79
        stub_code:  A7
        label:      "Age 70-79"
        id:         "APOLLO:SV_00000241" # age range category
      Age80to89:
        code:       A80_89
        stub_code:  A8
        label:      "Age 80-89"
        id:         "APOLLO:SV_00000241" # age range category
      Age90to99:
        code:       A90_99
        stub_code:  A9
        label:      "Age 90-99"
        id:         "APOLLO:SV_00000241" # age range category
      Unknown:
        code: UNK
        stub_code: UN # 2 char mandatory - not using global
        label: "Unknown"
        id:    "NCIT:C17998"
      'Not Available':
        code: NAV
        stub_code: NA # 2 char mandatory - not using global
        label:      "Not Available"
        id:         "NCIT:C126101"

πŸ“ Metadata

Defines global codebook info:

metadata:
  version: "0.03"         # 🏷️ official ClarID specification version
  local_version: "CNAG-GDC-v1"  # 🏷️ project-specific codebook revision (optional)
  author: "M. Rueda"      # πŸ‘€ author
  center: "CNAG"          # 🏒 institution
  date: "2026-04-01"      # πŸ“… YYYY-MM-DD
  description: "ClarID codebook"  # πŸ“ summary
  repository: "https://github.com/cnag-biomedical-informatics/clarid-tools"  # πŸ”— repo URL
About version

version identifies the official ClarID specification release targeted by the codebook and schema, for example 0.03. ClarID-Tools compatibility is defined at this release level.

About local_version

local_version is optional and can be used for ad hoc or project-specific codebook variants without changing the official ClarID specification version. This is useful when different projects need their own controlled vocabularies, aliases, or dictionary updates while still conforming to the same ClarID release.


🌐 Entities

All under entities:.

πŸ”„ _defaults

Fallback when no match:

entities:
  _defaults:
    Unknown:
      code:       UNK
      stub_code:  U
      label:      "Unknown"
      id:         "NCIT:C17998"
    "Not Available":
      code:       NAV
      stub_code:  n
      label:      "Not Available"
      id:         "NCIT:C126101"

πŸ› οΈ biosample

πŸ“ project

entities:
  biosample:
    project: &all_projects
      "TCGA-AML":
        code:      TCGA_AML
        stub_code: AML
        label:     "TCGA Acute Myeloid Leukemia"
        id:         "NCIT:C17998" # Unknown
About species:

Reference: Based on Schrade et al., Animals 2024, Table 2.

🧬 Component I: Species Information

Each species entry is defined by two key elements:

  • Element 1 (positions 1–3): tax_code A 3-letter taxonomic classification code:
  • 1st letter: Class
  • 2nd letter: Order
  • 3rd letter: Family
  • Example: MPC = Mammalia | Primates | Cercopithecidae

  • Element 2 (positions 5–10): code A 6-letter binomial acronym formed by:

  • 3 letters from the genus name
  • 3 letters from the species name
  • Example: MacMul = Macaca mulatta

  • stub_code: A 2-character Base-62 encoded unique species identifier (e.g. "01" for Homo sapiens, "0E" for Macaca mulatta)

Note: tax_code is provided as metadata and is not used in encode/decode logic.

🧬 species

    species:
      Human:
        code: HomSap        # πŸ†” binomial acronym
        stub_code: "01"     # πŸ”’ index
        label: "Homo sapiens"  # πŸ“– name
        id: "NCBITaxon:9606"  # πŸ”— taxonomy
        tax_code: MPH       # 🏷️ class|order|family

πŸ₯ tissue

    tissue:
      Liver:
        code: LIV
        stub_code: L
        label: "Liver"
        id: "UBERON:0002107"

πŸ§ͺ sample_type

    sample_type:
      Tumor:
        code: TUM
        stub_code: T
        label: "Tumor"
        id: "NCIT:C4872"

πŸ”¬ assay

    assay:
      RNA_seq:
        code:       RNA
        stub_code:  R
        label:      "RNA-seq"
        id:         "EFO:0008896"

⏰ timepoint

    timepoint:
      Baseline:
        code:       BSL
        stub_code:  "B"
        label:      "Baseline"
        id:         "NCIT:C25213"

πŸ” Patterns

Regex-based formats:

    condition_pattern:
      regex: '^([A-Z]\d{2}(?:\.\d+)?)$'  # βœ… Letter+digits
      code_format: '%s'
      stub_format: '%s'

πŸ‘₯ Subject

πŸ”„ study

Reuses biosample.project:

  subject:
    study: *all_projects

πŸ§‘β€πŸ€β€πŸ§‘ type, sex, age_group

      Case:
        code:       Case
        stub_code:  C
        label:      "Case Study"
        id:         "NCIT:C15362"
      sex:
      Male:
        code:       Male
        stub_code:  M
        label:      "Male"
        id:         "PATO:0000384"
      age_group:
        Age20to29:
          code:       A20_29
          stub_code:  A2
          label:      "Age 20-29"
          id:         "APOLLO:SV_00000241" # age range category
Naming conventions

Vocabulary keys under biosample and subject (e.g. RhesusMacaque, PeripheralBlood) use CamelCase; attributes (e.g. code, stub_code, tax_code) use snake_case.
Rationale: CamelCase keeps multi-word names compact and avoids confusion with attributes.
Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style for clarity or compatibility.