Skip to content

πŸ“˜ ClarID Codebook Documentation

Overview

This codebook defines a standardized encoding system for species, biosample metadata, assay types, conditions, and other identifiers in the ClarID project. It provides a mapping between human-readable labels and compact codes or stub codes for use in structured identifiers and databases.

See Codebook:
# -----------------------------------------------------------------------------
# ClarID Codebook
# -----------------------------------------------------------------------------
# Note: Species Information taken from  (Schrade et al., Animals 2024, Table 2):
# COMPONENT I 
#   * Element 1 (positions 1-3): 3-letter taxonomic code
#       (Class | Order | Family, e.g. "MPC" for Mammalia | Primates | Cercopithecidae)
#   * Element 2 (positions 5-10): 6-letter binomial acronym
#       (first 3 letters of Genus + first 3 letters of species, e.g. "MacMul" for Macaca mulatta)
#   
# In this file:
#   - code      = Element 2 (binomial acronym)
#   - tax_code  = Element 1 (taxonomic classification)
#                 included here only as metadata, NOT used in current encode/decode routines
#   - stub_code = 2-character Base-62 species index (unique identifier)
#
# Naming conventions:
# - Vocabulary keys under "biosample" and "subject" use CamelCase
#   (e.g. RhesusMacaque, PeripheralBlood).
# - Attributes use snake_case (e.g. code, stub_code, tax_code).
# Rationale: CamelCase keeps multi-word names compact and distinct from fields.
# Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style.
# -----------------------------------------------------------------------------

metadata:
  version:       "0.02"                # official ClarID release
  local_version: "CNAG-2025.09.05"     # internal/project revision (optional)
  author:        "Manuel Rueda <manuel.rueda@cnag.eu>"
  center:        "CNAG"
  date:          "2025-09-05"
  description:   "CNAG's ClarID codebook."
  repository:    "https://github.com/cnag-biomedical-informatics/clarid-tools"

entities:

  # Define "global" fall-through entries
  # Does not work out-of-the-box with YAML::XS (added during BUILD)
  _defaults: &defaults
    "Unknown":
      code:       UNK
      stub_code:  U
      label:      "Unknown"
      id:         "NCIT:C17998"
    "Not Available":
      code:       NAV
      stub_code:  n
      label:      "Not Available"
      id:         "NCIT:C126101"

  biosample:
    project: &all_projects
      "TCGA-AML": 
        code:      TCGA_AML
        stub_code: AML
        label:     "TCGA Acute Myeloid Leukemia"
        id:         "NCIT:C17998" # Unknown
      "TARGET-AML":
        code:      TARGET_AML
        stub_code: TAML
        label:     "TARGET’s Study of Acute Myeloid Leukemia"
        id:         "NCIT:C17998" # Unknown
      "CNAG-Test":
        code:      CNAG_Test
        stub_code: CT
        label:     "CNAG test project for ClarID-Tools"
        id:         "NCIT:C17998" # Unknown


    species:
      Unknown:
        code:       UnkNow
        stub_code:  "00"
        label:      "Unknown"
        id:         "NCIT:C17998" # Unknown
        tax_code:   UNK
      Human:
        code:       HomSap
        stub_code:  "01"
        label:      "Homo sapiens"
        id:         "NCBITaxon:9606"
        tax_code:   MPH       # Mammalia | Primates         | Hominidae
      Mouse:
        code:       MusMus
        stub_code:  "02"
        label:      "Mus musculus"
        id:         "NCBITaxon:10090"
        tax_code:   MRM       # Mammalia | Rodentia         | Muridae
      Rat:
        code:       RatNor
        stub_code:  "03"
        label:      "Rattus norvegicus"
        id:         "NCBITaxon:10116"
        tax_code:   MRM       # Mammalia | Rodentia         | Muridae
      Zebrafish:
        code:       DanRer
        stub_code:  "04"
        label:      "Danio rerio"
        id:         "NCBITaxon:7955"
        tax_code:   ACC       # Actinopterygii | Cypriniformes   | Cyprinidae
      Fruitfly:
        code:       DroMel
        stub_code:  "05"
        label:      "Drosophila melanogaster"
        id:         "NCBITaxon:7227"
        tax_code:   IDD       # Insecta      | Diptera          | Drosophilidae
      Worm:
        code:       CaeEle
        stub_code:  "06"
        label:      "Caenorhabditis elegans"
        id:         "NCBITaxon:6239"
        tax_code:   CRR       # Chromadorea | Rhabditida       | Rhabditidae
      Yeast:
        code:       SacCer
        stub_code:  "07"
        label:      "Saccharomyces cerevisiae"
        id:         "NCBITaxon:4932"
        tax_code:   SSS       # Saccharomycetes | Saccharomycetales | Saccharomycetaceae
      Ecoli:
        code:       EscCol
        stub_code:  "08"
        label:      "Escherichia coli"
        id:         "NCBITaxon:562"
        tax_code:   GEE       # Gammaproteobacteria | Enterobacterales  | Enterobacteriaceae
      Dog:
        code:       CanLup
        stub_code:  "09"
        label:      "Canis lupus familiaris"
        id:         "NCBITaxon:9615"
        tax_code:   MCC       # Mammalia | Carnivora        | Canidae
      Pig:
        code:       SusScr
        stub_code:  "0A"
        label:      "Sus scrofa"
        id:         "NCBITaxon:9823"
        tax_code:   MAS       # Mammalia | Artiodactyla     | Suidae
      Cow:
        code:       BosTau
        stub_code:  "0B"
        label:      "Bos taurus"
        id:         "NCBITaxon:9913"
        tax_code:   MAB       # Mammalia | Artiodactyla     | Bovidae
      Chicken:
        code:       GalGal
        stub_code:  "0C"
        label:      "Gallus gallus"
        id:         "NCBITaxon:9031"
        tax_code:   AGP       # Aves      | Galliformes      | Phasianidae
      Rabbit:
        code:       OryCun
        stub_code:  "0D"
        label:      "Oryctolagus cuniculus"
        id:         "NCBITaxon:9986"
        tax_code:   MLL       # Mammalia | Lagomorpha       | Leporidae
      RhesusMacaque:
        code:       MacMul
        stub_code:  "0E"
        label:      "Macaca mulatta"
        id:         "NCBITaxon:9544"
        tax_code:   MPC       # Mammalia | Primates         | Cercopithecidae
      CynomolgusMacaque:
        code:       MacFas
        stub_code:  "0F"
        label:      "Macaca fascicularis"
        id:         "NCBITaxon:9543"
        tax_code:   MPC       # Mammalia | Primates         | Cercopithecidae
      CommonMarmoset:
        code:       CalJac
        stub_code:  "0G"
        label:      "Callithrix jacchus"
        id:         "NCBITaxon:9483"
        tax_code:   MCP       # Mammalia | Primates         | Callitrichidae
      GuineaPig:
        code:       CavPor
        stub_code:  "0H"
        label:      "Cavia porcellus"
        id:         "NCBITaxon:10141"
        tax_code:   MRC       # Mammalia | Rodentia         | Caviidae
      GoldenHamster:
        code:       MesAur
        stub_code:  "0I"
        label:      "Mesocricetus auratus"
        id:         "NCBITaxon:10036"
        tax_code:   MRC       # Mammalia | Rodentia         | Cricetidae
      AfricanClawedFrog:
        code:       XenLae
        stub_code:  "0J"
        label:      "Xenopus laevis"
        id:         "NCBITaxon:8355"
        tax_code:   AAP       # Amphibia   | Anura            | Pipidae
      Ferret:
        code:       MusPut
        stub_code:  "0K"
        label:      "Mustela putorius furo"
        id:         "NCBITaxon:9612"
        tax_code:   MCM       # Mammalia   | Carnivora        | Mustelidae
      NakedMoleRat:
        code:       HetGla
        stub_code:  "0L"
        label:      "Heterocephalus glaber"
        id:         "NCBITaxon:314479"
        tax_code:   MRB       # Mammalia   | Rodentia         | Bathyergidae
      Opossum:
        code:       MonDom
        stub_code:  "0M"
        label:      "Monodelphis domestica"
        id:         "NCBITaxon:13710"
        tax_code:   MDD       # Mammalia   | Didelphimorphia  | Didelphidae

    tissue:
      Liver:
        code:       LIV
        stub_code:  L
        label:      "Liver"
        id:         "UBERON:0002107"
      Lung:
        code:       LUN
        stub_code:  LU
        label:      "Lung"
        id:         "UBERON:0002048"
      Kidney:
        code:       KID
        stub_code:  K
        label:      "Kidney"
        id:         "UBERON:0002113"
      Blood:
        code:       BLO
        stub_code:  B
        label:      "Blood"
        id:         "UBERON:0000178"
      PeripheralBlood:
        code:       PBLO
        stub_code:  PB
        label:      "Peripheral Blood"
        id:         "BTO:0000553"
      Tumor:
        code:       TUM
        stub_code:  T
        label:      "Neoplasm"
        id:         "NCIT:C3262"
      Brain:
        code:       BRN
        stub_code:  N
        label:      "Brain"
        id:         "UBERON:0000955"
      Heart:
        code:       HRT
        stub_code:  H
        label:      "Heart"
        id:         "UBERON:0000948"
      Spleen:
        code:       SPL
        stub_code:  S
        label:      "Spleen"
        id:         "UBERON:0002106"
      Skin:
        code:       SKN
        stub_code:  I
        label:      "Skin"
        id:         "UBERON:0002097"
      Pancreas:
        code:       PNC
        stub_code:  P
        label:      "Pancreas"
        id:         "UBERON:0001264"
      Colon:
        code:       CLN
        stub_code:  C
        label:      "Colon"
        id:         "UBERON:0000059"
      Stomach:
        code:       STM
        stub_code:  M
        label:      "Stomach"
        id:         "UBERON:0000945"
      Muscle:
        code:       MSC
        stub_code:  V
        label:      "Muscle"
        id:         "BTO:0000887"
      Intestine:
        code:       INT
        stub_code:  E
        label:      "Intestine"
        id:         "UBERON:0000160"
      Bone:
        code:       BNE
        stub_code:  O
        label:      "Bone"
        id:         "BTO:0000140"
      AdiposeTissue:
        code:       ADT
        stub_code:  A
        label:      "Adipose tissue"
        id:         "UBERON:0001013"
      BoneMarrow:
        code:       BMR
        stub_code:  R
        label:      "Bone marrow"
        id:         "UBERON:0002371"
      DerivedCellLine:
        code:       DCL
        stub_code:  DC
        label:      "Derived Cell Line"
        id:         "NCIT:C156445" 

    sample_type:
      Tumor:
        code:       TUM
        stub_code:  T
        label:      "Neoplasm"
        id:         "NCIT:C3262"
      Normal:
        code:       NOR
        stub_code:  N
        label:      "Normal"
        id:         "PATO:0000461"
      Primary:
        code:       PRI
        stub_code:  P
        label:      "Primary Tumor Site Indicator"
        id:         "NCIT:C172602"
      Recurrence:
        code:       REC
        stub_code:  R
        label:      "Recurrent Neoplasm"
        id:         "NCIT:C4798"
    assay:
      RNA_seq:
        code:       RNA
        stub_code:  R
        label:      "RNA-seq"
        id:         "EFO:0008896"
      WES:
        code:       WES
        stub_code:  E
        label:      "Exome sequencing"
        id:         "EFO:0005396"
      ChIP_seq:
        code:       CHI
        stub_code:  C
        label:      "ChIP-seq"
        id:         "EFO:0002692"
      IHC:
        code:       IHC
        stub_code:  I
        label:      "Immunohistochemistry"
        id:         "EFO:0022943"
      LC_MS:
        code:       LCMS
        stub_code:  S
        label:      "Liquid Chromatography Mass Spectrometry"
        id:         "NCIT:C18475"
      ATAC_seq:
        code:       ATAC
        stub_code:  A
        label:      "ATAC-seq"
        id:         "EFO:0007045"
      scRNA_seq:
        code:       SCR
        stub_code:  N
        label:      "Single-cell RNA sequencing"
        id:         "EFO:0008913"
      scATAC_seq:
        code:       SCAT
        stub_code:  T
        label:      "ScATAC-seq"
        id:         "EFO:0010891"
      HiC:
        code:       HIC
        stub_code:  HI
        label:      "Hi-C"
        id:         "EFO:0007693"
      WGBS:
        code:       WGB
        stub_code:  G
        label:      "WGBS" # Whole-genome bisulfite sequencing
        id:         "EFO:0008985"
      FlowCytometry:
        code:       FCM
        stub_code:  F
        label:      "Flow Cytometry"
        id:         "NCIT:C16585"
      Proteomics:
        code:       PRO
        stub_code:  P
        label:      "Proteomics"
        id:         "NCIT:C20085"
      Metabolomics:
        code:       METB
        stub_code:  B
        label:      "Metabolomics"
        id:         "NCIT:C49019"
      Microarray:
        code:       ARR
        stub_code:  X
        label:      "Microarray"
        id:         "NCIT:C44282"
      ELISA:
        code:       ELS
        stub_code:  L
        label:      "Enzyme Immunoassay"
        id:         "NCIT:C17455"
      qPCR:
        code:       QPCR
        stub_code:  Q
        label:      "quantitative polymerase chain reaction"
        id:         "AFP:0003769"
      WesternBlot:
        code:       WBL
        stub_code:  W
        label:      "western blot assay"
        id:         "OBI:0000854"
      WGS:
        code:       WGS
        stub_code:  Z
        label:      "Whole Genome Sequencing"
        id:         "NCIT:C101294"
      ddPCR:
        code:       DDP
        stub_code:  D
        label:      "Droplet Digital PCR"
        id:         "NCIT:C166064"

    timepoint:
      Baseline:
        code:       BSL
        stub_code:  "B"
        label:      "Baseline"
        id:         "NCIT:C25213"
      Treatment:
        code:       TRT
        stub_code:  "T"
        label:      "Treatment Ongoing"
        id:         "NCIT:C165209"
      Surgery:
        code:       SUR
        stub_code:  "S"
        label:      "Surgery"
        id:         "NCIT:C17998"
      Challenge:
        code:       CHL
        stub_code:  "C"
        label:      "Challenge"
        id:         "NCIT:C78166"
      Collection:
        code: COL
        stub_code: CT
        label: "Collection Time"
        id: "NCIT:C81287"

    condition_pattern: &condition_pattern
      regex: '^([A-Z]\d{2}(?:\.\d+)?)$'
      code_format: '%s'
      stub_format: '%s'

    duration_pattern:
      # accept P<digits><D|W|M|Y> OR exactly P0N
      regex: '^(?:P?(\d+)([DWMY])|P?(0)(N))$'
      code_format: 'P%d%s'   # uses captures
      stub_format: '%d%s'

    batch_pattern:
      # capture up to two digits
      regex: '^(\d{1,2})$'
      code_format: 'B%02d'   # B01, B02, ...
      stub_format: 'B%02d'   # idem

    replicate_pattern:
      # 1–99
      regex: '^(\d{1,2})$'
      code_format: 'R%02d'   # R01, R02, ...
      stub_format: 'R%02d'   # idem

  subject:
    study: *all_projects

    type:
      Case:
        code:       Case
        stub_code:  C
        label:      "Case Study"
        id:         "NCIT:C15362"
      Control:
        code:       Control
        stub_code:  N
        label:      "Study Control"
        id:         "NCIT:C142703"

    condition_pattern: *condition_pattern

    sex:
      Male:
        code:       Male
        stub_code:  M
        label:      "Male"
        id:         "PATO:0000384"
      Female:
        code:       Female
        stub_code:  F
        label:      "Female"
        id:         "PATO:0000383"
      "Not reported":
        code:       NotR
        stub_code:  N
        label:      "Not Reported"
        id:         "NCIT:C43234"
      Unspecified:
        code:       "Uns"
        stub_code:  S
        label:      "Unspecified"
        id:         "NCIT:C38046"

    age_group:
      Age0to9:
        code:       A0_9
        stub_code:  A0
        label:      "Age 0-9"
        id:         "APOLLO:SV_00000241" # age range category
      Age10to19:
        code:       A10_19
        stub_code:  A1
        label:      "Age 10-19"
        id:         "APOLLO:SV_00000241" # age range category
      Age20to29:
        code:       A20_29
        stub_code:  A2
        label:      "Age 20-29"
        id:         "APOLLO:SV_00000241" # age range category
      Age30to39:
        code:       A30_39
        stub_code:  A3
        label:      "Age 30-39"
        id:         "APOLLO:SV_00000241" # age range category
      Age40to49:
        code:       A40_49
        stub_code:  A4
        label:      "Age 40-49"
        id:         "APOLLO:SV_00000241" # age range category
      Age50to59:
        code:       A50_59
        stub_code:  A5
        label:      "Age 50-59"
        id:         "APOLLO:SV_00000241" # age range category
      Age60to69:
        code:       A60_69
        stub_code:  A6
        label:      "Age 60-69"
        id:         "APOLLO:SV_00000241" # age range category
      Age70to79:
        code:       A70_79
        stub_code:  A7
        label:      "Age 70-79"
        id:         "APOLLO:SV_00000241" # age range category
      Age80to89:
        code:       A80_89
        stub_code:  A8
        label:      "Age 80-89"
        id:         "APOLLO:SV_00000241" # age range category
      Age90to99:
        code:       A90_99
        stub_code:  A9
        label:      "Age 90-99"
        id:         "APOLLO:SV_00000241" # age range category
      Unknown:
        code: UNK
        stub_code: UN # 2 char mandatory - not using global
        label: "Unknown"
        id:    "NCIT:C17998"
      'Not Available':
        code: NAV
        stub_code: NA # 2 char mandatory - not using global
        label:      "Not Available"
        id:         "NCIT:C126101"

πŸ“ Metadata

Defines global codebook info:

metadata:
  version: "0.02"         # 🏷️ version
  local_version: "CNAG-2025.09.05"  #  🏷️  internal/project revision (optional)
  author: "M. Rueda"      # πŸ‘€ author
  center: "CNAG"          # 🏒 institution
  date: "2025-09-05"      # πŸ“… YYYY-MM-DD
  description: "ClarID codebook"  # πŸ“ summary
  repository: "https://github.com/cnag-biomedical-informatics/clarid-tools"  # πŸ”— repo URL
About Codebook Version

The codebook and the ClarID-Tools software are versioned in sync.
Each software release is supposed to include the corresponding version of the codebook.


🌐 Entities

All under entities:.

πŸ”„ _defaults

Fallback when no match:

entities:
  _defaults:
    Unknown:
      code:       UNK
      stub_code:  U
      label:      "Unknown"
      id:         "NCIT:C17998"
    "Not Available":
      code:       NAV
      stub_code:  n
      label:      "Not Available"
      id:         "NCIT:C126101"

πŸ› οΈ biosample

πŸ“ project

entities:
  biosample:
    project: &all_projects
      "TCGA-AML":
        code:      TCGA_AML
        stub_code: AML
        label:     "TCGA Acute Myeloid Leukemia"
        id:         "NCIT:C17998" # Unknown
About species:

Reference: Based on Schrade et al., Animals 2024, Table 2.

🧬 Component I: Species Information

Each species entry is defined by two key elements:

  • Element 1 (positions 1–3): tax_code A 3-letter taxonomic classification code:
  • 1st letter: Class
  • 2nd letter: Order
  • 3rd letter: Family
  • Example: MPC = Mammalia | Primates | Cercopithecidae

  • Element 2 (positions 5–10): code A 6-letter binomial acronym formed by:

  • 3 letters from the genus name
  • 3 letters from the species name
  • Example: MacMul = Macaca mulatta

  • stub_code: A 2-character Base-62 encoded unique species identifier (e.g. "01" for Homo sapiens, "0E" for Macaca mulatta)

Note: tax_code is provided as metadata and is not used in encode/decode logic.

🧬 species

    species:
      Human:
        code: HomSap        # πŸ†” binomial acronym
        stub_code: "01"     # πŸ”’ index
        label: "Homo sapiens"  # πŸ“– name
        id: "NCBITaxon:9606"  # πŸ”— taxonomy
        tax_code: MPH       # 🏷️ class|order|family

πŸ₯ tissue

    tissue:
      Liver:
        code: LIV
        stub_code: L
        label: "Liver"
        id: "UBERON:0002107"

πŸ§ͺ sample_type

    sample_type:
      Tumor:
        code: TUM
        stub_code: T
        label: "Tumor"
        id: "NCIT:C4872"

πŸ”¬ assay

    assay:
      RNA_seq:
        code:       RNA
        stub_code:  R
        label:      "RNA-seq"
        id:         "EFO:0008896"

⏰ timepoint

    timepoint:
      Baseline:
        code:       BSL
        stub_code:  "B"
        label:      "Baseline"
        id:         "NCIT:C25213"

πŸ” Patterns

Regex-based formats:

    condition_pattern:
      regex: '^([A-Z]\d{2}(?:\.\d+)?)$'  # βœ… Letter+digits
      code_format: '%s'
      stub_format: '%s'

πŸ‘₯ Subject

πŸ”„ study

Reuses biosample.project:

  subject:
    study: *all_projects

πŸ§‘β€πŸ€β€πŸ§‘ type, sex, age_group

      Case:
        code:       Case
        stub_code:  C
        label:      "Case Study"
        id:         "NCIT:C15362"
      sex:
      Male:
        code:       Male
        stub_code:  M
        label:      "Male"
        id:         "PATO:0000384"
      age_group:
        Age20to29:
          code:       A20_29
          stub_code:  A2
          label:      "Age 20-29"
          id:         "APOLLO:SV_00000241" # age range category
Naming conventions

Vocabulary keys under biosample and subject (e.g. RhesusMacaque, PeripheralBlood) use CamelCase; attributes (e.g. code, stub_code, tax_code) use snake_case.
Rationale: CamelCase keeps multi-word names compact and avoids confusion with attributes.
Exceptions: some keys (e.g. "Not Available", RNA_seq) keep original style for clarity or compatibility.