C4: Scientific Context Readable by Scientists and Machines
Context is readable by scientists and machines — not trapped in PDF reports, Electronic Laboratory Notebook (ELN) narratives, or spreadsheet column headers.
What does this mean?
Scientific context in pharmaceutical R&D is overwhelmingly stored in human-readable, machine-opaque formats. Method descriptions live in PDF Standard Operating Procedures. Sample provenance is recorded in ELN narratives. Experiment parameters are captured in spreadsheet column headers that mean something to the author and nothing to a machine. Study relationships are documented in regulatory submissions that are structured for human review, not machine traversal.
C4 requires that the scientific context established by C1 (linking), C2 (reconciliation), and C3 (lineage) is represented in formats readable by both scientists and machines — structured data models that computational systems can query, traverse, and reason over without human interpretation.
The machine-readability spectrum
Context representation falls on a spectrum from fully opaque to fully machine-readable:
- Opaque: PDF reports, scanned documents, unstructured ELN text. No machine access to the content without natural language processing (NLP), which is probabilistic and error-prone for regulatory-grade data.
- Semi-structured: Spreadsheets with named columns, XML files with custom schemas, JSON with undocumented field names. Machines can parse the format but cannot interpret the meaning without domain-specific mappings.
- Structured: Data stored in formal schemas with defined vocabularies. Allotrope is the strongest example: each technique is defined as a composite of JSON schemas (ASM) that describe the data structure and RDF ontologies (AFO) that define the scientific vocabulary. A machine reading Allotrope-compliant data can both parse the values (from the schema) and reason about what they mean scientifically (from the ontology). Other examples include ISA-Tab for study metadata and controlled terminologies from CDISC or the NCI Thesaurus. At this level, machines can parse, interpret, and reason over the content.
C4 requires that scientific context reaches level 3 — structured, schema-defined, vocabulary-controlled. This does not mean that PDFs and ELN narratives cease to exist. It means that the governed system extracts and represents their content in machine-readable form alongside the human-readable originals.
ELNs, even those with APIs, typically provide browse-only access — scientists can follow links and open individual records, but cannot search across experiments, programs, or therapeutic areas. ELN data that is only browse-accessible is operationally trapped: visible to individual scientists but invisible to cross-program queries, trend analyses, and AI. By extracting ELN content into a structured, indexed representation, it becomes searchable without replacing the ELN as the authoritative record.
Established ontological frameworks extend beyond analytical data formats. BFO (Basic Formal Ontology) provides the upper-level backbone used by over 500 biomedical domain ontologies. CMCP-O (CMC Process Ontology) provides vocabulary for linking analytical measurements to manufacturing process context. IDMP-O (based on ISO IDMP standards) defines the regulatory standard for substance, product, and organization identification — directly relevant to master data reconciliation (C2). These frameworks demonstrate that machine-readable context has mature, adopted standards across analytical, manufacturing, and regulatory domains.
Can a computational system — without any human intervention — answer: "Show me all stability data for this compound, at this storage condition, measured by HPLC, across all sites, for the last 24 months"? If answering this query requires a human to interpret method names, open PDF protocols, or decode spreadsheet headers, C4 is not satisfied.
Why this is the prerequisite for AI
Machine learning models, large language models (LLMs), and agentic AI systems cannot operate on context they cannot read. An AI system that receives a chromatographic peak table without knowing the method, the sample, the study, or the regulatory context cannot generate a meaningful stability trend, cannot identify an out-of-specification result in its regulatory context, and cannot draft a CMC section that references the correct data.
C4 is the bridge between Contextualize and Analyze. Without machine-readable context, analysis (A) operates on numbers without meaning. With machine-readable context, analysis operates on scientifically interpretable evidence.
An HPLC method for impurity profiling specifies: gradient elution from 5% to 95% acetonitrile over 30 minutes, C18 column (150 × 4.6 mm, 3.5 µm), UV detection at 220 nm, injection volume 10 µL, column temperature 30°C. In a PDF SOP, this information is human-readable but machine-opaque. In a structured method definition (e.g., an Allotrope method description), every parameter is a named, typed, searchable field. A computational system can compare this method's parameters with another site's method to assess equivalence — without a human reading two PDFs side by side.
Relationship to other principles
C4 is the culmination of the Contextualize sequence. C1 establishes the links, C2 reconciles identities, C3 traces lineage, and C4 ensures all of this context is computationally accessible. C4 is the direct prerequisite for A1 — analysis that operates on contextualized data requires that context to be readable by scientists and machines, not just human-documented.