C1: Data is linked to its scientific context
Data is linked to its scientific context — method, sample, experiment, study, program, and regulatory submission.
What does this mean?
Instrument data — even when captured at the point of creation (I1), preserved in its native format (I2), and converted to a vendor-neutral representation through industrialized integration (I3) — is scientifically inert without context. A chromatogram is a collection of peaks and retention times. It becomes scientific evidence only when linked to: which method was used, which sample was injected, which experiment this injection belongs to, which study that experiment serves, which program the study supports, and which regulatory submission the program targets.
C1 requires that these linkages are established explicitly and programmatically — not through file naming conventions, folder structures, or informal knowledge that exists only in a scientist's memory.
The scientific context hierarchy
In analytical R&D and regulatory submissions, scientific data typically follows a hierarchy of context:
- Measurement: A single instrument acquisition (e.g., one HPLC injection, one NMR scan)
- Method: The analytical procedure applied (e.g., USP <621> compliant gradient method for impurity profiling)
- Sample: The material measured (compound identity, batch, formulation, stability time point)
- Experiment: The designed investigation (method validation, forced degradation study, in-process control)
- Study: The regulatory-defined scope (e.g., ICH Q1A stability study, bioequivalence study)
- Program: The drug development program (compound portfolio, therapeutic area, development phase)
- Submission: The regulatory filing that references the data (Investigational New Drug application, New Drug Application, Marketing Authorization Application)
Each level in the hierarchy also connects to a scientific purpose — the question the experiment is designed to answer (e.g., does formulation B improve bioavailability vs. formulation A?). Purpose motivates the experiment, shapes the study design, and ultimately determines the regulatory strategy. It is not a single layer in the hierarchy — it is the rationale that runs through it.
C1 requires that each of these links is navigable by scientists and machines. Given a measurement, a scientist or system can traverse upward to the submission. Given a submission, the system can traverse downward to every measurement that supports it.
Other operational domains have their own context hierarchies. Process manufacturing uses ISA-88 (batch control) and ISA-95 (enterprise-site-area) models. Quality management organizes context around investigations, root causes, and corrective actions. C1 does not prescribe a specific hierarchy — it requires that whatever context model the organization uses, the links between data and context are explicit, navigable by scientists and machines, and governed.
The transformation from raw measurement to derived result — curve fitting, normalization to controls, statistical reduction, outlier exclusion — is itself scientific context. The calculation parameters, reduction model, and control assignments define how a raw reading became the reported value. C1 requires these reduction steps to be captured and linked, not buried in report templates that vary by lab, assay, and user.
R&D data also exists in orthogonal context dimensions that overlay the scientific hierarchy: business domain (therapeutic area, portfolio classification), organizational (legal entity, site, team ownership), decision domain (which milestone or gate this data feeds), and provenance beyond lineage (who commissioned this work, under what authority). In pharma R&D — where datasets are sparse, expensive, and each data point carries disproportionate weight — these additional context axes are especially critical. Regulatory context — GxP classification, applicable jurisdiction, and the guidance version in effect at time of creation — is another axis that overlays the scientific hierarchy. Data created under one regulatory framework may be reviewed under a later one; the governed system must record both.
Pick any data file in the governed system. Can you, in three clicks or one API call, identify: the method, the sample, the study, and the regulatory submission it supports? If not, C1 is not satisfied.
Why folder structures and naming conventions fail
Most organizations encode context through file naming conventions
(PROJ-ABC_STB-3M_HPLC-IMP_001.dat) or folder hierarchies
(Program/Study/Method/Sample/). This fails for three reasons:
- Conventions drift across sites. The naming convention at Site A diverges from Site B within months of deployment. There is no enforcement mechanism.
- Folder structures are rigid. A method transfer study involves data from two sites — it does not fit in either site's folder hierarchy. A cross-program comparison requires data from multiple programs — it cannot live in one program's folder.
- Machine readability requires formal schemas. A human can parse
STB-3Mas "stability, 3-month time point." A machine cannot, unless a formal schema defines the mapping.
A 36-month ICH Q1A stability study for a monoclonal antibody produces ~2,400 individual measurements across HPLC (purity, impurities), CE-SDS (fragmentation), icIEF (charge variants), and SEC (aggregation). C1 requires that every measurement links to: the stability protocol, the storage condition (25°C/60% RH or 40°C/75% RH), the time point, the batch number, the analytical method version, and the originating instrument. When a Chemistry, Manufacturing, and Controls (CMC) section in the Biologics License Application (BLA) references "purity trend at 25°C/60% RH," the system can retrieve every HPLC measurement that supports that claim — across sites, across instruments, across the full 36-month timeline.
Relationship to other principles
C1 transforms the output of Integrate (I1–I4) from raw instrument data into scientifically meaningful records. Without C1, the governed pipeline contains data — but data without context is not evidence. C1 is the prerequisite for C2 (master data reconciliation), because reconciliation requires knowing what each data point represents before reconciling identity across systems.
End-to-end lineage (C3) has a direct bearing on intellectual property protection. Patent claims, trade secret documentation, and freedom-to-operate assessments all require tracing from a legal assertion back to the original experimental evidence. When lineage is complete and governed — from the ELN entry where the hypothesis was recorded, through every instrument measurement, to the analytical conclusion — the organization can demonstrate inventorship, establish priority, and defend its intellectual property with auditable provenance rather than reconstructed narratives.