C3: Lineage traces every data point from instrument through transformation to decision
Lineage traces every data point from instrument through transformation to decision, with no gaps.
What does this mean?
Data lineage is the complete, verifiable chain of custody from the original instrument measurement through every transformation, aggregation, and derivation to the final value used in a decision — a release specification, a regulatory claim, or an AI recommendation. C3 requires that this chain has no gaps: every step is recorded, every transformation is traceable, and the entire lineage is navigable by scientists and machines — scientists navigate lineage visually during investigations and audits; machines navigate lineage programmatically for automated traceability checks, AI provenance, and regulatory report generation.
Why lineage is a regulatory requirement
21 CFR Part 11 requires that electronic records include audit trails that document the creation, modification, and deletion of records. EU GMP Annex 11 extends this to require traceability of data through processing steps. ICH Q10 requires a pharmaceutical quality system that ensures data integrity throughout the product lifecycle.
In practice, this means a regulatory auditor should be able to point at any number in a Certificate of Analysis (CoA), a stability report, or an IND submission and trace it backward through every calculation, aggregation, and transformation to the original instrument measurement — and forward from any instrument measurement to every derived value and decision that used it.
Select a value in a regulatory submission — for example, an assay result in the CMC section of an IND. Trace it backward to the original instrument file. Count the steps where lineage is maintained by the system versus the steps where a human must explain the connection verbally or through documentation. If any step requires verbal explanation, C3 has a gap at that point.
The lineage graph
C3 lineage is not a flat audit trail. It is a directed acyclic graph (DAG) connecting:
- Source nodes: raw instrument files (from I1/I2) — the leaves of the graph
- Transformation nodes: parsing, calculation, aggregation, statistical analysis — each recording the algorithm, parameters, and software version used
- Derivation nodes: cross-referencing data from multiple sources (e.g., an impurity profile that combines HPLC purity, LC-MS identification, and Karl Fischer moisture data)
- Decision nodes: the points where data supports a decision — CoA release, stability conclusion, regulatory claim, AI recommendation
Every edge in the graph records: what transformed what, when, using which rules, by which system or operator. This is what "no gaps" means — every edge exists, and every edge is attributable.
This graph structure is a logical requirement, not a storage prescription — it can be implemented in graph databases, relational databases with recursive queries, document stores, or linked data formats. C3 requires that the structure exists and is navigable, not that it uses a specific technology.
A Certificate of Analysis reports an assay value of 99.4% for an API batch. The lineage graph traces backward: 99.4% is the mean of triplicate injections → each injection produced a chromatographic peak area → each peak area was calculated by integration of the raw signal using specified parameters → each raw signal was acquired by a specific HPLC instrument under specific conditions → the reference standard used for quantitation traces to a specific lot with a specific assigned value. Every node and edge in this graph is recorded in the governed system. No human explanation is required to traverse it.
Forward lineage (impact analysis)
Lineage is bidirectional. Backward lineage answers "where did this number come from?" Forward lineage answers "what depends on this data?" Both are essential:
- A reference standard lot is recalled — which assay results are affected? Which CoAs reference those results? Which batches were released using those CoAs?
- An instrument calibration is found to be out of specification — which measurements during the affected period must be investigated? Which studies used those measurements?
- A method is updated — which historical results were generated with the prior version? Do any need to be re-evaluated?
Without forward lineage, impact analysis requires manual search across systems, sites, and filing cabinets. With forward lineage, the system answers in seconds.
Lineage does not break at tool boundaries. When data exits the governed system for statistical analysis, the governed system maintains the outbound edge — what was exported, when, by whom, to which tool. When the result returns, the governed system records the inbound edge — what conclusion, derived from what source data, via what analytical method. The external tool’s internal processing is an opaque segment in the lineage graph, governed at its input and output boundaries.
Relationship to other principles
C3 builds on I2 (the native file as the authoritative starting point), C1 (the scientific context that gives lineage meaning), and C2 (reconciled identities — without which lineage chains for the same entity under different names remain disconnected). C3 enables A3 (traceable models) — you cannot trace a statistical model to its training data if the training data itself has no lineage. C3 also enables D2 (traceable AI recommendations) by extending the lineage graph from analysis through to automated decision.