A3: Statistical models are traceable to their training data with full provenance
Statistical models are traceable to their training data with full provenance — no black-box analytics.
What does this mean?
When a statistical model, a machine learning algorithm, or any computational method generates a result — a prediction, a classification, a trend estimate — A3 requires that the result can be traced backward to: (1) the specific dataset used for model fitting or training, (2) the algorithm and its parameterization, (3) the software version and computational environment, and (4) the governance rules that selected the training data (from A2).
This is not a documentation requirement. It is an infrastructure requirement. The provenance of a model and its outputs must be machine-navigable and version-controlled, extending the lineage graph (C3) from data through analysis to analytical result.
Why traceability is non-negotiable in pharma
In pharmaceutical R&D, analytical models supporting regulatory decisions are subject to validation requirements. ICH Q2(R2) defines validation requirements for analytical procedures. ICH Q8–Q12 define lifecycle management for process understanding and control strategies. The FDA/EMA Guiding Principles of Good AI Practice in Drug Development (January 2026) require life cycle management, data governance and documentation, risk-based performance assessment, and clear context of use. The FDA’s draft guidance on AI to Support Regulatory Decision-Making (January 2025) defines a formal “Context of Use” (COU) concept — the conditions under which a model’s output is credible for regulatory purposes — which maps directly to A3’s “operational scope.”
A model that predicts shelf life, identifies impurities, or recommends process parameters — and cannot demonstrate what data it was trained on, under what governance rules that data was selected, and how the model was validated — will not withstand regulatory scrutiny.
For any model output in the governed system: (1) Can you retrieve the exact training dataset, including its C3 lineage? (2) Can you identify the algorithm, its hyperparameters, and the software version? (3) Can you identify the governance rules (A2) that selected the training data? (4) Can you re-run the model and reproduce the output? If any of these four questions cannot be answered from system records, A3 is not satisfied.
The model registry
A3 implies the existence of a model registry — a governed catalog of analytical models, each versioned and linked to:
- Training data: a frozen snapshot of the dataset used, with full C3 lineage to source instruments
- Validation data: the independent dataset used to assess performance, also with full lineage
- Algorithm specification: the mathematical method (linear regression, random forest, neural network, etc.), its parameters, and the objective function
- Performance metrics: accuracy, precision, recall, calibration error, or domain-specific metrics (e.g., prediction interval coverage for stability models)
- Operational scope: the conditions under which the model is valid (compound class, technique, measurement range)
- Retirement criteria: conditions under which the model should be retrained or replaced (new data distribution, performance degradation, method change)
The technical model registry operates within an organizational AI management framework — such as ISO 42001 — that governs model development, deployment, monitoring, and retirement across the organization.
A model predicts 36-month purity for a monoclonal antibody based on 6-month accelerated and real-time data. The model registry entry records: training data (426 measurements from 8 batches, all from governed HPLC purity methods with lineage to source instruments), algorithm (Arrhenius-based kinetic model with linear degradation assumption), validation (independent 3-batch dataset held out from training), performance (mean absolute error: 0.3% purity, prediction interval coverage: 96.2%), scope (applicable to IgG1 molecules in aqueous formulations at 2–8°C), and retirement trigger (retrain when prediction interval coverage drops below 90% on new batch data).
A3 traceability extends to external analytical tools. When a scientist exports governed data to a statistical package and produces a result, the governed system must record: what data entered the tool, what model or script was applied, what parameters were used, and what result was produced. The tool's project file alone is not sufficient traceability — the governed system is the authoritative provenance record, even for analyses performed outside it.
Relationship to other principles
A3 extends C3 (lineage) into the analytical domain. Where C3 traces data from instrument to derived value, A3 traces the analytical model itself — its training data, its algorithm, and its outputs. A3 enables D2 (traceable AI recommendations) by ensuring that the analytical foundation underlying any AI decision is itself fully traceable.