D1: AI operates only on data that has passed through I→C→A
AI operates only on data that has passed through I→C→A — never on ungoverned, decontextualized inputs.
What does this mean?
D1 is the gate between Analyze and Decide. It states the hardest prerequisite in the ICAD sequence: no AI system — whether a recommendation engine, an agentic workflow, a predictive model, or a large language model — should operate on data that has not passed through the full Integration (I), Contextualization (C), and Analysis (A) sequence.
This is not an abstract principle. It is the operational response to the most common failure mode in pharmaceutical AI: organizations deploy AI/ML on data that is ungoverned, decontextualized, or unanalyzed — and the AI produces outputs that are technically sophisticated and scientifically unreliable.
Why most pharma AI initiatives fail
Industry surveys consistently report that 70–85% of AI initiatives in pharma do not reach production (Pistoia Alliance, 2024; Deloitte AI in Pharma, 2024). The failure mode is not algorithmic — the models work. The failure mode is data infrastructure:
- Training data is extracted from multiple systems without master data reconciliation (no C2) — the model learns from data that conflates different entities
- Input features lack scientific context (no C1) — the model cannot distinguish stability data from method validation data
- Data lineage is unavailable (no C3) — the model's outputs cannot be traced to governed sources for regulatory review
- Cross-program comparisons in the training set are ungoverned (no A2) — the model learns artifacts of data heterogeneity, not scientific patterns
D1 prevents these failures by design. If the data has not passed through I→C→A, it does not enter the decision layer.
For any AI system operating in the governed platform: trace every input to its source. Does every input originate from the contextualized, analyzed data layer? Or does any input bypass the sequence — pulling raw data from a CSV export, a direct database query, or an unstructured file share? If any input bypasses I→C→A, D1 is violated.
The data quality contract
D1 establishes a contract between the AI layer and the data infrastructure. The AI layer's requirements are:
- Identity certainty: every entity in the input dataset has a reconciled identity (C2). The AI never receives two different names for the same compound.
- Context completeness: every data point has its full scientific context (C1). The AI knows the method, the sample, the study, and the regulatory context for every value.
- Lineage integrity: every data point has complete lineage (C3). The AI's outputs are traceable to governed sources.
- Analytical validity: every comparison or aggregation in the input dataset was governed (A2) and reproducible (A4). The AI did not learn from artifacts.
If the data infrastructure cannot fulfill this contract, the AI layer should not operate. This is the D1 principle's enforcement mechanism: the system enforces the gate, not the data scientist's judgment.
The I→C→A sequence exists precisely because system heterogeneity is permanent. D1 does not require data from a single system — it requires that data from every system has been integrated (I), contextualized (C), and analyzed (A) under governed rules before AI operates on it. The FDA/EMA Guiding Principles of Good AI Practice in Drug Development explicitly require data governance and documentation as a precondition for credible AI — D1 is the architectural enforcement of that regulatory expectation.
An AI-assisted IND compilation system assembles the Chemistry, Manufacturing, and Controls (CMC) section from governed data. D1 requires that every data point referenced in the compilation: (1) was integrated via the governed pipeline (I1–I4), (2) has full scientific context (C1–C4) linking it to the relevant study, method, and regulatory protocol, (3) was analyzed under governed comparison rules (A1–A4). The AI system queries the contextualized data layer — it never scrapes PDFs, reads spreadsheets from file shares, or imports data that bypassed the pipeline.
Relationship to other principles
D1 is the enforcement of ICAD's sequential property at its most critical point. The entire I→C→A sequence exists to produce data that D1 can trust. Without D1, the compounding investment in integration, contextualization, and analysis is undercut by an AI layer that accepts ungoverned inputs alongside governed ones — contaminating the decision space.