I2

I2: Raw data is preserved in its native format with full provenance

Raw data is preserved in its native format with full provenance to source system, timestamp, and operator.

What does this mean?

When instrument data is captured (I1), the ingestion pipeline must store the original file in its native format — the native format that the source system produced. Transformation into standardized formats (Allotrope, AnIML, mzML, …) occurs as a secondary step, but the native file is always retained as the authoritative record.

Full provenance means the system records, at minimum: the source instrument (serial number and model), the data system that produced the output (CDS, ELN, LIMS, MES, ERP — with version), the operator identity (where available from the source), the acquisition timestamp (from the instrument, not the ingestion system), and a cryptographic hash of the original file for integrity verification.

Why transformation at ingestion is destructive

Many integration approaches convert instrument data at the point of ingestion — parsing a chromatography result file into a relational database schema, extracting peak tables from a spectrum, or converting a native format into CSV. This is lossy in most cases. A CSV export of a chromatogram discards the raw signal trace, baseline parameters, integration events, audit trails, and system suitability test (SST) metadata embedded in the native file.

When a regulatory auditor asks to see the original data, the organization must produce the instrument's native output. If the governed system stored only a converted copy, the original exists only on the instrument's local storage — which may have been overwritten, archived to tape, or decommissioned. The provenance chain is broken.

The dual-representation model

I2 does not prohibit transformation. It mandates that transformation is additive, not replacing. The governed system maintains:

  1. The native file — immutable, hash-verified, stored with its original filename and directory structure
  2. The converted representation — extracted into a queryable, interoperable format (e.g., Allotrope) with a bidirectional link to the native file

This dual-representation model satisfies both regulatory requirements (original records for audit) and operational requirements (structured data for analysis).

Source of truth governance

Integration does not replace source systems. The ELN, LIMS, or instrument software remains the authoritative record for operations. The ICAD index is a governed, provenance-linked copy — every indexed record traces back to its originating entry in the source system. This is not duplication; it is governed extraction. The source system answers "what is the record?" The index answers "what does this record mean in context, and how does it connect to everything else?"

Long-term preservation

The dual-representation model establishes true-copy eligibility at ingestion time, not at decommissioning time. When a source system is eventually retired, the governed copy — native file plus validated conversion with complete lineage — meets the regulatory definition of a true copy: "verified to have the same information, including data that describe the context, content, and structure" (MHRA GxP Data Integrity Guidance). Validated conversion (I3) plus complete lineage (C3) means the organization does not face a last-minute migration crisis at decommissioning. The ICAD pipeline has been producing true-copy-eligible records from day one.

Operational test

Given any data point in the governed system, can you retrieve the original instrument file, verify its integrity via cryptographic hash, and confirm it has not been modified since acquisition? If yes, I2 is satisfied.

Provenance metadata requirements

The provenance record for each ingested file must include:

  • Source instrument: make, model, serial number, firmware version
  • Data system: name, version
  • Operator: user ID from the source system (not the integration service account)
  • Acquisition timestamp: from the instrument clock, not the ingestion system clock
  • File hash: SHA-256 of the native file, computed at point of ingestion
  • Ingestion metadata: pipeline version, ingestion timestamp, parsing rules applied
Example — NMR data

An NMR spectrometer produces a FID (free induction decay) dataset as a directory of binary files. The integration pipeline ingests the entire directory structure as an immutable archive, computes a SHA-256 hash of the archive, and stores it alongside extracted metadata (nucleus, frequency, pulse sequence, solvent, temperature). The converted representation links to both the processed spectrum and the raw FID, enabling re-processing with different parameters without re-acquisition.

Relationship to other principles

I2 depends on I1 — you can only preserve a native format if you captured the data at the point of generation rather than from a downstream export. I2 enables C3 (end-to-end lineage) by providing the authoritative starting point of the provenance chain. Without the native file, lineage begins at a transformation artifact, not at the original data.