Architecture

Status: Draft · Part 4 of the solution specification · June 2026

The how. It resolves the implementation decisions deferred from the requirements and specifies the mechanisms behind the outcomes.

How decisions are framed: the load-bearing choices are presented as criteria → options → recommendation. Transform & rules language is now decided — TypeScript (§3). Persistence, hosting and identity & access remain DECISION PENDING — leanings for discussion, not locks; the final calls are the author’s and are flagged for the discussion pack.

1. Shape (technology-agnostic)

Preserve the layered shape the original proved, reopen the technology. The read path is layered, not parallel (solution definition §8):

flowchart TB
    subgraph WRITE["Write path"]
        direction LR
        SRC["Sources"]:::source --> ET["Extract &amp; transform"]:::engine --> REC["Reconcile<br/>(merge/split)"]:::engine
    end
    REC --> STORE[("Raw reconciled store<br/>values · conflicts · provenance")]:::store
    STORE --> SKIN["Interpretation tier<br/>(skin applied at query time)"]:::skin
    SKIN --> QR["Query &amp; reporting tier"]:::query
    QR --> C["Reports · quality metrics · API · exports"]:::out
    classDef source fill:#dbeafe,stroke:#2563eb,color:#1e3a5f;
    classDef engine fill:#e0e7ff,stroke:#4f46e5,color:#312e81;
    classDef store fill:#e2e8f0,stroke:#475569,color:#1e293b;
    classDef skin fill:#fef3c7,stroke:#d97706,color:#7c2d12;
    classDef query fill:#dcfce7,stroke:#16a34a,color:#14532d;
    classDef out fill:#ccfbf1,stroke:#0d9488,color:#134e4a;

The tiers are layered, not parallel: query/reporting sits on top of interpretation, which sits on top of the raw store — so reports run on reshaped data, never the raw consolidation alone.

Extract & transform — owned for structured/semi-structured sources; third-party extractors for unstructured sources feed in.
Reconciliation — reversible identity merge/split; produces the raw reconciled store (entities + every source contribution + conflicts + provenance).
Interpretation tier — applies a skin at query time; sits on top of the raw store.
Query & reporting tier — runs on the interpreted view; reports, quality metrics, API, exports.

Propagation between modules is by subscription and changeset/delta, so only entities whose merged state actually changed flow downstream.

2. The core mechanism — reversible identity merge/split

The “how” behind requirements §2.4. Identity is modelled as a graph: sources and their scoped identifiers are nodes/edges, and each connected component is one entity. Merge is the connected-components computation; split is the same computation re-run when an edge (a source’s contribution) is removed. Because every attribute and identifier edge is tagged with its contributing source, withdrawing a source deletes its edges and the component re-partitions — entities un-merge automatically. Conflicts are retained as multiple source-tagged values on the entity; nothing is collapsed to a single winner at this layer. Corrections enter as a top-priority source contribution under the same identity.

This is the subtlest behaviour in the system. The original engine already exercised it with a focused persistence test suite (merge, multi-scope merge, un-merge, re-merge, split, plus bug regressions); the rebuild’s job is to make that coverage first-class and more rigorous than example cases — see §7.

flowchart LR
    CH["Source change<br/>file · folder · upstream module"]:::source --> SUB{"Subscription"}:::engine
    SUB --> INT["Extract / interpret"]:::engine --> TR["Transform<br/>(TypeScript)"]:::engine --> DEL["Delta analysis"]:::engine --> AP["Apply to store"]:::store
    AP --> PROP["Propagate delta<br/>to subscribers"]:::query
    PROP -. "recursive" .-> SUB
    subgraph LEGEND["Legend"]
        direction LR
        li["Input"]:::source
        le["Pipeline step"]:::engine
        lst["Store"]:::store
        lp["Propagate"]:::query
    end
    classDef source fill:#dbeafe,stroke:#2563eb,color:#1e3a5f;
    classDef engine fill:#e0e7ff,stroke:#4f46e5,color:#312e81;
    classDef store fill:#e2e8f0,stroke:#475569,color:#1e293b;
    classDef query fill:#dcfce7,stroke:#16a34a,color:#14532d;

The changeset pipeline. Propagation is recursive along subscriptions, and only entities whose merged state actually changed flow downstream.

3. Decision — transform & rules language (DECIDED: TypeScript)

Decision (June 2026): no custom DSL. Data-flow customisation and transform/validation rules are authored in TypeScript — a standard, widely-used language — against a typed STREAM SDK that provides the domain primitives (entities, scoped identifiers, unit-aware values, regex-extraction helpers). This retires the original Fluid DSL; Fluid’s domain ergonomics (sigils, unit-aware arithmetic, regex-first matching) become a typed library rather than bespoke syntax.

Why:

Talent and tooling — a very large existing developer pool, and first-class editors, linting, testing and refactoring out of the box; nothing STREAM-specific to learn before being productive.
No language to build or maintain — no grammar, parser, docs or toolchain to carry, the recurring cost a bespoke DSL imposes.
Type safety — a typed SDK catches mapping and rule errors at author time, where data transforms are most error-prone.
One language across the platform — customisation matches a modern TypeScript-based stack, so authors, integrators and core developers share a language.

To resolve during the build (mechanism, not the choice): the safe-execution approach for running author-supplied TypeScript (sandbox/isolation), and the shape of the typed SDK — carried to §8.

4. Decision — persistence

Criteria: must support connected-components merge and reversible split; per-value provenance (source-tagging) and conflict retention (multiple values per attribute); change tracking for delta propagation; auditability; beachhead scale (tens of thousands to low millions of assets, with headroom); interpretation config stored separately from the data.

Options:

A. Relational. Mature, transactional, strong for provenance/audit; but identity-as-a- network and multi-hop merge are awkward (recursive joins), and connected-components is not native.
B. Native graph database. Identity is a network; connected components and traversal are native; the identity graph becomes a living asset. Caveats exist on entity-resolution at very large scale and on operational maturity.
C. Hybrid. Model the source·identity graph as a graph (native graph DB, or a graph compute layer over the store), keep the entity/attribute/provenance record in a store chosen for write-throughput and audit, and hold interpretation config separately.

Recommendation — DECISION PENDING: lean C (hybrid). The merge/split is intrinsically a graph problem — “identity is not an attribute, it’s a network” — so the identity layer should be modelled as a graph, while the heavy attribute/provenance record need not be. Prove the identity-graph-at-scale assumption with a spike (§8) before committing.

5. Decision — hosting & deployment

Criteria: data sovereignty for quasi-public Australian water buyers; procurement acceptability; tenant isolation; operational burden on low-capacity utilities; and the self-host requirement some utilities will impose.

Options:

A. Multi-tenant SaaS. Lowest operational burden for the buyer and fastest to ship; but multi-tenant isolation is a recurring finding under the ISM/IRAP, and shared-tenancy raises sovereignty and procurement friction for public-sector water.
B. Single-tenant, Australian-data-resident cloud. Clean sovereignty and isolation story, aligned to the Hosting Certification Framework and an IRAP-assessable posture; higher per-deployment cost and ops than multi-tenant.
C. Self-host / on-prem. For utilities that mandate it; highest burden on the buyer.

Recommendation — DECISION PENDING: lean B as the default — single-tenant, AU-resident, with C (self-host) supported for utilities that require it. Treat pure multi-tenant SaaS (A) cautiously for this buyer given sovereignty and isolation. Whether full IRAP assessment is needed for the beachhead (council water businesses may require data residency without full IRAP) is a discovery question. (Hosting Certification Framework, IRAP for SaaS vendors)

6. Decision — identity & access management

Identity and access is a key decision in its own right, not a detail. The original’s Windows/AD-only model does not fit cross-organisation or quasi-public deployment — but the direction of the replacement is shaped by a strong fact: the majority of Australian government agencies and water utilities run Microsoft Entra ID (formerly Azure AD) and Microsoft 365.

Criteria: integrate with the identity provider buyers already run; the requirements’ per-module and per-skin access, staged exposure (WIP vs published), and full audit; minimal setup for low-capacity utilities; a fallback for the rare utility without a cloud IdP.

Options / direction:

A. Federated identity, Entra / M365 first (recommended). SSO via Microsoft Entra ID (OIDC/SAML) as the primary integration, with Entra / M365 groups usable for role mapping; role-based access enforced per module and per skin; staged exposure via access scopes; audit built on the provenance the engine already records. Other OIDC/SAML identity providers supported as secondary.
B. Built-in local accounts — fallback only, for self-host without a cloud IdP.

Recommendation — DECISION PENDING: A, prioritising Entra / M365 because it is what the beachhead buyers already operate — turning identity from an integration cost into a near-zero-setup advantage (SSO and group-based access on day one). The open detail is the role-model granularity (how per-module/per-skin roles map to Entra groups). This also nudges hosting (§5): utilities already standardised on M365 / Azure make an Azure Australian-region single-tenant deployment a natural candidate.

7. How correctness is proven

The requirement (§2.4) that the reconciliation be demonstrable is met here. The original engine already had a focused persistence test suite over merge/split/un-merge (PersistenceTests.cs — example-based link, multi-scope link, unlink, re-link, split and bug regressions), confirming the behaviour was understood as critical; the rebuild’s job is to make that coverage first-class and more rigorous than example cases:

A written specification of merge/split semantics (transitive merge, conflict retention, reversible split on withdrawal, determinism, delta correctness).
An automated test suite, ideally property-based, asserting those semantics — built before anything depends on the engine.
Golden datasets drawn from real (anonymised) beachhead data once discovery provides it, to test against the messiness the engine actually has to survive.

8. Open questions and spikes

To resolve before or early in the MVP build:

Identity-graph-at-scale spike — validate the persistence lean (§4) against beachhead volumes and the merge/split + delta workload.
TypeScript safe-execution — the sandbox/isolation approach for running author-supplied transform and rule code, and the shape of the typed SDK (§3).
Connector strategy — build vs adapt for the foundational sources (GIS, EAM/CMMS, etc.), pending the discovery in requirements §4.
Sovereignty path — confirm the actual residency/IRAP bar the SEQ beachhead utilities impose (§5).

9. Decisions for the discussion pack

Transform & rules language is settled (TypeScript, §3). The remaining open choices most worth reviewing before locking:

Decision	Recommended leaning	Why it needs discussion
Persistence (§4)	Hybrid, graph identity layer	Hinges on an unproven scale assumption; spike first.
Hosting (§5)	Single-tenant AU-resident + self-host option	Sovereignty/procurement bar varies by utility.
Identity & access (§6)	Federated identity, Entra / M365 first + per-module/per-skin RBAC	Buyers are M365/Entra shops; role-to-group mapping granularity is open.