System architecture¶
This is the complete architecture of Venturi for customer architects: where the system runs, how it is partitioned into planes, how attribution is built, where your data lives, and — most importantly — where it does not. It is the reference to hand to whoever reviews Venturi before it goes into your environment.
If you only need to connect and start seeing attribution, start with the Quickstart. If you want the security-reviewer summary, read Trust & security. This page is the architecture underneath both.
The one-sentence model
Venturi is the attribution layer for enterprise AI — the enterprise system of record for AI consumption. It deploys as a dedicated data plane inside your cloud trust boundary, reads cost/usage/identity signals, links every AI invocation to the team, service, identity, and budget responsible, and never blocks your AI traffic and never stores prompt or completion content.
Deployment model: inside your VPC¶
Venturi runs as a dedicated data plane inside your own cloud trust boundary (your VPC / project / subscription). Your operational data — invocation events, the attribution graph, the materialized index — stays in your environment. The Venturi control plane is outbound-only: it manages releases, configuration, and licensing, and it never holds your tenant data.
graph LR
subgraph YOUR["Your cloud trust boundary (VPC)"]
GW[Gateway / interceptor plane]
PR[Attribution processor plane]
DB[Dashboard / API plane]
IDX[(Materialized index)]
GRAPH[(Attribution graph)]
STREAM[(Event stream)]
GW --> STREAM --> PR --> GRAPH --> IDX --> DB
end
subgraph VENTURI["Venturi control plane (outbound-only)"]
REL[Release & config delivery]
end
REL -. signed artifacts, config .-> PR
USERS[Your engineers / finance] --> DB
Three product invariants follow directly from this shape:
- Your operational data stays in your trust boundary. Venturi's own dev, staging, and production environments exist for development, certification, and release rehearsal — not for holding your data.
- Integrations are read-only. Venturi reads cost, usage, and identity. It never writes back into your environment. This is enforced, not promised — see Trust & security.
- The decision-time interceptor fails open. It never depends on an external control plane to let your traffic through.
Outbound-only control plane
Venturi runs inside your boundary and the control plane reaches out for signed release artifacts and configuration. There is no inbound path from Venturi's control plane into your data. Server-initiated outbound calls to customer-named endpoints (billing hosts, connectors, webhooks) pass through a single hardened fetch wrapper with private/link-local/metadata-IP denial and DNS-rebind protection, so a connector can never be coerced to reach your instance-metadata service or internal hosts.
The three planes¶
Venturi partitions cleanly into three planes with different latency profiles, different failure semantics, and different responsibilities. Keeping them separate is what lets the synchronous path stay fast and fail open while the analytics path stays correct and durable.
| Plane | Job | Path | Failure mode |
|---|---|---|---|
| Gateway / interceptor | Observe live AI invocations at decision time | Synchronous hot path | Fails open — always forwards your traffic |
| Attribution processor | Build the attribution graph and confidence | Asynchronous, off the hot path | Reconciles later; never on the request path |
| Dashboard / API | Serve attribution to people and systems | Synchronous read | Fails closed — denies on auth/tenant error |
Gateway / interceptor plane (synchronous, fail-open)¶
The gateway plane optionally sits in the decision-time path of an AI request. Its only job there is to observe — record that an invocation happened and what identity/service made it — and forward the request unmodified. It performs a fast index lookup, not inline model inference, and emits an event onto the stream for the processor plane to attribute asynchronously.
This plane runs on a hard 50 ms P99 end-to-end latency budget, enforced with a wall-clock timeout. Internal calls behind the gateway (index lookup, policy evaluation) fit inside a 15 ms internal budget. Its only runtime dependencies are the fast in-memory index and the event stream — no graph database call ever enters the hot path.
Fail-open is absolute on this plane
No code path on the gateway hot path may block your AI traffic. If anything is slow, degraded, or down, the request is forwarded unmodified and attribution is reconciled later from the event. This is enforced by a hardware-level timeout, not by application logic. See Fail-open vs fail-closed.
The gateway is not the only way Venturi sees traffic. Venturi classifies every AI
pathway into a 20-class taxonomy (PathwayCategory) — direct API, AI
gateway, model router, AWS Bedrock, Azure OpenAI, Vertex AI, orchestration
frameworks, agentic AI, SaaS-embedded (per-seat and consumption), developer
tools, self-hosted, batch API, RPA workflows, and more — and captures each
through whichever of six instrumentation layers is feasible:
| Layer | Source | Example |
|---|---|---|
| L1 | Network proxy / gateway | Live interceptor, AI gateway |
| L2 | SDK / framework | Orchestration-framework instrumentation |
| L3 | Billing / control-plane | CUR, BigQuery billing export, Cost Management |
| L4 | Observability pipelines | OpenTelemetry, metrics |
| L5 | Source-of-record systems | HRIS, IdP, repository ownership |
| L6 | Vendor admin APIs | Provider usage/cost and audit-log APIs |
Each event carries a capture-feasibility class so the system is honest about what is fully capturable versus indirectly capturable versus uncapturable for a given pathway — it never silently treats missing visibility as zero.
Attribution processor plane (asynchronous)¶
The processor plane consumes the event stream and does the real work of attribution: running the HRE pipeline, building the attribution graph, computing calibrated confidence, and materializing results into the index that the dashboard reads. This is where 95%+ of attribution volume is handled, off the synchronous path, on a generous 100 ms inference design budget measured at the processor seam. Because it is asynchronous and event-sourced, it can take the time to be correct without ever touching your live traffic.
Dashboard / API plane (synchronous reads, fail-closed)¶
The dashboard and developer API serve attribution to your engineers, finance team, and systems. Reads are served from the materialized index at low latency. Every result carries its interpretation metadata (see Evidence on every result), so a number is never shown with more authority than its evidence supports. This plane is the fail-closed half of the system: authentication, authorization, tenant isolation, exports, and billing all deny on error.
The six-layer attribution graph¶
Venturi's defensibility is the attribution graph: it links every AI inference signal across six layers into one graph, with no manual tagging required.
graph TD
R[Invocation] --> S[Service]
S --> C[Code / Project]
C --> I[Identity]
I --> O[Organization]
O --> B[Budget responsibility]
The product "six-layer" framing maps to an engineering canon of five node types plus a budget edge:
| Product layer | Graph node | Answers |
|---|---|---|
| Invocation | Invocation |
Which AI call happened? |
| Service | Service |
Which service or workload made it? |
| Code / Project | Project |
Which codebase / project owns that service? |
| Identity | Identity |
Which person or service account is responsible? |
| Organization | Organization |
Which team / org unit do they roll up to? |
| Budget | (edge on Organization) |
Which cost center / budget is billed? |
Budget responsibility is an attribute of the organization expressed through
billed_to / budgeted_under edges — not a separate node. The graph uses a
frozen eight-edge wire taxonomy (owns, member_of, deployed_in, called_by,
produced_by, billed_to, owned_by_org, budgeted_under) so that every
relationship a result depends on is explicit and auditable.
The result answers the question finance and engineering both ask — which team, which service, which person, which budget is responsible for a given slice of AI spend — and it does so without you tagging anything by hand.
The HRE three-stage pipeline¶
The HRE (Heuristic Reconciliation Engine) is the runtime pipeline that turns raw invocation signals into attributed, confidence-scored records. It runs in three stages.
graph LR
EV[InvocationEvent] --> A
subgraph A["Stage A — deterministic"]
A1[R1 direct key match]
A2[R2 temporal proximity]
A3[R3 naming correlation]
A4[R4 historical patterns]
A5[R5 service-account trace]
end
A -->|unresolved edges| B
subgraph B["Stage B — RAIL inference"]
RAIL[Trained RAIL model]
HB[HeuristicBaseline fallback]
end
B --> C
subgraph C["Stage C — allocation"]
R6[R6 fractional allocation]
end
C --> AR[AttributionRecord]
Stage A — deterministic resolution (R1–R5)¶
Stage A resolves what can be known for certain, using five deterministic reconciliation methods:
| Method | Reconciliation signal |
|---|---|
| R1 | Direct key match (API key / service-account identifier) |
| R2 | Temporal proximity |
| R3 | Naming correlation |
| R4 | Historical patterns |
| R5 | Service-account trace |
Stage A is bit-reproducible: the same input always yields the same
AttributionRecord fields. Anything Stage A resolves carries stage_origin =
stage_a and the strongest confidence the evidence allows. This is the backbone
of chargeback-grade attribution.
Stage B — trained RAIL edge-existence inference¶
For edges Stage A cannot resolve deterministically, Stage B asks a narrower
question — does this edge exist? — and answers it with the trained RAIL
model (RailStageB). RAIL is the attribution-intelligence sidecar described in
the next section. It emits a posterior per edge, which is calibrated and
materialized into customer-facing confidence.
Stage B is wrapped in fail-open behavior at every level. If the RAIL model
times out (its share of the latency budget is a hard 20 ms wall-clock), is
missing its artifact, returns an invalid output, or is disabled, the pipeline
falls back to the HeuristicBaseline — a permanent safety feature that keeps
attribution flowing. RAIL never blocks production traffic and never makes a
result fail; it only ever improves the attribution that the deterministic path
already guarantees.
Stage C — fractional cost allocation¶
When an invocation genuinely cannot be pinned to a single owner, Stage C allocates its cost fractionally across the candidate owners using the R6 allocation prior. R6 carries explicit allocation semantics and a deliberate 0.50 confidence ceiling — it is never treated as a Stage B inference feature, and the system is transparent that an allocation is an allocation, not a resolution.
The output of the pipeline is the AttributionRecord: the durable, auditable
unit that carries the resolved edge, its output_state, its calibrated
confidence (c_oper), the originating stage, and the evidence behind it.
Six output states¶
Every result lands in exactly one of six honest output states — Venturi never fabricates a default when it does not know:
output_state |
Meaning |
|---|---|
deterministically_resolved |
Resolved with certainty by Stage A |
strongly_inferred |
High-confidence Stage B inference |
bounded |
Narrowed to a set; not a single owner |
ambiguous |
Multiple plausible owners remain |
unknown |
Insufficient evidence to attribute |
not_identifiable |
Cannot be attributed even in principle |
RAIL: the attribution-intelligence sidecar¶
RAIL (Reconciliation Attribution Intelligence Layer) is the tenant-scoped
intelligence service that serves Stage B. It runs inside your boundary as a
sidecar to the processor plane. It loads a verified model artifact, performs
feature extraction and edge-existence inference, emits confidence and evidence
semantics, and fails open to HeuristicBaseline on timeout, missing artifact,
invalid output, or a disabled flag.
RAIL is deliberately narrow. It is not a customer-facing app, a training UI, or an authorization boundary. It supplies evidence-backed predictions through one seam and owns nothing else: it does not own dashboards, billing, admin mutation, security controls, or exports.
What confidence means and how to act on it¶
Every attribution carries a single customer-facing confidence value, coper,
on a 0–1 scale. Two numbers govern how you use it:
| Threshold | Value | What it means |
|---|---|---|
| Confidence cap | 0.95 | The highest confidence Venturi ever asserts. This is a deliberate, conservative policy ceiling — Venturi never claims certainty on an inferred attribution. |
| Chargeback floor | 0.80 | An attribution must be at or above this to be eligible for chargeback and to count toward a savings-share billing base. |
So in practice: anything at 0.80 or above is solid enough to bill and charge back; below it, treat the attribution as advisory and investigate before acting. Venturi caps inferred confidence at 0.95 by policy — when you see a number that high, it means "as confident as we will ever assert," not "certain."
How the model is improved — and who decides
The production RAIL model is selected and promoted by a human Venturi decision through a formal gate, never automatically. Promotion requires per-edge precision and calibration bars to clear, a non-waivable cap on high-confidence errors, and a canary rollout that auto-rolls-back without ever touching customer traffic if calibration drifts. The most explainable model wins ties. You consume the result of that discipline; you never have to operate it.
Evidence on every result¶
No attribution is presented as a bare number. Every customer-visible result carries interpretation metadata — an evidence card — so you can always see why and how strongly Venturi believes it:
stage_origin— Stage A deterministic, Stage B inference (RAIL orHeuristicBaseline), or Stage C allocation.output_state— one of the six states above.coper+ confidence band — the calibrated confidence and where it sits.- Evidence basis — which reconciliation methods (R1–R5) and which input evidence produced the result.
- Model version — the RAIL artifact version, where applicable.
- Degradation state — whether the system was running fully or in a fallback/degraded mode when the result was produced.
- Freshness — when the underlying data was last reconciled.
The product surfaces these as five trust dimensions on every result — Status, Sources, Permissions, Automation, and Consequences — and the absence of any one is rendered explicitly as "unknown / unavailable," never as a blank or a fabricated default. This is the honest-unknown discipline: Venturi would rather tell you it does not know than guess.
Control plane vs data plane¶
| Data plane (in your VPC) | Control plane (Venturi, outbound-only) | |
|---|---|---|
| Holds | Your events, graph, index, attribution records | Releases, configuration, licensing |
| Holds your data? | Yes — and only here | No |
| Direction | Internal to your boundary | Outbound from your boundary only |
| Customer data egress | None | None |
Your data plane is the only place your operational data lives. The control plane delivers signed, verified release artifacts and configuration inbound to the data plane and never reaches your data. There is no path by which Venturi's control plane reads your invocation events, your graph, or your index.
Fail-open vs fail-closed: the boundary¶
The single most important rule in the architecture is the fail-open boundary, and it is drawn precisely:
Fail-open applies to exactly one path
Fail-open applies only to customer AI traffic on the gateway hot path. If Venturi is degraded, your AI request is forwarded unmodified and attribution is reconciled later. Venturi cannot take your AI traffic down.
Everything security-relevant fails closed
Fail-closed applies to every security-relevant decision: authentication, authorization (RBAC), tenant isolation, admin mutation, export creation, billing mutation, data-residency routing, legal-gated adoption/workforce views, and support break-glass access. On error, timeout, ambiguity, or missing input, these deny — with zero data egress and an audit entry.
This split is exhaustive and frozen. There is no feature flag that converts a fail-closed path to fail-open. The gateway forwards your traffic when in doubt; the security boundary denies access when in doubt. Both behaviors are tested directly.
Where customer data lives — and where it does not¶
| Data | Where it lives | Where it never goes |
|---|---|---|
| Invocation events, attribution graph, materialized index | In your VPC, in your data plane | Venturi's control plane or environments |
| Cost / usage / identity signals | Read into your data plane | Written back to your environment (read-only) |
| Prompt and completion content | Nowhere — never captured or stored | The pipeline has no field for it |
| Provider admin keys | KMS-encrypted in your tenant boundary | Venturi control plane never sees plaintext |
No content capture, by design
The core pipeline never stores prompt or completion text. The canonical
InvocationEvent schema has no content field at all. Attribution is
built entirely from metadata — model, tokens, cost, identity, timing. This
is structural, not configurable: there is no setting that turns content
capture on.
Data that does live in your plane is retained for 13 months operationally, encrypted at rest with a customer-managed key (one key per tenant) and in transit with TLS. Per-subject erasure is supported via crypto-shred within a 30-day SLA. See Trust & security for the full data-handling model.
Energy and carbon attribution¶
Because the attribution graph already knows which model served each invocation and how many tokens it consumed, Venturi projects the same graph onto energy and carbon accountability. Each invocation is attributed energy (Wh/kWh) and carbon (gCO₂e) from a model-and-region catalog, rolled up the same six layers so a team, service, or budget can see its AI energy and carbon alongside its cost.
Null is not zero
When a model is not in the energy/carbon catalog, Venturi reports the value as null (unknown coverage), never as zero. It will not understate impact by treating missing catalog coverage as no impact, and energy multipliers are never restated as cost figures.
How it fits together¶
- To connect a cloud and start producing attribution, see Quickstart and the onboarding guides.
- To send request-level events for per-call resolution, see Ingestion.
- For the security-reviewer view of read-only enforcement, no-content-capture, and fail-open, see Trust & security.
- For a shorter conceptual overview, see How Venturi works.