Why forensics, not dashboards.
Most teams running production LLM systems already have observability — Langfuse, Arize, Helicone, LangSmith — and many run requests through a gateway. Both are useful. Neither is forensics.
The distinction matters because the question an engineering team can answer with a dashboard is different in kind from the question an auditor, a compliance officer, or an incident responder needs to answer with the same data.
The questions dashboards can't answer
Dashboards answer aggregate questions in real time:
- What was our p95 latency last hour?
- Which tenant generated the most tokens yesterday?
- How many requests hit the rate limit?
Replace the words "LLM" with "HTTP" and these are the same questions any APM tool has answered for fifteen years. They tell you what the fleet is doing. They don't tell you what one specific interaction did, three weeks ago, well enough to defend against a question that starts "how do you know?"
The questions a forensic tool has to answer instead look like this:
- A customer is contesting a model output from October 3rd. Reproduce the exact prompt, retrieved context, system message, and tool calls that produced it — and prove the record hasn't been altered since.
- A security team suspects a prompt injection succeeded. Show the retrieved chunk that contained the injected instruction, the attention-weighted span of the output that was steered by it, and the captured trace as it existed before any operator touched it.
- An auditor under EU AI Act Annex IV is asking for our evidence trail on a specific decision class for the last 90 days. Produce it in a format that's defensible without a witness needing to explain "trust our logs."
None of those are aggregate. None of those tolerate "we have logs about that." All three require the interaction itself to be a signed, ordered, tamper-evident record — not a snapshot inferred from telemetry.
Three properties that disqualify logs as evidence
The logs most platforms emit share three properties. Each one is disqualifying.
1. Mutable
Anyone with write access to the log store can change them. That includes the platform operator, an attacker who pivoted into the log infrastructure, and — accidentally — anyone with a migration script that touches the wrong column.
Forensic property required: append-only at the schema level, and hash-linked so any retroactive edit is detectable from outside the log store.
2. Unsigned
A log line says "this event happened." Its authenticity rests on the operator's word. In a regulated context, that's a one-way trust arrow: the customer trusts the operator. There is no mechanism for the operator to prove authenticity to the customer, an auditor, or a court.
Forensic property required: producer signing. The event is signed in the producer's process by a key the platform never holds. The signature travels with the event, forever, and verifies offline.
3. Aggregated
The point of an observability platform is to roll up. A million events become a chart. The token-level detail — which retrieved chunk ended up in the prompt, which output span the chunk steered, which tool call the model picked between two options — is lost by design.
Forensic property required: token-level retention, with the content of every event preserved and addressable by primary key.
What "forensics" actually is
A forensic layer for LLMs has to do three things observability and gateways do not:
- Capture every interaction at the token level, signed in the producer's process by a key the platform never sees.
- Chain those captures so that tampering with any one event invalidates the chain hash of every event after it.
- Anchor the chain externally so that the integrity claim doesn't rest only on the platform's word — a public timestamp, a Merkle root that lives outside the platform's control, an evidence bundle verifiable by a third party.
Pillar 1 of TokenForensics is exactly this: per-tenant, signed, chain-hashed, daily-Merkle-rooted token-level capture. None of it competes with observability or with gateways — both still serve their original purposes. The forensic layer sits downstream of either, ingesting the same traffic into a record that can answer the questions dashboards cannot.
What you can do with it that you couldn't before
Once capture is forensic-grade, the capabilities above it compose. Source-to-output attribution is meaningful because the events can't be reordered. Prompt-injection replay is reproducible because the captured input was signed before any operator touched it. A compliance evidence pack is accepted by an auditor because the chain it sits on is mutable only by adding new events, never by editing old ones.
This is the difference between informative and admissible. Both are useful. Only one survives the question.
If you want to see how the substrate is built, the implementation — the split hash, the per-tenant signing keys, the daily Merkle anchor — is documented in Pillar 1 — Tamper-evident capture and its marketing narrative at /pillars/pillar-1.