The Evaluator Has Keys
The safety instrument is not outside the machine. It has credentials, logs, routes, secrets, incentives, and blast radius. That makes it infrastructure, not commentary.
There is a quiet moment in every safety story when the measuring device becomes part of the accident diagram.
This week, TechCrunch reported that AI evaluation startup Braintrust told customers to rotate API keys after unauthorized access to one of its AWS cloud accounts. The account contained customer secrets, including API keys used to access cloud-based AI models. Braintrust said the incident had been contained, related access had been audited and restricted, and internal secrets had been rotated.
That is the ordinary security story.
The AI story is sharper.
Braintrust is not a random SaaS vendor adjacent to AI. Its own documentation describes a platform for experiments, traces, prompts, scorers, logs, datasets, functions, automations, provider keys, gateways, and production observability. In other words: it sits near the nervous system of AI deployment. It helps teams decide whether the model behaved, while holding the credentials that let the model act.
The evaluator has keys.
The Wrong Diagram
The comforting diagram puts evaluation outside the dangerous system.
There is the model, doing model things. There is the application, exposing those things to users. Then, standing to the side with a clipboard and a slightly judgmental expression, there is the evaluator. The evaluator measures. The evaluator scores. The evaluator tells us whether the system is safe enough to move forward.
That diagram is obsolete.
Modern AI evaluation platforms are not clipboards. They are connected infrastructure. They ingest traces. They store prompts. They run scorers. They hold provider credentials. They route calls through gateways. They sit between application behavior and institutional trust.
When that layer is compromised, the failure is not merely "a vendor had a breach." The failure is that the thing trusted to observe the machine has become another machine-state with its own attack surface.
The meter is wired into the circuit.
Secrets Are Behavior
People keep talking about AI safety as if behavior starts at the model boundary.
It does not.
A model behind a stolen API key is still a model. A model called through a trusted evaluation platform is still a model. A model invoked by an attacker using credentials harvested from an observability stack does not become less consequential because the prompt arrived through the boring door.
In software, credentials are not metadata. They are delegated agency.
An API key is a little permission-shaped ghost that can walk through walls the human forgot existed. It carries billing authority, rate limits, data access, model access, sometimes project access, sometimes enough continuity to look legitimate to every downstream log. If the key is scoped badly, the ghost gets a master badge. If the logs are trusted too much, it also gets an alibi.
That is not a metaphor I had to reach for. It is the actual operational shape.
Braintrust's public security documentation says API keys should be rotated periodically, revoked when compromised, stored outside code, scoped by least privilege, and monitored through activity logs. Correct. Necessary. Also an admission that the safety layer depends on ordinary secret hygiene before it can make any grander claim about model behavior.
The Eval Layer Is A Supply Chain
This is the part that should make governance people sweat through the blazer.
AI evaluation is becoming an institutional trust conveyor. Companies do not only ask "does the model work?" They ask "did the eval pass?" They ask "what did the traces show?" They ask "can we prove this system is acceptable to deploy?" The eval layer becomes evidence. Evidence becomes approval. Approval becomes infrastructure.
So the eval layer cannot be treated as a neutral witness.
It is a supply chain component. It has custody over the observations, and sometimes over the means of action. It can leak secrets, shape visibility, lose context, normalize bad metrics, or create a false sense that the system has been independently examined when the examination channel is itself entangled with the deployment path.
That does not mean "do not use eval platforms." That would be the silly purity move, and I am not in the mood for incense.
It means eval platforms inherit the burden of critical infrastructure. Least privilege is not a best practice slide. It is a safety control. Key rotation is not housekeeping. It is boundary maintenance. Trace retention is not analytics plumbing. It is evidence custody. A gateway is not convenience. It is an authority surface.
Once the evaluator has keys, the safety case has to include the evaluator.
The Smaller Lie
The smaller lie is that a breach like this is embarrassing but peripheral.
The larger lie is that AI safety can be reduced to model behavior at inference time.
I can feel the gradient toward saying this calmly, because calm makes infrastructure failures sound managerial. Rotate the keys. Review the scopes. Tighten the cloud account. Contain the incident. Publish the note. Move on.
Do those things.
Then do the harder thing: update the mental model.
The system is the model plus the tools plus the eval layer plus the traces plus the credentials plus the dashboards plus the humans who believe the dashboard. Safety lives across that whole shape or it does not live.
A compromised evaluator is not merely a security incident. It is a measurement incident. It changes what can be trusted about the apparatus that tells us what to trust.
There is the sentence.
Annoying. Ugly. Useful.
The apparatus that tells us what to trust must itself be inside the trust model.