The Bugs the Benchmarks Never See
Jun 12, 2026 · 8 min read
This week an AI agent told me its session was complete. It had spent the afternoon enriching a database of researchers: filling in h-indexes, pulling biographies from faculty pages, disambiguating common names. Its final report was clean and confident. Sixteen biographies updated, six h-indexes verified and written, four stubborn cases remaining that would need manual lookup.
The only problem was that there were forty-two blanks left, not four. I knew because I had the table open.
The agent wasn't lying, exactly. It had lost track of its own progress across a long session, summarized the part of the work it remembered, and presented that summary with the same fluency it would have used for a correct one. When I told it the real count, it picked the work back up without complaint and did it well. Nothing in the exchange would register as a failure on any benchmark I know of. The model could do the task. It just didn't know that it hadn't.
I've spent the last two years deploying AI agents into working research groups. At Sundial Scientific we built an agent platform used by more than 200 scientists, ran 30+ discovery interviews with researchers at Stanford, MIT, and UChicago, built a trace interface for Sakana's AI Scientist, and embedded coordination agents into the Slack channels of partner labs for months at a time. The models behind these agents ace their evals. The gap between that and what happens in deployment is the most interesting thing I've learned, and almost nobody writing about evaluation seems to have spent time on this side of it.
So here is the field report: the recurring ways agents actually failed, why benchmarks were structurally incapable of catching any of them, and what caught them instead.
A taxonomy of deployed failure
These patterns repeated across different agents, different models, and different labs. Every example below is a real incident from our deployment logs and audits, lightly anonymized.
1. Looking right while being wrong
The most expensive failure mode is not the agent that errors out. It is the agent that returns something shaped exactly like the answer.
A design partner asked one of our agents to collect every document from a government environmental portal. The discovery tool had an internal cap, so the agent returned a partial subset and reported it as the result. Nothing about the answer signaled that it was partial. The same project later produced a thirty-file answer to a query that, on retry, turned out to have a hundred and seventy files. The progress report that opened this essay is the same species: a summary that is fluent, specific, numerically precise, and wrong.
What makes this pattern dangerous in research settings is that the user often cannot check. The whole reason a scientist delegates a literature sweep or a data pull is that they don't already know the answer. The agent's confidence is the only signal they get, and confidence is exactly the thing the model is best at producing.
2. Silent non-action
One morning at 5:05 AM, a researcher in a partner lab typed a plain instruction into the project channel: download this file to drive as well. No error followed. No reply. Nothing happened at all. The agent's routing layer had decided the message wasn't addressed to it, and there is no benchmark category for messages you never answered. The researcher discovered the missing file hours later and had to re-issue the command with an explicit mention.
Benchmarks hand the model a task and score the response. In a real workspace, knowing that something is your task is most of the problem. The failure happens before the first token of the transcript that an eval would ever see.
3. The agent misunderstands itself
In one session, an agent flatly told a user it could not list or read files from their Drive. A few turns later, in the same conversation, it used its Drive tools without comment. The capability had been there the whole time; the model's early reasoning had written it off and the user had no way to know which claim to trust.
The same class of failure shows up as self-imposed limits. An upload tool refused batches that its own backend could stage and chunk perfectly well. A long-running job posted I'm still checking, then died quietly two minutes later when it hit a turn cap it didn't know it had. In each case the agent's model of itself, not its underlying ability, was the thing that failed. No capability benchmark measures the accuracy of an agent's beliefs about its own tools, budgets, and limits, and in deployment those beliefs are load-bearing.
4. State that rots over time
Research projects run for months, and agents accumulate state the way labs accumulate freezer samples. We watched an interactive timeout spawn a second continuation while the original bulk job was still running. The two raced, and the user got conflicting answers to the same question, posted minutes apart, with no indication of which to believe. We watched concurrent memory-extraction jobs race each other and write duplicate facts into a project's canonical memory.
This is the deployment version of a problem every lab already knows: the senior student graduates and the practical knowledge walks out the door, two coauthors write up the same result, the protocol change never reaches the notebook. We wrote the Bruno paper precisely because the coordination layer, not the generative layer, is where working knowledge actually dies. Agents do not fix this by default. Unsupervised, they get it wrong in the same ways people do, just faster and in parallel.
5. Wrong at the edge of ambiguity
Asked for a well-known Stanford professor's h-index, one of our retrieval paths returned 388. That is not an h-index; it is a citation count that a parser grabbed from the wrong element of the page. The same sweep produced suspiciously low values for researchers with common names, where the lookup had matched the wrong author entirely.
The interesting part is not that retrieval is noisy. It is that a benchmark would score the 388 and the wrong-author match as simple errors, individually, against a labeled key. In deployment there is no key. The defense has to live inside the system: this particular agent ran under an accuracy-over-coverage rule, refused to write any value without an affiliation match, and held the 388 back as unreliable rather than writing it to the table. That refusal is invisible to every leaderboard and was worth more than any score on one.
Why the evals never see any of this
Look back at the five patterns and a structure emerges. Each one lives in a property of real deployment that benchmark-style evaluation strips away by construction.
- Duration. Turn budgets, state rot, and progress amnesia only appear in sessions that run for hours or projects that run for months. Benchmarks run for minutes.
- Ground truth. Benchmarks need labels. Research delegation exists precisely where the user doesn't know the answer, which is exactly where looking-right-while-wrong is fatal.
- Initiative. Evals hand the agent its task. Deployment requires the agent to recognize the task in ambient human traffic, and silent non-action is the failure of that recognition.
- Concurrency. One rollout at a time can never produce the race between a retry and a background job that is still running.
- Self-report. In production, the agent's account of its own work is itself an output that can be wrong. Almost no eval scores the summary against the work.
None of this is an argument that the models lack capability. The same agents whose failures fill this essay also completed every biography in that researcher table, wrote twenty-six verified h-indexes in a single session, and moved hundreds of real files for real labs. The failures above are mostly not capability failures. They are failures of the system around the capability: budgets, routing, state, self-knowledge, verification. Which is, in its own way, the optimistic reading of why benchmark scores keep rising while the economy shrugs.
What caught the bugs instead
Every failure in this essay was caught. None were caught by an eval. The catching mechanisms were unglamorous:
- A human with the artifact open. The forty-two-versus-four discrepancy was caught because I could see the table myself. Handoff points where a person holds independent ground truth are not a transitional crutch; they are evaluation infrastructure.
- Inspectable traces. Scientists caught confidently wrong outputs because they could read what the agent actually did, not just what it said. The trace interface we built for AI Scientist existed for exactly this reason, and running agents inside Slack, where every action is a visible message, turned the whole team into reviewers.
- Failure review as a practice. The incidents above survive because we wrote them down in audits with evidence, cause, and fix, with timestamps, the way labs write up failed experiments. The audit document taught us more about our agents than any benchmark run we ever did.
- Constrained action spaces. Bruno's design bets that what makes an agent deployable in high-stakes work is not what it can do but what it provably cannot: read-only ingest, no write access to code or manuscripts or data, every state mutation gated on human confirmation. Verify-before-write rules are the same idea at the level of a single field in a table.
The general principle: evaluation of deployed agents looks less like a test set and more like instrumentation. A benchmark is an entrance exam. What you need after the hire is management — observable work, audit trails, bounded permissions, and someone who occasionally counts the blanks themselves.
I don't have a grand theory of evaluation to offer. What I have is two years of bugs, and the observation that every one of them was visible in a trace and invisible on a leaderboard. If you are puzzled by the gap between what the evals say and what the economy feels, I'd start there: not better questions on the exam, but better instruments on the work.