University Labs Need AI PMs

Jun 12, 2026 · 7 min read

Every university lab already has a product manager. The problem is that the job is usually split across a professor with no time, a senior PhD student with too much context in their head, a postdoc who knows the real protocol, three Slack channels, four Overleaf comments, a GitHub issue nobody has updated, and one spreadsheet named final_final_v3.

Nobody calls this product management because labs do not think they are building products. They are doing science. But the work still has all the product problems: ambiguous goals, changing requirements, brittle timelines, users with different incentives, artifacts scattered across tools, and a constant fight to keep the team aligned on what is true now.

That is why university labs need AI PMs.

Not agents that invent hypotheses. Not agents that write papers while everyone sleeps. A narrower thing is more useful: an AI project manager that keeps the research machine legible to the people inside it.

The lab is a coordination problem

The fantasy version of AI for science starts with discovery. Give the model a problem, let it search the literature, design an experiment, run the analysis, and draft the manuscript.

Sometimes that work is real. But after spending time with working research groups, the failure mode I keep seeing is much less glamorous. The lab does not lose only because it lacks ideas. It loses because the idea, the experiment, the result, the decision, and the artifact stop pointing at each other.

A senior student graduates and the practical knowledge walks out with them. Two people write up the same result. A failed sweep lives in a Slack thread but never reaches the manuscript. Someone changes a protocol after a meeting, but the task tracker still says the old thing. The professor asks why a baseline was dropped, and the answer is technically somewhere, if someone has three hours to reconstruct the archaeology.

This is the coordination layer. It is boring until it breaks. Then it becomes the reason nobody trusts the result.

What an AI PM actually manages

An AI PM for a lab should track four kinds of state.

Tasks. What are we trying to do, who owns it, what blocks it, and what deadline or dependency makes it matter?
Decisions. What did we decide, why did we decide it, who was in the room, and what evidence would make us revisit it?
Artifacts. Which repo, branch, Overleaf section, W&B run, dataset, figure, notebook, or uploaded file is the current source of truth?
Failures. What did not work, why did it fail, and where should a future teammate look before repeating it?

That last one matters most. Labs are strangely good at preserving successful results and strangely bad at preserving negative knowledge. The failed sweep, the weird reagent behavior, the prompt setting that looked promising and then collapsed, the reason a baseline got killed: these are exactly the things a human team needs in order to move faster, and exactly the things that vanish first.

The AI PM is not there to be clever. It is there to make the working memory of the lab durable.

Why this is AI-shaped

Normal project management software assumes the work will enter through the front door. Someone creates a ticket. Someone updates the status. Someone writes the decision in the right field.

Labs do not work like that. The real state arrives sideways. A Slack reply changes the plan. A GitHub PR closes a task without saying so. An Overleaf edit invalidates yesterday's figure. A W&B run fails in a way that should update the next meeting agenda. A calendar event becomes the only record of a go or no-go decision.

This is a good job for AI because the inputs are messy, cross-tool, and linguistic. The agent has to read the ambient exhaust of work and turn it into structured coordination state. It has to notice that "let's drop this baseline" is not just conversation, it is a decision. It has to notice that "rerunning with the old tokenizer" is not just a status update, it is a fork in the artifact history.

The product trick is constraint. The agent should read widely and write narrowly.

Merton and Bruno

Merton is my version of this bet: an AI project manager for scientists that tracks tasks, decisions, and artifacts across Slack, GitHub, and Overleaf.

Bruno is the paper we wrote from that work, accepted at the ICML 2026 AI for Science Workshop. The core claim is simple. Scientific teams do not first need an agent that can touch every part of the scientific workflow. They need an agent that covers the meta-layer across the workflow.

Bruno ingests state from tools labs already use: Slack, GitHub, Overleaf, W&B, calendars, transcripts, email, and uploads. It maintains a project-scoped model of task graphs, decision logs, artifact indexes, and failure logs. It surfaces that state through Slack and dashboards.

The important part is what Bruno cannot do. It has no write access to code, manuscripts, datasets, or instruments. Its write paths are messages, dashboards, and human-confirmed state mutations.

That constraint is not a lack of ambition. It is the thing that makes the system deployable. University labs are high-trust, high-context, low-process environments. The quickest way to get an agent kicked out is to let it mutate the scientific record before the team trusts its judgment.

The AI PM should be a little annoying

A good AI PM should occasionally bother people. Not constantly, and not with fake productivity theater. But it should interrupt when the plan and the work diverge.

It should say: this task is blocked by a decision nobody has made.

It should say: the manuscript still claims we use the old baseline, but the repo and Slack thread suggest we dropped it.

It should say: this failed run is about to disappear into W&B history, and it looks like the reason is worth writing down.

It should say: two people appear to be doing the same analysis.

That is the difference between a chatbot and a PM. A chatbot waits to be asked. A PM holds state on behalf of the group and speaks when the group is about to lose it.

The evaluation should change too

If the agent is a PM, the eval cannot only ask whether it produced a good answer. The unit of value is the team's coordination over time.

Did fewer tasks go stale? Did decisions become easier to reconstruct? Did handoffs get better when a student missed a week? Did failed experiments become visible before they were repeated? Did the professor spend less time asking "where is that thing?" Did the new student ramp faster because the lab's memory had an interface?

This is why Bruno's evaluation roadmap is longitudinal. The right test is not a one-shot benchmark. It is a quarter of real project work, with measures for transactive memory, shared mental models, perceived coordination effectiveness, task completion, and failure-mode analysis.

The question is not whether the model sounds smart. The question is whether the lab becomes less forgetful.

Why universities first

University labs are the perfect place to build this because the pain is extreme and the bureaucracy is light enough for experiments to happen. The same team is often doing discovery, engineering, writing, fundraising, mentoring, and operations at once. Every lab is a tiny research startup with worse tooling and better ideas.

They also create the right ethical boundary. In a lab, attribution matters. Provenance matters. Negative results matter. Human judgment matters. A good AI PM has to respect all of that or it will fail socially before it fails technically.

This is the broader lesson. AI in science does not have to begin by replacing the scientist. It can begin by becoming the teammate who remembers what everyone decided, where the evidence lives, what failed, what changed, and who needs to know.

That sounds modest compared with autonomous discovery. It is also the layer every real lab is already missing.

Before the AI scientist, build the AI PM.