A teammate wired up a coding agent to do the obvious thing: watch the issue tracker, pick up tickets assigned to it, write the code, and open a pull request. It worked. In fact it worked so well that within a few days our review queue looked like a denial-of-service attack. PRs piled up faster than anyone could read them. The bot wasn’t broken. We were.
That mess turned out to be the most useful thing that happened to us all quarter, because it forced a realization I haven’t been able to shake. AI didn’t make my team faster. It moved the bottleneck.
The constraint just relocated
We run with work-in-progress limits, and we mostly respect them. So why the pile-up? Because WIP limits govern work that’s been started, the “in progress” column. They say nothing about work that’s waiting to be reviewed. As long as humans were both writing and reviewing, those two rates were roughly matched, and the review column never overflowed.
AI breaks that symmetry. It authors changes far faster than people can meaningfully review them. The work doesn’t disappear; it stacks up at the next station, code review, which is exactly where you least want a backlog, because review is where understanding and quality live. We’d accelerated the easy half of the pipeline and quietly overloaded the hard half.
The fix was simple once we saw it: put a WIP limit on the review column too. Not more reviewers, not faster reviewing, just a cap. We started with one or two open AI-assisted PRs per person and tuned from there. If too many are waiting, the pipeline stops producing more until a human catches up. Reviewer capacity is the real constraint, so we made it the thing we manage. The spam stopped being possible, instead of being a mess we cleaned up after the fact.
From autonomous doer to bounded amplifier
The deeper lesson was about what role we’d actually given the AI. We’d made it an autonomous doer. It owned the ticket, made the decisions, and produced the result. The human showed up at the end as a reviewer of a finished thing they hadn’t designed.
We flipped that. Now the AI is a bounded amplifier. A person pulls the work, frames it, sets the constraints, and stays accountable for the outcome. The AI does the heavy lifting inside those bounds. It sounds like a small distinction, but it’s the whole game. A reviewer rubber-stamping a finished black box is not the same as an engineer who designed the approach and used AI to execute it, even if the diff looks identical.
Autonomy is earned per task, not granted per tool
The most expensive assumption I see teams make is treating “how autonomous should the AI be” as a single dial you set once. It isn’t. Autonomy should be granted per type of work, and earned with evidence.
Our bot’s first mistake wasn’t “too much autonomy” in the abstract. It was applying maximum autonomy to the highest-risk work, arbitrary tickets headed for production, on day one, with no measurement. Meanwhile there’s plenty of low-risk, easily reversible work like generating tests, updating docs, or scaffolding, where more autonomy is cheap to grant and easy to claw back.
So we built a ladder. Early on, a human triggers every run. Once a category of work shows clean results over time, we let it fire on events instead. Later, for a curated list of narrow, reversible changes, the system can pick up work on its own within those review caps. Each rung is earned by the previous rung’s track record, for that specific kind of work. Production-shaping changes may never reach the top, and that’s fine.
Which leads to the principle the whole thing now rests on. You don’t remove the human from production by trusting the AI harder. You remove them by making changes reversible and verification automatic, then supervising by exception.
Feature flags, canaries, automatic rollback, real test coverage: that infrastructure is what earns you the right to step back, not a growing sense of confidence in the model. Until it exists, the human approval gate stays exactly where it is.
Don’t re-estimate work just because AI helps
One more, because it surprises people. We deliberately did not lower our story-point estimates for work the AI would help with. A point measures the complexity of the work, not who or what performs it. The moment you start re-pointing because “AI will make this easy,” you’ve corrupted the one unit that lets you compare across time, and you’ve tied your estimates to a tool that changes monthly.
Let the AI effect show up where it’s actually measurable, in throughput and cycle time, not by quietly redefining the ruler. If the work got genuinely faster, your flow metrics will say so. You don’t need to lie to your backlog to find out.
What I’d tell past me
We didn’t need a grand AI strategy. We needed to notice that a fast new producer in a pipeline doesn’t make the pipeline fast. It just exposes wherever the real constraint was hiding. For us that was review and human judgment, which, conveniently, is also the part you least want to automate away.
Treat AI as an amplifier with limits. Cap the queue where the humans are. Earn autonomy one reversible step at a time. And don’t move the goalposts on estimation to flatter the tool.
The bot’s still running, by the way. It just doesn’t get to flood the queue anymore, and the humans are still the ones who decide what good looks like.