AI Agents: Where They Earn Their Keep, Where They Don't

The “AI agent” label is doing a lot of heavy lifting right now. It covers a chatbot that drafts an email, a model that ships code on its own, and an autonomous system that operates a browser on your behalf. Same word, very different stakes.

There are roughly four shapes these tools come in: chat, copilot, coding agent, autonomous agent. The lines blur, but the autonomy levels don’t. Here’s what each is good at, where each breaks, and how to decide which one to actually reach for.

What an Agent Actually Is

Strip the marketing and an agent is three things stacked together:

A model that reasons over input and produces structured output
Tools it can invoke — read a file, run a command, hit an API, kick off a sub-task
A loop that lets it act, observe results, and decide what to do next

Chat is that loop with humans as the only “tool.” A coding agent has a filesystem and a shell. An autonomous agent has a browser, a calendar, and a task queue. The further down that list you go, the more autonomy the agent has — and the less forgiving its failure modes get.

Where Agents Earn Their Keep

Compression of grunt work. The win isn’t “AI does my job.” It’s “AI does the parts I’d avoid.” Refactoring boilerplate, drafting follow-ups, turning a vague spec into ten concrete questions. Things I could do but won’t bother with on a Tuesday.

Breadth on demand. A model has read more vendor docs, RFCs, and CVE writeups than any single engineer. When I’m in unfamiliar territory, an agent gets me 80% to a working mental model in minutes. I still verify, but I’m verifying instead of guessing where to start.

Iteration speed. With a coding agent in the loop, the cost of trying a refactor approaches zero. “What if we restructured this module?” used to be a half-day commitment. Now it’s a 90-second experiment with a diff I can throw away.

Parallelisation. Three agents working on three independent problems is a real productivity gain. Not because each one is faster than me, but because I’m only the bottleneck on one of them at a time.

Where They Fall Apart

Confidence ≠ correctness. This is still the biggest one. Agents are excellent at producing answers that look right. Wrong answers come in the same tone as right ones. The cost of trusting the wrong output scales with what the agent can touch — bad chat reply is annoying, bad code commit is a bug, bad autonomous action is a Tuesday afternoon you’ll never get back.

Context decay. Long-running agents lose the plot. They forget constraints from earlier in the session, redo work, contradict their own earlier decisions. The longer the loop, the more this hurts.

No skin in the game. An agent doesn’t know what’s load-bearing. It’ll cheerfully delete a file with a TODO that was actually a critical workaround. It can’t feel the weight of “we cannot break this in production.”

Failure modes scale with autonomy. A chatbot’s worst day is wrong information. A coding agent’s worst day is a broken build. An autonomous agent’s worst day is a real production outage or a compliance incident. The agents that promise the most value are also the ones that need the most guardrails.

Skill atrophy. Use an agent for something long enough and you forget how to do it yourself. Sometimes that’s fine — I don’t need to remember regex syntax. Sometimes it isn’t — losing the muscle for system design hurts.

The Four Shapes

Chat — the simplest case

The conversational interface. Single session, no persistent state, no tools beyond what you paste in. The lowest-stakes agent there is.

Who makes it: ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), Copilot (Microsoft), and a growing list of open-weight models you can run locally.

Best at: thinking out loud, drafting, explaining, exploring an idea before you commit to it. Most “what’s the right way to approach X” thinking starts here.

Falls down at: anything multi-step, anything that needs the actual files in front of it, anything where context matters across days.

Copilot — the collaborator

AI embedded in the tools you already use. Persistent context, ambient presence, lower friction than switching to a separate chat window.

Who makes it: Microsoft 365 Copilot (Teams, Word, Outlook), Gemini in Google Workspace (Docs, Gmail, Meet), Notion AI, Linear AI, GitHub Copilot for inline suggestions.

Best at: ongoing work where the value is staying in the loop. Drafting follow-ups. Summarising long threads. Catching what you missed in a doc review. The kind of work a thoughtful colleague would do if they had unlimited time.

Falls down at: deep technical execution. These are collaborators, not builders. Treating them like coding agents is a category error.

Coding agent — the developer

Reads and writes files, runs commands, executes test suites, ships actual code. This is where “agent” stops being metaphor.

Who makes it: Claude Code (Anthropic), GitHub Copilot Workspace (Microsoft), Cursor Agent, Devin (Cognition), Gemini Code Assist with agent mode.

Best at: real engineering loops. “Add this feature, run the tests, fix what breaks.” These agents can hold a multi-step plan, execute it, and recover from their own mistakes more often than you’d expect. This tier produces the most measurable productivity gain.

Falls down at: anything where the agent can’t see the whole picture. Cross-repo refactors. Hidden production constraints. Long sessions where it loses the thread of what it was doing three hours ago.

The trust calibration here is the real skill. Letting one run on a side project is one thing. Letting it run on infrastructure is another. Read every diff before it lands.

Autonomous agent — the highest stakes

Operates without a human in the loop on each step. Can control a browser, run extended multi-hour tasks, orchestrate sub-agents, and take actions in the real world.

Who makes it: OpenAI Operator, Claude computer use (Anthropic), Gemini with Google Workspace actions, and an expanding range of research and enterprise-specific agents from all three labs.

Best at: breadth and volume. Long sequences of mechanical steps — process every document, cross-reference every data source, work a task queue while you sleep. The kind of work that’s tedious but not cognitively hard.

Falls down at: judgment. These agents don’t know what’s load-bearing. They don’t know when to stop. They don’t know what’s in scope. The failure modes scale directly with their reach — an autonomous agent touching production, finance, or infrastructure needs hard guardrails and a kill switch before you turn it on.

When to Reach for Each

If the work is…	Reach for
Thinking, drafting, exploring	Chat (ChatGPT, Claude, Gemini)
Ongoing collaboration in your tools	Copilot (M365, Gemini Workspace)
Building or shipping code	Coding agent (Claude Code, Cursor, Devin)
Long-running, high-volume automation	Autonomous agent (Operator, computer use)

The mistake to avoid: using a more powerful agent than the task needs. A coding agent for what should have been a chat is overkill and slows you down. An autonomous agent for what should have been a script is asking for trouble. Match the agent’s autonomy to the actual blast radius of the work.

The Honest Bottom Line

Agents are not a replacement for engineers, security folks, or thinkers. They’re a multiplier — and multipliers cut both ways. A good engineer with a good agent is meaningfully faster than they were. A bad engineer with a good agent ships more bad code, faster.

The interesting question isn’t “are AI agents good?” It’s “what work should be agent-driven, and what work needs to stay human?” That answer changes month by month as the tools get better and the failure modes get sharper.

What I do know: once you’ve had three agents working in parallel on three different problems while you make coffee, the old way feels slow. But read every diff. Challenge every output. Never let an autonomous loop touch production without a kill switch.

The agents are good. They’re just not good enough to trust without watching.

What an Agent Actually Is#

Where Agents Earn Their Keep#

Where They Fall Apart#

The Four Shapes#

Chat — the simplest case#

Copilot — the collaborator#

Coding agent — the developer#

Autonomous agent — the highest stakes#

When to Reach for Each#

The Honest Bottom Line#