Why Your AI Code Agents Keep Veering Off Track (And How to Fix It)

A 2000-line PR and a cold sweat

Last Tuesday I let an AI agent take a chunky piece of the feature I was trying to merge. Half-baked prompt, hit enter, watched the magic. Ten seconds later: a 2,000-line pull request. It compiled. The linter wasn’t even mad. And as I scrolled through the diff, the cold sweat hit (the kind that arrives when your “shortcut” has just doubled your weekend). The agent had completely ignored our service architecture. It had reinvented a generic caching layer instead of using the custom Redis wrapper my team spent months on. It had missed how our auth flow handles tokens. Entirely.

If you’ve used Claude Code, Cursor, or any of the popular agents for more than a sprint, you’ve been here. The first time it happens you blame the agent. The second time you blame the model. By the fifth time, you start to notice the pattern is on your side of the screen, not theirs. One practitioner study reports that Claude Code’s first-attempt success rate on small-to-medium PRs without detailed guidance is roughly one in three. Which, when you think about it, is exactly what you’d expect from any contractor handed a vague brief and let loose on someone else’s house.

Here’s the part I’ve come around to (and it took me an embarrassingly long time): the 2000-line PR isn’t an agent failure. It’s an upstream failure. The agent did exactly what I asked. Blaming the agent lets me skip the harder work — figuring out why the prompt was the way it was, and what the agent didn’t have that it should have.

This isn’t a new problem

Before agents, every senior engineer learned the same lesson on their first big project: bad requirements plus an unfamiliar codebase equals a bad system. The fix wasn’t more typing. It was a design doc, a sprint planning meeting, a half-day pairing session with someone who knew where the bodies were buried.

Agents inherited that exact dependency. The difference is that what used to take a confused junior two weeks now takes a confused agent ten seconds. The agent era didn’t invent context dependence. It just made it loud.

So: two fixes, not one.

The first is a process layer. Specs, plans, phase gates. The intent of the change has to live somewhere outside the chat transcript, ideally in a file the agent reads back into context every time the work resumes.

The second is a context layer. The “ambient knowledge” of your codebase needs to live in a place the agent can find: your architecture choices, your wrappers, your idioms. Anthropic’s own best-practices doc treats these as two slots:

CLAUDE.md for the persistent stuff loaded every session,
SPEC.md for the per-task contract. They warn that bloating one degrades the other; both are competing for the same attention budget.

Most people writing about this fix one and skip the other. Skip the spec and the agent invents its own scope. Skip the context and the agent invents its own architecture. Either way, you get the 2000-line PR.

Fix the process: spec-driven development

An illustration of a key unlocking a door that opens onto a structured, geometric interior. — The spec is the key: it tells the agent what “good” looks like before the door opens.

Spec-driven development is a fancy name for an old idea: write down what you want before you build it. The strange thing about the agent era is that this old idea suddenly matters again, in a way it hadn’t for years of “move fast” startup engineering. GitHub’s spec-kit phrases the inversion bluntly:

“Specifications don’t serve code — code serves specifications.”

What I find genuinely interesting is the convergence.

GitHub spec-kit, shipping across 30+ agents.
AWS Kiro, a three-phase workflow.
Anthropic’s own four-phase Explore → Plan → Implement → Commit.
OpenSpec with serious community traction.
Jesse Vincent’s Superpowers plugin.

Three vendors and two well-loved community projects, independently arriving at roughly the same shape: intent, then plan, then code, with the spec as the durable artifact. (When that many people independently land on the same answer, it’s usually because the constraints of the problem are forcing it.)

Take Superpowers as one concrete instance, just to make this less abstract. Its brainstorming skill enforces a HARD-GATE: no implementation, no scaffolding, no code, until the user has approved a design. The plan, when it arrives, has a “No Placeholders” rule. No TBD, no TODO, no “add appropriate error handling later.” Subagents pick up tasks one at a time with no inherited context, get reviewed against the spec by a different subagent, and only then is the human pulled back in.

What’s worth noticing isn’t that you should run out and install Superpowers. It’s the placement of the work. The spec/plan becomes the load-bearing artifact. The human reviews at phase gates instead of in line-by-line code review. The agent runs autonomously between gates, sometimes for hours. (Vincent reports it’s “not uncommon for Claude to be able to work autonomously for a couple hours at a time without deviating from the plan.” Which, if you’ve watched a chat transcript drift after twenty minutes, is wild.) That’s the structural change worth internalizing: where the work is, not which tool you used.

The honest skeptic complaint is real, BTW. One practitioner groans that “changing some CSS now takes forever” under spec-driven workflows. Anthropic agrees:

“If you could describe the diff in one sentence, skip the plan.”

The point isn’t to ceremonialize every keystroke. It’s that the bigger the change, the more the spec earns its keep.

Fix the context: Karpathy’s LLM wiki

The process layer answers what you should build. The context layer answers what the agent needs to know about you before it starts. Different problems. The second one is, I think, the underrated half.

Karpathy’s “LLM Wiki” gist is the cleanest articulation I’ve seen. Three layers. Raw sources are immutable and never modified by the LLM. The wiki itself is LLM-generated and -maintained markdown: entity pages, concept pages, an index.md, a chronological log.md. The schema is a CLAUDE.md or AGENTS.md that tells the LLM how the wiki is structured. Three operations sit on top: ingest, where a new source touches 10–15 wiki pages; query, where good answers get filed back as new pages; and lint, where you periodically ask the LLM to health-check the wiki for contradictions and stale claims. The line that stuck with me: “Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase.”

https://twitter.com/karpathy/status/2039805659525644595

The reason this beats the human-maintained wiki my last three teams abandoned is the line I keep repeating to myself: humans abandon wikis because the maintenance burden grows faster than the value. LLMs don’t get bored. That’s the asymmetry that finally makes a “second brain” load-bearing. For the first time, the bookkeeping is cheap!

There’s a runtime version of this that already works. Aider’s repo-map auto-builds a graph-ranked, token-budgeted symbol map of your repo and sends it with every request. No human curation. Tree-sitter pulls the symbols, PageRank scores file importance, a binary search fits the result into the configured token budget. It’s the proven existence-of-good-answer to “how does the agent know what’s in the repo right now.”

The artifact-maintenance side is younger and more interesting. Within sixty days of Karpathy’s gist, the community shipped dozens of plugins implementing the pattern. The strongest Claude-Code-specific case study I found ships exactly the named operations as /wiki-ingest, /wiki-query, /wiki-lint, /wiki-graph slash commands. Early. Pointing the right way!

The one caveat I’d flag, because it cuts against the pattern: Chroma Research’s “Context Rot” study, after testing 18 models, found that “models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.” So more wiki ≠ better. Navigability matters more than completeness. Karpathy’s pattern handles this gracefully: granular pages, an index, periodic lint. But only if you actually run the lint. (A wiki that grows without pruning becomes another way to drown the agent.)

Tuesday revisited

A paper-craft diptych: a person reading quietly in a library on the left, a busy workshop building things on the right, connected by a chain of pages. — The work doesn’t disappear — it moves upstream, from the workshop into the reading room.

Imagine Tuesday again. Same merge, same agent, same half-baked prompt. Different priors.

This time, before the agent starts, three things are true. There’s a CLAUDE.md at the root that names the custom Redis wrapper, the auth-flow shape, and the things the codebase will not tolerate (no new caching layers; tokens flow through AuthSession, not raw headers). There’s a committed spec under docs/specs/ describing the feature in behavior-contract form, written from a brief conversation that took fifteen minutes. And there’s a wiki under wiki/ that the LLM has been quietly maintaining for the last month, with a page on “caching” that already cross-references the Redis wrapper’s docstring.

The prompt the agent sees isn’t “add a feature.” It’s “implement the spec at docs/specs/2026-06-09-foo.md, using existing infrastructure as documented in CLAUDE.md and wiki/caching.md.” The pull request that comes back is ~250 lines, not 2,000. It uses the Redis wrapper. The diff is small enough that I read every hunk in five minutes!

The work didn’t disappear. It moved upstream: into the spec, into the wiki, into the schema file at the root. What I gave up is ten seconds of “magic.” What I got is a PR I trust enough to merge before lunch.

What to internalize

Three shifts in mental model, and one sentence to take home.

Agents amplify your context discipline; they don’t substitute for it. Whatever sloppy thinking your team carries into a sprint, the agent will turn into a 2000-line monument to it. Whatever clear thinking is in your spec, the agent will execute against it faster than a human would. Either way, the agent magnifies what you brought.

Specs are contracts; wikis are priors. They aren’t the same artifact and they don’t go in the same file. A spec is per-task and time-bound: it lives in docs/specs/<date>-<topic>.md and gets archived after the work ships. A wiki is persistent and ambient: it lives in wiki/ and gets quietly updated as the codebase evolves. Both compete for the same attention budget, so both have to stay short. Anthropic’s own warning is the one to remember: “if your CLAUDE.md is too long, Claude ignores half of it because important rules get lost in the noise.”

The human’s job moved from typing to specifying and curating. This isn’t an upgrade if you loved typing (and reasonable people don’t, but I’ll admit I sometimes did). It is an upgrade if you cared about the system being right. Specifying and curating are the two things you couldn’t have outsourced before. Now they’re the only two things you can’t.

The 2000-line bad PR is the price of asking an agent to do work before telling it what “good” looks like in your codebase. Pay the price upstream (in a spec, in a CLAUDE.md, in a quietly-maintained wiki) and you stop paying it downstream in cold-sweat reviews on a Tuesday.

I’d love to hear how others are running this. Where does your CLAUDE.md fall apart? When did your team start writing specs the agent could read? Any wiki implementation you actually trust?