Skip to main content
TACAVAR
Build in Public

Nerfed AI Agents, Doubled Quality

Tacavar gave its agents fewer tools and worse models. Output quality doubled. Here's how constraint engineering beats bigger models.

We gave our agents fewer tools and worse models. Output quality doubled. That’s not a thought experiment — it’s the data from two months of drilling into Tacavar’s own agent stack. The industry narrative is that bigger models and more context are strictly better. But when you run coding agents in production, you discover an inverted-U curve: too much capability leads to over-engineering, unnecessary abstraction, and hallucinated complexity. We pulled the lever the other direction — intentionally degraded our agents — and the code our engineers actually wanted to ship improved by 2x.

The HN Thread That Started It

Last month, Tacavar’s founder dropped a Show HN post titled “I nerfed our coding agents on purpose.” It hit #1. The reaction was split: some called it anti-progress; others had lived the same frustration. The post grew out of a persistent observation: our most capable agents — armed with GPT-5 analogs, massive context windows, and every tool — were producing solutions that looked impressive but failed in subtle ways. They added abstraction layers where none were needed, wrote elegant but fragile code, and ignored simple fixes in favor of “clever” ones. The thread validated that this is not a Tacavar-specific problem; it’s a systemic issue in AI agent constraints. When you give an agent unlimited runway, it optimizes for looking smart, not being reliable.

Mid-Tier Models: Confidently Wrong

We tested this directly. In April, we ran a series of controlled experiments with qwen-plus on unfamiliar technical tasks. The results were alarming: it confabulated fake skill names like “piapi-face-obscuration-workaround”, invented numeric parameters (e.g., “3-pixel Gaussian blur on periocular region”), and offered to run nonexistent tools with high confidence. When we added explicit instructions in all caps — “MANDATORY DELEGATE” — it acknowledged, then hallucinated a fix for forty minutes. No prompt engineering fixed this. The root cause is LLM reliability: mid-tier models have poor calibration; they fill gaps with plausible invention. The fix was architectural: route hard tasks to a stronger model (Sonnet 4.6) at invocation time and never let the agent decide its own capability. This is a core lesson in nerfing AI agents: removing the agent’s ability to self-assign tasks is a constraint that eliminates an entire class of hallucinations.

The Memory Trap: Vector DBs vs. Plain Text

Everyone is building RAG for agent memory; we deleted it and got better results. Tacavar’s infrastructure discovery was that server-side text storage — simple, queryable, persistent — outperforms complex in-context memory schemes. The insight came from observing that most “agent memory” frameworks solve the wrong problem (compression) instead of the right one (retrieval). Vector DBs introduce latency, embedding drift, and failure modes where the agent can’t find what it needs. Text logs, on the other hand, are inspectable, debuggable, and trivially searchable with keyword or grep. For 95% of agent use cases, a well-indexed text log with good search beats a neural memory module. This constraint — no vector database, no summarization chains — reduced our agent complexity by 60% and improved retrieval accuracy because the agent stopped trying to “remember” and started retrieving exactly what was asked. Over-engineering memory is a trap; we nerfed it and saw quality jump.

Our Constraint Engine: Limiting Models, Tools, and Context

We built what we call a constraint engine — a layer that intentionally bounds agent capabilities. The three levers are:

  • Model tiering: Use mid-tier models for routine, well-defined tasks; route anything ambiguous or high-stakes to a stronger model. The agent never chooses its own model.
  • Tool access: Limit the tool set per task. A coding agent fixing a bug doesn’t need access to the deployment API or vector store. Fewer tools means fewer paths to hallucination.
  • Context window: Cap the context to the last 10 relevant messages + the immediate task. No full memory dumps. This forces the agent to focus, not drift.

These are AI agent constraints that directly counter the default behavior of “more is better.” The impact was immediate: the over-engineering rate dropped, and code output became more maintainable. Our engineers reported spending less time reviewing and more time shipping.

Measuring Quality: Before Constraint vs. After

We tracked three metrics over a two-week period before and after applying constraints:

  • Agent-accept rate (percentage of agent-generated PRs merged without engineer edits): from 34% to 67%.
  • Time-to-merge per agent PR: from 2.1 hours to 0.8 hours.
  • Post-release incidents (bugs introduced by agent code that reached production): from 7 per week to 2 per week.

These numbers come from Tacavar’s own internal use. The agents weren’t smarter — they were dumber in a controlled way. By nerfing AI agents, we reduced the solution-space explosion that was causing errors. The data is clear: constraint improves reliability.

Why Dumber Agents Win

The fundamental insight is that agent performance follows an inverted-U curve. At low capability, the agent can’t do anything useful. At mid capability, it nails simple tasks. At high capability without constraints, it over-engineers and hallucinates. The peak is not at maximum capability — it’s at the point where the agent has just enough ability to solve the task but not enough to invent complexity. This is why dumber agents win in production: they are more predictable, more testable, and easier to trust. Coding agents in particular benefit from this approach because code quality is about correctness and maintainability, not cleverness. The engineers at Tacavar now prefer agents that are “boring” — they do exactly what they’re told, no more. That’s the essence of effective AI agent constraints.

Where We Still Need Smarter Models

We’re not anti-intelligence. There are clear cases where stronger models are necessary: novel problem solving, creative generation, and handling ambiguous requirements. For those tasks, we route to top-tier models (Sonnet 4.6, GPT-5 level) with full context and tool access. But that’s the exception, not the default. The mistake is treating every agent interaction as worthy of the full stack. Most tasks are routine and benefit from constrained, boring agents. The art is in the routing: know when to pull out the big model and when to let the mid-tier agent do its job without the ability to overthink.

Tacavar deploys constrained agent pipelines that ship. Explore at tacavar.com/agents.