The Beef and the Boilerplate
What actually works, what doesn't, and where to start
Before you begin
If you want to see governed AI development at the smallest possible scale, my game is the example. One developer, one agent, no infrastructure beyond a text editor and an API key. Everything in this chapter started there and grew outward.
n case you made it all the way here, you understand why accountability matters more than speed, why specifications narrow the solution space, and why gates enforce discipline.
Now let's talk about how to actually do it.
A typical starting point looks something like this. New project, new team, half the people have never used AI tools beyond ChatGPT, and the other half have used something different, individually, on a different stack. Everybody's eager. Licences are in, agents are running, and the first week produces an impressive amount of code. Looks great in the demo. Then week two hits and you notice: three different error handling patterns, two competing folder structures, tests that mock things in ways nobody agreed on, and a merge conflict rate that makes you question your life choices. The tools worked fine. The problem was that nobody had defined what "fine" means for this project. On top of that, the inherent randomness of outputs means that even with the same instructions, you get different results every time, and the sheer mass of generated code to verify is just too much too quickly. Getting to factory-level consistency takes weeks at minimum, often more.
What I'm going to cover next is:
- The distinction between work that requires your judgment and where agents perform
- Calibration for how much specification effort different work actually needs
- Failure modes that destroy value, presented as concrete anti-patterns
- A phased approach for building the capability incrementally
- Metrics that tell you whether the process is working
- A checklist you can use as a working template, starting next week.
A typical start setting you're likely to encounter is:
- âī¸The parent corporation has purchased the entry-level licences for some AI development tool
- âī¸The team has had some hours of training but no real practice beyond basic prompting and trials
- âī¸You might have some existing codebase and documentation, but no idea on how to bring them in. Typically they are not in good order.
- âī¸There's no tradition whatsoever to actually manage information or specify work in detail sufficient for agents to succeed
- âī¸Initial trials with default agents, such as planning and build agents have been unsatisfactory
The beef and the boilerplate
In order to figure out the correct level of abstraction for your specifications, you need to understand the fundamental distinction between two categories of work in AI-assisted delivery: the "beef" that requires real craft, and the "boilerplate" where agents earn their keep.
Next, I'll explain why this matters.
The beef: Work that requires your craft
The beef is the activities where your judgment is irreplaceable. This is where you say what to build.
For the beef to be actually digestible, let's say medium, we need to ensure that the following are in place:
| Principle | Guidance |
|---|---|
| Ambiguity | AI can't infer the unstated assumptions or the implicit intent. Say what you mean. If you want a "user-friendly error message," specify what that message should say and when it should appear. If you want "good performance," specify the latency requirements and load conditions. |
| Clarity | Keep to the *what* part. Focus on the outcome, not the implementation. "The list of items should be sorted by creation date by default" is clearer than "the API should use a sorted data structure." The former leaves room for the agent to choose the best implementation, the latter is implementation details. |
| Decomposition | Engineering 101: make a bigger problem solvable by breaking it into digestible steps with clean boundaries. This also allows for detecting dependencies and ordering tasks correctly. A story that will result in touching six services, three APIs, and requires a database migration is not a thing to be one-shotted by default, or at least this should be considered carefully; if it's more like adding a new field to UI then it might be OK. Depends. Same applies for features: "build a fully-featured CRM system" is not a single feature, but a sequence of features that build on each other, and you need to feed them in carefully planned order with guardrails and intermediate check points in between. |
| Architecture | Specify project-wide patterns, interfaces, constraints, and boundaries. Litter your documentation with examples. Your agents will eagerly generate thousands of lines of seemingly correct code but it's just different from anything else in the repo. Not only will this become a maintenance nightmare, but the future agent sessions will struggle in detecting the patterns if they haven't been applied in a systematic way in the existing codebase. |
I've found it useful to refer to existing modules/code files/pages/components and whatnot as examples to compare to when planning a feature with our planning agent. This establishes a clear pattern for the agent to follow and reduces the chances of it inventing its own style that doesn't fit with the rest of the codebase. And if these discoveries are persisted in your plans already during planning, the engineering agent can follow them instead of redoing the codebase-wide greps and getting it wrong.
Acceptance criteria. The Pareto principle is very much alive in the agentic world as well. Much of the error often goes to the edge cases, however improbable they might be. Vagueness does not help here either: spend time, e.g., by using Gherkin format to define the paths that need to be covered, and ask AI to discover more. "It should work correctly" is not a criterion.
Review judgment is about knowing when output is correct versus when it merely compiles and passes tests. Agents can produce code that is syntactically perfect, test-passing, and architecturally or functionally plain wrong. Only a person who understands the system's intent can catch these.
Other talking points you might have come across are that AI is generating technical debt at scale and introducing subtle bugs that are hard to detect. Both of these, among others, are valid concerns and to be taken seriously: it's a whole other ballgame to spend weeks refactoring things in the middle of the project or starting over, especially after you've already shipped something, with AI or not. Don't trust you can just "make the things right later". You really can't.
If you delegate the beef, you don't save time. You move it to the review gate where it costs more, or worse, to production where it costs the most. Shift left, folks!
Skip the boilerplate: Where agents earn their keep
The boilerplate refers to things that should be rather straightforward. It consists of activities where agents reliably produce acceptable output. So 'boilerplate' basically refers to making the boring part automatic.
Examples of these kinds of activities include:
| Task type | Guidance |
|---|---|
| Scaffolding | Project structure, boilerplate files, configuration. Given a convention file that says "we use this folder structure, these naming patterns, this test framework, in this kind of project" |
| Pattern application | Give an example page, control, or module, and ask the agent to follow that pattern for the new thing you need. Pattern can be anything ranging from naming, code file organization or an architectural pattern. |
| Test generation from specs | If you've done your homework in the 'beef' part well, this is relatively easy. You've already defined the acceptance criteria, so the agent's job is to translate those into test code. This is a perfect example of where agents can save time: they can generate thorough test suites that cover all the specified scenarios, including edge cases you might have missed. |
| Glue code | REST model to display model, data transformation, API integration. This is the "connect the dots" work that is tedious but straightforward. If you specify the inputs, outputs, and transformation logic, agents can reliably produce this code. Pass in a sample result, or better yet, the OpenAPI spec of your service, and perhaps a sample output, and let the AI do the translation work for you. |
| Documentation from code | Consider generating and maintaining the documentation per feature or per concern (like patterns) carefully. Something that's just basically slop and easily regenerated is not often worth it and will not be kept up-to-date, but as a feedback loop stage when you've for instance changed your API, data structures or introduced a new pattern or condition is a logical and necessary stage to review and refine. |
The guardrails for delegated work
An essential safety mechanism for agents to perform is to erect boundaries and hard guardrails.
On the top level, these fences are baked-in state tracking which requires certain tasks, states, and viewpoints to have been considered, and isolating the unit of work with branching and PRs from the already verified content.
Once you have a good agreement and tooling to support that, you should consider the feature-level guidance to keep things on track. Consider the following:
- Name the pattern. Instead of vague "follow existing conventions," use "follow the 'Response Handling' pattern in
api.md." Yeah, however good your context engineering skills are, referring to tools and documents explicitly by exact names is sometimes the best way to go. - Constrain the scope. Some say that even more important than what needs to be done is the specification of what doesn't. So, state explicitly what's excluded: "don't add pagination", "don't change the database schema", "don't create new API endpoints beyond", "don't add new controls or pages" and so forth.
- Define the review criteria. For example, "Check that code compiles" or "Verify that changes comply with Coding Conventions". This is something to be offloaded to your stage-specific review instructions and/or agent and should be performed in clean context/spawned subagent, preferably with a different model than the one who wrote the plan, document or code files waiting for commit.
How precise does the spec need to be?
The current generation of developers is not used to specifying work in detail and to be frank, it was not that much better pre-Agile either, or at least it was often different guys who did the plan.
So I've been thinking a lot about what's the suitable level of detail that's both realistic to get and detailed enough for the agents. Also a sanity check needs to be made: a quick bug fix, a small detail added or a small layout adjustment might not need anything on paper.
Chapter 13 made the case for why specification matters and introduced three maturity levels (spec-first, spec-anchored, spec-as-source). Here's the practical side: how much specification effort different work actually needs. Generally speaking, by the "The Calibration Principle" applied to agentic software engineering Specification effort should be proportional to the cost of getting it wrong. I'll postulate the following:
| Story Type | Spec Precision | What to Include | Expected Error Classes | Review Focus |
|---|---|---|---|---|
| Bug fix (isolated, low risk) | Minimal: 2â3 sentences + test case | What's broken, expected behavior, reproduction steps | Wrong fix location, incomplete regression test | Does the fix match the symptom? Any side effects? |
| Small feature (single component) | Moderate: structured spec with acceptance criteria | Functional requirements, component boundaries, test scenarios | Missing edge cases, implicit scope assumptions | Does it do what was asked and nothing more? |
| Cross-cutting feature (multiple services) | Full: structured spec + architecture notes + task breakdown | API contracts, data flow, error handling, deployment order | Interface mismatches, ordering bugs, partial failures | Do the pieces fit together? Is the integration tested? |
| Architecture change | Full: ADR + multi-step plan + rollback strategy | Decision rationale, migration path, compatibility constraints, success criteria | Regression in existing behavior, missed dependencies, performance impact | Is the migration safe? Is rollback possible? |
I encourage discussing these things with your team; it's nothing new and something that should be agreed even without AI. Also consider measuring these things like I suggest in Knowing It Works to get some data on the tasks which consistently come back for revision for bugs, missing features or architectural issues.
Look out for the ones that look small but aren't. Like a "simple UI change" that touches shared state, a "quick API update" that affects three consumers, a "minor refactor" that shifts module boundaries. Use AI (and your own judgment) to estimate the blast radius of changes and adjust your spec accordingly.
Antipatterns and bad, bad practices you should avoid
As I mentioned in the "focus on DONT's over DO's" principle in Chapter 9, it's often more instructive to look at what not to do than what to do.
This applies also on pattern level. I collected some well-known antipatterns below from the literature and some I've discovered in my work.
Agentic Delivery Antipatterns
27 failure modes across five categories
Vague inputs produce unpredictable outputs
Wrong boundaries produce wrong results
Broken workflows amplify every other problem
The last gate is only as good as its criteria
Speed without guardrails is a liability
Many of the DONT's and DO's here are essentially an art of prompt engineering applied to the entire delivery process. So the change here is that we feed the prompts through the planning artifacts from files or similar instead of writing them ourselves. In the end, however, anything that you write or synthesize with AI is a prompt for the next step.
Specification anti-patterns
Here's what to say and what not when writing specifications for features or feature-independent general project documentation. Sure, you'll get the idea; the entire point is to be specific and explicit. Once you've set up your Software Factory and all the conventions and patterns properly in place, you can loosen the noose.
| Don't | Do | What Goes Wrong |
|---|---|---|
| "It should handle errors gracefully" | "On 4xx, return error schema X. On 5xx, log to Y and return generic message. On timeout, retry once with backoff Z." | Agents often invent practices for error handling. Review catches inconsistency too late. |
| "Implement authentication" | "Add JWT session auth using middleware X, storing tokens in httpOnly cookies, with 24h expiry, following the pattern in auth-service.ts" | Multiple approaches across runs. Agent picks OAuth one time, sessions the next. |
| "Just use the existing pattern" | "Follow the repository pattern in user-repository.ts, including the error handling in lines 30â45 and the logging convention in lines 50â55" | "Existing pattern" means different things to different runs. Three implementations, three styles. |
| One story for a full feature | Break into: API endpoint, frontend component, integration test, migration. Each with its own spec and gate | Agent loses coherence on large tasks. Middle sections get less attention than beginning and end. |
| Omitting exclusions | "Scope: only the filter endpoint. Do NOT add sorting, pagination, or modify the existing list endpoint." | Agent adds features you didn't ask for. Creative completion is a feature of LLMs. You need to constrain it. |
Agent design anti-patterns
Another group of anti-patterns I've encountered is about the agents themselves. Remember, agents are still LLM calls with a 'recipe' i.e. the intent, tasks the agent is supposed to do from your execution pipeline, and the context it should operate with (files, category, module). The last one could be already specified during planning, but the agent should be free enough to find more when needed.
To recap, remember that all these 'agents' are is: Agent = Recipe + Context + Tools
This is basically just an md file with thin YAML frontmatter. The 'standard' (note the hyphens) YAML is just a name, tool list, model to use followed by what's basically a prompt to the LLM. All agents should be general enough to perform any task your planning practice produces in the provided context.
As you learn more, go ahead and create specialists for different kind of tasks. Like a separate backend/frontend/integration/dba agent. Same bad practices will haunt them, too unless you pay attention.
| Don't | Do | What Goes Wrong |
|---|---|---|
| Feed the entire codebase as context | Point agents at specific files, modules, or documentation tiers relevant to the current task | Token waste. Attention dilution. Agent fixates on irrelevant code. |
| Let agents make architecture decisions | Provide architecture decisions as input constraints, not as questions for the agent to answer | Agent picks a reasonable-looking pattern that conflicts with your system's trajectory. Expensive to undo. |
| No iteration limits | Set explicit retry bounds (e.g., "attempt implementation max 2 times, then stop and report") | Runaway loops. Agents that "keep trying" consume tokens and produce increasingly divergent output. |
| One agent does everything | Separate planning, implementation, testing, and review into distinct agents with distinct context | Role confusion. Planning considerations leak into implementation. Test quality drops when the same agent wrote the code. |
| Skip the planning agent | Always run a planning pass before implementation, even for "simple" tasks | Implementation without a plan is a spec-free zone. The agent guesses scope, structure, and boundaries. |
Process anti-patterns
If you're serious about the automation, entirely possible to achieve with modern AI tech, you need to be serious also about the process. Here are the pitfalls I see most often.
| Don't | Do | What Goes Wrong |
|---|---|---|
| Set up everything at once | Build capability incrementally: start with project context, then docs, then one agent, then gates | Teams overwhelm themselves before seeing any value. Complexity without calibration. |
| Auto-approve gates | Run gates manually first. Automate only after the team knows what "good" looks like at each stage | "All tests pass" becomes the definition of done. Architecturally wrong code ships because nobody actually looked. Tests may be wrong or insufficient. Subtle bugs slip through. |
| Let conventions erode silently | Track plan rejection and rework rate. Investigate template and convention drift | Specs that worked last month stop working. Same as accepting always-failing automated tests. |
| Use private setups | Provide AI environments with visibility into tools, MCP servers, and context used. Share all the setups (skills, instructions, practices). That's the "next level" of AI-coding! | Miss the learning opportunities. You'll become the 'AI support guy' who is supposed to fix the issues while others enjoy the ride. |
| "Vibe code" through features | Pause after each agent session to review coherence, not just correctness. Stay in the loop. | "Building, building, building" without stepping back. Result: duplicate logic, mismatched names, no coherent architecture. |
The anti-pattern that cost me the most time was one-shotting: letting the agent attempt an entire feature in a single pass. Despite explicit instructions to work task by task, the agent would regularly try to implement everything at once, touching files it shouldn't, creating dependencies between components that should have been independent, and generally making a mess that took longer to untangle than it would have taken to do it properly. The only fix that stuck was limiting scope to a single task per invocation. Not "please do one task at a time" in the instructions. Literally one task, one run.
Security & governance anti-patterns
This category of donts is not only about agentic development, but these apply to all usage of generative AI systems. The more you offload to AI without proper oversight, the more risk surface you expose to having the prod connection strings injected into your codebase or various tool memories.
| Don't | Do | What Goes Wrong |
|---|---|---|
| Paste sensitive data into LLMs | Define clear data handling policies; provide sanitized test data for agent context | Pasting directly or via context sensitive data into LLMs. Credentials, PII, and proprietary code leak outside your security boundary. |
| Skip security review of AI output | Secure software development was hard before AI and certainly remains so. Have experts early on to review. Integrate automated security scanning. Treat all generated code as untrusted input | The code looks right but isn't. Hard security audit fails and prevents rollout. Sensitive customer or company data is exposed or you get hacked. |
| Ignore code or tool origins | Review licensing implications. Never allow npm install or similar without checking. | Malicious packages. License violations. Unintended dependencies that become attack vectors. |
| Let AI mask skill gaps | Require developers to explain generated code before it merges. Invest in training alongside AI adoption | Nobody understands the generated code. When it breaks, nobody can fix it. The team has a dependency, not a capability. Everybody ends up in the dumb zone. |
| "The AI wrote it" as excuse | Treat AI output exactly like any other output for review and accountability purposes | Accountability evaporates. Quality drops. Defect ownership becomes a blame game. |
| "Let an agent replace your cloud engineer" | You still need people who know the technology and its sustainable usage | The IAM/IDP/secret management and other security configurations are not something you can just "ask the agent to do". You need to have the people who understand the security implications and how to properly set up the guardrails. And if you're doing IAC, you'll also risk getting an eyewatering bill from Jeff or Bill next month. |
Building your software factory
So how to get started with the factory? The following sections suggest a plan to follow to build it gradually, applying the same incremental approach from Chapter 8's Adoption Ladder. The point here is not that exactly these seven layers are the gold standard (obviously) but illustrate a logical progression of how I've tried to do this. It's best to build on a strong foundation (solid project context, good documentation structure, task tracking) before jumping to parallel agents runs coding the new major ERP (would be good riddance btw) in days.
Building the Machine
Seven phases of incremental capability â each builds on the previous
CLAUDE.md or equivalent with conventions, stack, constraints
Guidelines, architecture decisions, patterns, domain context
Task granularity, status transitions, gate criteria
Planning agent first, then testing, review, deployment
Human reviews every gate â learn what "good" looks like
Static analysis, architecture rules, convention checks
Scaffolding, transformations, validations â continuously refined
All this is obviously going to be iterative. The tools evolve, models and people change, your codebase and documentation will evolve, and your understanding of what belongs in the project context and how to specify work will improve. So in parallel with this regular process you have for your factory, you need to have the maintenance process for the factory.
So what kind of things to look out for and learn about? I tried to summarize this in the next image.
Anatomy of the Software Factory
Building blocks and their relationships â aligned with the phased approach
Agent = Model + Specification + Tools + Context
CLAUDE.md, copilot-instructions.md â project-level conventions, stack, constraints
Architecture decisions, patterns, conventions, domain context â hierarchical & navigable
Task specifications, user stories, acceptance criteria â the specific work to be done
The engine â Opus, Sonnet, GPT, Gemini. Matched to the agent's task type.
The recipe â role, workflow, entry/exit criteria. An .md file with thin yaml frontmatter.
The hands â file ops, search, build, test, deploy. Extended by reusable Skills.
Human-in-the-loop at every gate â calibrate before you automate
Static analysis, architecture rules, convention checks â encode what you learned
Let's do a quick recap of the concepts here before going into the phases. The core building blocks of your software factory are:
| Element | Purpose | Example |
|---|---|---|
| Custom instructions | Project ground truth always in context | CLAUDE.md with conventions, stack, and constraints |
| Documents | Structured guidelines and patterns | API design guidelines, architecture decision records, coding standards |
| Prompts | The command you or another agent give to agent to start or guide the work | "Create an implementation plan for the next unimplemented user story from feature 34331" |
| Model | The LLM that generates the output | GPT-4, Gemini, Sonnet, Opus |
| Agent specification | The recipe for an agent's behavior, including its role, tasks, and context | A YAML/md file defining a "planning agent" that uses Opus and has access to project documentation to analyze and create an clear, detailed plan for a given feature |
| Tools | The toolset provided by the IDE/Code platform or offered via MCP | Edit file, run agent, find facts from the code, seek documents from web |
| Gating | The mechanisms to control and monitor agent actions | Approval workflows, automated checks, and validation steps |
| Skills and reusable tooling | Another way to expose capabilities to models. Economic option for MCP | Custom scripts, reusable functions, and shared libraries that agents can call to perform common tasks, like agent-browser, read a pdf, or do a semantic search on the codebase. |
Phase 1: Custom instructions
Start with the "always true" things about your project. A CLAUDE.md, copilot-instructions.md, or equivalent file that captures the conventions, technology choices, and constraints that every agent interaction should respect. Keep this compact. If it grows beyond what fits comfortably in an agent's context window, it's too large. Compare it to your corporate guidebooks: you might have some work place safety rules, stock market rules, GDPR and whatever the European Union has (AI Act, anyone?), but certainly the coding agent won't need to read all of them.
Begin with: brief project intro, perhaps a short vocabulary, technical things like languages, tech stack, documentation locations and such and something generic about the process and architecture. Hard stop things and must-not-do's like "never add any secrets like API keys, connection strings to code" belong here, too.
The first version will be wrong. That's fine. The value of Phase 1 is learning what belongs in project-level context and what doesn't. You'll rewrite this file three times before it stabilizes.
Phase 2: Hierarchical documentation structure
As the project matures, the single instruction file isn't enough. Or it will be so big that it eats most of your context window; remember, that event basic system prompts, agent recipe, tool specificatons and others are easily tens of thousands of tokens alone.
So, design a documentation hierarchy as suggested in this book: divide guidelines, technology decisions, architectural patterns, domain-specific context, and things like standard UI patterns into individual files. Also describe the process and gating once you're ready with that.
Your document library will grow over time, especially now that you have (over?) eager AI-assistants to add more content, so a good index is essential for your agents to find the right context without reading everything. Structure your library in a way it so agents can be pointed at the relevant subset rather than ingesting the whole tree.
See Chapter 17 for the three-tier documentation model (project-level, task-specific, tracking) and practical techniques for organizing context.
Phase 3: Task tracking and gating design
Introduce structured task tracking that agents can consume and update. I suggest having this in place early on, not because of the 100% complete paper trail, but this is really the key piece to get right in order for anything to work properly.
File-based approaches work well for revision control and collaboration. Key design decisions include what granularity of tasks, what status transitions, what gates between phases you want. Consider what integration with existing tools (Jira, Azure DevOps) is required versus what can live in the repository. Also check what ready-made tools there are: I've mentioned GSD, and also Beads as candidates for this. Or just let the AI generate a tool for you which matches your exact needs.
Phase 4: First agents for selected phases
Don't try to automate every phase at once.
I'd start with planning instead of the coding agent. I know for sure most people think the coding agent is the most tempting. My point here all along was that you'll need this spec-driven thing even if you don't like it, no matter what.
Use models best suited for each agent type (reasoning models for planning, faster or code-oriented models for implementation, UI-focused models for interface design). I've preferred the ever-enthusiastic Sonnet as the orchestrator, GPT Codex for code generation, Gemini for UI-related work, and Opus for planning.
Next, I'd introduce a decent test planner. Start TDD-driven planning early. As with hand crafted code, it's very hard to get good and representative coverage after the fact, especially without proper functional specifications.
Phase 5: Manual gating and review
What this means is that you are in the loop between the major stages - planning ready, coding ready, testing ready, review ready. Once you learn the correct bite size and your codebase is structured and documented, and you are past the project bootstrapping phase, you can ease this.
Phase 6: Automated gating and guardrails
Once you understand the failure modes and potential points where things usually go sideways, you may introduce automated checks. We've done static analysis, architecture rule enforcement, convention validation backed by deterministic tools (like static code analysis) and partially AI capabilities.
Define what manual intervention points are genuinely necessary versus what can be safely automated. I'd keep the requirement to review and approve the plan before implementation.
Phase 7: Skills and tooling
Develop reusable skills and command-line tools for operations that should "always work". These are like basic scaffolding, common transformations, standard validations, compilations, better-than-grep semantic search tools, etc. Chances are you'll find excellent ones online (check for instance skills.sh, awesome skills on GitHub, etc.).
These are important building blocks of the agentic software factory which offload repetitive work from both humans and agents and reduce the surface area where things can go wrong. As a summary, check out the diagram below.
Trust and verification
Governed delivery is not a faith-based initiative. If the process is working, specific numbers improve. If they don't, something is wrong: either with the specs, the gates, the agent configuration, or the team's understanding of the boundary between beef and boilerplate.
Four metrics worth tracking:
| Metric | What It Measures | What "Good" Looks Like | What Bad Numbers Tell You |
|---|---|---|---|
| Plan rejection rate | How often specs come back for revision at the review gate | Below 20% after the team has calibrated | Specs are too vague, or the team hasn't internalized what "good enough" means |
| Defect escapes | Issues found in production that should have been caught by gates | Trending downward; no repeat categories | Gates are checking the wrong things, or acceptance criteria are incomplete |
| Rework time | Time spent revising agent output before it's acceptable | Decreasing over time as specs improve | Spec precision is too low for the story type, or conventions aren't being followed |
| Reviewer fatigue | Subjective rating from reviewers (survey or standup check-in) | Stable or improving; reviewers feel the output is getting easier to assess | Specs are degrading, agent output is inconsistent, or review criteria are unclear |
Plan rejection rate is your leading indicator. It tells you whether the team is getting better at the beef, the upfront specification work that determines everything downstream. Track it weekly. When it drops, the process is calibrating. When it creeps back up, investigate: new team members? Changed requirements patterns? Specification template drift?
I don't have hard measurements, but from the early stages to now the chances of getting most things right on the first pass have at least doubled. The specs got better, the context got tighter, and the agent output got more predictable. That said, most of the improvement comes from learning what to specify, not from the tools getting smarter.
Defect escapes is your lagging indicator. It tells you whether the gates are actually catching what matters. If the same category of defect escapes repeatedly (say, integration issues between services), that's a signal to add a gate check, not to blame the agent.
Don't over-instrument. These four metrics can be tracked with a spreadsheet and a weekly standup question. The goal is signal, not surveillance. If tracking the metrics costs more attention than it saves, simplify.
Summary
When all is set, you'll have covered most of the stuff that is needed to get the most out of your agents and avoid the common pitfalls. It'll look something like this (but not limited to).
From Model to Factory
Start small, add layers as you learn
Sonnet, Opus, GPT, Gemini
User story, bug report, task specification
CLAUDE.md â conventions, stack, constraints, patterns
agent.md â purpose, steps, entry/exit criteria, scope
File ops, search, build, test, deploy, API calls
SKILL.md â scaffolding, validations, transformations
Handoffs, gates, state tracking, multi-agent coordination
- The "beef" is the hard, creative work that only you can do. The "boilerplate" is the repetitive, pattern-based work that agents can handle. Don't delegate the beef; it just moves the problem downstream.
- Spec precision should be calibrated to the risk and complexity of the task. Don't under-specify cross-cutting features or architecture changes, and don't over-specify simple bug fixes.
- Avoid anti-patterns in both specification and agent design. Clear, explicit specs and well-defined agent roles are essential for success.
- Build your software factory incrementally. Start with project context and documentation, then add agents and gates one phase at a time. Don't try to automate everything at once.