↑ Back to Contents
16

The Beef and the Boilerplate

What actually works, what doesn't, and where to start


Before you begin

If you want to see governed AI development at the smallest possible scale, my game is the example. One developer, one agent, no infrastructure beyond a text editor and an API key. Everything in this chapter started there and grew outward.

n case you made it all the way here, you understand why accountability matters more than speed, why specifications narrow the solution space, and why gates enforce discipline.

Now let's talk about how to actually do it.

A typical starting point looks something like this. New project, new team, half the people have never used AI tools beyond ChatGPT, and the other half have used something different, individually, on a different stack. Everybody's eager. Licences are in, agents are running, and the first week produces an impressive amount of code. Looks great in the demo. Then week two hits and you notice: three different error handling patterns, two competing folder structures, tests that mock things in ways nobody agreed on, and a merge conflict rate that makes you question your life choices. The tools worked fine. The problem was that nobody had defined what "fine" means for this project. On top of that, the inherent randomness of outputs means that even with the same instructions, you get different results every time, and the sheer mass of generated code to verify is just too much too quickly. Getting to factory-level consistency takes weeks at minimum, often more.

What I'm going to cover next is:

  • The distinction between work that requires your judgment and where agents perform
  • Calibration for how much specification effort different work actually needs
  • Failure modes that destroy value, presented as concrete anti-patterns
  • A phased approach for building the capability incrementally
  • Metrics that tell you whether the process is working
  • A checklist you can use as a working template, starting next week.

A typical start setting you're likely to encounter is:

  • â˜‘ī¸The parent corporation has purchased the entry-level licences for some AI development tool
  • â˜‘ī¸The team has had some hours of training but no real practice beyond basic prompting and trials
  • â˜‘ī¸You might have some existing codebase and documentation, but no idea on how to bring them in. Typically they are not in good order.
  • â˜‘ī¸There's no tradition whatsoever to actually manage information or specify work in detail sufficient for agents to succeed
  • â˜‘ī¸Initial trials with default agents, such as planning and build agents have been unsatisfactory

The beef and the boilerplate

In order to figure out the correct level of abstraction for your specifications, you need to understand the fundamental distinction between two categories of work in AI-assisted delivery: the "beef" that requires real craft, and the "boilerplate" where agents earn their keep.

Next, I'll explain why this matters.

The beef: Work that requires your craft

The beef is the activities where your judgment is irreplaceable. This is where you say what to build.

For the beef to be actually digestible, let's say medium, we need to ensure that the following are in place:

PrincipleGuidance
AmbiguityAI can't infer the unstated assumptions or the implicit intent. Say what you mean. If you want a "user-friendly error message," specify what that message should say and when it should appear. If you want "good performance," specify the latency requirements and load conditions.
ClarityKeep to the *what* part. Focus on the outcome, not the implementation. "The list of items should be sorted by creation date by default" is clearer than "the API should use a sorted data structure." The former leaves room for the agent to choose the best implementation, the latter is implementation details.
DecompositionEngineering 101: make a bigger problem solvable by breaking it into digestible steps with clean boundaries. This also allows for detecting dependencies and ordering tasks correctly. A story that will result in touching six services, three APIs, and requires a database migration is not a thing to be one-shotted by default, or at least this should be considered carefully; if it's more like adding a new field to UI then it might be OK. Depends. Same applies for features: "build a fully-featured CRM system" is not a single feature, but a sequence of features that build on each other, and you need to feed them in carefully planned order with guardrails and intermediate check points in between.
ArchitectureSpecify project-wide patterns, interfaces, constraints, and boundaries. Litter your documentation with examples. Your agents will eagerly generate thousands of lines of seemingly correct code but it's just different from anything else in the repo. Not only will this become a maintenance nightmare, but the future agent sessions will struggle in detecting the patterns if they haven't been applied in a systematic way in the existing codebase.

I've found it useful to refer to existing modules/code files/pages/components and whatnot as examples to compare to when planning a feature with our planning agent. This establishes a clear pattern for the agent to follow and reduces the chances of it inventing its own style that doesn't fit with the rest of the codebase. And if these discoveries are persisted in your plans already during planning, the engineering agent can follow them instead of redoing the codebase-wide greps and getting it wrong.

Acceptance criteria. The Pareto principle is very much alive in the agentic world as well. Much of the error often goes to the edge cases, however improbable they might be. Vagueness does not help here either: spend time, e.g., by using Gherkin format to define the paths that need to be covered, and ask AI to discover more. "It should work correctly" is not a criterion.

Review judgment is about knowing when output is correct versus when it merely compiles and passes tests. Agents can produce code that is syntactically perfect, test-passing, and architecturally or functionally plain wrong. Only a person who understands the system's intent can catch these.

Other talking points you might have come across are that AI is generating technical debt at scale and introducing subtle bugs that are hard to detect. Both of these, among others, are valid concerns and to be taken seriously: it's a whole other ballgame to spend weeks refactoring things in the middle of the project or starting over, especially after you've already shipped something, with AI or not. Don't trust you can just "make the things right later". You really can't.

If you delegate the beef, you don't save time. You move it to the review gate where it costs more, or worse, to production where it costs the most. Shift left, folks!

Skip the boilerplate: Where agents earn their keep

The boilerplate refers to things that should be rather straightforward. It consists of activities where agents reliably produce acceptable output. So 'boilerplate' basically refers to making the boring part automatic.

Examples of these kinds of activities include:

Task typeGuidance
ScaffoldingProject structure, boilerplate files, configuration. Given a convention file that says "we use this folder structure, these naming patterns, this test framework, in this kind of project"
Pattern applicationGive an example page, control, or module, and ask the agent to follow that pattern for the new thing you need. Pattern can be anything ranging from naming, code file organization or an architectural pattern.
Test generation from specsIf you've done your homework in the 'beef' part well, this is relatively easy. You've already defined the acceptance criteria, so the agent's job is to translate those into test code. This is a perfect example of where agents can save time: they can generate thorough test suites that cover all the specified scenarios, including edge cases you might have missed.
Glue codeREST model to display model, data transformation, API integration. This is the "connect the dots" work that is tedious but straightforward. If you specify the inputs, outputs, and transformation logic, agents can reliably produce this code. Pass in a sample result, or better yet, the OpenAPI spec of your service, and perhaps a sample output, and let the AI do the translation work for you.
Documentation from codeConsider generating and maintaining the documentation per feature or per concern (like patterns) carefully. Something that's just basically slop and easily regenerated is not often worth it and will not be kept up-to-date, but as a feedback loop stage when you've for instance changed your API, data structures or introduced a new pattern or condition is a logical and necessary stage to review and refine.
Unless carefully isolated, generated tests may be entirely rubbish and work on the wrong abstraction level, and in the worst case made to fit the code, not the requirements. Especially proper E2E tests need a black-box approach to test for outcomes, and often require explicit instrumentation (like data-id etc). AI is often overenthusiastic in generating way too many tests which make the test maintenance a nightmare.
This is something that might be a maintenance nightmare. You might have noticed your codebase being polluted by dozens of GUIDE.md IMPORTANT.md NOTES.md QUICKSTART.md REFERENCE.md in each folder. (Yes, Anthropic, it's you, hello!) In the end you'll have dozens of them with no idea if they are relevant or up-to-date. Be careful what you commit.

The guardrails for delegated work

An essential safety mechanism for agents to perform is to erect boundaries and hard guardrails.

On the top level, these fences are baked-in state tracking which requires certain tasks, states, and viewpoints to have been considered, and isolating the unit of work with branching and PRs from the already verified content.

Once you have a good agreement and tooling to support that, you should consider the feature-level guidance to keep things on track. Consider the following:

  1. Name the pattern. Instead of vague "follow existing conventions," use "follow the 'Response Handling' pattern in api.md." Yeah, however good your context engineering skills are, referring to tools and documents explicitly by exact names is sometimes the best way to go.
  2. Constrain the scope. Some say that even more important than what needs to be done is the specification of what doesn't. So, state explicitly what's excluded: "don't add pagination", "don't change the database schema", "don't create new API endpoints beyond", "don't add new controls or pages" and so forth.
  3. Define the review criteria. For example, "Check that code compiles" or "Verify that changes comply with Coding Conventions". This is something to be offloaded to your stage-specific review instructions and/or agent and should be performed in clean context/spawned subagent, preferably with a different model than the one who wrote the plan, document or code files waiting for commit.

How precise does the spec need to be?

The current generation of developers is not used to specifying work in detail and to be frank, it was not that much better pre-Agile either, or at least it was often different guys who did the plan.

So I've been thinking a lot about what's the suitable level of detail that's both realistic to get and detailed enough for the agents. Also a sanity check needs to be made: a quick bug fix, a small detail added or a small layout adjustment might not need anything on paper.

Chapter 13 made the case for why specification matters and introduced three maturity levels (spec-first, spec-anchored, spec-as-source). Here's the practical side: how much specification effort different work actually needs. Generally speaking, by the "The Calibration Principle" applied to agentic software engineering Specification effort should be proportional to the cost of getting it wrong. I'll postulate the following:

Story TypeSpec PrecisionWhat to IncludeExpected Error ClassesReview Focus
Bug fix (isolated, low risk)Minimal: 2–3 sentences + test caseWhat's broken, expected behavior, reproduction stepsWrong fix location, incomplete regression testDoes the fix match the symptom? Any side effects?
Small feature (single component)Moderate: structured spec with acceptance criteriaFunctional requirements, component boundaries, test scenariosMissing edge cases, implicit scope assumptionsDoes it do what was asked and nothing more?
Cross-cutting feature (multiple services)Full: structured spec + architecture notes + task breakdownAPI contracts, data flow, error handling, deployment orderInterface mismatches, ordering bugs, partial failuresDo the pieces fit together? Is the integration tested?
Architecture changeFull: ADR + multi-step plan + rollback strategyDecision rationale, migration path, compatibility constraints, success criteriaRegression in existing behavior, missed dependencies, performance impactIs the migration safe? Is rollback possible?

I encourage discussing these things with your team; it's nothing new and something that should be agreed even without AI. Also consider measuring these things like I suggest in Knowing It Works to get some data on the tasks which consistently come back for revision for bugs, missing features or architectural issues.

Look out for the ones that look small but aren't. Like a "simple UI change" that touches shared state, a "quick API update" that affects three consumers, a "minor refactor" that shifts module boundaries. Use AI (and your own judgment) to estimate the blast radius of changes and adjust your spec accordingly.

Antipatterns and bad, bad practices you should avoid

As I mentioned in the "focus on DONT's over DO's" principle in Chapter 9, it's often more instructive to look at what not to do than what to do.

This applies also on pattern level. I collected some well-known antipatterns below from the literature and some I've discovered in my work.

Agentic Delivery Antipatterns

27 failure modes across five categories

Specification6

Vague inputs produce unpredictable outputs

Agent Design6

Wrong boundaries produce wrong results

Process5

Broken workflows amplify every other problem

Review5

The last gate is only as good as its criteria

Security & Governance5

Speed without guardrails is a liability

Agentic delivery antipatterns across six categories.
Click to enlarge

Many of the DONT's and DO's here are essentially an art of prompt engineering applied to the entire delivery process. So the change here is that we feed the prompts through the planning artifacts from files or similar instead of writing them ourselves. In the end, however, anything that you write or synthesize with AI is a prompt for the next step.

Specification anti-patterns

Here's what to say and what not when writing specifications for features or feature-independent general project documentation. Sure, you'll get the idea; the entire point is to be specific and explicit. Once you've set up your Software Factory and all the conventions and patterns properly in place, you can loosen the noose.

Don'tDoWhat Goes Wrong
"It should handle errors gracefully""On 4xx, return error schema X. On 5xx, log to Y and return generic message. On timeout, retry once with backoff Z."Agents often invent practices for error handling. Review catches inconsistency too late.
"Implement authentication""Add JWT session auth using middleware X, storing tokens in httpOnly cookies, with 24h expiry, following the pattern in auth-service.ts"Multiple approaches across runs. Agent picks OAuth one time, sessions the next.
"Just use the existing pattern""Follow the repository pattern in user-repository.ts, including the error handling in lines 30–45 and the logging convention in lines 50–55""Existing pattern" means different things to different runs. Three implementations, three styles.
One story for a full featureBreak into: API endpoint, frontend component, integration test, migration. Each with its own spec and gateAgent loses coherence on large tasks. Middle sections get less attention than beginning and end.
Omitting exclusions"Scope: only the filter endpoint. Do NOT add sorting, pagination, or modify the existing list endpoint."Agent adds features you didn't ask for. Creative completion is a feature of LLMs. You need to constrain it.

Agent design anti-patterns

Another group of anti-patterns I've encountered is about the agents themselves. Remember, agents are still LLM calls with a 'recipe' i.e. the intent, tasks the agent is supposed to do from your execution pipeline, and the context it should operate with (files, category, module). The last one could be already specified during planning, but the agent should be free enough to find more when needed.

To recap, remember that all these 'agents' are is: Agent = Recipe + Context + Tools

This is basically just an md file with thin YAML frontmatter. The 'standard' (note the hyphens) YAML is just a name, tool list, model to use followed by what's basically a prompt to the LLM. All agents should be general enough to perform any task your planning practice produces in the provided context.

As you learn more, go ahead and create specialists for different kind of tasks. Like a separate backend/frontend/integration/dba agent. Same bad practices will haunt them, too unless you pay attention.

Don'tDoWhat Goes Wrong
Feed the entire codebase as contextPoint agents at specific files, modules, or documentation tiers relevant to the current taskToken waste. Attention dilution. Agent fixates on irrelevant code.
Let agents make architecture decisionsProvide architecture decisions as input constraints, not as questions for the agent to answerAgent picks a reasonable-looking pattern that conflicts with your system's trajectory. Expensive to undo.
No iteration limitsSet explicit retry bounds (e.g., "attempt implementation max 2 times, then stop and report")Runaway loops. Agents that "keep trying" consume tokens and produce increasingly divergent output.
One agent does everythingSeparate planning, implementation, testing, and review into distinct agents with distinct contextRole confusion. Planning considerations leak into implementation. Test quality drops when the same agent wrote the code.
Skip the planning agentAlways run a planning pass before implementation, even for "simple" tasksImplementation without a plan is a spec-free zone. The agent guesses scope, structure, and boundaries.

Process anti-patterns

If you're serious about the automation, entirely possible to achieve with modern AI tech, you need to be serious also about the process. Here are the pitfalls I see most often.

Don'tDoWhat Goes Wrong
Set up everything at onceBuild capability incrementally: start with project context, then docs, then one agent, then gatesTeams overwhelm themselves before seeing any value. Complexity without calibration.
Auto-approve gatesRun gates manually first. Automate only after the team knows what "good" looks like at each stage"All tests pass" becomes the definition of done. Architecturally wrong code ships because nobody actually looked. Tests may be wrong or insufficient. Subtle bugs slip through.
Let conventions erode silentlyTrack plan rejection and rework rate. Investigate template and convention driftSpecs that worked last month stop working. Same as accepting always-failing automated tests.
Use private setupsProvide AI environments with visibility into tools, MCP servers, and context used. Share all the setups (skills, instructions, practices). That's the "next level" of AI-coding!Miss the learning opportunities. You'll become the 'AI support guy' who is supposed to fix the issues while others enjoy the ride.
"Vibe code" through featuresPause after each agent session to review coherence, not just correctness. Stay in the loop."Building, building, building" without stepping back. Result: duplicate logic, mismatched names, no coherent architecture.

The anti-pattern that cost me the most time was one-shotting: letting the agent attempt an entire feature in a single pass. Despite explicit instructions to work task by task, the agent would regularly try to implement everything at once, touching files it shouldn't, creating dependencies between components that should have been independent, and generally making a mess that took longer to untangle than it would have taken to do it properly. The only fix that stuck was limiting scope to a single task per invocation. Not "please do one task at a time" in the instructions. Literally one task, one run.

Security & governance anti-patterns

This category of donts is not only about agentic development, but these apply to all usage of generative AI systems. The more you offload to AI without proper oversight, the more risk surface you expose to having the prod connection strings injected into your codebase or various tool memories.

Don'tDoWhat Goes Wrong
Paste sensitive data into LLMsDefine clear data handling policies; provide sanitized test data for agent contextPasting directly or via context sensitive data into LLMs. Credentials, PII, and proprietary code leak outside your security boundary.
Skip security review of AI outputSecure software development was hard before AI and certainly remains so. Have experts early on to review. Integrate automated security scanning. Treat all generated code as untrusted inputThe code looks right but isn't. Hard security audit fails and prevents rollout. Sensitive customer or company data is exposed or you get hacked.
Ignore code or tool originsReview licensing implications. Never allow npm install or similar without checking.Malicious packages. License violations. Unintended dependencies that become attack vectors.
Let AI mask skill gapsRequire developers to explain generated code before it merges. Invest in training alongside AI adoptionNobody understands the generated code. When it breaks, nobody can fix it. The team has a dependency, not a capability. Everybody ends up in the dumb zone.
"The AI wrote it" as excuseTreat AI output exactly like any other output for review and accountability purposesAccountability evaporates. Quality drops. Defect ownership becomes a blame game.
"Let an agent replace your cloud engineer"You still need people who know the technology and its sustainable usageThe IAM/IDP/secret management and other security configurations are not something you can just "ask the agent to do". You need to have the people who understand the security implications and how to properly set up the guardrails. And if you're doing IAC, you'll also risk getting an eyewatering bill from Jeff or Bill next month.

Building your software factory

So how to get started with the factory? The following sections suggest a plan to follow to build it gradually, applying the same incremental approach from Chapter 8's Adoption Ladder. The point here is not that exactly these seven layers are the gold standard (obviously) but illustrate a logical progression of how I've tried to do this. It's best to build on a strong foundation (solid project context, good documentation structure, task tracking) before jumping to parallel agents runs coding the new major ERP (would be good riddance btw) in days.

Building the Machine

Seven phases of incremental capability — each builds on the previous

Foundation
1
Custom Instructions

CLAUDE.md or equivalent with conventions, stack, constraints

2
Documentation Hierarchy

Guidelines, architecture decisions, patterns, domain context

Structure
3
Task Tracking & Gating

Task granularity, status transitions, gate criteria

4
First Agents

Planning agent first, then testing, review, deployment

Calibration
5
Manual Gating & Review

Human reviews every gate — learn what "good" looks like

6
Automated Gating

Static analysis, architecture rules, convention checks

Maturity
7
Skills & Tooling

Scaffolding, transformations, validations — continuously refined

Seven phases of building governed agentic delivery: from project context to reusable tooling.
Click to enlarge

All this is obviously going to be iterative. The tools evolve, models and people change, your codebase and documentation will evolve, and your understanding of what belongs in the project context and how to specify work will improve. So in parallel with this regular process you have for your factory, you need to have the maintenance process for the factory.

So what kind of things to look out for and learn about? I tried to summarize this in the next image.

Anatomy of the Software Factory

Building blocks and their relationships — aligned with the phased approach

Agent = Model + Specification + Tools + Context

Context
Custom InstructionsP1

CLAUDE.md, copilot-instructions.md — project-level conventions, stack, constraints

DocumentsP2

Architecture decisions, patterns, conventions, domain context — hierarchical & navigable

PromptsP3

Task specifications, user stories, acceptance criteria — the specific work to be done

loaded into
Agent
P4
Model

The engine — Opus, Sonnet, GPT, Gemini. Matched to the agent's task type.

Agent Specification

The recipe — role, workflow, entry/exit criteria. An .md file with thin yaml frontmatter.

Tools

The hands — file ops, search, build, test, deploy. Extended by reusable Skills.

output reviewed by
Gating
Manual ReviewP5

Human-in-the-loop at every gate — calibrate before you automate

Automated GatesP6

Static analysis, architecture rules, convention checks — encode what you learned

accelerated by
Skills & Reusable Tooling
P7
The building blocks of a software factory and how they relate.
Click to enlarge

Let's do a quick recap of the concepts here before going into the phases. The core building blocks of your software factory are:

ElementPurposeExample
Custom instructionsProject ground truth always in contextCLAUDE.md with conventions, stack, and constraints
DocumentsStructured guidelines and patternsAPI design guidelines, architecture decision records, coding standards
PromptsThe command you or another agent give to agent to start or guide the work"Create an implementation plan for the next unimplemented user story from feature 34331"
ModelThe LLM that generates the outputGPT-4, Gemini, Sonnet, Opus
Agent specificationThe recipe for an agent's behavior, including its role, tasks, and contextA YAML/md file defining a "planning agent" that uses Opus and has access to project documentation to analyze and create an clear, detailed plan for a given feature
ToolsThe toolset provided by the IDE/Code platform or offered via MCPEdit file, run agent, find facts from the code, seek documents from web
GatingThe mechanisms to control and monitor agent actionsApproval workflows, automated checks, and validation steps
Skills and reusable toolingAnother way to expose capabilities to models. Economic option for MCPCustom scripts, reusable functions, and shared libraries that agents can call to perform common tasks, like agent-browser, read a pdf, or do a semantic search on the codebase.

Phase 1: Custom instructions

Start with the "always true" things about your project. A CLAUDE.md, copilot-instructions.md, or equivalent file that captures the conventions, technology choices, and constraints that every agent interaction should respect. Keep this compact. If it grows beyond what fits comfortably in an agent's context window, it's too large. Compare it to your corporate guidebooks: you might have some work place safety rules, stock market rules, GDPR and whatever the European Union has (AI Act, anyone?), but certainly the coding agent won't need to read all of them.

Begin with: brief project intro, perhaps a short vocabulary, technical things like languages, tech stack, documentation locations and such and something generic about the process and architecture. Hard stop things and must-not-do's like "never add any secrets like API keys, connection strings to code" belong here, too.

The first version will be wrong. That's fine. The value of Phase 1 is learning what belongs in project-level context and what doesn't. You'll rewrite this file three times before it stabilizes.

Phase 2: Hierarchical documentation structure

As the project matures, the single instruction file isn't enough. Or it will be so big that it eats most of your context window; remember, that event basic system prompts, agent recipe, tool specificatons and others are easily tens of thousands of tokens alone.

So, design a documentation hierarchy as suggested in this book: divide guidelines, technology decisions, architectural patterns, domain-specific context, and things like standard UI patterns into individual files. Also describe the process and gating once you're ready with that.

Your document library will grow over time, especially now that you have (over?) eager AI-assistants to add more content, so a good index is essential for your agents to find the right context without reading everything. Structure your library in a way it so agents can be pointed at the relevant subset rather than ingesting the whole tree.

See Chapter 17 for the three-tier documentation model (project-level, task-specific, tracking) and practical techniques for organizing context.

Phase 3: Task tracking and gating design

Introduce structured task tracking that agents can consume and update. I suggest having this in place early on, not because of the 100% complete paper trail, but this is really the key piece to get right in order for anything to work properly.

File-based approaches work well for revision control and collaboration. Key design decisions include what granularity of tasks, what status transitions, what gates between phases you want. Consider what integration with existing tools (Jira, Azure DevOps) is required versus what can live in the repository. Also check what ready-made tools there are: I've mentioned GSD, and also Beads as candidates for this. Or just let the AI generate a tool for you which matches your exact needs.

Phase 4: First agents for selected phases

Don't try to automate every phase at once.

I'd start with planning instead of the coding agent. I know for sure most people think the coding agent is the most tempting. My point here all along was that you'll need this spec-driven thing even if you don't like it, no matter what.

Use models best suited for each agent type (reasoning models for planning, faster or code-oriented models for implementation, UI-focused models for interface design). I've preferred the ever-enthusiastic Sonnet as the orchestrator, GPT Codex for code generation, Gemini for UI-related work, and Opus for planning.

Next, I'd introduce a decent test planner. Start TDD-driven planning early. As with hand crafted code, it's very hard to get good and representative coverage after the fact, especially without proper functional specifications.

Phase 5: Manual gating and review

What this means is that you are in the loop between the major stages - planning ready, coding ready, testing ready, review ready. Once you learn the correct bite size and your codebase is structured and documented, and you are past the project bootstrapping phase, you can ease this.

Phase 6: Automated gating and guardrails

Once you understand the failure modes and potential points where things usually go sideways, you may introduce automated checks. We've done static analysis, architecture rule enforcement, convention validation backed by deterministic tools (like static code analysis) and partially AI capabilities.

Define what manual intervention points are genuinely necessary versus what can be safely automated. I'd keep the requirement to review and approve the plan before implementation.

I've used a special orchestrator agent for the "YOLO" or "autopilot" mode to coordinate the agents for 'easier' features. But this is only after a few iterations and the groundwork for the project was properly laid. My workflow begins with issuing a planned task, which is delegated to a planning agent, and taken from there if all checks out, and the manual review may be left optional or cursory.

Phase 7: Skills and tooling

Develop reusable skills and command-line tools for operations that should "always work". These are like basic scaffolding, common transformations, standard validations, compilations, better-than-grep semantic search tools, etc. Chances are you'll find excellent ones online (check for instance skills.sh, awesome skills on GitHub, etc.).

These are important building blocks of the agentic software factory which offload repetitive work from both humans and agents and reduce the surface area where things can go wrong. As a summary, check out the diagram below.

Trust and verification

Governed delivery is not a faith-based initiative. If the process is working, specific numbers improve. If they don't, something is wrong: either with the specs, the gates, the agent configuration, or the team's understanding of the boundary between beef and boilerplate.

Four metrics worth tracking:

MetricWhat It MeasuresWhat "Good" Looks LikeWhat Bad Numbers Tell You
Plan rejection rateHow often specs come back for revision at the review gateBelow 20% after the team has calibratedSpecs are too vague, or the team hasn't internalized what "good enough" means
Defect escapesIssues found in production that should have been caught by gatesTrending downward; no repeat categoriesGates are checking the wrong things, or acceptance criteria are incomplete
Rework timeTime spent revising agent output before it's acceptableDecreasing over time as specs improveSpec precision is too low for the story type, or conventions aren't being followed
Reviewer fatigueSubjective rating from reviewers (survey or standup check-in)Stable or improving; reviewers feel the output is getting easier to assessSpecs are degrading, agent output is inconsistent, or review criteria are unclear

Plan rejection rate is your leading indicator. It tells you whether the team is getting better at the beef, the upfront specification work that determines everything downstream. Track it weekly. When it drops, the process is calibrating. When it creeps back up, investigate: new team members? Changed requirements patterns? Specification template drift?

I don't have hard measurements, but from the early stages to now the chances of getting most things right on the first pass have at least doubled. The specs got better, the context got tighter, and the agent output got more predictable. That said, most of the improvement comes from learning what to specify, not from the tools getting smarter.

Defect escapes is your lagging indicator. It tells you whether the gates are actually catching what matters. If the same category of defect escapes repeatedly (say, integration issues between services), that's a signal to add a gate check, not to blame the agent.

Don't over-instrument. These four metrics can be tracked with a spreadsheet and a weekly standup question. The goal is signal, not surveillance. If tracking the metrics costs more attention than it saves, simplify.

Summary

When all is set, you'll have covered most of the stuff that is needed to get the most out of your agents and avoid the common pitfalls. It'll look something like this (but not limited to).

From Model to Factory

Start small, add layers as you learn

LLM Model

Sonnet, Opus, GPT, Gemini

receives
Prompt

User story, bug report, task specification

enriched by
Project Context

CLAUDE.md — conventions, stack, constraints, patterns

structured by
Agent Recipe

agent.md — purpose, steps, entry/exit criteria, scope

equipped with
Tools via MCP

File ops, search, build, test, deploy, API calls

packaged as
Skills

SKILL.md — scaffolding, validations, transformations

orchestrated into
Workflows

Handoffs, gates, state tracking, multi-agent coordination

↑ Everyone starts hereBuild up as you learn ↓
The building blocks of agentic delivery, from a bare model to orchestrated workflows.
Click to enlarge
  • The "beef" is the hard, creative work that only you can do. The "boilerplate" is the repetitive, pattern-based work that agents can handle. Don't delegate the beef; it just moves the problem downstream.
  • Spec precision should be calibrated to the risk and complexity of the task. Don't under-specify cross-cutting features or architecture changes, and don't over-specify simple bug fixes.
  • Avoid anti-patterns in both specification and agent design. Clear, explicit specs and well-defined agent roles are essential for success.
  • Build your software factory incrementally. Start with project context and documentation, then add agents and gates one phase at a time. Don't try to automate everything at once.