The Beef and the Boilerplate

What actually works, what doesn't, and where to start

Before you begin

If you want to see governed AI development at the smallest possible scale, my game is the example. One developer, one agent, no infrastructure beyond a text editor and an API key. Everything in this chapter started there and grew outward.

n case you made it all the way here, you understand why accountability matters more than speed, why specifications narrow the solution space, and why gates enforce discipline.

Now let's talk about how to actually do it.

A typical starting point looks something like this. New project, new team, half the people have never used AI tools beyond ChatGPT, and the other half have used something different, individually, on a different stack. Everybody's eager. Licences are in, agents are running, and the first week produces an impressive amount of code. Looks great in the demo. Then week two hits and you notice: three different error handling patterns, two competing folder structures, tests that mock things in ways nobody agreed on, and a merge conflict rate that makes you question your life choices. The tools worked fine. The problem was that nobody had defined what "fine" means for this project. On top of that, the inherent randomness of outputs means that even with the same instructions, you get different results every time, and the sheer mass of generated code to verify is just too much too quickly. Getting to factory-level consistency takes weeks at minimum, often more.

What I'm going to cover next is:

The distinction between work that requires your judgment and where agents perform
Calibration for how much specification effort different work actually needs
Failure modes that destroy value, presented as concrete anti-patterns
A phased approach for building the capability incrementally
Metrics that tell you whether the process is working
A checklist you can use as a working template, starting next week.

A typical start setting you're likely to encounter is:

☑️The parent corporation has purchased the entry-level licences for some AI development tool
☑️The team has had some hours of training but no real practice beyond basic prompting and trials
☑️You might have some existing codebase and documentation, but no idea on how to bring them in. Typically they are not in good order.
☑️There's no tradition whatsoever to actually manage information or specify work in detail sufficient for agents to succeed
☑️Initial trials with default agents, such as planning and build agents have been unsatisfactory

The beef and the boilerplate

In order to figure out the correct level of abstraction for your specifications, you need to understand the fundamental distinction between two categories of work in AI-assisted delivery: the "beef" that requires real craft, and the "boilerplate" where agents earn their keep.

Next, I'll explain why this matters.

The beef: Work that requires your craft

The beef is the activities where your judgment is irreplaceable. This is where you say what to build.

For the beef to be actually digestible, let's say medium, we need to ensure that the following are in place:

Principle	Guidance
Ambiguity	AI can't infer the unstated assumptions or the implicit intent. Say what you mean. If you want a "user-friendly error message," specify what that message should say and when it should appear. If you want "good performance," specify the latency requirements and load conditions.
Clarity	Keep to the what part. Focus on the outcome, not the implementation. "The list of items should be sorted by creation date by default" is clearer than "the API should use a sorted data structure." The former leaves room for the agent to choose the best implementation, the latter is implementation details.
Decomposition	Engineering 101: make a bigger problem solvable by breaking it into digestible steps with clean boundaries. This also allows for detecting dependencies and ordering tasks correctly. A story that will result in touching six services, three APIs, and requires a database migration is not a thing to be one-shotted by default, or at least this should be considered carefully; if it's more like adding a new field to UI then it might be OK. Depends. Same applies for features: "build a fully-featured CRM system" is not a single feature, but a sequence of features that build on each other, and you need to feed them in carefully planned order with guardrails and intermediate check points in between.
Architecture	Specify project-wide patterns, interfaces, constraints, and boundaries. Litter your documentation with examples. Your agents will eagerly generate thousands of lines of seemingly correct code but it's just different from anything else in the repo. Not only will this become a maintenance nightmare, but the future agent sessions will struggle in detecting the patterns if they haven't been applied in a systematic way in the existing codebase.

I've found it useful to refer to existing modules/code files/pages/components and whatnot as examples to compare to when planning a feature with our planning agent. This establishes a clear pattern for the agent to follow and reduces the chances of it inventing its own style that doesn't fit with the rest of the codebase. And if these discoveries are persisted in your plans already during planning, the engineering agent can follow them instead of redoing the codebase-wide greps and getting it wrong.

Acceptance criteria. The Pareto principle is very much alive in the agentic world as well. Much of the error often goes to the edge cases, however improbable they might be. Vagueness does not help here either: spend time, e.g., by using Gherkin format to define the paths that need to be covered, and ask AI to discover more. "It should work correctly" is not a criterion.

Review judgment is about knowing when output is correct versus when it merely compiles and passes tests. Agents can produce code that is syntactically perfect, test-passing, and architecturally or functionally plain wrong. Only a person who understands the system's intent can catch these.

Other talking points you might have come across are that AI is generating technical debt at scale and introducing subtle bugs that are hard to detect. Both of these, among others, are valid concerns and to be taken seriously: it's a whole other ballgame to spend weeks refactoring things in the middle of the project or starting over, especially after you've already shipped something, with AI or not. Don't trust you can just "make the things right later". You really can't.

If you delegate the beef, you don't save time. You move it to the review gate where it costs more, or worse, to production where it costs the most. Shift left, folks!

Skip the boilerplate: Where agents earn their keep

The boilerplate refers to things that should be rather straightforward. It consists of activities where agents reliably produce acceptable output. So 'boilerplate' basically refers to making the boring part automatic.

Examples of these kinds of activities include:

Task type	Guidance
Scaffolding	Project structure, boilerplate files, configuration. Given a convention file that says "we use this folder structure, these naming patterns, this test framework, in this kind of project"
Pattern application	Give an example page, control, or module, and ask the agent to follow that pattern for the new thing you need. Pattern can be anything ranging from naming, code file organization or an architectural pattern.
Test generation from specs	If you've done your homework in the 'beef' part well, this is relatively easy. You've already defined the acceptance criteria, so the agent's job is to translate those into test code. This is a perfect example of where agents can save time: they can generate thorough test suites that cover all the specified scenarios, including edge cases you might have missed.
Glue code	REST model to display model, data transformation, API integration. This is the "connect the dots" work that is tedious but straightforward. If you specify the inputs, outputs, and transformation logic, agents can reliably produce this code. Pass in a sample result, or better yet, the OpenAPI spec of your service, and perhaps a sample output, and let the AI do the translation work for you.
Documentation from code	Consider generating and maintaining the documentation per feature or per concern (like patterns) carefully. Something that's just basically slop and easily regenerated is not often worth it and will not be kept up-to-date, but as a feedback loop stage when you've for instance changed your API, data structures or introduced a new pattern or condition is a logical and necessary stage to review and refine.

Unless carefully isolated, generated tests may be entirely rubbish and work on the wrong abstraction level, and in the worst case made to fit the code, not the requirements. Especially proper E2E tests need a black-box approach to test for outcomes, and often require explicit instrumentation (like data-id etc). AI is often overenthusiastic in generating way too many tests which make the test maintenance a nightmare.

This is something that might be a maintenance nightmare. You might have noticed your codebase being polluted by dozens of GUIDE.md IMPORTANT.md NOTES.md QUICKSTART.md REFERENCE.md in each folder. (Yes, Anthropic, it's you, hello!) In the end you'll have dozens of them with no idea if they are relevant or up-to-date. Be careful what you commit.

The guardrails for delegated work

An essential safety mechanism for agents to perform is to erect boundaries and hard guardrails.

On the top level, these fences are baked-in state tracking which requires certain tasks, states, and viewpoints to have been considered, and isolating the unit of work with branching and PRs from the already verified content.

Once you have a good agreement and tooling to support that, you should consider the feature-level guidance to keep things on track. Consider the following:

Name the pattern. Instead of vague "follow existing conventions," use "follow the 'Response Handling' pattern in api.md." Yeah, however good your context engineering skills are, referring to tools and documents explicitly by exact names is sometimes the best way to go.
Constrain the scope. Some say that even more important than what needs to be done is the specification of what doesn't. So, state explicitly what's excluded: "don't add pagination", "don't change the database schema", "don't create new API endpoints beyond", "don't add new controls or pages" and so forth.
Define the review criteria. For example, "Check that code compiles" or "Verify that changes comply with Coding Conventions". This is something to be offloaded to your stage-specific review instructions and/or agent and should be performed in clean context/spawned subagent, preferably with a different model than the one who wrote the plan, document or code files waiting for commit.

How precise does the spec need to be?

The current generation of developers is not used to specifying work in detail and to be frank, it was not that much better pre-Agile either, or at least it was often different guys who did the plan.

So I've been thinking a lot about what's the suitable level of detail that's both realistic to get and detailed enough for the agents. Also a sanity check needs to be made: a quick bug fix, a small detail added or a small layout adjustment might not need anything on paper.

Chapter 13 made the case for why specification matters and introduced three maturity levels (spec-first, spec-anchored, spec-as-source). Here's the practical side: how much specification effort different work actually needs. Generally speaking, by the "The Calibration Principle" applied to agentic software engineering Specification effort should be proportional to the cost of getting it wrong. I'll postulate the following:

Story Type	Spec Precision	What to Include	Expected Error Classes	Review Focus
Bug fix (isolated, low risk)	Minimal: 2–3 sentences + test case	What's broken, expected behavior, reproduction steps	Wrong fix location, incomplete regression test	Does the fix match the symptom? Any side effects?
Small feature (single component)	Moderate: structured spec with acceptance criteria	Functional requirements, component boundaries, test scenarios	Missing edge cases, implicit scope assumptions	Does it do what was asked and nothing more?
Cross-cutting feature (multiple services)	Full: structured spec + architecture notes + task breakdown	API contracts, data flow, error handling, deployment order	Interface mismatches, ordering bugs, partial failures	Do the pieces fit together? Is the integration tested?
Architecture change	Full: ADR + multi-step plan + rollback strategy	Decision rationale, migration path, compatibility constraints, success criteria	Regression in existing behavior, missed dependencies, performance impact	Is the migration safe? Is rollback possible?

I encourage discussing these things with your team; it's nothing new and something that should be agreed even without AI. Also consider measuring these things like I suggest in Knowing It Works to get some data on the tasks which consistently come back for revision for bugs, missing features or architectural issues.

Look out for the ones that look small but aren't. Like a "simple UI change" that touches shared state, a "quick API update" that affects three consumers, a "minor refactor" that shifts module boundaries. Use AI (and your own judgment) to estimate the blast radius of changes and adjust your spec accordingly.

Antipatterns and bad, bad practices you should avoid

As I mentioned in the "focus on DONT's over DO's" principle in Chapter 9, it's often more instructive to look at what not to do than what to do.

This applies also on pattern level. I collected some well-known antipatterns below from the literature and some I've discovered in my work.

Agentic Delivery Antipatterns

27 failure modes across five categories

Specification6

Vague inputs produce unpredictable outputs

Agent Design6

Wrong boundaries produce wrong results

Process5

Broken workflows amplify every other problem

Review5

The last gate is only as good as its criteria

Security & Governance5

Speed without guardrails is a liability

Agentic delivery antipatterns across six categories.

Click to enlarge

Many of the DONT's and DO's here are essentially an art of prompt engineering applied to the entire delivery process. So the change here is that we feed the prompts through the planning artifacts from files or similar instead of writing them ourselves. In the end, however, anything that you write or synthesize with AI is a prompt for the next step.

Specification anti-patterns

Here's what to say and what not when writing specifications for features or feature-independent general project documentation. Sure, you'll get the idea; the entire point is to be specific and explicit. Once you've set up your Software Factory and all the conventions and patterns properly in place, you can loosen the noose.

Don't	Do	What Goes Wrong
"It should handle errors gracefully"	"On 4xx, return error schema X. On 5xx, log to Y and return generic message. On timeout, retry once with backoff Z."	Agents often invent practices for error handling. Review catches inconsistency too late.
"Implement authentication"	"Add JWT session auth using middleware X, storing tokens in httpOnly cookies, with 24h expiry, following the pattern in auth-service.ts"	Multiple approaches across runs. Agent picks OAuth one time, sessions the next.
"Just use the existing pattern"	"Follow the repository pattern in user-repository.ts, including the error handling in lines 30–45 and the logging convention in lines 50–55"	"Existing pattern" means different things to different runs. Three implementations, three styles.
One story for a full feature	Break into: API endpoint, frontend component, integration test, migration. Each with its own spec and gate	Agent loses coherence on large tasks. Middle sections get less attention than beginning and end.
Omitting exclusions	"Scope: only the filter endpoint. Do NOT add sorting, pagination, or modify the existing list endpoint."	Agent adds features you didn't ask for. Creative completion is a feature of LLMs. You need to constrain it.

Agent design anti-patterns

Another group of anti-patterns I've encountered is about the agents themselves. Remember, agents are still LLM calls with a 'recipe' i.e. the intent, tasks the agent is supposed to do from your execution pipeline, and the context it should operate with (files, category, module). The last one could be already specified during planning, but the agent should be free enough to find more when needed.

To recap, remember that all these 'agents' are is: Agent = Recipe + Context + Tools

This is basically just an md file with thin YAML frontmatter. The 'standard' (note the hyphens) YAML is just a name, tool list, model to use followed by what's basically a prompt to the LLM. All agents should be general enough to perform any task your planning practice produces in the provided context.

As you learn more, go ahead and create specialists for different kind of tasks. Like a separate backend/frontend/integration/dba agent. Same bad practices will haunt them, too unless you pay attention.

Don't	Do	What Goes Wrong
Feed the entire codebase as context	Point agents at specific files, modules, or documentation tiers relevant to the current task	Token waste. Attention dilution. Agent fixates on irrelevant code.
Let agents make architecture decisions	Provide architecture decisions as input constraints, not as questions for the agent to answer	Agent picks a reasonable-looking pattern that conflicts with your system's trajectory. Expensive to undo.
No iteration limits	Set explicit retry bounds (e.g., "attempt implementation max 2 times, then stop and report")	Runaway loops. Agents that "keep trying" consume tokens and produce increasingly divergent output.
One agent does everything	Separate planning, implementation, testing, and review into distinct agents with distinct context	Role confusion. Planning considerations leak into implementation. Test quality drops when the same agent wrote the code.
Skip the planning agent	Always run a planning pass before implementation, even for "simple" tasks	Implementation without a plan is a spec-free zone. The agent guesses scope, structure, and boundaries.

Process anti-patterns

If you're serious about the automation, entirely possible to achieve with modern AI tech, you need to be serious also about the process. Here are the pitfalls I see most often.

Don't	Do	What Goes Wrong
Set up everything at once	Build capability incrementally: start with project context, then docs, then one agent, then gates	Teams overwhelm themselves before seeing any value. Complexity without calibration.
Auto-approve gates	Run gates manually first. Automate only after the team knows what "good" looks like at each stage	"All tests pass" becomes the definition of done. Architecturally wrong code ships because nobody actually looked. Tests may be wrong or insufficient. Subtle bugs slip through.
Let conventions erode silently	Track plan rejection and rework rate. Investigate template and convention drift	Specs that worked last month stop working. Same as accepting always-failing automated tests.
Use private setups	Provide AI environments with visibility into tools, MCP servers, and context used. Share all the setups (skills, instructions, practices). That's the "next level" of AI-coding!	Miss the learning opportunities. You'll become the 'AI support guy' who is supposed to fix the issues while others enjoy the ride.
"Vibe code" through features	Pause after each agent session to review coherence, not just correctness. Stay in the loop.	"Building, building, building" without stepping back. Result: duplicate logic, mismatched names, no coherent architecture.

The anti-pattern that cost me the most time was one-shotting: letting the agent attempt an entire feature in a single pass. Despite explicit instructions to work task by task, the agent would regularly try to implement everything at once, touching files it shouldn't, creating dependencies between components that should have been independent, and generally making a mess that took longer to untangle than it would have taken to do it properly. The only fix that stuck was limiting scope to a single task per invocation. Not "please do one task at a time" in the instructions. Literally one task, one run.

Security & governance anti-patterns

This category of donts is not only about agentic development, but these apply to all usage of generative AI systems. The more you offload to AI without proper oversight, the more risk surface you expose to having the prod connection strings injected into your codebase or various tool memories.

Don't	Do	What Goes Wrong
Paste sensitive data into LLMs	Define clear data handling policies; provide sanitized test data for agent context	Pasting directly or via context sensitive data into LLMs. Credentials, PII, and proprietary code leak outside your security boundary.
Skip security review of AI output	Secure software development was hard before AI and certainly remains so. Have experts early on to review. Integrate automated security scanning. Treat all generated code as untrusted input	The code looks right but isn't. Hard security audit fails and prevents rollout. Sensitive customer or company data is exposed or you get hacked.
Ignore code or tool origins	Review licensing implications. Never allow npm install or similar without checking.	Malicious packages. License violations. Unintended dependencies that become attack vectors.
Let AI mask skill gaps	Require developers to explain generated code before it merges. Invest in training alongside AI adoption	Nobody understands the generated code. When it breaks, nobody can fix it. The team has a dependency, not a capability. Everybody ends up in the dumb zone.
"The AI wrote it" as excuse	Treat AI output exactly like any other output for review and accountability purposes	Accountability evaporates. Quality drops. Defect ownership becomes a blame game.
"Let an agent replace your cloud engineer"	You still need people who know the technology and its sustainable usage	The IAM/IDP/secret management and other security configurations are not something you can just "ask the agent to do". You need to have the people who understand the security implications and how to properly set up the guardrails. And if you're doing IAC, you'll also risk getting an eyewatering bill from Jeff or Bill next month.

Building your software factory

So how to get started with the factory? The following sections suggest a plan to follow to build it gradually, applying the same incremental approach from Chapter 8's Adoption Ladder. The point here is not that exactly these seven layers are the gold standard (obviously) but illustrate a logical progression of how I've tried to do this. It's best to build on a strong foundation (solid project context, good documentation structure, task tracking) before jumping to parallel agents runs coding the new major ERP (would be good riddance btw) in days.

Building the Machine

Seven phases of incremental capability — each builds on the previous

Foundation

Custom Instructions— Project context that every agent respects

CLAUDE.md or equivalent with conventions, stack, constraints

Documentation Hierarchy— Structured context agents can navigate

Guidelines, architecture decisions, patterns, domain context

Structure

Task Tracking & Gating— Design the gate state machine on paper

Task granularity, status transitions, gate criteria

First Agents— Start with planning, expand from there

Planning agent first, then testing, review, deployment

Calibration

Manual Gating & Review— Calibrate before you automate

Human reviews every gate — learn what "good" looks like

Automated Gating— Encode what you learned into guardrails

Static analysis, architecture rules, convention checks

Maturity

Skills & Tooling— Reusable capabilities that "just work"

Scaffolding, transformations, validations — continuously refined

Seven phases of building governed agentic delivery: from project context to reusable tooling.

Click to enlarge

All this is obviously going to be iterative. The tools evolve, models and people change, your codebase and documentation will evolve, and your understanding of what belongs in the project context and how to specify work will improve. So in parallel with this regular process you have for your factory, you need to have the maintenance process for the factory.

So what kind of things to look out for and learn about? I tried to summarize this in the next image.

Anatomy of the Software Factory

Building blocks and their relationships — aligned with the phased approach

Agent = Model + Specification + Tools + Context

Context— what the agent reads

Custom InstructionsP1

CLAUDE.md, copilot-instructions.md — project-level conventions, stack, constraints

DocumentsP2

Architecture decisions, patterns, conventions, domain context — hierarchical & navigable

PromptsP3

Task specifications, user stories, acceptance criteria — the specific work to be done

loaded into

Agent— the execution unit

Model

The engine — Opus, Sonnet, GPT, Gemini. Matched to the agent's task type.

Agent Specification

The recipe — role, workflow, entry/exit criteria. An .md file with thin yaml frontmatter.

Tools← Skills

The hands — file ops, search, build, test, deploy. Extended by reusable Skills.

output reviewed by

Gating— quality control

Manual ReviewP5

Human-in-the-loop at every gate — calibrate before you automate

Automated GatesP6

Static analysis, architecture rules, convention checks — encode what you learned

accelerated by

Skills & Reusable Tooling— packaged capabilities that extend agent Tools

The building blocks of a software factory and how they relate.

Click to enlarge

Let's do a quick recap of the concepts here before going into the phases. The core building blocks of your software factory are:

Element	Purpose	Example
Custom instructions	Project ground truth always in context	CLAUDE.md with conventions, stack, and constraints
Documents	Structured guidelines and patterns	API design guidelines, architecture decision records, coding standards
Prompts	The command you or another agent give to agent to start or guide the work	"Create an implementation plan for the next unimplemented user story from feature 34331"
Model	The LLM that generates the output	GPT-4, Gemini, Sonnet, Opus
Agent specification	The recipe for an agent's behavior, including its role, tasks, and context	A YAML/md file defining a "planning agent" that uses Opus and has access to project documentation to analyze and create an clear, detailed plan for a given feature
Tools	The toolset provided by the IDE/Code platform or offered via MCP	Edit file, run agent, find facts from the code, seek documents from web
Gating	The mechanisms to control and monitor agent actions	Approval workflows, automated checks, and validation steps
Skills and reusable tooling	Another way to expose capabilities to models. Economic option for MCP	Custom scripts, reusable functions, and shared libraries that agents can call to perform common tasks, like agent-browser, read a pdf, or do a semantic search on the codebase.

Phase 1: Custom instructions

Start with the "always true" things about your project. A CLAUDE.md, copilot-instructions.md, or equivalent file that captures the conventions, technology choices, and constraints that every agent interaction should respect. Keep this compact. If it grows beyond what fits comfortably in an agent's context window, it's too large. Compare it to your corporate guidebooks: you might have some work place safety rules, stock market rules, GDPR and whatever the European Union has (AI Act, anyone?), but certainly the coding agent won't need to read all of them.

Begin with: brief project intro, perhaps a short vocabulary, technical things like languages, tech stack, documentation locations and such and something generic about the process and architecture. Hard stop things and must-not-do's like "never add any secrets like API keys, connection strings to code" belong here, too.

The first version will be wrong. That's fine. The value of Phase 1 is learning what belongs in project-level context and what doesn't. You'll rewrite this file three times before it stabilizes.

Phase 2: Hierarchical documentation structure

As the project matures, the single instruction file isn't enough. Or it will be so big that it eats most of your context window; remember, that event basic system prompts, agent recipe, tool specificatons and others are easily tens of thousands of tokens alone.

So, design a documentation hierarchy as suggested in this book: divide guidelines, technology decisions, architectural patterns, domain-specific context, and things like standard UI patterns into individual files. Also describe the process and gating once you're ready with that.

Your document library will grow over time, especially now that you have (over?) eager AI-assistants to add more content, so a good index is essential for your agents to find the right context without reading everything. Structure your library in a way it so agents can be pointed at the relevant subset rather than ingesting the whole tree.

See Chapter 17 for the three-tier documentation model (project-level, task-specific, tracking) and practical techniques for organizing context.

Phase 3: Task tracking and gating design

Introduce structured task tracking that agents can consume and update. I suggest having this in place early on, not because of the 100% complete paper trail, but this is really the key piece to get right in order for anything to work properly.

File-based approaches work well for revision control and collaboration. Key design decisions include what granularity of tasks, what status transitions, what gates between phases you want. Consider what integration with existing tools (Jira, Azure DevOps) is required versus what can live in the repository. Also check what ready-made tools there are: I've mentioned GSD, and also Beads as candidates for this. Or just let the AI generate a tool for you which matches your exact needs.

Phase 4: First agents for selected phases

Don't try to automate every phase at once.

I'd start with planning instead of the coding agent. I know for sure most people think the coding agent is the most tempting. My point here all along was that you'll need this spec-driven thing even if you don't like it, no matter what.

Use models best suited for each agent type (reasoning models for planning, faster or code-oriented models for implementation, UI-focused models for interface design). I've preferred the ever-enthusiastic Sonnet as the orchestrator, GPT Codex for code generation, Gemini for UI-related work, and Opus for planning.

Next, I'd introduce a decent test planner. Start TDD-driven planning early. As with hand crafted code, it's very hard to get good and representative coverage after the fact, especially without proper functional specifications.

Phase 5: Manual gating and review

What this means is that you are in the loop between the major stages - planning ready, coding ready, testing ready, review ready. Once you learn the correct bite size and your codebase is structured and documented, and you are past the project bootstrapping phase, you can ease this.

Phase 6: Automated gating and guardrails

Once you understand the failure modes and potential points where things usually go sideways, you may introduce automated checks. We've done static analysis, architecture rule enforcement, convention validation backed by deterministic tools (like static code analysis) and partially AI capabilities.

Define what manual intervention points are genuinely necessary versus what can be safely automated. I'd keep the requirement to review and approve the plan before implementation.

I've used a special orchestrator agent for the "YOLO" or "autopilot" mode to coordinate the agents for 'easier' features. But this is only after a few iterations and the groundwork for the project was properly laid. My workflow begins with issuing a planned task, which is delegated to a planning agent, and taken from there if all checks out, and the manual review may be left optional or cursory.

Phase 7: Skills and tooling

Develop reusable skills and command-line tools for operations that should "always work". These are like basic scaffolding, common transformations, standard validations, compilations, better-than-grep semantic search tools, etc. Chances are you'll find excellent ones online (check for instance skills.sh, awesome skills on GitHub, etc.).

These are important building blocks of the agentic software factory which offload repetitive work from both humans and agents and reduce the surface area where things can go wrong. As a summary, check out the diagram below.

Trust and verification

Governed delivery is not a faith-based initiative. If the process is working, specific numbers improve. If they don't, something is wrong: either with the specs, the gates, the agent configuration, or the team's understanding of the boundary between beef and boilerplate.

Four metrics worth tracking:

Metric	What It Measures	What "Good" Looks Like	What Bad Numbers Tell You
Plan rejection rate	How often specs come back for revision at the review gate	Below 20% after the team has calibrated	Specs are too vague, or the team hasn't internalized what "good enough" means
Defect escapes	Issues found in production that should have been caught by gates	Trending downward; no repeat categories	Gates are checking the wrong things, or acceptance criteria are incomplete
Rework time	Time spent revising agent output before it's acceptable	Decreasing over time as specs improve	Spec precision is too low for the story type, or conventions aren't being followed
Reviewer fatigue	Subjective rating from reviewers (survey or standup check-in)	Stable or improving; reviewers feel the output is getting easier to assess	Specs are degrading, agent output is inconsistent, or review criteria are unclear

Plan rejection rate is your leading indicator. It tells you whether the team is getting better at the beef, the upfront specification work that determines everything downstream. Track it weekly. When it drops, the process is calibrating. When it creeps back up, investigate: new team members? Changed requirements patterns? Specification template drift?

I don't have hard measurements, but from the early stages to now the chances of getting most things right on the first pass have at least doubled. The specs got better, the context got tighter, and the agent output got more predictable. That said, most of the improvement comes from learning what to specify, not from the tools getting smarter.

Defect escapes is your lagging indicator. It tells you whether the gates are actually catching what matters. If the same category of defect escapes repeatedly (say, integration issues between services), that's a signal to add a gate check, not to blame the agent.

Don't over-instrument. These four metrics can be tracked with a spreadsheet and a weekly standup question. The goal is signal, not surveillance. If tracking the metrics costs more attention than it saves, simplify.

Summary

When all is set, you'll have covered most of the stuff that is needed to get the most out of your agents and avoid the common pitfalls. It'll look something like this (but not limited to).

From Model to Factory

Start small, add layers as you learn

LLM Model— The engine — pick the right one for the task

Sonnet, Opus, GPT, Gemini

receives

Prompt— Direct the model with what needs to be done

User story, bug report, task specification

enriched by

Project Context— Ground every session in project reality

CLAUDE.md — conventions, stack, constraints, patterns

structured by

Agent Recipe— Formalize the role, workflow, and boundaries

agent.md — purpose, steps, entry/exit criteria, scope

equipped with

Tools via MCP— Give agents the ability to act on the world

File ops, search, build, test, deploy, API calls

packaged as

Skills— Package reusable capabilities for consistency

SKILL.md — scaffolding, validations, transformations

orchestrated into

Workflows— Orchestrate multi-agent delivery end to end

Handoffs, gates, state tracking, multi-agent coordination

↑ Everyone starts hereBuild up as you learn ↓

The building blocks of agentic delivery, from a bare model to orchestrated workflows.

Click to enlarge

The "beef" is the hard, creative work that only you can do. The "boilerplate" is the repetitive, pattern-based work that agents can handle. Don't delegate the beef; it just moves the problem downstream.
Spec precision should be calibrated to the risk and complexity of the task. Don't under-specify cross-cutting features or architecture changes, and don't over-specify simple bug fixes.
Avoid anti-patterns in both specification and agent design. Clear, explicit specs and well-defined agent roles are essential for success.
Build your software factory incrementally. Start with project context and documentation, then add agents and gates one phase at a time. Don't try to automate everything at once.