↑ Back to Contents
14

Developer Intervention Required

User checkpoints, testing as confidence, and the regeneration option


Shift left, verify right

here should we engineers, testers, and other specialists in the software business shift our focus to make the most impact? One thing I've noticed, partly due to the immaturity of tools and processes, is that new roles will emerge. People need to rethink what their daily tasks will actually be. I see no return to handcrafting code line by line outside very specialist areas.

One dominant theme is the shift left, as already discussed in this book. More work needs to be done up front (left) and in the rear (right), letting the factory run the boring part in the middle. The explorative aspect of coding (or design) remains, but the tools are at a much higher level. In a way you can experiment and compare a hell of a lot faster than just a year or two ago.

We must also think about checkpoints, even hard stops where anthropogenic judgment is still required. What would these be?

I'd approach this puzzle by summoning the good ol' 5W+1H problem-framing technique:

Solution = What, Why, When, Who, Where, and How.

5W+1H assumes, appealing to those with an engineering mindset (that's the tunnel vision a member of my family keeps talking about, right?) that to solve something, these questions all need decent answers. We the engineers, and especially software engineers, have always been obsessed with the H (How) and somewhat less about the Ws. Certainly this remains a problem to this day.

Now, engineers with their TLAs and IDEs who are told to take on an advising role about the "How" might finally give attention to "What, Why, When, Who, and Where". Perhaps this could've been a good idea all along. Anyway, in more practical terms, this new RACI could look something like this:

5W+1HCheckpointValidatoionWhat It Catches
What & WhyRequirements approvalAI understood the requirement correctly, scope is bounded, problem is worth solvingBuilding wrong things before any code is generated
When & WhoPlan approvalPhasing makes sense, task breakdown is realistic, dependencies are identifiedPlanning failures, unrealistic scope, missing sequencing
What & HowCode reviewImplementation matches the plan, tests cover acceptance criteria, quality standards metQuality drift before code enters main branch
HowArchitecture complianceSolution follows architectural boundaries, conventions, and patternsArchitectural erosion, structural decisions that compound over time
HowTooling evaluationCurrent models, tools, and agent setup are still the best fitStale infrastructure, falling behind on capabilities

Obstacles to adoption

Getting people on board with these new job descriptions is going to be hard.

The specification discipline is new for many developers who, despite working in engineering, have been used to working like artists or craftspeople for quite some time. Luckily not in aeronautics or bridge building, but in software industry. We are used to making decisions on the fly and adjusting as we go.

Another challenge is that in practice you'll end up verifying the same things all over again, which are often wordy, repetitive, and not that fun.

Sooner or later you'll find yourself pressing "pass," "yes," "agreed," "do it" and move on without reading or checking anything. That's not always a problem as you can always revisit, adjust, and continue, but the risk of becoming too disconnected from the actual work is real.

Hence:

The checkpoints should be kept small, intuitive, and focused on the most critical things.

The AI fatigue and kind of laziness will get you, too. It's just so tempting to throw smaller and smaller things to the fancy model just at your fingertips, and essentially to stop really thinking. Don't do that: first it's not good for you, second it's a waste of resources, and third it will lead to more mistakes and rework.

A picture is worth a thousand words

To keep the review business viable, try to make it graphical.

In the example below, I've provided a simple ASCII graphic. This is what my AI UX Agent gave me to review. Not impressed? There were more details down to individual controls, but this was what I thought about most and found most valuable.

Any intermediate planning artifacts for real persons must be visual, structured, and easy to review.

As rudimentary and boring as my example probably is, it represents a sanitized real-world example of a dashboard generated from a short description: "give me a page layout and implementation plan for this kind of data with this kind of hierarchy, with search and action list options."

With the rough idea, I could throw in (via project instructions) a basic design system, some generic code principles, and a similar page already implemented as an example.

Here's what I got:

+------------------------------------------------------------------------+
|  PageHeader: "ProcessControl"                                          |
|  [Search: Seek with order...] [Date: DD.MM.YYYY]             [Actions] |
+------------------------------------------------------------------------+
|  ProcessList (Master View)                                       Card  |
|  Tabs: [Active] [History]                                              |
|  +------------------------------------------------------------------+  |
|  | Id     | Unit   | Inf1 | Inf2 | Quality | Notes  | Amount | ...  |  |
|  |--------|--------|------|------|---------|--------|--------|------|  |
|  | 12542  | A12    |  X   |      | prem    | ...    | 10032k | [X]  |  |
|  | 54344  | B05    |      |  Y   | 2nd     | !      | 1233   | [-]  |  |
|  +------------------------------------------------------------------+  |
|  (flex: 0 0 40%, overflow: hidden)                                     |
+------------------------------------------------------------------------+
|  ProcessDetails (Detail View)                            Card outlined |
|  Tabs: [Events] [KPIs] [BOM]                                           |
|  +------------------------------------------------------------------+  |
|  | [+ Add]                                                          |  |
|  | Date | Event  | Notes   | Start      | End    | .. |             |  |
|  |---------------|---------|------------|--------|-------------|----|  |
|  | 1202 | Bling  | BP      | 14:30:15   | 14:32    | 1m |           |  |
|  +------------------------------------------------------------------+  |
|  (flex: 1 1 auto, overflow: hidden)                                    |
+------------------------------------------------------------------------+

Compare this to reading hundreds of lines of text like this and trying to figure out what I'm going to get. Day in, day out, and then wait an hour or two to discover the thing was nothing I wanted. In case you wonder, no, your visual drafts don't have to be the 80x25 ASCII art I so proudly showcase here (mostly motivated by nostalgia I reckon). Go ahead and spin up an HTML version of it looking very close to the final product, and review it.

Drawers (rendered at page level, not in content area):
  - RecordDrawer (right side)
  - QualityDrawer (bottom, placement="bottom", size="large")

Which of the above would you rather read? Use AI to generate summaries and graphical representations of the tasks, designs, and plans. Like a Gantt chart of the tasks, a diagram of the solution components and blast radius, a draft screenshot of the UI. All this is available at your fingertips.

While you can certainly still vibecode your way through all this, or "Lovablise" it, it might not give you anything solid to build on. The point of having somebody with a real brain in the loop is to keep the train on track, keeping things manageable and visual for the humans.

Your new role

You, in a governed loop, are not a programmer. You are an architect, a lead, a quality authority. The role shifts from producing code to governing delivery.

This is a genuine cultural shift. Most development organizations are structured around production: developers produce code, testers produce test cases, leads produce architecture documents. In a governed AI loop, the AI produces most of this. Your value is in the decisions: is this plan correct? Does this implementation meet our standards? Should this be shipped?

This requires different skills, different judgment, and different ways of measuring contribution. Organizations that try to run governed delivery with the old mental model—where "value" means "lines of code written"—will find the framework frustrating. The value is in the governance, not the generation.

Organizations that embrace the shift often find their senior engineers are happier. The tedious parts of coding—boilerplate, routine transformations, repetitive patterns—are delegated to AI. The interesting parts—architecture, design decisions, quality judgment—become the focus of your attention.

Testing as the confidence layer

If AI output is probabilistic, testing converts probability into confidence.

Testing LayerWhat It CatchesWhy It Matters for AI Code
Unit testsLogic errors, incorrect return valuesVerifies the agent got the core behavior right
Contract testsAPI mismatches, schema violationsCatches when agents invent or misread API contracts
E2E testsUser flow breakage, integration failuresValidates the full story works, not just individual pieces
Static analysisConvention violations, architecture driftEnforces patterns agents should follow but sometimes don't
Policy-as-codeGate criteria violations, compliance gapsAutomates governance checks that otherwise require someone's judgment

In the governed pipeline, testing is a first-class stage with its own agent and gate. Tests must map to acceptance criteria before the gate advances. This is not an afterthought or a nice-to-have—it is the mechanism that makes probabilistic generation reliable enough for production.

The scope extends progressively: unit tests for logic, contract tests for APIs, E2E tests for user flows, static analysis for conventions and architecture rules, and eventually policy-as-code for gate criteria themselves.

This changes the economics of AI-generated code. When validation is automated and thorough, you can let AI generate aggressively and catch failures cheaply. The tradeoff shifts from "is this code perfect?" to "does this code pass all the checks?"—which is exactly the tradeoff that CI/CD pipelines already manage for code people write.

The more you can automate quality verification, the more safely you can delegate generation to AI agents.

The regeneration option

Here is perhaps the most counterintuitive benefit of a spec-anchored approach: if your specifications are good enough and your traceability chain is complete, you can—in theory—regenerate the entire implementation.

Think about what governed delivery produces for every story: a structured plan (architecture decisions, component mapping, scope boundaries), an ordered task breakdown with dependencies and deliverables, tests that verify acceptance criteria, and a complete traceability chain from story ID through branch, commits, and PR back to the backlog item.

If the AI models improve—and they will—you could take the same specifications and re-run the Engineering Agent with a better model. If your architecture changes, you could update the plan and regenerate. If a new framework emerges, the tasks could be re-scoped while the requirements stay stable. The specification becomes the durable asset; the code becomes (partially) disposable.

This reframes the relationship between specifications and code. In traditional development, the code is the valuable artifact—specifications are documentation that quickly drifts from reality. In a spec-anchored world, the specification is the valuable artifact—code is a (verifiable) derivation of it.

I never needed to regenerate any part of the game. But the specifications GSD produced for each feature are still there, the acceptance criteria still valid. I'd be curious how close a fresh run from those same specs would come to the original result. That's the real test of whether specifications are truly the durable artifact.

The regeneration option is best understood as a theoretical endgame that motivates investment in specification quality—a direction of travel rather than a current capability. Every improvement in specification precision, every addition to test coverage, moves you closer to this future. But teams should not plan on regeneration working reliably in the near term.

Feedback loop design

The governance boundary is not just about control—it is also where learning happens. Each gate rejection is information: the plan was unclear, the implementation diverged from intent, the tests missed an edge case. Without systematic feedback loops, this information is lost.

Effective feedback loops answer:

What information from review flows into future specifications? When plan approval rejects a specification because the scope was wrong or the architecture would not work, that insight should inform how similar stories are specified in the future.

How does the team systematically learn from gate rejections? Patterns in rejections indicate systemic problems: unclear story descriptions, missing architectural context, inadequate test coverage. Tracking rejection reasons reveals where the process needs improvement.

How do specifications evolve based on implementation experience? When coding reveals that a planned approach will not work, the specification should be updated to reflect the new understanding, not just the code changed while the spec drifts.

The governed delivery artifacts make these feedback loops possible. Because specifications are structured and versioned, you can analyze them. Because the traceability chain is complete, you can connect rejections at later gates back to specification quality at earlier ones.

The governance boundary is not just a control mechanism, but it is a measurement instrument for process improvement. Organizations that treat gate rejections as learning opportunities improve faster than those that treat them as annoyances.