Everyone Wants a Software Factory

Power tools, not assembly lines

Everyone wants a software factory

he tools we covered in the previous chapters (CLIs, IDE assistants, and prompt-to-app builders) are coding accelerators. They are designed to help individuals work faster. But the topic of this book goes beyond individual tooling. It's about how teams share and develop the process and the AI framework that brings it all together. Essentially people want a software factory.

Of course this is compelling. Just describe what you want, and a pipeline of AI agents delivers working, tested, deployed software. A software factory. Simple, right?

Click to enlarge

To put it mildly, and as someone who has been working with (real) manufacturing for best part of their career by writing real software for it, I found the metaphor 'software factories' quite misleading. It simplifies the art of software engineering to a mechanistic process of assembly line production, which is not how software development works or what is the outcome of it. The closest thing to 'factory' and 'creating software' in the industrial economics terminology I could think of could be 'tailored mass production'. If you leave the mass out of it.

Something from the construction might be closer. How I see it is that writing software is akin to first designing and then building one-off buildings or complexes. Not spewing out standard blue LIDL stores.

Yeah, we might have some templates, regulations and standards (perhaps quite a few of them) to follow. Like software, most buildings have a roof, walls, windows and doors. But the details are different every time! A site for storing nuclear materials, which arrive on railway and you need to build a hotel for IAEA inspectors to stay in (think like, SAP), are quite different things, from let's say a shopping mall (like, your custom CRM) or a suburban single family house (your personal webpage with javascript animation on startup).

Let's not dismiss the factory metaphor entirely, though. The construction analogy is closer to how software actually works: one-off designs, unique constraints, skilled judgment at every turn. But "software factory" captures something real about the ambition: repeatable pipelines, automated quality checks, consistent output. The industry has settled on the term, and so will this book. Just keep in mind that the "factory" we're talking about looks more like a custom design shop with very good power tools than an assembly line stamping out identical widgets.

Building the roguelike with GSD was exactly this: describe the feature, let the research agent dig into Lovecraftian lore and roguelike conventions, review the plan, then let the coding agent execute. A one-person software factory.

What's being built

Let's first look at what kind of factories already exist.

AWS launched Kiro as a spec-driven IDE in mid-2025, but by year's end it had evolved into something more ambitious. The Kiro autonomous agent can run for days, working across multiple repositories, opening pull requests with detailed explanations, and learning from code review feedback. It runs up to ten tasks concurrently in isolated sandbox environments. Spec-driven methodology (requirements in EARS (Easy Approach to Requirements Syntax) notation, design documents and test plans) is baked into the workflow, not optional.

Microsoft published a detailed reference architecture for an end-to-end agentic SDLC using Azure and GitHub: Spec Kit handles the planning phase, GitHub Copilot agents handle implementation, and GitHub Actions orchestrate CI/CD. Separately, Microsoft's Data Science team proposed an architectural concept using specialized AI agents for each SDLC phase: requirement extraction, system design, code generation, testing, deployment, monitoring. These would all be coordinated by a Core Orchestrator Agent built on Azure AI services.

EPAM built a custom agentic platform for PostNL that deployed over 20 types of AI agents supporting multiple teams across the full lifecycle: test case generation (80% time savings claimed), documentation (70–90% reduction in manual effort), code review, and deployment. This is one of the few documented enterprise case studies of an actual production deployment, not a concept or reference architecture. I don't have any data on reliability or performance, so I'd approach these claims with a healthy dose of skepticism though.

Cognition AI's Devin and the open-source OpenHands (64k+ GitHub stars) take a different approach entirely: a single broad-capability agent that plans, writes code, browses documentation, debugs, and deploys. No role-specific stages, no formal handoffs. Assign a GitHub issue, and the agent figures it out. OpenHands achieves 72% resolution rate on SWE-Bench Verified.

As important pieces of these larger factories, Task organization and learning tools such as Spec Kit (and its forks), GSD and Beads don't try to own the full lifecycle but add crucial structure around existing coding agents. It's actually amazing that the most widely used AI developer tools don't have any proper tools to manage projects, tasks or progress. (No, nobody wants to use JIRA for that in the future. Never wanted.)

For instance, as my weapon of choice to guide my way through the Lovecraftian adventures in my roguelike, GSD helped tremendously in breaking work into phases and tracking progress in planning artifacts. In my day job I've built something like that (but far inferior), too. GSD is based on a (kind of) agent workflow to use specialized workers for specific tasks. What I found to be particularly well thought out in GSD is the interview process, where things are constantly refined and the user is asked for input before jumping into the coding agent. Beads provides persistent memory so agents can wake up, ask "what's next?", and resume work across sessions.

With the possible exception of Microsoft's Azure- and GitHub-integrated reference architecture, most of these efforts stop at the PR. CI/CD, monitoring, and user feedback loops are left as an exercise to the reader. More about these loops later in this book.

I think this is a interesting opportunity. Combining data from infrastructure, usage patterns and user feedback would provide trace data to analyze both the qualitative and quantitative performance of the produced solution, and feed good insights back into the next iteration.

References

EPAM PostNL Case Study

Devin by Cognition AI

Spec Kit

GSD

Beads

The compound probability problem I've introduced applies directly to each one of these solutions I listed and probably to all the GenAI-based ones I didn't mention.

Some may have some capabilities for self-correcting and so forth, but are essentially at the mercy of the persons using them and verifying the product really did what it was supposed to do. So any "92% accurate" agents or "70%" SWE bench models perform way worse in real, parallel multi-agent setups, outside their training material and with the noisy input and lackluster QC department. This should come as no surprise to anyone.

This may be partially addressed by manual intervention and iterative fixing. Yes, there are times when you launch a lot of agents with a large task list and let them self-correct and run for days (e.g. 'ralph looping'). Using this kind of approach, you should introduce reinforcement-learning style feedback, be token-efficient, and have an orchestrator that stops when things go sideways. Let's call that the circuit breaker. Certain tooling for this kind of approach is emerging, but it's still early days.

I'm not sure any of the major players have good answers to this yet. Spec-driven methodology constrains each agent's scope, reducing the probability of drift. There may be technical guardrails like token budgets, timeouts or hooks to prevent runaway agents. But the "iterate until tests pass" approach is not a solution to the compound probability problem. It adds more iterations and can actually make things worse if the underlying reliability isn't high enough. And it is, by nature, not.

One thing constantly overlooked and partially solvable via smart agent design is related to testing. With vague tests, overenthusiastic agents, your 231 E2E tests with 100% are just code fitted to the code, not the real needs. (And often break when somebody as much as looks at them.) Without careful guardrails, like isolated testing agents that cannot read the implementation code, or do that after the fact, agents will eventually try to modify the tests to fit the code. Obviously, this defeats the purpose of testing entirely but is what is happening unless you pay attention.

Whether full-lifecycle agent orchestration can work reliably without enforced governance is the question the industry is running an experiment on right now. The rest of this book is about finding an answer to that question.

We'll return to these frameworks with a detailed tool-by-tool analysis later in this book.