The AI Tool Ecosystem

What's available, what it does, and how to choose

Choosing the right tools

If it ain't easy, developers won't use it.' Heard that before? Leaving the obvious 'how come you use vi then' aside, one obstacle to cross on our way to an AI-powered nirvana is to convince the development team to actually use the tools, and in order for that to happen they need to have decent Developer Experience, preferably in their favorite IDE.

For Software Development, the toolset and the primary way to use it has been pretty much stable since the late '90s or so. You have an IDE or an editor system to write, compile and test your code. That's combined with scripts and other tools to, for instance, run, test, or manage the running system.

That was the starting point of the AI-powered tools as well: you could (and should) think of them as wrappers between your codebase and the LLMs. A bit like intellisense on steroids running in the familiar IDE of your choice.

Key concepts related to the current state of affairs are depicted below. Regardless of which tool you use, you prompt, attach context, and some kind of agent processes the request and produces the output.

Click to enlarge

Agentic harnesses

One way to think about this field is in terms of agentic harnesses: the coordination layer that wraps around a language model and determines how it interacts with your codebase, your workflow, and other agents. The harness is what turns a raw LLM into a development tool. Some harnesses are minimal, a prompt and a file reader. Some are opinionated, a full workflow with gates and artifacts. The choice of harness shapes the developer experience more than the choice of underlying model.

Opinionated vs unopinionated tools (generic vs specialized)

Nowadays the AI tools are much more than the generic ChatGPT prompt in the browser we all began with, even though the primary interface is still the prompt.

The primary groups of AI tools in our craft:

IDE-integrated assistants which are embedded directly into your code editor, like GitHub Copilot, Cursor, Windsurf, and Antigravity. They provide in-context code generation, inline chat, and agent mode capabilities.
Code- and workflow-focused tools, often integrated directly into your code hosting platform, like GitHub Copilot.
App-centric or user-facing tools, less focused on the code and infra, like Lovable.
CLI tools which bring us back to the terminal ages, perhaps with somewhat enhanced UX. Claude Code, Opencode, Codex CLI, Amp CLI are examples of this category.

One of the issues with plugging AI capabilities into our craft is actually the unopinionated nature of the tools. They offer you the power of the LLMs, but they don't tell you how to use it. Yes, there are capabilities like tools (retrieval, file access, API calls) and features like agent mode, but they don't tell you when to use them, how to structure your prompts, how to manage the context, how to integrate with your workflow.

I summarize these distinct tool groups in the table below to compare them with the agentic coordination and context framework proposed in this book.

Category	Tools	Core idea
Agentic framework	VS Code, Plan/Agent modes, Custom	Coordinating layer, parallelism, context management (subagents), handoffs, structured artifacts, human checkpoints
Agentic coding CLI	Claude Code, Codex CLI, Amp CLI	Terminal-resident agent that understands your codebase and executes tasks via natural language
IDE-integrated assistants	GitHub Copilot, Cursor, Windsurf	AI embedded inside your editor — autocomplete, inline chat, agent mode
Prompt-to-app builders	Lovable, v0, Bolt.new	Describe an app in plain English → get a working application
Spec-driven toolkit	GitHub Spec Kit, Kiro (AWS), GSD, Beads	Specification-first workflow: write a spec before code, use it as source of truth
Agent memory system	Memories	Persistent, queryable work state across sessions
Runtime assistants	Agent-browser, Playwright MCP	A runtime view to your application for more dynamic context
Design assistants	Figma MCP	Import and couple your design system with your codebase for better design-to-code handoff and context
Context augmentation tools	Serena	Give access to indexed code graph beyond grepping and reading files

Another way to look at this is in stages: whatever 'agentic framework' you use to coordinate and delegate the work, different kinds of tools and capabilities match certain stages of that flow. This is depicted below.

Click to enlarge

CLI-based agentic coders

Claude Code, Codex CLI, Opencode, and Copilot CLI are general-purpose LLM frontend engines built on a command line interface. They have artifacts for subagents, different modes (edit/plan), and so forth, and use the same context and configuration management approach as their IDE-integrated counterparts: basic instruction files, skills, prompts, and whatnot.

Calling them CLIs to somebody who thinks that means ps | awk rather than a nethack-like window is perhaps a bit misleading. These CLIs are not flow-based (text flowing in your terminal) but use colors and organize the windowing to keep the current screen in focus, with the ability to scroll just parts of the window.

IDE-integrated assistants

Copilot and Cursor fill a similar role: they accelerate coding within a stage. The developer might use Cursor during the Engineering gate for day-to-day work, while the framework handles the broader lifecycle around them. Copilot has evolved from inline autocomplete to full agent mode that can plan multi-step edits and create PRs, but it remains a coding accelerator, not a delivery governance framework.

Prompt-to-app generators

Lovable, v0, and Bolt.new solve a completely different problem for a completely different audience. They produce demos; this framework produces governed deliveries. There is no integration point between them.

These tools are extraordinarily effective at time-to-first-demo, and they are reshaping expectations about how fast software should appear. But every one of them hits a "technical cliff": authentication, security, performance, testing, team collaboration, and long-term maintenance are your problem. The demo is free; production is not.

This matters because these tools set the pace expectations that professional delivery teams are measured against.

Spec-driven toolkits

GitHub's Spec Kit shares the conviction that specifications should precede code, but differs in scope and enforcement. Spec Kit plans per-project with advisory checklists; the governed approach plans per-story with enforced approval gates and machine-readable schemas tied to backlog items.

Dimension	Governed approach	Spec Kit
Scope	Per-story, program-level	Per-project, greenfield-focused
Enforcement	Hard gates, state machine	Advisory checklists, AI-interpreted
Human checkpoints	Mandatory approval stops	Optional review suggestions
Backlog integration	First-class (Jira, ADO, GitHub Issues)	Minimal
Agent relationship	Opinionated: role-specific agents	Agent-agnostic, works with any tool
Existing codebase	Core use case (modernization, legacy)	Designed for greenfield projects
Setup effort	Significant: requires custom configuration	Low: works with minimal setup
Best suited for	Multi-team delivery needing traceability	Solo or small-team greenfield projects

Context engineering layers

GSD fights context rot through task-level isolation: each task gets a fresh 200k-token subagent context. The governed approach fights context rot through role-level isolation: each stage is a separate agent invocation with defined inputs. Both work; they optimize for different things.

GSD optimizes for solo throughput with minimal ceremony. Its creator's philosophy: "I'm not a 50-person software company. I don't want to play enterprise theater." That resonates for good reason, and my roguelike project (the thread running through this book) was built entirely with GSD. It works, and it works well. Most software is not built by large teams with compliance requirements. For a solo developer or a small team shipping a product, GSD's lightweight phases with optional verification give you structure without the overhead of mandatory gates, formal artifacts, and human checkpoint ceremonies. The ceremony has a real cost: configuration effort, context overhead, and slower iteration cycles. If your project doesn't need traceability or multi-team coordination, that cost buys you nothing.

The governed approach earns its overhead when the work must be auditable, when multiple people need to understand what was built and why, and when the cost of an undetected defect outweighs the cost of a gate. This is not a hierarchy where governed is "better"; it's a spectrum where the right choice depends on your context.

Agent memory systems

Beads provides persistent, queryable work state across sessions, a Git-native issue tracker designed so agents can wake up, ask "what's next?", and resume work. The governed pipeline's task-tracking layer serves a similar function but is inseparable from its gate model: state and workflow are unified in a purpose-built artifact store, versioned in Git alongside the code.

Token economy

The Generative AI models, especially the larger ones, are expensive to run. The hardware that runs them is nothing short of the things referred to as supercomputers just a few years ago. So the slowness you'll see - a simple prompt might take minutes to complete, is not due to business of the service, but the immense calculations required to produce the response tokens.

Comparing token efficiency by approach or tool would be an interesting exercise, but is beyond the scope of this book. Anyway, I argue that keeping tasks small (my key thesis!), having clear separations of concerns and well-defined inputs and outputs, and avoiding unnecessary back-and-forth with the model are the best ways to keep token costs down. In practice this means opening fewer files (like e.g. via Compacted Indexes), having clear task definitions that don't require multiple rounds of clarification, and using the model's output as directly as possible without needing to reprocess it too much are effective techniques not only to save money, but make your agents run smoother and faster.

Choosing what fits

What I've described throughout this book is not a product. It's an agentic SDD workflow, pieced together from available tools: shareable, version-controlled task artifacts; multiple developer intervention points; and automatic quality control with a feedback loop. As of this writing, none of this exists as an off-the-shelf product you can install and run.

So how do you actually choose? Four questions determine your starting point:

How many people touch this codebase? Solo work needs freedom; shared work needs traceability.
What's the cost of an undetected defect? A PoC demo crash is a shrug; a production billing error is a lawsuit.
How long will this code live? A throwaway prototype needs no governance; a system maintained for years needs decisions to be traceable.
Will you be around to maintain it? If you're building something short-lived, speed over quality is a legitimate trade-off. If you'll be living with this system for years, invest in the structure that makes future-you's life bearable.

Your answers map to the table below:

Your situation	Recommended stack	Why
Solo developer, side project or PoC	IDE assistant (Cursor/Copilot) + good CLAUDE.md. Add GSD or Beads if you want lightweight structure.	Plan and Agent modes get you surprisingly far. Governance overhead buys you nothing here.
Small team (2-5), shipping a product	CLI tool (Claude Code) or IDE assistant + spec-driven toolkit (GSD/Spec Kit) + basic conventions docs	You need shared context and some structure, but mandatory gates and formal artifacts are overkill. Spec-first habits pay off even without enforcement.
Medium team (5-15), professional delivery	CLI/IDE tools + hierarchical docs + task tracking + manual gates + at least planning and review agents	Multiple people touching the same codebase means you need traceability. Start with manual gates and automate as you learn what breaks.
Large team or regulated environment	Full governed approach: role-specific agents, enforced gates, state machine, automated quality checks, audit trail	The cost of an undetected defect or an untraceable decision justifies the overhead. This is where the framework in this book earns its keep.

You'll know it's time to move to the next row when you start seeing things like the same error happening again, an agent keeps spilling output that makes a mess of somebody else's work or totally disregards your 100kB carefully crafted AGENTS.md, or a reviewer can't tell why something was built a certain way (or why it was done in the first place).

I could go on, but these kinds of signals are just the Universe telling you that you need more structure.

A few rules of thumb that apply regardless of team size:

Start with context, not agents. A well-written CLAUDE.md and organized documentation will improve every tool you plug in. Skip this and no amount of agent sophistication will save you.
Add structure when pain appears, not before. If you're losing track of what was built and why, add task tracking. If agent output keeps missing the mark, add a planning agent and specs. If defects keep escaping, add gates. Don't set up infrastructure for problems you don't have yet.
Match the model to the job. Reasoning models for planning, fast models for scaffolding and boilerplate, code-oriented models for implementation. Using Opus for everything is like driving a truck to the corner store.
Invest in skills and reusable tooling early. A good scaffolding skill or a semantic search tool will pay for itself across every project, every team member, and every agent invocation.

This kind of rather strict and multi-stage governance is really only necessary when you have multiple people working on the same codebase, when you need to be able to trace decisions and changes, and when the cost of a defect is high enough to justify the overhead.

Pick the lightest approach that meets your actual needs, and grow it as you go.