The Hardest Part of Building AI Agents Isn’t the AI

I spent a recent stretch refactoring my AI agent system. I didn’t touch a single prompt.

Seven phases. Seven commits. And it taught me more about building reliable AI systems than any model upgrade ever has.

The System

I run a multi-agent AI system — a team of specialized agents that handle different parts of my daily workflow. There’s a chief of staff agent that triages my communications and surfaces what needs attention. A research agent that searches across email, Teams, and documents. An architect agent that makes design decisions. Engineers that write code. A scribe that turns meeting transcripts into structured recaps.

They coordinate through a shared knowledge graph (Myelin, which I wrote about previously), and they run on cron schedules — monitoring channels, generating reports, consolidating memory while I sleep.

When it works, it’s genuinely transformative. I sit down in the morning and my chief of staff has already triaged overnight messages, flagged aging work items, and prepped me for my first meeting. I don’t re-brief anyone. The system remembers.

But here’s the thing nobody tells you about multi-agent AI systems: the hard part isn’t the agents. It’s everything around them.

The Mess

The system grew fast — and organically. It accumulated the kind of technical debt that would make any engineer wince:

Config duplication. The same channel IDs appeared in 4 different files. Agent port assignments were hardcoded in scripts instead of referenced from a config. Change one, miss three — guaranteed drift.
Identity scatter. One agent’s identity information was spread across 6 sources — its definition file, a rules document, a memory file, a config file, inline in cron prompts, and baked into startup scripts. Which one was the source of truth? All of them. None of them.
State file sprawl. Six state files scattered across a directory with no naming convention. Some prefixed with the agent name, some not. No way to tell at a glance which agent owned what.
Cron indirection. A cron job called a PowerShell script, which called the CLI, which loaded a prompt from a variable defined in the script itself. Three layers of indirection to do one thing. Debugging meant tracing through all three.
Dead code. Retired agents still had entries in config files, aliases in shell profiles, and references in documentation. Ghost infrastructure that confused every audit.

None of these are AI problems. They’re software engineering problems. But in an AI system, they compound in ways that are uniquely painful — because when an agent makes a wrong decision because it loaded stale config, you don’t get a stack trace. You get a plausible-sounding answer that happens to be wrong.

The Cleanup

I did what any engineer would do when the foundation is shaky: I stopped building features and cleaned the house.

Phase 1 — Eliminate dead code. Kill retired agents, remove stale aliases, delete orphaned state files. If it’s not running, it shouldn’t exist in the codebase. This is table stakes, but in a system that grew organically, there’s always more dead code than you think.

Phase 2 — Single source of truth for config. One JSON file for all agent definitions, channel mappings, port assignments, and monitoring configuration. Every script reads from it. Nothing is hardcoded. Change the config, and everything downstream picks it up.

Phase 3 — Organize state. All state files into one directory with a consistent naming convention: {agent}-{purpose}.json. Immediately obvious who owns what, and trivially scriptable for cleanup and rotation.

Phase 4 — Standardize agent definitions. Every agent follows the same boot sequence: load rules, load graph context, load recent logs. No special cases. No agent-specific startup logic scattered across shell scripts. The boot contract is the same for all of them.

Phase 5 — Runtime management. Start and stop commands for the whole system. Shell aliases that set the right working directory and flags. One command to bring up the fleet, one command to bring it down.

Phase 6 — Externalize prompts. Cron prompts moved out of scripts and into versioned files. The scripts become thin wrappers — read the prompt, invoke the CLI, done. The prompts are now diffable, reviewable, and independently testable.

Phase 7 — Documentation sweep. Every reference to retired agents, old config file names, and stale channel IDs — found and fixed. If the docs don’t match the system, the docs are a liability.

Seven commits. Every phase built on the previous one. And when it was done, the system was the same from the outside — same agents, same schedules, same behavior — but fundamentally more maintainable on the inside.

The Principle

In the middle of this cleanup, a colleague shared a design principle that crystallized everything I was learning:

“Deterministic in plumbing, agentic in judgment.”

It’s deceptively simple, and it changed how I think about building AI systems.

The idea: if something can be deterministic — a config lookup, a cron schedule, an API call with known parameters — make it deterministic. Don’t ask your agent to figure out which channel to post to. Don’t let it guess the API endpoint. Don’t make it parse a config file every time it runs. Bake that into the infrastructure.

Save the AI for the parts that genuinely need judgment: triaging a message, deciding whether something is urgent, synthesizing information from multiple sources, choosing the right tone for a response.

Every decision you move from the agent into the infrastructure is:

One less thing that can go wrong — deterministic code doesn’t hallucinate
One less token spent — the agent’s context stays focused on judgment, not lookup
One less thing to debug — when something breaks, you know whether it’s a plumbing problem or a judgment problem

This maps directly to the phases of cleanup I just did. Config consolidation? Deterministic plumbing. Standardized boot sequences? Deterministic plumbing. Externalized prompts with exact API calls baked in? Deterministic plumbing.

The agent’s job is to read the message, understand the context, and decide what to do about it. Everything else is infrastructure.

What I’d Tell Someone Building Their First Multi-Agent System

Start with the plumbing. Before you build your second agent, build the infrastructure that will let them coexist cleanly — shared config, consistent state management, standardized communication patterns. It’s not exciting, but it’s the difference between a system that scales and one that collapses under its own complexity.

Audit early, audit often. I let my system grow unchecked before auditing it. By the time I looked, I had 11 agents, 16 cron jobs, 29 skills, and enough config duplication to guarantee drift. Regular audits — even quick ones — would have caught most of this before it snowballed.

Treat your agent system like a software system. Because it is one. Version your prompts. Review your configs. Test your cron jobs. Write docs that stay current. The AI is the flashy part, but the engineering discipline around it is what determines whether your system works reliably at 2 AM when nobody’s watching.

The model is the least interesting part. I know that’s a hot take. But the model is a commodity that improves every few months whether you do anything or not. The system you build around it — the memory, the orchestration, the monitoring, the state management — that’s your actual product. That’s what compounds.

The hardest part of building AI agents isn’t the AI. It’s the engineering discipline to treat the system around it with the same rigor you’d give any production software.

The prompts are easy. The plumbing is the work.