Notes from putting agents into government infrastructure
A demo runs in conditions you control. Government infrastructure runs in conditions that control you. Moving agentic automation from one to the other is less about model quality than about everything around the model — and that gap is where the real work lives. I'll keep this at the level of what generalises: no client specifics, no architecture anyone could reconstruct.
The network changes first. The assumptions baked into most agent tooling — call a hosted model, reach an external API, pull a dependency at runtime — quietly evaporate on-prem. The model, the orchestration, the storage, and the observability all have to exist in a form you can deploy inside the boundary, with no line to the outside. A surprising amount of "production-ready" agent tooling assumes the public internet. It isn't ready for this.
Then identity and authorisation. In a demo the agent acts as you. In regulated infrastructure it acts as a named, scoped service principal whose every permission is justified and logged. The design question stops being "what can the agent do" and becomes "what is it allowed to do, who decided that, and where is the record." Half the architecture ends up being the parts that say no.
Then the approval surface. Fully autonomous agents are an easy sell on a slide and a hard sell in a building where a wrong action has consequences. What holds up is human-in-the-loop placed deliberately: the agent does the work, assembles the evidence, and pauses at the decisions that carry weight. The humans aren't there to babysit it — they're there because accountability has to land on a person. The system's job is to make that decision fast and well-informed, not to remove it.
What quietly doesn't hold up is anything that depends on the model being right the first time. In a demo you re-run the prompt. In production you design for the model being wrong some of the time and make that survivable — bounded retries, idempotent actions, a clear non-retryable path, and a trace you can read in an incident review months later.
The other thing that doesn't survive contact is undocumented behaviour. "It just works" is not a sentence you can say to an auditor. Every decision the system makes has to be explainable after the fact, in terms a non-engineer can follow. That requirement, more than any performance target, shapes how you build.
If there's a single lesson, it's that the model is the easy part. The engineering around it — deployment inside a boundary, scoped identity, deliberate human checkpoints, observability you can defend — is what decides whether agentic automation belongs anywhere near critical infrastructure. That work is unglamorous, and it's most of the job. It's also the part I find most interesting: making a clever capability dependable enough that a serious institution will run it, on their own hardware, against work that matters. That's a control-systems problem wearing an AI hat — and it's where the field is actually headed.