Applied AI

Why eight agents, and not one chatbot.

The single-chatbot pattern fails the specific tests an accounting policy team puts a junior member through. Eight agents pass them.

By Bamidele Aly 7 min read May 2026

The shape of policy work

Policy advisory work, in a Big 4 firm or an in-house accounting team, is not question-and-answer. It is adversarial reasoning under uncertainty. Two readings of the same paragraph in IAS 36 can both be defensible. The question is which one the auditor will accept and the regulator will not contest. That is not a fact-retrieval problem.

A real policy team plays this out structurally. The preparer drafts the position. A reviewer challenges it from the perspective of a sceptical auditor. A senior partner sits above both and decides which framing is robust. The output is not the answer the preparer gave first; it is the answer that survived the exchange.

A single chatbot has none of this. It generates one plausible response and stops. Plausibility is the failure mode. The reader cannot tell, from the output alone, which of three or four equally plausible alternative readings the model considered and rejected. So you are forced to repeat the prompt with variations to surface what should have been visible the first time.

What each of the eight does

Ile Owo names the roles explicitly and routes work between them. The Orchestrator holds the working context and decides which agent runs next. The Filter narrows the enquiry against the corpus — UK Endorsement Board endorsements, BIS publications, Big 4 technical libraries, internal policy memos — and discards out-of-scope material. The Summaryan extracts the load-bearing claims from each source.

The middle stage is where the adversarial structure lives. The Historian situates the question against prior standard-setting decisions: when was this paragraph last interpreted, by whom, with what subsequent amendments? The Insider argues the preparer's position — where the standard gives discretion, where it gives a hard rule. The Outsider argues the auditor's or regulator's position, surfacing the challenges the Insider would prefer to avoid.

Synthesis is two more agents. The Auditor adjudicates the Insider/Outsider exchange against the standard text and the Historian's analogues, and writes the resolved position. The Scribe produces the final advisory note in the house style — citations, working, summary.

Plausibility is the failure mode. A single chatbot generates one plausible answer and stops; the reader cannot see what was considered and rejected.

Why this beats one large model

Three reasons, in order of importance. Specialisation first: each agent's prompt is tuned for its role. The Outsider's prompt rewards finding the position the Insider missed; the Insider's prompt rewards the cleanest defensible drafting. A single chatbot averaged across both roles writes neither well.

Context isolation second: each agent runs with its own working memory. The Outsider does not see the Insider's draft until it has formed its own challenges. That removes the politeness compression that happens when a single model writes the response and the critique in one pass — it always ends up moderating the critique to match the draft.

Audit trail third, and this is the one the regulator cares about. Each agent's output is logged with timestamps, source citations and intermediate reasoning. When a supervisor asks "how did this position arise?", the answer is not a chat transcript: it is a structured record of which sources were considered, which positions were tested, and which review surfaced which counter-argument. That is a reviewable artefact. A single chatbot produces only the final answer and the prompt; everything in between is implicit.

What this does not solve

Ile Owo does not certify completeness. The Filter might have missed a relevant source; the Historian might have missed a precedent; the Auditor might have under-weighted a regulator's known concern. The system is a draft accelerator and a structured first reviewer, not a decision authority.

It also does not replace the human judgement call on genuinely novel questions — the ones where the answer is "we should ask the standard-setter directly". For those cases the most useful output of the agents is not their conclusion; it is the inventory of considerations they raised, which becomes the briefing pack for the conversation that follows.

Treat it as a senior associate's first draft, written under twenty minutes instead of three days. Then read it like you would any first draft from someone you do not yet fully trust: looking for what is missing rather than what is there.