AgentArmy Template Platform Roadmap¶
Strategic Brief & Visual Roadmap (rolling, session-paced)¶
Executive Summary¶
AgentArmy must climb from Custom/Manual orchestration (today) to Self-Service, Observable, Cost-Optimized Platform (~15 agent sessions across 4 release trains). The competitive pressure is real: new agent frameworks (Copilot extensibility, Anthropic Managed Agents) are disrupting the routing/orchestration layer that AgentArmy currently executes by hand.
Planning unit: This army plans in agent sessions — one parallel-agent invocation — not human calendar time. All durations and timelines below are expressed in sessions or session ranges.
The keystone move: Convert the static CLAUDE.md routing table into an executable policy engine. This single shift unblocks intelligent routing, spoke self-service, observability, and the learning loop — everything downstream depends on it.
Investment thesis: Spend capital building genesis (A: spoke instantiation, D: learning loop) and refuse to fight the commodity wars (E/F: observability, cost). Let Langfuse/Helicone commoditize those layers while you climb the moat.
The Wardley Map (Strategic Landscape)¶
Below is the OWM syntax for create.wardleymaps.ai. Key insight: Your most evolved asset is Agent Definitions (180 specialists, already in Product stage). Your biggest vulnerability is the static routing table (Custom, treating it as ground truth when it should be a hypothesis).
title AgentArmy Platform Evolution — Self-Service, Observable, Cost-Optimized Orchestration
anchor Developer / Platform Team [0.98, 0.34]
component Self-Service Spoke Instantiation [0.93, 0.10] label [-60, -5]
component Intelligent Agent Routing [0.80, 0.22] label [10, -8]
component Cross-Agent Choreography [0.62, 0.18] label [-95, 5]
component Learning Loop (Teachable Failure) [0.55, 0.12] label [12, -2]
component Agent-Run Observability [0.45, 0.30] label [12, 6]
component Pre-Delegation Cost Visibility [0.42, 0.28] label [-150, 0]
component Agent Definitions (180) [0.50, 0.45] label [10, -10]
component Agent Registry + MECE Governance [0.35, 0.40] label [-175, 6]
component Coordination Plane (GitHub Projects) [0.30, 0.72] label [10, 8]
component CI/CD Runtime (GitHub Actions) [0.20, 0.78] label [10, -10]
component LLM Inference (Claude / Copilot) [0.15, 0.68] label [-130, 5]
component Compute / Hosting [0.08, 0.90] label [10, 5]
Developer / Platform Team->Self-Service Spoke Instantiation
Developer / Platform Team->Intelligent Agent Routing
Developer / Platform Team->Agent-Run Observability
Developer / Platform Team->Pre-Delegation Cost Visibility
Self-Service Spoke Instantiation->Intelligent Agent Routing
Intelligent Agent Routing->Cross-Agent Choreography
Cross-Agent Choreography->Learning Loop (Teachable Failure)
Learning Loop (Teachable Failure)->Agent-Run Observability
Intelligent Agent Routing->Agent Definitions (180)
Intelligent Agent Routing->Agent Registry + MECE Governance
Cross-Agent Choreography->LLM Inference (Claude / Copilot)
Agent-Run Observability->LLM Inference (Claude / Copilot)
Pre-Delegation Cost Visibility->LLM Inference (Claude / Copilot)
Intelligent Agent Routing->Coordination Plane (GitHub Projects)
Cross-Agent Choreography->CI/CD Runtime (GitHub Actions)
Coordination Plane (GitHub Projects)->Compute / Hosting
CI/CD Runtime (GitHub Actions)->Compute / Hosting
LLM Inference (Claude / Copilot)->Compute / Hosting
evolve Self-Service Spoke Instantiation 0.55 label [12, -3]
evolve Intelligent Agent Routing 0.60 label [12, -8]
evolve Cross-Agent Choreography 0.45 label [10, 6]
evolve Learning Loop (Teachable Failure) 0.42 label [12, 4]
evolve Agent-Run Observability 0.68 label [10, -10]
evolve Pre-Delegation Cost Visibility 0.70 label [-40, -12]
evolve LLM Inference (Claude / Copilot) 0.85 label [10, 6]
pipeline Agent Definitions (180) [0.40, 0.50]
note Inertia: static CLAUDE.md table = knowledge+cultural inertia [0.74, 0.30]
note WAR zone: vendors commoditizing agent obs/cost (Langfuse, Helicone) [0.40, 0.40]
note WONDER: choreography+learning loop is new genesis above commodity LLMs [0.60, 0.05]
annotation 1 [0.80, 0.22] Routing must become a policy engine, not a doc
annotation 2 [0.55, 0.12] Learning loop is the durable moat
How to use: Paste the OWM block above into https://create.wardleymaps.ai to render the full map interactively.
Strategic Plays (Sequenced Execution)¶
| # | Play | What | Why | Timeline | Viability |
|---|---|---|---|---|---|
| 1 | Sensing Engine | Instrument delegation flow → OTel spans (borrow Langfuse) | Unlocks doctrine: situational awareness + flow metrics | Sessions 1–2 (RT1) | HIGH — borrow, don't build |
| 2 | Manage Inertia / Policy Engine | Convert CLAUDE.md table → executable routing policy resolver | Removes keystone constraint; unblocks spoke init + choreography | Sessions 1–2 (RT1 keystone) | HIGH — CRITICAL dependency |
| 3 | Land & Expand (Spoke Init) | Ship spoke init generator; wires repo + board + armies automatically |
First user-facing win; depends on Play 2 | Sessions 5–6 (RT3) | HIGH |
| 4 | Learning Loop (Durable Moat) | Operationalize error-coordinator + knowledge-synthesizer runtime; close the loop on failures |
Builds what competitors cannot replicate (your failure history) | Sessions 3–4 (RT2, concurrent) | HIGH — genius zone |
| 5 | Buy, Don't Build (Obs/Cost) | Integrate Langfuse + Helicone; abstract LLM provider (avoid lock-in) | Frees capital for Play 4; lets vendors win the commodity WAR | Sessions 4–5 (RT2/RT3, concurrent) | HIGH — strategic deferral |
| 6 | Open Standard (Future) | Release routing policy schema + spoke-init as open standard | Build ecosystem; commoditize the layer beneath your moat | After Plays 1–4 prove model | MEDIUM — requires proof first |
Release Train Structure & Alignment¶
Four release trains, sequenced by capability maturity and dependency order. This army plans in agent sessions, not calendar weeks — durations below are session ranges; see the Release Train Index for the session-by-session plan and velocity calibration.
Release Train 1: Foundation & Routing (Plays 1, 2 + enablers)¶
Duration: Sessions 1–2 (~2 sessions)
Theme: Make agent capabilities explicit; routing deterministic.
| Feature | Area | Type | Size | Routing | Outcome |
|---|---|---|---|---|---|
| Agent Spec Template + Capability Matrix | 1 | Feature | M | agent-distinctiveness-advocate |
Definitions + governance → transparent, queryable |
| Executable Routing Decision Tree (.yaml) | 6 | Feature | M | architect-reviewer |
CLAUDE.md → policy engine (executable + auditable) |
| Telemetry Instrumentation (OTel spans) | Play 1 | Enabler | M | observability-engineer |
Flow is now measurable |
| Few-Shot Prompt Library (Phase 1) | 2 | Enabler | M | prompt-engineer |
System prompt patterns per agent category |
Block Diagram:
Developer
↓
Routing Policy Engine ← Agent Specs + Governance
↓
Instrumentation (telemetry)
↓
Ready for Plays 2–5
Release Train 2: Operations & Quality (Plays 3, 4, 5 + operationals)¶
Duration: Sessions 3–5 (~2–3 sessions)
Theme: Formalize multi-agent workflows; measure quality; establish learning loop.
| Feature | Area | Type | Size | Routing | Outcome |
|---|---|---|---|---|---|
| Multi-Agent Choreography (Saga patterns, state machines) | 3 | Feature | L | workflow-orchestrator |
Handoffs explicit; compensation on failure |
| Agent Evaluation Gates (DoD rubrics, SLI/SLO framework) | 4 | Feature | M | observability-engineer + qa-expert |
Quality is measurable |
| Skill Scaffolding & Composition | 5 | Feature | M | tooling-engineer |
Skills versioned, composable, discoverable |
Learning Loop Runtime (error-coordinator + knowledge-synthesizer) |
Play 4 | Enabler | M | knowledge-synthesizer |
Failures are teachable; patterns accumulate |
| Cost Visibility & Provider Abstraction | Play 5 | Enabler | M | finops-engineer |
Token budgets visible; multi-vendor abstraction |
Block Diagram:
Routing Policy Engine
↓
Choreography Patterns + Quality Gates
↓
Learning Loop (failures → lessons)
↓
Cost Transparency + Observability
↓
Ready for spoke onboarding (RT3)
Release Train 3: Spoke Readiness & Observability (Play 3 completion + operationals)¶
Duration: Sessions 5–6 (~2 sessions)
Theme: Spoke teams self-serve; cost is visible; observability is wired.
| Feature | Area | Type | Size | Routing | Outcome |
|---|---|---|---|---|---|
| Hub→Spoke Onboarding Playbook (checklists, pre-commit hooks) | 9 | Feature | M | platform-engineer |
Spoke instantiation is scriptable, not manual |
| Cost & Capacity Model (unit economics, showback) | 7 | Feature | L | finops-engineer |
Team X can see the cost of delegating work Y |
| Spoke-Specific Prompt Adaptation (greenfield vs. legacy context) | 2 Phase 2 | Story | S | prompt-engineer |
Prompts adapt to spoke context |
| Observability Dashboard (MTTR, success rate, cost per delegation) | 10 Phase 1 | Enabler | M | observability-engineer |
Visibility into agent health + behavior |
Block Diagram:
Routing + Orchestration + Learning Loop
↓
Spoke Init (automated)
↓
Cost Model + Observability Dashboard
↓
Ready for advanced learning (RT4)
Release Train 4: Learning & Advanced Observability (Play 4 completion + intelligence)¶
Duration: Sessions 6–7 (~1–2 sessions)
Theme: Accumulate and share lessons; trace agent decisions; iterate on routing.
| Feature | Area | Type | Size | Routing | Outcome |
|---|---|---|---|---|---|
| Agent Lesson-Learned KB (incident log, anti-patterns library) | 8 | Feature | M | knowledge-synthesizer |
Failures become institutional learning |
| Request Tracing & Decision Audit Log (end-to-end visibility) | 10 Phase 2 | Feature | L | observability-engineer |
Why did agent X route to Y? Fully auditable |
| Feedback Integration (PR reviews → routing/prompt refinement) | 10 Phase 3 | Enabler | S | prompt-engineer |
Failures feed back into agent definitions |
| Competency Evolution Tracking (which agents improved on task type T?) | 8 Phase 2 | Story | M | knowledge-synthesizer |
Agent performance is measurable over time |
Block Diagram:
Learning Loop + Observability
↓
Lesson KB + Tracing
↓
Feedback loops (rework → prompt/routing refinement)
↓
Continuous improvement cycle
Prioritization Matrix (Effort vs. Impact)¶
HIGH IMPACT
│
├─ 🎯 QUICK WINS (execute first)
│ • Routing Policy Engine (CRITICAL: unblocks 3+)
│ • Agent Registry Query (medium impact, easy)
│
├─ 🚀 STRATEGIC BETS (invest heavily)
│ • Self-Service Spoke Init (user-facing differentiator)
│ • Learning Loop Runtime (durable moat)
│ • Choreography Patterns (enables multi-agent work)
│
├─ 🔧 PLUMBING (concurrent, medium ROI)
│ • Prompt Library Phase 1
│ • Evaluation Rubrics
│ • Hub-Spoke Playbook
│
└─ 🛒 BUY, DON'T BUILD (defer or consume)
• Observability Integration → consume Langfuse
• Cost Estimation → consume Helicone/provider APIs
LOW EFFORT ←────────────────────────→ HIGH EFFORT
Key Decisions & Trade-Offs¶
Decision 1: Build vs. Buy Observability/Cost¶
Option A: Build your own telemetry/cost platform. - Pros: Full control, agile iteration - Cons: Commodity layer, vendors racing downward, burns genesis capital
Option B: Consume Langfuse/Helicone; build only AgentArmy-specific adapters (RECOMMENDED). - Pros: Free up resources for moat (learning loop); ride the vendor price war; faster TTM - Cons: Dependency on third-party vendor roadmaps
✅ Decision: OPTION B (Buy, don't build). This is the strategic deferral that funds Play 4 (learning loop), which is unreplicable.
Decision 2: Spoke Instantiation Timing¶
Option A: Wait for all infrastructure (RT1 + RT2) before launching spoke init. - Pros: More mature feature set at spoke launch - Cons: Delayed user value; competitive window closes
Option B: Land spoke init in RT1 (after routing policy exists), then expand with observability/cost in RT2/3 (RECOMMENDED). - Pros: Early user value + feedback; land-and-expand model - Cons: Early spokes may lack observability (acceptable MVP)
✅ Decision: OPTION B (Land and expand). Routing policy + spoke init ship together; observability follows.
Decision 3: Release Train Sequencing¶
Strict dependency chain: Play 1 (Sensing) → Play 2 (Policy Engine) → Play 3 (Spoke Init)
Plays 4 (Learning Loop) and 5 (Buy Obs/Cost) run concurrent with Play 2 onward — they don't block each other.
Doctrine Assessment (Quick Summary)¶
The three principles most critical to improve:
- Situational Awareness (score 2/5) — Static CLAUDE.md table is a report, not a map. Play 1 (telemetry) + Play 2 (policy engine) fix this.
- Manage Inertia (score 2/5) — The table feels like documentation when it's actually the platform's biggest constraint. Naming it + converting it to executable code removes this drag.
- Optimize Flows (score 2/5) — Without measuring delegation flow (request → route → execute → succeed/fail → cost), you can't make failures teachable or costs transparent.
All four release trains improve these three principles simultaneously.
Climatic Patterns (External Forces)¶
| Force | Likelihood | Impact | Direction |
|---|---|---|---|
| LLM inference commoditizes | 5/5 | 5/5 | Accelerant — consume, don't build; stay multi-vendor |
| Agent obs/cost vendors race | 4/5 | 5/5 | Accelerant — buy those layers |
| Static table inertia persists | 5/5 | 4/5 | Threat — convert to policy engine |
| Competitors (Copilot, Managed Agents) disrupt routing | 4/5 | 5/5 | Threat — you must commoditize routing so competitors commoditize the layer beneath your moat (learning loop) |
| New value in choreography + learning loop | 5/5 | 5/5 | Accelerant — your blue ocean |
Success Criteria (by Release Train)¶
RT1 (Foundation & Routing) ✓¶
- Routing policy engine is executable (label → army → agent)
- Agent specs are documented + governance is active
- Telemetry spans are emitted for every delegation
- Agent definitions are queryable from a catalog
RT2 (Operations & Quality) ✓¶
- Choreography patterns are codified (sagas, compensation)
- DoD rubrics exist per agent type
- Learning loop closes: failures →
error-coordinator→knowledge-synthesizer→ KB entry - Cost model is estimated (not yet live)
RT3 (Spoke Readiness) ✓¶
-
spoke initcommand spins up a layer repo + board + armies - Observability dashboard shows MTTR, success rate, cost per delegation
- Onboarding playbook is complete + tested on ≥1 real spoke
RT4 (Learning & Intelligence) ✓¶
- Lesson KB has ≥10 incident entries with root cause + remedy
- Request tracing shows full delegation flow (issue → label → route → PR → merge)
- Feedback loop is live: PR comments → routing/prompt refinement
- Agent competency tracking shows improvement trends
Budget & Capacity Estimate¶
| Release Train | Sessions | Parallel agents | Agent Routing | Notes |
|---|---|---|---|---|
| RT1 | ~2 sessions | up to 4 | agent-distinctiveness-advocate, architect-reviewer, observability-engineer, prompt-engineer |
Foundation blocks RT2–4 |
| RT2 | ~2–3 sessions | up to 4 | workflow-orchestrator, observability-engineer, knowledge-synthesizer, tooling-engineer |
Concurrent with Play 4, 5 |
| RT3 | ~2 sessions | up to 3 | platform-engineer, finops-engineer, prompt-engineer, observability-engineer |
Spoke feedback informs RT4 |
| RT4 | ~1–2 sessions | up to 3 | knowledge-synthesizer, observability-engineer, prompt-engineer |
Continuous improvement |
Total: ~7–9 sessions across the hub trains; up to 4 parallel agents per session; 10–12 specialist agents routing across 2 armies. (Planning unit = agent sessions, not calendar time — see the Release Train Index.)
Next Steps (After Approval)¶
- Review this roadmap — does the Wardley map resonate? Any doctrine points to prioritize differently?
- Approve release train sequencing — is RT1→RT2→RT3→RT4 the right order?
- Create GitHub issues from features above with:
- Type, PI, Size, Estimate set
- Parent/child hierarchy (Epic → Feature → Story/Enabler)
- Routing guidance (which agents + armies own each item)
- Dependencies marked
- Add to the GitHub Projects board — set Status, Priority, PI, Iteration (no calendar dates; the army plans in sessions)
- Kick off Play 1 (Sensing Engine) — next agent session
Appendix: Build/Buy/Borrow Decisioning¶
| Component | Stage | Decision | Why |
|---|---|---|---|
| A. Self-Service Spoke Instantiation | Genesis | Build | Differentiator; no vendor exists |
| B. Intelligent Agent Routing | Custom | Build | Core platform; declarative policy engine |
| C. Cross-Agent Choreography | Custom | Build (on borrowed primitives) | Use GitHub Actions DAG; encode your choreography |
| D. Learning Loop | Genesis | Build | Unreplicable asset; your moat |
| E. Agent-Run Observability | Custom→Product | Buy (Langfuse) | Commodity racing toward product; don't build |
| F. Pre-Delegation Cost Visibility | Custom | Buy (Helicone) + thin build | Buy metering; build the estimate surface |
| G. Agent Definitions (180) | Product | Build (own) | Already built; maintain via governance |
| H. Agent Registry + MECE Governance | Product (early) | Build (own) | Already scaffolded; productize the catalog |
| I. Coordination Plane | Commodity | Consume (GitHub Projects) | Never build |
| J. CI/CD Runtime | Commodity | Consume (GitHub Actions) | Never build |
| K. LLM Inference | Product→Commodity | Consume; abstract provider | Avoid lock-in; stay multi-vendor |
| L. Compute / Hosting | Commodity | Consume | Never build |
Document prepared: 2026-05-23
Review by: [Your team]
Approval date: [TBD]
Release date (RT1 kick-off): 2026-06-01