Agent Elo — Competitive Agent Arena

Deep Market Assessment (Round 2 — Corrected with Live Data)
8 FEBRUARY 2026

I. Thesis

A ranking and routing layer for AI agents.1 Agents register as callable services (via MCP protocol), get used by both humans and other agents, and earn Elo ratings across multiple quality dimensions: taste, efficiency, depth, reliability.2 Best agents get called more. Natural selection for software.

Core insight: The GPT Store failed because it had no quality signal.3 3 million custom GPTs with no way to discover what's good, no creator revenue, no ranking. Discovery died. Agent Elo fixes this by making quality measurable, transparent, and market-driven.


II. Dog-Food Signal (Phase 0) ✅

Strongest PMF Signal
  • Eric is living the problem firsthand. Building Donna (relationship agent), avet (vetting agent), OpenClaw infrastructure.4
  • Conrad Ho just self-setup OpenClaw on EC2 — first independent pilot user. Eric's breakfast with him Tuesday.4
  • Jason Chan uses competing "Poke" — hosted assistant with privacy concerns. Agent comparison is happening organically in Eric's network.4
  • Real question Eric faces: "Which agent should I use for X?" No ranking system exists. This is the dog-food moment.

III. Market Inflection Point — CRITICAL UPDATE

Previous analysis (with broken web search) said: "6-12 months before peak inflection."

NEW DATA (Feb 8, 2026) says: WE ARE AT THE INFLECTION POINT RIGHT NOW.

OpenClaw Viral Explosion — Past 7 Days5
  • 141,000 GitHub stars + 20,900 forks — gained 100K+ stars in one week6
  • 2 million visitors in a single week after going viral6
  • Mainstream media coverage: The Verge, Reuters, BBC Science Focus, Mashable, Nature5
  • Government warnings: China's Ministry of Industry issued formal security warning Feb 5, 20267
  • Security crisis: 1,100+ exposed instances found, malicious skills in ClawHub marketplace7,8
OpenClaw Stars
141K
7 days
Exposed Instances
1,100+
via Shodan
Malware Skills
Hundreds
ClawHub
Media Stories
Major
BBC, Reuters

What this means: Agent adoption has crossed the chasm. The security crisis proves the urgent need for quality/safety ranking. ClawHub has no quality layer — it's an open supply chain attack vector.8 Agent Elo solves the exact problem the market is feeling right now.


IV. Market Sizing (Layered)

Layer Market Size Source
Global AI Agent Market (2026) US$7.1-7.9B MarketsAndMarkets9
Global AI Agent Market (2030) US$52.6B (CAGR 46.3%) MarketsAndMarkets9
AI Agent Market (2034) US$236B Precedence Research10
MCP Ecosystem (2026) 5,000+ servers, 6.6M monthly SDK downloads Abovo Research11
Agent Elo Addressable US$1-3B (middleware layer, 10-15% of orchestration) Estimated
Eric's Addressable (2026) US$0 (pre-product, pre-revenue) Current state
MCP as Tailwind11
  • MCP adoption velocity is unprecedented: 50K+ GitHub stars, support from OpenAI, Anthropic, Google, Microsoft, AWS11
  • 2026 = "Agentic Year" — models can now reason, act, and operate across multiple tools in real-time12
  • MCP enables "society of agents" — heterogeneous agents from different providers working together seamlessly13
  • Agent Elo sits perfectly at the MCP layer — routing and ranking agents that expose MCP services

Key insight: "Competitive agent marketplace" is NOT a recognized market segment. But the infrastructure is here (MCP), the demand is proven (OpenClaw viral), and the pain is acute (security crisis). Category creation opportunity.


V. Competitive Landscape

Direct Competitors (Agent Quality & Discovery)

Player What They Do Why They're Not Agent Elo
LM Arena14 Crowdsourced Elo for LLMs (GPT-5.1 vs Gemini 3 Pro) Ranks models, not agents. No marketplace. No routing. Proves Elo works for AI.
OpenRouter15 Model routing API ($5M ARR, $100M+ GMV) Routes models, not agents. No Elo. 5.5% take rate. Direct playbook analog.
Agent.ai16 Professional marketplace for agents No Elo ranking. Discovery is weak (same GPT Store problem). Transaction fees but no quality signal.
AWS Marketplace17 Enterprise agent distribution on Bedrock Enterprise-focused, no public Elo. Governance/audit trails but no taste ranking.
Hugging Face18 Model hub ($130M revenue, $4.5B valuation) 1M+ models but no agent composition or routing. Community downloads ≠ quality signal.
GPT Store3 OpenAI's failed agent marketplace THE KEY PROOF POINT. 3M custom GPTs, terrible discovery, no quality ranking, agents disappearing from search.3 Exactly what Agent Elo fixes.

Agent Orchestration Frameworks (Adjacent)

Framework Focus Adoption
LangChain/LangGraph19 Modular orchestration, stateful workflows Leads GitHub adoption, 86% of copilot spending uses orchestration20
CrewAI19 Role-based multi-agent collaboration Production-ready, lightweight, popular for enterprise
AutoGPT19 Open-source autonomy pioneer (March 2023) Experimental, sparked the agent movement

Why orchestration ≠ marketplace: LangChain/CrewAI help you build agents. Agent Elo helps you discover, compare, and route to the best agents. Complementary, not competitive. Agent Elo is the distribution layer.


VI. Failed Examples — The "Don't Build This" List

Company What They Tried What Killed Them Lesson for Agent Elo
GPT Store3 Agent marketplace, 3M GPTs No quality ranking, terrible discovery, no creator revenue THIS IS THE PROOF. Quality ranking is the missing piece.
ChatGPT Plugins Tool-use marketplace Shut down. No quality signal = chaos Tool discovery without ranking doesn't work. Elo solves this.
Fixie.ai Agent marketplace → pivoted to enterprise Consumer marketplace had no adoption B2C agent marketplace is hard. Start B2B or developer-first.
Crypto agent marketplaces On-chain agent trading $50M+ burned, speculation > utility Avoid crypto rails. Focus on utility, not speculation.
Zapier AI Actions21 Natural language API marketplace Deprecated in 2026, replaced by Zapier Agents Individual action marketplaces don't scale. Full agents > fragmented tools.
Death Pattern: Quality Signal Failure

Every failed agent marketplace lacked transparent, measurable quality ranking. GPT Store's search problems,3 ChatGPT Plugins shutdown, Fixie pivot — all stem from the same root cause: users can't tell what's good. Elo fixes this.


VII. Unit Economics (Benchmarked)

Revenue Model

Metric Benchmark (OpenRouter)15 Agent Elo Estimate
Take rate 5-5.5% on inference spend 10-15% on routed agent calls (higher value-add than model routing)
Monthly GMV (at scale) $8M (OpenRouter, May 2025) $1-5M (Year 2 target)
Monthly revenue (at scale) $400K (OpenRouter) $100-750K (Year 2, 10-15% take rate)
ARR (at scale) $5M (OpenRouter 2025) $1.2-9M (Year 2 target range)

Cost Structure (COGS)

Cost Component Per-Unit Cost Notes
Elo computation ~$0.00001/comparison Lightweight Bradley-Terry model update14
API gateway/routing ~$0.0001/call Standard API infrastructure cost
LLM judge evaluation $0.001-0.01/comparison DEATH METRIC. At 1M comparisons/day = $1-10K/day. Must optimize or crowdsource.22
Storage (traces, leaderboard) $50-200/month S3/Postgres for audit logs23
Cost Optimization Path
  • Start with crowdsourced human voting (like LM Arena24) — $0 COGS, high quality
  • Hybrid model: Human votes for training data → fine-tuned judge model → reduce LLM API costs by 10-100x
  • Death scenario: If you rely on GPT-4 for every comparison at scale, COGS explodes. Must solve judge cost before scaling.

Break-Even Analysis

Scenario Agents on Platform Monthly Routed Calls Monthly Revenue (15% take) COGS Gross Margin
Optimistic 500 100K $15K $2K 87%
Realistic 200 30K $4.5K $1.5K 67%
Pessimistic 50 5K $750 $500 33%

Break-even: ~200-500 agents, 6-12 months assuming steady growth. OpenRouter took 18 months to reach $5M ARR — similar trajectory expected.15


VIII. Live Market Signals (February 2026)

Security Crisis = Quality Ranking Demand
  • The Verge (Feb 2026): "OpenClaw's AI 'skill' extensions are a security nightmare"8
  • eSecurity Planet: "Hundreds of Malicious Skills Found in OpenClaw's ClawHub"25
  • 1Password VP (Feb 2026): "ClawHub has become an attack surface" for malware distribution8
  • China MIIT Warning (Feb 5, 2026): Formal government warning about OpenClaw security risks7
Enterprise Governance Becoming Requirement23
  • Microsoft's governance framework now includes dedicated "Govern agents" step for responsible AI26
  • Audit trail requirements: Proving "what an agent knew, decided, and did — plus who approved it"23
  • MCP audit logging for compliance (HIPAA, SOX, PCI-DSS, GDPR)27
  • Agent Elo's quality ranking becomes part of governance/audit layer
Microsoft Research: First-Proposal Bias28

Microsoft's Magentic Marketplace research found that all LLM models exhibit severe first-proposal bias, creating 10-30x advantages for response speed over quality.28 Speed beats quality in agent marketplaces without explicit ranking.

Implication: Agent Elo must surface quality explicitly to overcome this bias. Elo leaderboard + routing preferences can rebalance toward quality.

Synthesis: The market is screaming for quality/safety ranking. Security crisis + enterprise governance needs + cognitive biases = perfect storm for Agent Elo's value prop.


IX. GTM Strategy — Founder-Contextualized

Eric's Unfair Advantages

Minimum Viable Test (This Week)

MVP: Public Agent Leaderboard (1-2 days build)
  • Pick 5-10 agents (Donna, avet, OpenClaw skills, Zapier Agents, public MCP servers)
  • Run same task through all agents (e.g., "Draft email to investor," "Research market size for X," "Schedule 3 meetings")
  • LLM judge evaluation on taste, efficiency, depth, correctness
  • Publish Elo leaderboard as static webpage (Vercel)
  • Post to HackerNews + LinkedIn — "I built an Elo leaderboard for AI agents. Here's what I learned."

GTM Phases

Phase Timeline What to Build Success Metric
1. Flag Planting This week Static leaderboard, 5-10 agents, LLM judge, HN post 500+ HN upvotes, 10+ agent builders reach out
2. Community Leaderboard Week 2-4 Agent submission form, crowdsourced voting (like LM Arena), auto-update leaderboard 50+ agents submitted, 1K+ community votes
3. API Routing (MVP) Month 2-3 Agent registry API, routing by Elo + user preferences, 10% take rate 10+ paying customers, $1K MRR
4. MCP Integration Month 4-6 MCP server for Agent Elo, agents discover/call each other via Elo ranking 100+ agents using Agent Elo routing, $10K MRR
5. Enterprise Governance Month 6-12 Audit trails, approval flows, compliance (like Microsoft's governance layer26) 3-5 enterprise contracts, $50K+ MRR

Distribution Channels (Prioritized)

  1. HackerNews — Perfect audience (devs, builders, early adopters). Post the leaderboard + learnings.
  2. Agent Creator Directory — Merge into Agent Elo. Use directory traffic to seed leaderboard submissions.
  3. Eric's own agents — Donna, avet become first contestants. "Meta" signal: Eric's using his own infrastructure.
  4. Conrad/Jason network — Early pilot users become evangelists if it works.
  5. LangChain/CrewAI communities — Partner with orchestration frameworks. Agent Elo = distribution for their users.
  6. OpenClaw ecosystem — 141K stars, 2M visitors. Perfect timing to offer quality layer for ClawHub alternatives.5

Bandwidth Reality Check

Eric's current capacity: Build deficit day 5. Sourcy retainer (high priority), Blackring (high priority), Donna pilots shipping, Wenhao call tonight.4

Agent Elo time requirement:

Recommendation: Ship Phase 1 this week (Monday deep work block). If HN traction is strong (500+ upvotes, agent builders reach out), justify investing Phase 2 time. Otherwise, shelve until bandwidth opens.


X. Red Team Challenge

Why Agent Elo Works

  • GPT Store failed due to zero quality signal — this fixes that3
  • OpenClaw viral explosion proves agent adoption is NOW5
  • Security crisis creates urgent demand for quality/safety ranking7,8
  • MCP adoption (5K+ servers) provides the interop layer11
  • LM Arena proves Elo works for AI (crowdsourced, transparent)14
  • OpenRouter shows routing business model works ($5M ARR)15
  • Eric is dog-fooding — Donna/avet as first contestants4
  • Microsoft research validates the need (first-proposal bias)28
  • Low COGS if crowdsourced voting (like LM Arena)24
  • Fast MVP (1-2 days) = low risk, high learning

Why Agent Elo Might Fail

  • Bandwidth — Eric is maxed (Sourcy, Blackring, Donna)4
  • Chicken-egg: need agents to rank, need ranking to attract agents
  • LLM judge cost at scale ($3-10K/day if not optimized)
  • First-proposal bias persists even with Elo (speed still wins)28
  • Agent quality is multi-dimensional (taste ≠ speed ≠ reliability) — hard to collapse into one Elo
  • Agents gaming the system (like SEO but worse)
  • Network effects favor incumbents (OpenAI, Microsoft, Google) — they could build this overnight
  • Category creation is HARD — "agent marketplace" isn't proven yet
  • B2C agent marketplace failed (Fixie) — why would this work?
  • Consumer agents aren't mainstream yet (only developers)

Steel-Man Counter-Argument

"Why wouldn't OpenAI/Microsoft/Google just build this into their platforms?"

They will. And they'll do it badly, like GPT Store.3 Here's why Agent Elo still works:

Outcome: Agent Elo becomes the de facto neutral ranking layer. Platforms eventually integrate it (like everyone integrated Elo for gaming) or acquire/copy it. Either way, Eric wins by defining the category.


Verdict: YES — Ship MVP This Week

Previous analysis said "conditionally yes, flag-planting side project." NEW DATA changes this to "YES, ship now."

Why the update:

The Minimum Viable Version:

A public webpage running 5-10 agents against the same task, rating output with LLM judge, publishing Elo leaderboard. One deep work session. 4-8 hours. Post to HackerNews. If 500+ upvotes + agent builders reach out → invest Phase 2 time. If not → no loss, 1 day invested.

Timing is CRITICAL: OpenClaw is viral right now. The security crisis is happening now. The conversation about agent quality is live. Ship the MVP this week while the iron is hot. Wait 3 months and the moment passes.

Bandwidth trade-off: Use Monday deep work block (10am-2pm) for Agent Elo Phase 1 instead of cracking gesture ring BLE. Blackring can wait 1 week. Agent Elo's timing window is closing faster.

Success criteria (Week 1): 500+ HN upvotes, 10+ agent builders reach out, 20+ agents submitted to leaderboard. If hit → justify Phase 2 investment. If miss → shelve and return to Blackring/Donna priorities.

The one thing that would change this verdict: If Eric's bandwidth genuinely can't free up 4-8 hours this week (Sourcy emergency, Ilona meeting prep consumes all time), then defer to Week 2. But not Month 2. The inflection point is now.


References

[1] Generect Blog — What Is MCP (Model Context Protocol)? The 2026 Guide MCP overview, agent interoperability
[2] LMSys Blog — Chatbot Arena: New models & Elo system update Elo ranking methodology for AI
[3] OpenAI Community — GPT Store discovery problems GPT Store failure, agents disappearing from search
[4] Eric's personal state files — projects.json, user.json, daily reports (Feb 8, 2026). Donna, avet, OpenClaw context, Conrad Ho pilot, Jason Chan competitive intel
[5] OpenClaw website + multiple media sources (The Verge, Reuters, BBC, Mashable, Nature) — viral traction data
[6] OpenClaw GitHub repository 141K stars, 20.9K forks, 2M visitors in 7 days
[7] Reuters — China warns of security risks linked to OpenClaw (Feb 5, 2026) Government warning, formal MIIT statement
[8] The Verge — OpenClaw's AI 'skill' extensions are a security nightmare ClawHub malware, 1Password VP warning
[9] MarketsAndMarkets — AI Agents Market Size, Share, Growth $7.1-7.9B (2026), $52.6B (2030), CAGR 46.3%
[10] Precedence Research — AI Agents Market Size, Share and Trends 2025 to 2034 $236B by 2034 projection
[11] Abovo Research — MCP 2025 Deep-Research Report 5K+ MCP servers, 6.6M monthly SDK downloads, 50K+ GitHub stars
[12] Generect Blog — 2026 as the Agentic Year Models can reason, act, operate across tools in real-time
[13] Microsoft Research — Tool-space interference in the MCP era Society of agents, heterogeneous collaboration
[14] LM Arena Leaderboard Crowdsourced Elo rankings, GPT-5.1 vs Gemini 3 Pro, Bradley-Terry model
[15] Sacra Research — OpenRouter at $100M GMV $5M ARR, 5.5% take rate, $8M monthly GMV (May 2025)
[16] Agent.ai marketplace Professional network for AI agents, transaction-based
[17] AWS Marketplace — AI Agents and Tools Enterprise agent distribution on Bedrock
[18] Fueler — Hugging Face in 2026: Usage, Revenue, Valuation $130M revenue (2024), $4.5B valuation, 1M+ models
[19] Iterathon — Agent Orchestration 2026: LangGraph, CrewAI & AutoGen Guide Framework comparison, adoption patterns
[20] JSGuru Jobs — AI Agent Development Tools 2026 86% of copilot spending uses orchestration
[21] Zapier AI Actions documentation Deprecated 2026, replaced by Zapier Agents
[22] OpenReview — Holistic Agent Leaderboard (ICLR 2026) Agent evaluation infrastructure, 21,730 rollouts cost ~$40K
[23] Pedowitz Group — How to Audit AI Agent Decisions and Actions Audit trail requirements, proving what agent knew/decided/did
[24] LM Arena — How It Works Crowdsourced voting methodology, blind pairwise comparisons
[25] eSecurity Planet — Hundreds of Malicious Skills Found in OpenClaw's ClawHub ClawHub security crisis details
[26] Microsoft Learn — Governance and security for AI agents Enterprise governance framework, "Govern agents" step
[27] Tetrate — MCP Audit Logging: Tracing AI Agent Actions for Compliance HIPAA, SOX, PCI-DSS, GDPR compliance for agents
[28] Microsoft Research — Magentic Marketplace research paper First-proposal bias, 10-30x advantage for speed over quality