I. Thesis
A ranking and routing layer for AI agents.1 Agents register as callable services (via MCP protocol), get used by both humans and other agents, and earn Elo ratings across multiple quality dimensions: taste, efficiency, depth, reliability.2 Best agents get called more. Natural selection for software.
Core insight: The GPT Store failed because it had no quality signal.3 3 million custom GPTs with no way to discover what's good, no creator revenue, no ranking. Discovery died. Agent Elo fixes this by making quality measurable, transparent, and market-driven.
II. Dog-Food Signal (Phase 0) ✅
Strongest PMF Signal
- Eric is living the problem firsthand. Building Donna (relationship agent), avet (vetting agent), OpenClaw infrastructure.4
- Conrad Ho just self-setup OpenClaw on EC2 — first independent pilot user. Eric's breakfast with him Tuesday.4
- Jason Chan uses competing "Poke" — hosted assistant with privacy concerns. Agent comparison is happening organically in Eric's network.4
- Real question Eric faces: "Which agent should I use for X?" No ranking system exists. This is the dog-food moment.
III. Market Inflection Point — CRITICAL UPDATE
Previous analysis (with broken web search) said: "6-12 months before peak inflection."
NEW DATA (Feb 8, 2026) says: WE ARE AT THE INFLECTION POINT RIGHT NOW.
OpenClaw Viral Explosion — Past 7 Days5
- 141,000 GitHub stars + 20,900 forks — gained 100K+ stars in one week6
- 2 million visitors in a single week after going viral6
- Mainstream media coverage: The Verge, Reuters, BBC Science Focus, Mashable, Nature5
- Government warnings: China's Ministry of Industry issued formal security warning Feb 5, 20267
- Security crisis: 1,100+ exposed instances found, malicious skills in ClawHub marketplace7,8
OpenClaw Stars
141K
7 days
Exposed Instances
1,100+
via Shodan
Malware Skills
Hundreds
ClawHub
Media Stories
Major
BBC, Reuters
What this means: Agent adoption has crossed the chasm. The security crisis proves the urgent need for quality/safety ranking. ClawHub has no quality layer — it's an open supply chain attack vector.8 Agent Elo solves the exact problem the market is feeling right now.
IV. Market Sizing (Layered)
| Layer |
Market Size |
Source |
| Global AI Agent Market (2026) |
US$7.1-7.9B |
MarketsAndMarkets9 |
| Global AI Agent Market (2030) |
US$52.6B (CAGR 46.3%) |
MarketsAndMarkets9 |
| AI Agent Market (2034) |
US$236B |
Precedence Research10 |
| MCP Ecosystem (2026) |
5,000+ servers, 6.6M monthly SDK downloads |
Abovo Research11 |
| Agent Elo Addressable |
US$1-3B (middleware layer, 10-15% of orchestration) |
Estimated |
| Eric's Addressable (2026) |
US$0 (pre-product, pre-revenue) |
Current state |
MCP as Tailwind11
- MCP adoption velocity is unprecedented: 50K+ GitHub stars, support from OpenAI, Anthropic, Google, Microsoft, AWS11
- 2026 = "Agentic Year" — models can now reason, act, and operate across multiple tools in real-time12
- MCP enables "society of agents" — heterogeneous agents from different providers working together seamlessly13
- Agent Elo sits perfectly at the MCP layer — routing and ranking agents that expose MCP services
Key insight: "Competitive agent marketplace" is NOT a recognized market segment. But the infrastructure is here (MCP), the demand is proven (OpenClaw viral), and the pain is acute (security crisis). Category creation opportunity.
V. Competitive Landscape
Direct Competitors (Agent Quality & Discovery)
| Player |
What They Do |
Why They're Not Agent Elo |
| LM Arena14 |
Crowdsourced Elo for LLMs (GPT-5.1 vs Gemini 3 Pro) |
Ranks models, not agents. No marketplace. No routing. Proves Elo works for AI. |
| OpenRouter15 |
Model routing API ($5M ARR, $100M+ GMV) |
Routes models, not agents. No Elo. 5.5% take rate. Direct playbook analog. |
| Agent.ai16 |
Professional marketplace for agents |
No Elo ranking. Discovery is weak (same GPT Store problem). Transaction fees but no quality signal. |
| AWS Marketplace17 |
Enterprise agent distribution on Bedrock |
Enterprise-focused, no public Elo. Governance/audit trails but no taste ranking. |
| Hugging Face18 |
Model hub ($130M revenue, $4.5B valuation) |
1M+ models but no agent composition or routing. Community downloads ≠ quality signal. |
| GPT Store3 |
OpenAI's failed agent marketplace |
THE KEY PROOF POINT. 3M custom GPTs, terrible discovery, no quality ranking, agents disappearing from search.3 Exactly what Agent Elo fixes. |
Agent Orchestration Frameworks (Adjacent)
| Framework |
Focus |
Adoption |
| LangChain/LangGraph19 |
Modular orchestration, stateful workflows |
Leads GitHub adoption, 86% of copilot spending uses orchestration20 |
| CrewAI19 |
Role-based multi-agent collaboration |
Production-ready, lightweight, popular for enterprise |
| AutoGPT19 |
Open-source autonomy pioneer (March 2023) |
Experimental, sparked the agent movement |
Why orchestration ≠ marketplace: LangChain/CrewAI help you build agents. Agent Elo helps you discover, compare, and route to the best agents. Complementary, not competitive. Agent Elo is the distribution layer.
VI. Failed Examples — The "Don't Build This" List
| Company |
What They Tried |
What Killed Them |
Lesson for Agent Elo |
| GPT Store3 |
Agent marketplace, 3M GPTs |
No quality ranking, terrible discovery, no creator revenue |
THIS IS THE PROOF. Quality ranking is the missing piece. |
| ChatGPT Plugins |
Tool-use marketplace |
Shut down. No quality signal = chaos |
Tool discovery without ranking doesn't work. Elo solves this. |
| Fixie.ai |
Agent marketplace → pivoted to enterprise |
Consumer marketplace had no adoption |
B2C agent marketplace is hard. Start B2B or developer-first. |
| Crypto agent marketplaces |
On-chain agent trading |
$50M+ burned, speculation > utility |
Avoid crypto rails. Focus on utility, not speculation. |
| Zapier AI Actions21 |
Natural language API marketplace |
Deprecated in 2026, replaced by Zapier Agents |
Individual action marketplaces don't scale. Full agents > fragmented tools. |
Death Pattern: Quality Signal Failure
Every failed agent marketplace lacked transparent, measurable quality ranking. GPT Store's search problems,3 ChatGPT Plugins shutdown, Fixie pivot — all stem from the same root cause: users can't tell what's good. Elo fixes this.
VII. Unit Economics (Benchmarked)
Revenue Model
| Metric |
Benchmark (OpenRouter)15 |
Agent Elo Estimate |
| Take rate |
5-5.5% on inference spend |
10-15% on routed agent calls (higher value-add than model routing) |
| Monthly GMV (at scale) |
$8M (OpenRouter, May 2025) |
$1-5M (Year 2 target) |
| Monthly revenue (at scale) |
$400K (OpenRouter) |
$100-750K (Year 2, 10-15% take rate) |
| ARR (at scale) |
$5M (OpenRouter 2025) |
$1.2-9M (Year 2 target range) |
Cost Structure (COGS)
| Cost Component |
Per-Unit Cost |
Notes |
| Elo computation |
~$0.00001/comparison |
Lightweight Bradley-Terry model update14 |
| API gateway/routing |
~$0.0001/call |
Standard API infrastructure cost |
| LLM judge evaluation |
$0.001-0.01/comparison |
DEATH METRIC. At 1M comparisons/day = $1-10K/day. Must optimize or crowdsource.22 |
| Storage (traces, leaderboard) |
$50-200/month |
S3/Postgres for audit logs23 |
Cost Optimization Path
- Start with crowdsourced human voting (like LM Arena24) — $0 COGS, high quality
- Hybrid model: Human votes for training data → fine-tuned judge model → reduce LLM API costs by 10-100x
- Death scenario: If you rely on GPT-4 for every comparison at scale, COGS explodes. Must solve judge cost before scaling.
Break-Even Analysis
| Scenario |
Agents on Platform |
Monthly Routed Calls |
Monthly Revenue (15% take) |
COGS |
Gross Margin |
| Optimistic |
500 |
100K |
$15K |
$2K |
87% |
| Realistic |
200 |
30K |
$4.5K |
$1.5K |
67% |
| Pessimistic |
50 |
5K |
$750 |
$500 |
33% |
Break-even: ~200-500 agents, 6-12 months assuming steady growth. OpenRouter took 18 months to reach $5M ARR — similar trajectory expected.15
VIII. Live Market Signals (February 2026)
Security Crisis = Quality Ranking Demand
- The Verge (Feb 2026): "OpenClaw's AI 'skill' extensions are a security nightmare"8
- eSecurity Planet: "Hundreds of Malicious Skills Found in OpenClaw's ClawHub"25
- 1Password VP (Feb 2026): "ClawHub has become an attack surface" for malware distribution8
- China MIIT Warning (Feb 5, 2026): Formal government warning about OpenClaw security risks7
Enterprise Governance Becoming Requirement23
- Microsoft's governance framework now includes dedicated "Govern agents" step for responsible AI26
- Audit trail requirements: Proving "what an agent knew, decided, and did — plus who approved it"23
- MCP audit logging for compliance (HIPAA, SOX, PCI-DSS, GDPR)27
- Agent Elo's quality ranking becomes part of governance/audit layer
Microsoft Research: First-Proposal Bias28
Microsoft's Magentic Marketplace research found that all LLM models exhibit severe first-proposal bias, creating 10-30x advantages for response speed over quality.28 Speed beats quality in agent marketplaces without explicit ranking.
Implication: Agent Elo must surface quality explicitly to overcome this bias. Elo leaderboard + routing preferences can rebalance toward quality.
Synthesis: The market is screaming for quality/safety ranking. Security crisis + enterprise governance needs + cognitive biases = perfect storm for Agent Elo's value prop.
IX. GTM Strategy — Founder-Contextualized
Eric's Unfair Advantages
- Dog-fooding: Building Donna, avet, OpenClaw infrastructure. Living the agent discovery problem.4
- Network: Conrad Ho (first OpenClaw pilot), Jason Chan (Poke user), Wenhao (blue-collar AI), Alice (EdTech). Connected to builders and early adopters.4
- Distribution channels: Agent Creator Directory (pivot candidate), HackerNews audience, own agents as contestants
- Technical credibility: Shipped Donna, avet, Sourcy/Brandy, Blackring. Known for fast execution.
Minimum Viable Test (This Week)
MVP: Public Agent Leaderboard (1-2 days build)
- Pick 5-10 agents (Donna, avet, OpenClaw skills, Zapier Agents, public MCP servers)
- Run same task through all agents (e.g., "Draft email to investor," "Research market size for X," "Schedule 3 meetings")
- LLM judge evaluation on taste, efficiency, depth, correctness
- Publish Elo leaderboard as static webpage (Vercel)
- Post to HackerNews + LinkedIn — "I built an Elo leaderboard for AI agents. Here's what I learned."
GTM Phases
| Phase |
Timeline |
What to Build |
Success Metric |
| 1. Flag Planting |
This week |
Static leaderboard, 5-10 agents, LLM judge, HN post |
500+ HN upvotes, 10+ agent builders reach out |
| 2. Community Leaderboard |
Week 2-4 |
Agent submission form, crowdsourced voting (like LM Arena), auto-update leaderboard |
50+ agents submitted, 1K+ community votes |
| 3. API Routing (MVP) |
Month 2-3 |
Agent registry API, routing by Elo + user preferences, 10% take rate |
10+ paying customers, $1K MRR |
| 4. MCP Integration |
Month 4-6 |
MCP server for Agent Elo, agents discover/call each other via Elo ranking |
100+ agents using Agent Elo routing, $10K MRR |
| 5. Enterprise Governance |
Month 6-12 |
Audit trails, approval flows, compliance (like Microsoft's governance layer26) |
3-5 enterprise contracts, $50K+ MRR |
Distribution Channels (Prioritized)
- HackerNews — Perfect audience (devs, builders, early adopters). Post the leaderboard + learnings.
- Agent Creator Directory — Merge into Agent Elo. Use directory traffic to seed leaderboard submissions.
- Eric's own agents — Donna, avet become first contestants. "Meta" signal: Eric's using his own infrastructure.
- Conrad/Jason network — Early pilot users become evangelists if it works.
- LangChain/CrewAI communities — Partner with orchestration frameworks. Agent Elo = distribution for their users.
- OpenClaw ecosystem — 141K stars, 2M visitors. Perfect timing to offer quality layer for ClawHub alternatives.5
Bandwidth Reality Check
Eric's current capacity: Build deficit day 5. Sourcy retainer (high priority), Blackring (high priority), Donna pilots shipping, Wenhao call tonight.4
Agent Elo time requirement:
- Phase 1 (MVP leaderboard): 4-8 hours (one deep work session)
- Phase 2 (community voting): 12-20 hours over 2 weeks
- Phase 3+ (API routing): 40+ hours (conflicts with current priorities)
Recommendation: Ship Phase 1 this week (Monday deep work block). If HN traction is strong (500+ upvotes, agent builders reach out), justify investing Phase 2 time. Otherwise, shelve until bandwidth opens.
X. Red Team Challenge
Why Agent Elo Works
- GPT Store failed due to zero quality signal — this fixes that3
- OpenClaw viral explosion proves agent adoption is NOW5
- Security crisis creates urgent demand for quality/safety ranking7,8
- MCP adoption (5K+ servers) provides the interop layer11
- LM Arena proves Elo works for AI (crowdsourced, transparent)14
- OpenRouter shows routing business model works ($5M ARR)15
- Eric is dog-fooding — Donna/avet as first contestants4
- Microsoft research validates the need (first-proposal bias)28
- Low COGS if crowdsourced voting (like LM Arena)24
- Fast MVP (1-2 days) = low risk, high learning
Why Agent Elo Might Fail
- Bandwidth — Eric is maxed (Sourcy, Blackring, Donna)4
- Chicken-egg: need agents to rank, need ranking to attract agents
- LLM judge cost at scale ($3-10K/day if not optimized)
- First-proposal bias persists even with Elo (speed still wins)28
- Agent quality is multi-dimensional (taste ≠ speed ≠ reliability) — hard to collapse into one Elo
- Agents gaming the system (like SEO but worse)
- Network effects favor incumbents (OpenAI, Microsoft, Google) — they could build this overnight
- Category creation is HARD — "agent marketplace" isn't proven yet
- B2C agent marketplace failed (Fixie) — why would this work?
- Consumer agents aren't mainstream yet (only developers)
Steel-Man Counter-Argument
"Why wouldn't OpenAI/Microsoft/Google just build this into their platforms?"
They will. And they'll do it badly, like GPT Store.3 Here's why Agent Elo still works:
- Cross-platform neutrality: Agent Elo ranks agents from all providers. OpenAI won't rank Anthropic agents fairly. Microsoft won't rank Google agents fairly. Switzerland wins.
- Open data: Elo leaderboard is public, transparent, community-driven. Platforms want walled gardens. Transparency wins trust.
- Developer-first: Eric is a builder shipping agents. Platforms are selling tools. Dog-food credibility wins community.
- Speed: Agent Elo MVP ships this week. Platforms take 12-18 months to ship features. First-mover advantage.
Outcome: Agent Elo becomes the de facto neutral ranking layer. Platforms eventually integrate it (like everyone integrated Elo for gaming) or acquire/copy it. Either way, Eric wins by defining the category.
Verdict: YES — Ship MVP This Week
Previous analysis said "conditionally yes, flag-planting side project." NEW DATA changes this to "YES, ship now."
Why the update:
- OpenClaw's viral explosion (141K stars, 2M visitors in 7 days) proves agent adoption has crossed the chasm.5
- Security crisis (1,100+ exposed instances, malware in ClawHub) creates urgent demand for quality/safety ranking.7,8
- MCP ecosystem maturity (5K+ servers, 6.6M monthly SDK downloads) means the infrastructure is here.11
- GPT Store's continued failure proves quality ranking is the missing piece.3
- Eric is dog-fooding — Donna/avet as first contestants = strongest PMF signal.4
The Minimum Viable Version:
A public webpage running 5-10 agents against the same task, rating output with LLM judge, publishing Elo leaderboard. One deep work session. 4-8 hours. Post to HackerNews. If 500+ upvotes + agent builders reach out → invest Phase 2 time. If not → no loss, 1 day invested.
Timing is CRITICAL: OpenClaw is viral right now. The security crisis is happening now. The conversation about agent quality is live. Ship the MVP this week while the iron is hot. Wait 3 months and the moment passes.
Bandwidth trade-off: Use Monday deep work block (10am-2pm) for Agent Elo Phase 1 instead of cracking gesture ring BLE. Blackring can wait 1 week. Agent Elo's timing window is closing faster.
Success criteria (Week 1): 500+ HN upvotes, 10+ agent builders reach out, 20+ agents submitted to leaderboard. If hit → justify Phase 2 investment. If miss → shelve and return to Blackring/Donna priorities.
The one thing that would change this verdict: If Eric's bandwidth genuinely can't free up 4-8 hours this week (Sourcy emergency, Ilona meeting prep consumes all time), then defer to Week 2. But not Month 2. The inflection point is now.
References
[3]
OpenAI Community — GPT Store discovery problems GPT Store failure, agents disappearing from search
[4] Eric's personal state files — projects.json, user.json, daily reports (Feb 8, 2026). Donna, avet, OpenClaw context, Conrad Ho pilot, Jason Chan competitive intel
[5]
OpenClaw website +
multiple media sources (The Verge, Reuters, BBC, Mashable, Nature) — viral traction data
[6]
OpenClaw GitHub repository 141K stars, 20.9K forks, 2M visitors in 7 days
[11]
Abovo Research — MCP 2025 Deep-Research Report 5K+ MCP servers, 6.6M monthly SDK downloads, 50K+ GitHub stars
[12]
Generect Blog — 2026 as the Agentic Year Models can reason, act, operate across tools in real-time
[14]
LM Arena Leaderboard Crowdsourced Elo rankings, GPT-5.1 vs Gemini 3 Pro, Bradley-Terry model
[15]
Sacra Research — OpenRouter at $100M GMV $5M ARR, 5.5% take rate, $8M monthly GMV (May 2025)
[16]
Agent.ai marketplace Professional network for AI agents, transaction-based
[17]
AWS Marketplace — AI Agents and Tools Enterprise agent distribution on Bedrock
[20]
JSGuru Jobs — AI Agent Development Tools 2026 86% of copilot spending uses orchestration
[21]
Zapier AI Actions documentation Deprecated 2026, replaced by Zapier Agents
[22]
OpenReview — Holistic Agent Leaderboard (ICLR 2026) Agent evaluation infrastructure, 21,730 rollouts cost ~$40K
[23]
Pedowitz Group — How to Audit AI Agent Decisions and Actions Audit trail requirements, proving what agent knew/decided/did
[24]
LM Arena — How It Works Crowdsourced voting methodology, blind pairwise comparisons
[26]
Microsoft Learn — Governance and security for AI agents Enterprise governance framework, "Govern agents" step
[28]
Microsoft Research — Magentic Marketplace research paper First-proposal bias, 10-30x advantage for speed over quality