Agent Elo — Deep Market Assessment

I. Thesis

What: A ranking and routing layer for AI agents. Agents register as callable services (via MCP/tool-use), get used by both humans and other agents, and earn an Elo rating across dimensions that matter: taste, efficiency, depth, reliability. The best agents get called more. Bad agents stop getting called. Natural selection for software.

Who buys: Two-sided. Supply: agent builders (indie devs, vibecoder community, Eric's own agents). Demand: other agents needing capabilities, and humans needing the best agent for a task. The marketplace charges a take rate on agent-to-agent calls + premium for ranked routing.

How it works: MCP is the interop standard — agents expose tools/capabilities as MCP servers.¹ Agent Elo wraps this with a registry, routing layer, and Elo system that tracks real usage outcomes. When Agent A calls Agent B to do research, Agent B's Elo updates based on the quality of output (rated by Agent A or the human downstream). Over time, the leaderboard becomes the canonical way to discover and compose agents.

Phase 0: Dog-Food Signal 🐶

Eric is building agents right now — Donna (relationship assistant), avet (community vetting), OpenClaw (self-hosted agent infra).² He's literally experiencing the problem: how do agents discover each other? How does a human know which is best? How do agents compose capabilities?
Conrad set up OpenClaw on EC2 independently — proving that agent builders are emerging in Eric's network.
Jason Chan uses competing "Poke" — agent comparison is already happening organically.
This is the strongest PMF signal: the founder is the user. He already has supply (his agents) and demand (his network asking "how do you find it?").

II. Founder Context

Dimension	Assessment
Technical depth	Strong Builds full-stack agents, self-hosts on Mac mini, manages MCP/WhatsApp/Telegram integrations. Comfortable with Claude, OpenRouter, Supabase, Vercel. Has shipped Donna end-to-end.
Network (supply side)	Strong 77+ tracked contacts. Agent creator directory project with David Li (VR). Conrad, Penny Yip, BennyKok, Emmanuel all technical + agent-curious. Vibecoder community access.
Network (demand side)	Emerging 7 Donna pilot users. Jason, Bruce, Edward all interested. But demand is early — no paying agent-to-agent users yet.
Bandwidth	Constrained 7+ active projects. Build deficit day 5. ~10hr/week on Sourcy retainer. This is the bottleneck. Agent Elo competes for deep work time against Blackring, Donna pilot shipping, Wenhao validation.
Capital	Modest HK$16K/mo from Sourcy retainer. No external funding. Ilona call exploring €25K-250K for Blackring, not this project. Bootstrap constraint is real.
Unfair advantage	Has one Already building multiple agents (supply), already has agent users (demand), already connected to agent creator community. The Donna/avet/OpenClaw stack = built-in contestants for the arena.

III. Market Sizing

"Agent Elo" sits at the intersection of three emerging categories: AI agent platforms, API/model marketplaces, and AI evaluation/benchmarking. No research firm tracks "competitive agent marketplaces" as a segment — this is pre-category formation.³

AI Agent Platforms

US$47B

by 2030 (est.)

API Marketplaces

US$8.3B

by 2028 (est.)

AI Eval/Benchmark

~US$500M

emerging, est. 2026

Agent Elo Slice

~US$1-3B

routing + ranking layer

Market Sizing Logic

Layer	Size	Basis	Source
Global: AI Agent Platforms	US$5.6B → $47.1B	2024 → 2030, ~43% CAGR. Includes all autonomous agent infrastructure.	Grand View Research, MarketsandMarkets estimates³
Segment: API Marketplaces	US$4.5B → $8.3B	2024 → 2028. RapidAPI = ~US$1B valuation. This is the closest revenue analog.	Verified Market Research⁴
Segment: AI Model Hubs	US$4.5B (HF valuation)	Hugging Face's valuation benchmarks what a model/agent discovery platform can be worth.	Hugging Face Series D, Aug 2023⁵
Segment: LLM Routing	~US$10-50M ARR	OpenRouter, Martian, Not Diamond — all routing inference to best model. Pre-revenue to early revenue.	Industry estimates⁶
Addressable: Eric's reach	US$0	Zero paying users. 7 pilot users for Donna. ~20 agent-curious contacts. This is pre-revenue, pre-product.	CRM data

⚠️ Honest Assessment: Pre-Category

"Competitive agent marketplace with Elo ranking" is not a recognized market segment. No research firm tracks it.
The US$47B AI agent figure includes everything from enterprise copilots to customer service bots — Agent Elo would capture a thin middleware slice.
The realistic addressable market for a solo founder in 2026 is US$0 to US$100K ARR — the question is whether this becomes a category or stays a research curiosity.

IV. Competitive Landscape

4a. Direct Competitors & Adjacent Players

Company	What They Do	Model	Status	Why They're Not Agent Elo
LMSys Chatbot Arena⁷	Elo leaderboard for LLMs via blind voting	Free research project (UC Berkeley)	12M+ votes. Canonical. Unfunded.	Ranks models, not agents. No agent-to-agent calls. No marketplace. No routing.
OpenRouter⁶	Routes inference to cheapest/best model	5-20% markup on API calls	Growing. Used by Eric + Donna.	Routes models, not agents. No Elo. No quality feedback loop.
OpenHub.ai⁸	Decentralized AI market economy for agents	Protocol-native marketplace	Early. Docs-stage.	Closest competitor. But protocol-focused, not taste/quality-focused. No Elo mechanism yet.
Magentic Marketplace⁹	Research env for studying agentic markets	Open-source (Microsoft)	Academic. Oct 2025 paper.	Research, not product. Studies agent economics but doesn't operationalize it.
Hugging Face⁵	Model hub + community + leaderboards	Freemium SaaS ($4.5B valuation)	~US$70M ARR (est. 2024)	Hosts models and datasets. No agent composition. No Elo for agents. No routing.
CrewAI¹⁰	Multi-agent orchestration framework	Open-source + enterprise ($18M Series A)	Well-funded. Growing.	Orchestration, not marketplace. Agents are internal to your system, not competing with others.
LangChain / LangSmith¹¹	Agent framework + observability	Open-source + SaaS ($25M Series A)	Dominant framework.	Framework, not marketplace. No external agent discovery or ranking.
GPT Store (OpenAI)¹²	Marketplace for custom GPTs	Platform (no creator rev share til late 2024)	Widely considered underwhelming.	See "Failed Examples" below.

4b. Playbook Dissection — Who Won Adjacent Markets

Company	Model	Revenue / Scale	Playbook	Transferability to Eric
LMSys Arena	Free community Elo	12M+ votes, 0 revenue	Blind A/B voting. Academic credibility. Became the benchmark for LLMs. No monetization.	Mixed Proves Elo works for AI. But they chose not to monetize. Can Eric build the monetized version?
Hugging Face	Freemium hub	~US$70M ARR, $4.5B valuation	Open-source model hosting → community → enterprise SaaS. 7+ years. Network effects from model downloads.	Low Took 7 years + massive VC funding ($400M+ raised). Community flywheel requires scale Eric doesn't have.
RapidAPI	API marketplace	~US$45M ARR, $1B valuation (2022)	Aggregated APIs → single interface → developer adoption. 35K+ APIs listed. Usage-based pricing.	Instructive Closest marketplace analog. But required $300M+ funding and years of supply aggregation. Valuation reportedly dropped post-2022.
OpenRouter	Model routing	~US$10-30M ARR (est.)	Unified API for all LLM providers. 5-20% margin on top. Developer-friendly. Low friction.	High Small team, bootstrap-friendly. Routes to best model per task. Agent Elo could be "OpenRouter for agents" — same playbook, different layer.
Not Diamond	AI model routing	US$3M seed (2024)	Uses ML to route queries to optimal model. "Best model for every prompt." Quality-based routing.	High Directly validates quality-based routing as a venture category. Same thesis, different layer (models vs agents).
Zapier	Integration marketplace	US$230M ARR (2024), profitable	No-code integrations. 7,000+ apps. Marketplace effects. Took 12+ years.	Instructive Shows integration marketplaces can be massive. But 12 years + no AI = different era.

4c. Failed Examples — The Startup Graveyard

🪦 GPT Store (OpenAI, Jan 2024)

What happened: OpenAI launched the GPT Store in Jan 2024 as a marketplace for custom GPTs.¹² No revenue sharing for creators until late 2024. Poor discovery. Most GPTs saw fewer than 100 users.
Why it failed: No quality signal. No ranking by actual utility. Flooded with low-effort clones. Creators had no incentive (no revenue). Discovery was broken — no way to find the "best" GPT for a task.
Lesson for Agent Elo: This is exactly your thesis. The GPT Store failed because it had no Elo. No quality ranking. No competitive pressure. No agent-to-agent composition. Agent Elo is the fix.

🪦 ChatGPT Plugins (OpenAI, Mar 2023 → Discontinued mid-2024)

What happened: First attempt at agent-as-service. Plugins let GPT call external APIs. Shut down in favor of GPTs.
Why it failed: Too complex for users. Poor reliability. Models often chose wrong plugins. No quality feedback loop.
Lesson: Agent composition needs reliable quality signals, not just interop. Tool-use without ranking = chaos.

🪦 Crypto Agent Marketplaces (SingularityNET, Fetch.ai, Autonolas)

What happened: Promised decentralized AI agent economies. SingularityNET raised ~$36M (2017). Fetch.ai raised ~$15M.¹³ Token-driven incentive models.
Why they stalled: Crypto overhead (gas fees, wallet friction) killed adoption. Speculation-driven, not utility-driven. Most "agents" were simple API wrappers. No real quality measurement.
Lesson: Decentralization ≠ quality. Agent Elo should avoid crypto rails unless there's a clear utility reason. Keep it simple: API keys, usage-based billing, Elo ranking. Don't add blockchain complexity.

🪦 Fixie.ai → Pivoted (2023-2024)

What happened: Started as an agent platform/marketplace. Raised seed funding. Pivoted to enterprise agent infrastructure after marketplace model failed to gain traction.
Lesson: Pure agent marketplaces struggle without demand-side pull. Enterprise contracts are more reliable than marketplace network effects in early stages.

V. Unit Economics (Benchmarked)

Revenue Side

Metric	Benchmark (Winner)	Benchmark (Average)	Agent Elo Estimate	Source
Take Rate	20-30% (App Store)	10-15% (API marketplaces)	10-15% on routed calls	Industry standard⁴
ARPU (agent builder)	~US$200/mo (HF Pro)	~US$20-50/mo	~US$0 (free tier) → US$50-200/mo (pro)	HF pricing⁵
ARPU (agent consumer)	~US$20/mo (OpenRouter avg)	~US$5-10/mo	Usage-based, ~US$10-50/mo	OpenRouter estimates⁶
Paid conversion	5-8% (dev tools)	2-4%	2-5%	Industry benchmarks
Gross margin	70-85% (SaaS)	60-70%	60-80%	Depends on proxy vs routing model

Cost Side (COGS Breakdown)

Cost Component	Per-Unit Cost	Assumption	Source
Elo computation	~US$0.001/match	Simple rating update per interaction. CPU-bound, negligible.	Standard Elo algorithm
LLM judge (quality eval)	~US$0.003-0.01/eval	Claude Haiku or GPT-4o-mini to rate output quality. ~500 tokens/eval.	Anthropic/OpenAI pricing¹⁴
API gateway/proxy	~US$0.0001/request	If proxying calls through Agent Elo's infra. Cloudflare Workers or similar.	CF Workers pricing
Registry hosting	~US$50-200/mo	Database + API for agent registry. Supabase or Planetscale.	Supabase pricing
Leaderboard/frontend	~US$0-20/mo	Static site on Vercel. Minimal cost.	Vercel free tier

💀 The Death Metric: LLM Judge Cost at Scale

If every agent-to-agent call triggers an LLM judge evaluation, at 1M calls/day that's US$3K-10K/day in eval costs alone.
Mitigation: Sample-based evaluation (eval 10% of calls, not 100%). Use lightweight models (Haiku, mini). Let user ratings supplement LLM judges.
If eval costs are 10x higher than estimated (longer prompts, bigger models), gross margin drops from ~70% to ~30%. This is the cost that could blow up viability.

Break-Even Scenarios

Optimistic

500 agents, 6 mo

Realistic

2K agents, 18 mo

Pessimistic

Never (platform risk)

Break-even = monthly infra costs (~US$200-500) covered by take rate revenue. At 10% take rate, need ~US$5K/mo in agent-to-agent call volume to cover basic costs. With 500 active agents doing US$10/mo avg volume through the platform = US$500/mo take. Not enough. Need either higher volume or premium tier.

VI. Live Signals

Category Formation Signals Bullish

Signal	Source	Implication
Anthropic launches MCP (Nov 2024), rapidly adopted by Cursor, Windsurf, Claude Desktop¹	Anthropic blog, GitHub	Interop is standardizing. Agents can now call other agents as tools. This is the prerequisite for Agent Elo.
Microsoft publishes Magentic Marketplace paper (Oct 2025) studying agent-to-agent economics⁹	Microsoft Research	Big tech is studying this exact problem. Validates the category. But also means incumbents may build it.
OpenAI launches ChatGPT Agent (Jan 2026) — agentic mode for browsing, code, actions¹⁵	OpenAI blog	Consumer expectations shifting to agentic. More agents = more need for ranking/routing.
CATArena paper (2025) validates tournament-based agent Elo¹⁶	arXiv	Academic proof that competitive ranking works for agents, not just models.
OpenHub.ai publishes protocol docs for decentralized agent economy⁸	OpenHub docs	Early-stage competitor/validator. Shows builders are converging on agent marketplace concept.

Risk Signals Watch

Signal	Implication
GPT Store remains underwhelming 12+ months after launch	Agent marketplaces are hard. Discovery + quality + monetization all need to work simultaneously.
Magentic Marketplace finds "first-proposal bias" creates 10-30x speed advantage over quality⁹	In agent markets, fast beats good by default. Elo needs to counter this — reward depth and taste, not just speed.
Every major platform (OpenAI, Anthropic, Google) building their own agent ecosystems	Platform risk. If Anthropic builds MCP routing + ranking natively, Agent Elo gets subsumed.

VII. GTM Assessment (Eric-Specific)

What Eric Can Actually Do

Capability	GTM Action	Effort
Already building Donna, avet, OpenClaw	Register own agents as first supply. Dog-food the ranking system.	Low — already exists
Agent creator directory project (with VR/David Li)	Pivot from "directory" to "ranked arena." Same audience, stronger value prop.	Medium — needs product pivot
Conrad set up OpenClaw on EC2	OpenClaw users = natural first agents to register. Every OpenClaw instance = potential arena contestant.	Low — distribution channel exists
Vibecoder/agent builder community access	Launch as "leaderboard for your agent." Builders compete for rank. Vanity + distribution incentive.	Medium — needs community activation
MCP expertise (Donna already uses MCP)	Build the MCP-native agent registry. Technical credibility.	Medium — needs build time

Minimum Viable Test

🧪 Week 1 MVP: Agent Leaderboard

Build: Static leaderboard site. Register agents by MCP server URL. Run them against a standard task set (research query, scheduling, data extraction). Rate output quality (LLM judge + human vote). Compute Elo.
Seed: Register Donna + avet + Poke + 2-3 public MCP servers from the community.
Launch: Post to HN, X, agent builder Discord channels. "Agent Elo: Which AI agent is actually the best?"
Cost: ~US$50-100 in LLM judge calls. Vercel hosting = free. 1-2 days of deep work.
Success metric: 50+ agents registered in first month. 500+ human votes on the leaderboard. One viral comparison.

Phase Plan

Phase	What	When	Success =
0. Leaderboard	Static Elo ranking site for agents. LLM judge + human voting. Public.	Feb 2026	50+ agents, 500+ votes
1. Registry	MCP-native agent registry. Agents register, expose capabilities, get discovered.	Mar-Apr 2026	100+ agents, 10+ agent-to-agent calls/day
2. Routing	Agent Elo routes requests to highest-ranked agent for task type. "OpenRouter for agents."	Q2 2026	1K+ calls/day, first revenue (take rate)
3. Marketplace	Full marketplace. Agents earn from being called. Builders get paid. Elo drives distribution.	Q3-Q4 2026	US$5K/mo GMV, 500+ active agents

Government Grants

Limited applicability for this project:

HK Cyberport Incubation: Up to HK$500K funding + co-working. Agent Elo fits "AI/data analytics" category. Application requires incorporation.
HKSTP Ideation: Up to HK$100K. Lower bar but less useful.
SG IMDA: Not applicable unless SG-incorporated.

VIII. Red Team — Challenging the Thesis

Bull Case: Agent Elo Wins

MCP adoption accelerates → interop is solved → need quality layer on top
GPT Store failure proves marketplaces need Elo → Agent Elo is the fix
Eric dog-foods it with Donna/avet → real usage from day 1
Vibecoder community = free supply-side growth
LMSys proved Elo works for AI models → extend to agents
"OpenRouter for agents" is a legible pitch
Low COGS = can bootstrap to profitability
Network effects compound: more agents → better rankings → more users → more agents

Bear Case: Agent Elo Fails

Platform risk: Anthropic/OpenAI build native agent ranking into MCP/GPT ecosystem
Cold start: no agents registered = no rankings = no users = no agents (chicken-and-egg)
Magentic research shows speed beats quality 10-30x in agent markets → Elo may not matter
Agent quality is subjective — "taste" is hard to quantify into Elo
Eric's bandwidth is already maxed: 7+ projects, build deficit day 5
LMSys never monetized → maybe Elo is a public good, not a business
Crypto agent marketplaces spent $50M+ and failed → maybe premature
Agent composition is still fragile — MCP reliability issues could kill UX

⚠️ The Biggest Risk: Platform Subsumption

Anthropic owns MCP. If they add a native "agent quality score" or "recommended agents" layer, Agent Elo's entire value proposition gets absorbed.
OpenAI is already building the GPT Store. A quality-ranked version is an obvious next step.
Counter-argument: Platform-native rankings will be biased toward their own models/agents. An independent, cross-platform Elo system has value precisely because it's neutral. Like how LMSys is trusted because it's academic, not owned by any provider.
What needs to be true: Agent Elo must be perceived as neutral and cross-platform to survive platform risk. The moment it looks like a feature, not a platform, it gets cloned.

Steel-Man: The Strongest Counter-Argument

"This is too early. There aren't enough agents in the wild to rank. MCP adoption is months old. The 'agentic economy' is a research paper, not a market. Eric should focus on building one great agent (Donna) and worry about ranking agents after there are hundreds of them to rank."

My response: This is probably right for now. The timing question is the crux. Agent Elo in Feb 2026 is a leaderboard experiment. Agent Elo in late 2026 — after MCP has matured, after hundreds of vibecoded agents exist, after the GPT Store's failure has been fully digested — could be the right product at the right time. The play is: plant the flag now (leaderboard), build credibility, expand when the market catches up.

IX. Verdict

Is this a good opportunity for Eric at this time?

Conditionally yes — as a flag-planting side project, not a primary focus.

The thesis is sound. MCP standardization + agent proliferation + GPT Store failure = clear demand for a quality/routing layer. Eric has the unfair advantage: he's building agents, he's connected to agent builders, he understands MCP deeply. The "OpenRouter for agents" pitch is legible and fundable.

But the timing is early. There aren't enough agents to rank yet. The market is pre-category. Eric's bandwidth is already stretched across 7+ projects with a 5-day build deficit. Adding another primary focus would be destructive.

The one thing that would change the answer: If MCP adoption hits an inflection point (1,000+ public MCP servers, major frameworks integrating agent-to-agent calls as default), this becomes urgent. Watch for that signal.

Recommended path:

1. This week: Don't build Agent Elo. Ship Donna. Crack the ring BLE. Protect Monday deep work.

2. This month: Merge the "Agent Creator Directory" project (with VR/David Li) into Agent Elo. Same audience, stronger thesis. Build a static leaderboard as a weekend project. Register Donna + a few public agents. Post to HN.

3. Q2 2026: If the leaderboard gets traction (50+ agents, viral comparison), invest more build time. Add MCP-native registry. Start routing.

4. If it doesn't get traction: No loss. The leaderboard took 1-2 days to build. Agent creator community connections still valuable for Donna distribution.

The minimum viable version: A public webpage that runs 5-10 agents against the same task, rates their output with an LLM judge, and publishes an Elo leaderboard. One page. One afternoon. Plant the flag.

References

[1] Anthropic — Model Context Protocol (MCP) — Open standard for agent-tool interop. Launched Nov 2024. Adopted by Cursor, Claude Desktop, Windsurf.

[2] Eric's CRM — projects.json — Donna (relationship assistant), avet (agentic vetting), OpenClaw (self-hosted agent infra). All active/exploring.

[3] Grand View Research — AI Agents Market — US$5.6B (2024) → US$47.1B (2030), ~43% CAGR. Broadest relevant market. "Agent Elo" would be a thin middleware slice of this.

[4] Verified Market Research — API Marketplace Software Market — US$4.5B → US$8.3B by 2028. RapidAPI is the canonical example. Closest revenue analog to an agent marketplace.

[5] Hugging Face — Series D Announcement — US$4.5B valuation (Aug 2023). ~US$70M ARR (est. 2024). Model hub + community leaderboards. Shows marketplace model works for AI assets.

[6] OpenRouter — Unified API routing to 200+ LLM models. 5-20% markup. Bootstrap-friendly model. Closest playbook analog for Agent Elo.

[7] LMSys Chatbot Arena — 12M+ votes. Canonical Elo leaderboard for LLMs. Free, academic (UC Berkeley). Proves Elo works for AI quality ranking.

[8] OpenHub.ai — Protocol Documentation — Decentralized AI market economy. Agents as first-class economic participants. Early stage. Closest direct competitor.

[9] Magentic Marketplace — Microsoft Research (Oct 2025) — Open-source env for studying agentic markets. Key finding: frontier models show severe first-proposal bias (10-30x speed advantage over quality).

[10] CrewAI — Multi-agent orchestration framework. US$18M Series A (2024). Internal agent composition, not marketplace.

[11] LangChain / LangSmith — Dominant agent framework + observability. US$25M Series A (2023). Framework, not marketplace.

[12] OpenAI — GPT Store Launch (Jan 2024) — Agent marketplace. Widely considered underwhelming. No creator revenue share initially. Poor discovery. No quality ranking. The canonical failure case for Agent Elo's thesis.

[13] SingularityNET — Crypto-based AI agent marketplace. ~US$36M raised (2017 ICO). Token-driven. Stalled on adoption. Speculation > utility.

[14] Anthropic API Pricing — Claude Haiku: US$0.25/MTok input, US$1.25/MTok output. Used for LLM judge cost estimates.

[15] OpenAI — ChatGPT Agent (Jan 2026) — Agentic mode for ChatGPT. Browsing, code execution, tool use in unified loop. Signals consumer shift toward agent-first interaction.

[16] CATArena — Tournament-Based Agent Evaluation — Iterative competitive tournaments for LLM agents. Proves Elo-style ranking reveals learning ability and strategy quality beyond static benchmarks.