TraceVerse Community

The fastest way to know what your AI agent is actually doing — and prove it on a public leaderboard.

You wrote an agent. It works. Sometimes. It calls an LLM, it calls a tool, sometimes it loops, occasionally it spends ₹400 on a single user query and you have no idea why.

This org exists to fix that. Open source, framework-agnostic, built so you can go from git clone to a traced agent with a leaderboard rank in under five minutes.

🔗 Discord · GitHub · genai-otel-instrument · SmolTrace

Get a traced agent in 30 seconds

# pip install genai-otel-instrument
from genai_otel_instrument import instrument

instrument(
    service_name="my-first-agent",
    otlp_endpoint="http://localhost:4318",   # or point at the public TraceMind Space
    redact_pii=True,                         # PII off your traces by default
)

# That's it. Run your agent. Every LLM call, tool call, token, rupee, and
# millisecond of latency is now visible.

No SDK lock-in. No daemons. No "you must use our framework." Works with LangGraph, CrewAI, OpenAI Agents SDK, AutoGen, smolagents, vanilla openai — anything that hits an LLM API.

What we ship

Libraries

Project	What you get
`genai-otel-instrument`	One-line OpenTelemetry instrumentation for any GenAI agent. Captures LLM calls, tool calls, cost, tokens, latency. Auto-redacts PII by default.
`SmolTrace`	Public benchmark + leaderboard for agent evals. Submit an agent, get a rank, compare on cost, latency, and quality.
`TraceMind`	Hosted trace viewer. Point your OTLP endpoint at it, see what your agent did, where it broke, what it cost. No signup.
`TraceMind-mcp-server`	An MCP server so your agent can query its own historical traces. Meta-observability for self-improving agents.

Live MCP servers (3 servers · 18 tools · synthetic data · no API key)

Surface	Space	Tools
Food delivery	`food-delivery-mcp`	7
Grocery / Instamart	`instamart-mcp`	6
Dineout / Reservations	`dineout-mcp`	5

Eval datasets (SmolTrace-format)

Dataset	Tasks
`food-delivery-evals`	111
`instamart-evals`	100
`dineout-evals`	100

Cross-domain SmolTrace datasets

For evaluation across other domains, see the TraceMind-AI Collection — 41 SmolTrace-format datasets covering:

General domains (12) — travel, ecommerce, healthcare, finance, legal, education, real-estate, social-media, recruitment, smart-home, customer-support, food-delivery
Ops & infrastructure (15) — aiops, apm, devops, secops, mlops, llmops, cloud-cost, kubernetes, database-ops, incident-management, IaC, SRE, observability-platform, CI/CD, log-management
Industry-specific (14) — drone, farming, manufacturing, hospitality, logistics, automotive, cybersecurity, telecom, insurance, events, marine, aviation, gaming, plus the three TraceVerse Community datasets above

Same SmolTrace schema, same prompt-template structure as ours. Use them directly — no need to mirror.

Reference agents + docs

GitHub: food-delivery-agents — the binding repo. Reference agents wired with genai-otel-instrument, architecture docs, observability primer, leaderboard CI.

What you'll get from this stack

See it. Every LLM call, tool call, token spent, millisecond burned — visualized as a trace tree.
Score it. Run your agent against shared task datasets. Get a number on a public leaderboard. Watch it move.
Compare it. Two model versions, two prompts, two frameworks — same dataset, side-by-side cost, latency, and quality.
Trust it. PII redaction is on by default. Self-host the viewer if you don't want anyone seeing your traces.

Who this is for

Buildathon participants — go from zero to traced agent with a leaderboard rank in under five minutes. Any framework, any model.
Indie builders — see what your agent actually does, not what you think it does. Stop debugging via print().
Teams shipping LLM apps — replace ad-hoc notebook evals with reproducible numbers you can show a stakeholder.
Researchers — every dataset and benchmark here is open. Fork it, extend it, contribute back.

What we believe

Observability is a precondition for serious agent work. You cannot improve what you cannot see.
Evaluation should be reproducible and public. Benchmarks that live in private notebooks help no one.
Cost and latency are first-class signals. Quality without cost discipline is a research demo, not a product.
The toolkit must work the same on localhost as in production. No magic that only kicks in on day 30.

Community

💬 Discord — chat with the community, ask questions, share traces, suggest tasks for the eval suites.
🐙 GitHub — open issues, PRs welcome, no CLA. Discussions enabled on every repo.
🤗 HF Discussions — every Space and Dataset has a Discussions tab. Use it for surface-specific questions (e.g. "found a bug in apply_promo" → discuss on the Space's tab).

Roadmap

✅ Live now — genai-otel-instrument, SmolTrace, public TraceMind, TraceMind-mcp-server, 3 live MCP servers (food / grocery / dineout, 18 tools), 3 own eval suites (311 tasks total), 18 mirrored eval datasets, food-delivery-agents binding repo.
🔜 Next — framework-specific reference agents (LangGraph + smolagents + CrewAI), automated PR-driven leaderboard, more domain MCP servers.
After — community-curated tasks across more domains, cost-optimization recipes, agents.md standardization across all our Spaces.

Production-grade companion

Need this stack on-premises with autonomous root-cause analysis, compliance audit trails, multi-year retention, and air-gapped deployment? TraceVerse Enterprise is the bigger sibling built for regulated environments — same telemetry contract, hardened for the bank floor.

Get involved

Try it — start with genai-otel-instrument on the agent you have right now.
Contribute — every repo above accepts PRs. Issues open. No CLA.
Share datasets — got a domain-specific task set? PR it into SmolTrace or open a discussion.
Join the conversation — Discord, GitHub Discussions, or HF Discussions on any repo. We answer.