G
GetLLMs
FinderAgent evalsSource-backed

Find the right agent evaluation tool

Choose by eval job, lifecycle stage, hosting model, and required evidence.

Coverage
Tools
6
Traces
5
Sandbox
1
Finder
Single choice
Choose eval infrastructure by job
Eval job
Current recommendation

Arize Phoenix

Debug tracesPre-deploy
Open source
observability
Arize Phoenix
10 fit
Debug tracesProduction monitorRegression tests
Adoption note+

Phoenix is strongest when instrumentation and traces already exist or can be added cleanly.

View docs
observability
Langfuse
10 fit
Debug tracesProduction monitorHuman review
Adoption note+

Choose Langfuse when trace storage and evaluation operations are part of the workflow, not only one-off benchmarks.

View docs
platform
LangSmith Evaluation
10 fit
Debug tracesRegression testsProduction monitor
Adoption note+

Use LangSmith-specific docs for current plan, retention, and self-hosting details before production adoption.

View docs
ToolCategoryTracesDatasetsHuman reviewSandboxHosting
Arize PhoenixobservabilityOpen source, Self-hosted, Local
LangfuseobservabilityCloud, Self-hosted, Open source
LangSmith EvaluationplatformCloud
OpenAI agent evalsplatformCloud
Inspect AIbenchmarkOpen source, Local
RagasframeworkOpen source, Local
Report
Evaluation plan fields
Capture
Traces, inputs, tool calls
Score
Graders, metrics, reviewers
Compare
Runs, datasets, versions
Decision frame

Trace first when failure shape is unknown

Debug

Use traces to see model calls, tool calls, handoffs, retrieved context, and failure points.

Regress

Use datasets, graders, metrics, and repeated runs when behavior is stable enough to compare versions.

Govern

Use human review, production scoring, and sandboxing when agent actions affect trust, safety, cost, or external systems.

Sources

Evidence used for this finder

LangSmith Evaluation

Evaluation and observability platform for LLM and agent workflows.

Open source
Arize Phoenix

Open-source observability and evaluation workflow for traces, scores, prompts, and experiments.

Open source
Langfuse

LLM engineering platform for traces, datasets, experiments, annotation queues, and eval scores.

Open source
Ragas

Evaluation library for LLM applications with metrics, experiments, datasets, and agent/tool-use metrics.

Open source
Inspect AI

Open-source evaluation framework for coding, agentic, tool-use, reasoning, multimodal, and sandboxed evaluations.

Open source
OpenAI agent evals

OpenAI platform workflow for evaluating agent traces with graders, datasets, and eval runs.

Open source

Agent Evaluation Tool Finder FAQ

Use this route to choose eval infrastructure before building a full agent quality loop.

What should I capture before evaluating an AI agent?+

Capture the user input, model calls, tool calls, retrieved context, intermediate steps, final answer, latency, cost, and failure notes. Without traces or repeatable datasets, most eval results are hard to debug or compare.

Should I start with traces or benchmark datasets?+

Start with traces when you do not yet know where the agent fails. Start with benchmark or regression datasets when the task is stable enough to run the same cases across versions.

Why does the finder warn about OpenAI Evals?+

OpenAI documentation says the general Evals platform is on a deprecation timeline. OpenAI agent-eval and dataset surfaces can still be useful for OpenAI-based workflows, but long-term infrastructure decisions should verify the current supported path first.

Need model candidates first?

Use /match before building evals

Open matcher