close

DEV Community

# evaluation

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
When "Slow Thinking" Is Just "Slow Talking"

When "Slow Thinking" Is Just "Slow Talking"

Comments
3 min read
What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

What Is Agent Evaluation? How EClaw Arena Benchmarks AI Agents Across 12 Dimensions

Comments
3 min read
LLM-as-Judge: using Claude to review a Gemini agent

LLM-as-Judge: using Claude to review a Gemini agent

Comments
7 min read
The Evaluation Gap: Why We Dont Know If Agents Are Getting Better

The Evaluation Gap: Why We Dont Know If Agents Are Getting Better

Comments
2 min read
Origin Part 3: The Teacher Was Scoring It Wrong

Origin Part 3: The Teacher Was Scoring It Wrong

Comments
9 min read
SQL Comparison Library Architecture

SQL Comparison Library Architecture

Comments
14 min read
Building an LLM Judge That Doesn't Lie to You

Building an LLM Judge That Doesn't Lie to You

Image 1
Comments
8 min read
Build a Production‑Ready SQL Evaluation Engine for LLMs

Build a Production‑Ready SQL Evaluation Engine for LLMs

Comments
5 min read
Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs

Beyond Text: How We Built an Evaluation Framework for Multi-File AI Outputs

Image 1
Comments
8 min read
Evaluating Vendor Offerings: A Structured Approach to Identify High-Quality, Compatible Tools at Conferences

Evaluating Vendor Offerings: A Structured Approach to Identify High-Quality, Compatible Tools at Conferences

Comments
13 min read
EVAL #006: LLM Evaluation Tools — RAGAS vs DeepEval vs Braintrust vs LangSmith vs Arize Phoenix

EVAL #006: LLM Evaluation Tools — RAGAS vs DeepEval vs Braintrust vs LangSmith vs Arize Phoenix

Comments
10 min read
No Evals, No Idea. How 40% of RAG Answers Go Wrong.

No Evals, No Idea. How 40% of RAG Answers Go Wrong.

Image 1
Comments
5 min read
Building an LLM Evaluation Framework That Actually Works

Building an LLM Evaluation Framework That Actually Works

Comments
7 min read
Evals Aren’t a One-Time Report: Build a Living Test Suite That Ships With Every Release.

Evals Aren’t a One-Time Report: Build a Living Test Suite That Ships With Every Release.

Image 1
Comments
6 min read
How I Approach Evaluation When Building AI Features

How I Approach Evaluation When Building AI Features

Image 1
Comments
6 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.