Ask HN: How are people doing AI evals these days?

alexhans · 2026-03-10T09:05:15 1773133515

Very, very heterogenous and fast moving space.

Depending on how they're made up, different teams do vastly different things.

No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.

It's definitely an afterthought for most teams although we are starting to see increased interest.

My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.

What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.

- [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)

yelmahallawy · 2026-03-12T06:55:49 1773298549

What I've noticed is that it's hard to measure outputs that aren't binary right or wrong, and that's where most human intervention is needed. The biggest examples of this are chatbots and coding agents – basically any output where you can say "hmm well that's a good response, but there is a better response" and that's what still _feels_ like an unsolved problem, benchmarking those kinds of responses.

On top of that, there are combinations of models+prompts that give different results. For example a prompt could yield a great response from Claude, but the same prompt could yield a mediocre response from Gemini. Not just that but different models have different capabilities (example of this is that composite function calling doesn't work the same way for all models).

I'm asking because I'm generally curious on how teams are solving this today – and it _seems_ like there is no gold standard for evals yet although it's gaining interest.

How I do evals today is by testing an output across different dimensions (and it can vary based on use-case): relevance, instruction following, clarity, hallucination rate, etc. which sucks a lot of time (and can never be fully accurate because how do you fully measure something like "clarity"?), and I feel like there's a better way out there.

raw_anon_1111 · 2026-03-11T06:05:53 1773209153

One of my specialties is AWS Connect based call centers

https://hackernews.hn/item?id=47241412

I use LLMs to determine what a caller’s “intent” is. I do my best with my initial prompt and then I have the “business” test it and I log phrases that they use.

I then make those phrases my scripted test suite. Any changes in prompts or models get put through the same test suite. In my case, I give my customers a website they can use to test new prompts and takes care of versioning.

I also log phrases that didn’t trigger an intent and modify the prompt and put it back through the suite.

yelmahallawy · 2026-03-12T07:09:43 1773299383

Do you play with the temperature/top k parameters at all?

raw_anon_1111 · 2026-03-12T07:39:28 1773301168

No. I also use the least sophisticated but fastest model that Amazon hosts - and it hosts all of them except OpenAI models - Nova Lite

Going from free text to tool call with parameters in the grand scheme of things is one of the easiest things to do especially when you only have a limited number of tools.

yelmahallawy · 2026-03-13T03:54:51 1773374091

Makes sense, simpler=better. Thanks!

kelseyfrog · 2026-03-11T00:16:33 1773188193

Automated benchmarking.

We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.

From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.

We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.

yelmahallawy · 2026-03-12T07:15:56 1773299756

This is interesting approach, thanks for the insight! If I may ask, _approximately_ how long does it take to test a newly-released model with the current strategy?

kelseyfrog · 2026-03-12T16:23:23 1773332603

20mins or so. The bottleneck is rate-limiting. It's amenable to parallelization. Each tests can run in isolation at the same time.

yelmahallawy · 2026-03-13T03:56:14 1773374174

Gotchu. Yeah that's pretty quick, awesome thanks!

fluffet · 2026-03-11T13:56:53 1773237413

It's kind of bespoke for me tbh.

For a co-pilot inside an app that could answer product questions, I looked at ~2000 or so support emails. I asked one LLM to dig out "How would you formulate the users question into a chatbot-like question from this email thread" and "What is the actual answer that should be in the response from this email thread", then just asked our bot that question, and have another LLM rate the answer like SUPERIOR | ACCEPTABLE | UNKNOWN etc. These labels proved out to be a good "finger in the wind"-indicator for altering the chunks, prompt changes or model updates.

For an invoice procesing app processing about 14M invoices/year, it was mostly doing fuzzy accuracy metrics against a pretty ok annotated dataset and iterating the prompt based on diffs for a long time. Once you had that dataset you could alter things and see what broke.

Currently, I work on an app with a pretty sophisicated prompt chain flow. Depending on bugs etc we kind of do tests against _behaviour_, like intent recognition or the correct sql filters. As long as the baseline is working with the correct behaviour, whatever model is powering it is not so important. For the final output, it's humans. But we know immediately if some model or prompt change broke some particular intent.

yelmahallawy · 2026-03-12T07:07:52 1773299272

This makes sense. I am particularly interested in your invoice processing app example because the accuracy of those outputs can be quantitatively measured from 0%-100% accuracy.

I'm curious as to what is _good enough_ and how many iterations it takes to get there. Is 100% the only acceptable threshold? If so, how many iterations does that take? What does that process look like? Okay let's say 100% accuracy is too difficult to reach, then how do you choose your minimum acceptable threshold (is 95% accuracy good enough? is 90%?). Do you have a dedicated set of outputs and documents used for evals? I'd love to hear more about this example (if you worked directly on the evals for this app).

kbdiaz · 2026-03-11T04:52:22 1773204742

The vast majority of AI companies I talk to seem to evaluate models mostly based on vibes.

At my company, we use a mix of offline and online evals. I’m primarily interested in search agents, so I’m fortunate that information retrieval is a well-developed research field with clear metrics, methodology, and benchmarks. For most teams, I recommend shipping early/dogfooding internally, collecting real traces, and then hand-curating a golden dataset from those traces.

Many people run simple ablation experiments where they swap out the model and see which one performs best. That approach is reasonable, but I prefer a more rigorous setup.

If you only swap the model, some models may appear to perform better simply because they happen to work well with your prompt or harness. To avoid that bias, I use GEPA to optimize the prompt for each model/tool/harness combination I’m evaluating.

yelmahallawy · 2026-03-12T07:20:15 1773300015

Ah, interesting – yeah only swapping out the model isn't super insightful since models perform differently given different prompts. I'm going to look into GEPA, thanks!

minikomi · 2026-03-11T04:37:01 1773203821

The more you can afford to build up your understanding of the problem space and define what inputs & outputs look like, the more flexible you can be with evals. Unfortunately, this is a lot of work and requires thinking and discussion with your team and those involved.

https://poyo.co/note/20260217T130137/

I wrote about general ideas I take towards simple single prompt features, but most of it is applicable to more involved agentic approaches too.

yelmahallawy · 2026-03-12T07:24:29 1773300269

Ah good read, thanks for sharing!

mock-possum · 2026-03-11T06:32:52 1773210772

We feed a handful of preset questions through the new AI, we collect the results, we ask another AI to score the answers based on example ‘hood’ answers we’ve written, then we have a guy sit down and use the fallout as a starting point to rank the performance of that AI, compared to all the previous ones.

Seems like it works pretty well. Our prompts and params get tweaked towards better and better results, and we get a sense of what’s worth paying more for.

yelmahallawy · 2026-03-12T07:47:13 1773301633

The guy who reviews all of this, is his role in the company fully dedicated to reviewing these eval pipelines?

mock-possum · 2026-03-17T15:37:46 1773761866

Yeah - and he’s kind of a black box of a contractor, people kept saying his name, which is unusual, and at first I figured it was some software or other company we were using - eventually I realized it was a real guy, who we just feed LLM results to and he ranks them for us. He’s not a full time employee and I’ve never actually seen him or had any contact with him, so now I think it’s entertaining to imagine that he’s a figment of the CEO’s imagination - his alter ego that takes over after hours and obsessively reviews LLM outputs.

satisfice · 2026-03-11T04:28:58 1773203338

It’s called testing. And from the reports and comments, there doesn’t seem to be much of it happening. The reason is: it’s quite expensive to do well.

I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.

aszen · 2026-03-11T07:54:58 1773215698

Seems like you are testing llms genric abilities rather than your actual agent logic.

Llms are like vendor code you don't need to test them yourself people already created benchmarks for that.

satisfice · 2026-03-12T02:11:50 1773281510

No they haven’t. The benchmarks suck, because they are cheap knockoffs instead of comprehensive experiments.

LLMs are poorly tested by vendors. They literally can’t afford to test them, so they force us to do it.

yelmahallawy · 2026-03-12T07:27:27 1773300447

Yeah it's a super tedious process and I was hoping that _maybe_ there is a tool out there that can help with this.

tmaly · 2026-03-11T15:10:48 1773241848

I was thinking about something similar the other day. I have seen a repeating pattern of people complaining that a new model comes out, it's amazing for a few weeks, then they nerf it.

Most of these claims are subjective. I was thinking if we had a standardized chain of though representation, and if we could capture each models chain of thought into this standardized format, we could compare these for the same tasks we run.

yelmahallawy · 2026-03-12T07:22:21 1773300141

Yeah that's essentially what I'm looking for. Since now that AI has become such a core part of most businesses, it's pretty critical to use the _best_ models + prompts for whatever your use case is.

bisonbear · 2026-03-10T20:39:04 1773175144

assume you're referencing coding agents - I don't think people are. If they are, it's likely using

- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)

I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?

yelmahallawy · 2026-03-12T07:37:36 1773301056

I'd love to hear more about what you're working on (if you're open to sharing!).

I like to play with knowledge base powered chatbots but what's most useful to me (and probably my primary use case) is coding agents since I use CC every day. Recently I just heard about Minimax m2.5 which apparently is a pretty good coding agent (they say it's comparable to opus 4.6) but I haven't tried it yet — plus it'd take a lot of time to figure out whether it's better or not.

mierz00 · 2026-03-11T10:27:55 1773224875

I highly rate Braintrust.

It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.

Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.

In our case we would use LLM for classification which fits in perfectly with evals.

yelmahallawy · 2026-03-12T07:43:18 1773301398

Have some good takeaways / feedback on this? First time I hear about Braintrust (the eval platform) so I'll look into it but I'm curious on your experience with it so far.

mierz00 · 2026-03-17T07:57:10 1773734230

If I am being honest, the value came from doing evals and testing against different models.

Essentially all I needed was a way to upload a data set, run tests against that data set and spit out a percentage of pass fail.

Braintrust makes this pretty easy, but If I was to do it again I would vibecode the same functionality.

maxalbarello · 2026-03-11T00:27:46 1773188866

Also wondering how to evals agentic pipelines. For instance, I generated memories from my chatGPT conversation history, how do I know whether they are accurate or not?

I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.

yelmahallawy · 2026-03-12T07:40:53 1773301253

And I think this is a common problem actually — figuring out what to measure and how to measure it – it's not black and white. What I do is have a few dimensions to measure it against (this may or may not fit your use case): relevance, instruction following, clarity, hallucination rate, etc. but even then, it becomes hard to measure things like 'clarity'.

gabdiax · 2026-03-11T13:28:57 1773235737

I'm also curious about this.

In some cases I've seen teams rely on a mix of automated metrics and human review, especially for production systems where reliability matters a lot.

But evaluation pipelines for AI still seem much less standardized compared to traditional software monitoring.

yelmahallawy · 2026-03-12T07:45:26 1773301526

Yeah, it feels like an unsolved problem still. I've also seen many teams spend hours on human review in eval pipelines (and this accumulates with each new model that gets released).

gabdiax · 2026-03-15T22:21:46 1773613306

I’m building EventSentinel.ai, a predictive AI platform that monitors hardware and network infrastructure to detect early signals of failures and connectivity issues before they cause downtime.

I’m looking for a few early-stage design partners (SRE / DevOps / IT / Network teams) who:

Manage on‑prem or hybrid infrastructure with critical uptime requirements

Are currently using tools like Datadog, PRTG, Zabbix, or similar, but still deal with “surprise” incidents?

Are open to trying an MVP and giving candid feedback in short feedback sessions?

What you’d get:

-Early access to our predictive failure and anomaly detection features

-Direct influence on the roadmap based on your needs

-Free usage during the MVP phase (and preferential terms later)

If this sounds relevant, drop a comment “interested” and I’ll follow up with details or email at gabriele@eventsentinel.ai

celestialcheese · 2026-03-11T00:01:58 1773187318

mix of promptfoo and ad-hoc python scripts, with langfuse observability.

Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.

yelmahallawy · 2026-03-12T07:29:42 1773300582

Any takeaways on Promptfoo?

moltar · 2026-03-11T08:45:11 1773218711

I use Promptfoo

yelmahallawy · 2026-03-12T07:28:50 1773300530

Any takeaways? Has it been helpful? OpenAI just acquired them so it's probably useful but I was curious to hear more from people who've actually used it.

moltar · 2026-03-13T22:21:58 1773440518

Yes, very useful. Can’t imagine managing a large interconnected prompt collection without it.

dkoy · 2026-03-11T00:31:23 1773189083

Curious who’s used OpenAI Evals

rurban · 2026-03-11T05:57:58 1773208678

Doing tickets and commenting cost and quality in the PR.

Still, the best are outstanding, and the medium ones bare usable. I rank it by IQ. From 140 to utterly stupid. opencode/gpt-oss-120b local got a 90. opencode/opus-4.6 gets 140. codex/gpt-5.4 gets 115. All for C/C++ tasks.

There was one expensive Chinese SWE benchmark posted recently to arxiv. It did confirm my evaluation.

HPSimulator · 2026-03-11T02:41:42 1773196902

[flagged]

kristianp · 2026-03-11T06:00:33 1773208833

What you've said is nothing to do with AI Evals.