Show HN: EvalsHub: Your AI is failing in production and you don't know it

AgentTax · 2026-03-20T21:31:27.000Z 1774042287

The consolidation angle makes sense — the Langfuse + promptfoo + custom scripts stack is genuinely painful. The question I'd ask is whether the tradeoff is worth it. Each of those tools is deep in its specific domain. What does EvalsHub sacrifice to cover all three, and where does it still defer to specialists? Also curious how you handle the rubric quality problem. LLM-as-a-judge is only as good as the criteria — do you have tooling to help teams know when their rubrics are underspecified?