The evaluation engine behind Helios AI agents
At Helios, we build AI agents for businesses where every customer interaction is high-stakes — with the potential to shape satisfaction, loyalty, and revenue in real time. That means we need to continuously evaluate our models to optimize agent performance and ensure they deliver concierge-quality support while improving key metrics like CSAT and resolution rate.
Our AI agents are powered by a diverse set of models and orchestrated workflows that work together to deliver exceptional performance. We use models from OpenAI, Anthropic, Gemini, and other providers, alongside our own fine-tuned versions of open-source and commercial models. We rigorously assess every component of the agent, from response generation to retrieval to safety.
Inside the Helios Model Evaluation Loop
We evaluate every model and system component through a two-phased framework: offline evaluation, followed by online A/B testing. This structured methodology ensures that only the most effective and reliable configurations are deployed to production to serve 100% of user traffic.
Each of these configurations — known as experiment variants — represent different approaches to agent behavior and performance. An experiment variant might involve anything from a new system prompt to an entirely new language model.
Offline evaluation: scalable, continuous, and stratified
Every variant must demonstrate consistent performance offline before it is eligible for online testing. Our offline evaluation process is structured around two core tracks, each providing a distinct but complementary signal.
LLM-as-judge evaluation
Using an LLM-as-judge system, we evaluate structured triplets consisting of a user query, the context provided to the model, and the model's generated response. These triplets are drawn from real-world interactions, ensuring that our evaluation process reflects authentic user needs and scenarios.
Each response is scored against several key criteria:
- Relevance: Does the response directly address the user's question?
- Correctness: Does the response reflect accurate information?
- Naturalness: Does the response sound human and conversational?
- Empathy: Does the response demonstrate understanding and care?
We also audit a subset of triplets with human labelers to further validate the LLM-as-judge scoring. These audits provide an extra layer of confidence and help us catch edge cases that automated evaluations might miss.
Ground truth evaluation
In parallel, we evaluate responses against a ground truth evaluation set: a curated collection of user queries with ideal responses, labeled by human experts. This high-confidence benchmark allows us to test for factuality and intent coverage with precision.
The combination of scalable LLM-based judgment and curated ground truth benchmarking gives us a fast, nuanced, and reliable evaluation signal before any model sees production traffic.
Online A/B testing: real-world performance evaluation
Once a variant passes offline thresholds, we move it into controlled online A/B testing with real customers. We manage rollouts by gradually increasing traffic to the variant group as performance improves, allowing us to limit risk while gaining signal.
During this phase, we track impact on business-critical metrics like CSAT and resolution rate through a unified evaluation view. This stage provides the ultimate signal: how well the model performs in live, unpredictable, customer-facing settings.
Continuous feedback and improvement
Evaluation does not stop once a variant ships. Insights from every phase feed back into the experiment loop: improving prompts, upgrading models, and refining overall agent behaviors.
Beyond evaluation, our platform also provides built-in tools that give teams visibility and control over agent performance. Features like always-on QA monitoring and natural language analysis of customer conversations help CX teams monitor and improve outcomes at scale.
If you would like to see how our evaluation loop powers real-world performance, book a demo to learn more about Helios.
Marco Salinas
ML Research Engineer