Engineering fast, performant agents through fine-tuning

Diego MoralesSenior ML EngineerNovember 20, 202510 min read

At Helios, we do not rely on a single monolithic model to power our agents. We use a layered approach: large frontier models for complex reasoning, smaller fine-tuned models for high-frequency, well-defined tasks. This architecture lets us optimize for latency, cost, and accuracy simultaneously.

Why fine-tuning matters at production scale

Frontier models like GPT-4o and Claude 3.5 Sonnet are remarkable general-purpose reasoners. But at production scale — millions of conversations per month — the economics of routing every interaction through a frontier model are challenging.

More importantly, many high-frequency tasks in customer service are narrow and well-defined enough that a smaller, fine-tuned model can outperform a larger general model on the specific task while running at a fraction of the latency and cost.

Our fine-tuning pipeline

Our fine-tuning pipeline has four stages:

Data curation. We start from production conversation logs, filtered for quality using our LLM-as-judge evaluation framework. We specifically select interactions where our agents performed well, creating a high-quality demonstration set. We also construct synthetic examples for edge cases and failure modes that are underrepresented in production data.

Task decomposition. Before fine-tuning, we decompose agent behavior into discrete subtasks: intent classification, entity extraction, response generation, tool selection, and safety filtering. Each subtask has its own fine-tuning track.

Iterative training and evaluation. We train on curated data, evaluate against our ground truth benchmark, and use the evaluation signal to identify gaps for the next training round. This loop typically runs three to five times before a model is ready for online testing.

A/B testing and gradual rollout. Fine-tuned models go through the same online A/B testing infrastructure as any other experiment variant. We gradually increase traffic exposure as performance metrics improve.

Results

Fine-tuned models have delivered meaningful improvements across our key metrics. On intent classification tasks, a fine-tuned 7B model matches the accuracy of a 70B model at 12x lower latency. On response generation for well-defined scenarios, fine-tuned models show higher naturalness scores than base models because they have been trained on domain-specific conversation styles.

The biggest win has been in cost efficiency. By routing appropriate traffic to fine-tuned smaller models, we have reduced per-conversation inference costs by over 60% without sacrificing quality on the interactions that matter most.

Diego Morales

Senior ML Engineer

Engineering fast, performant agents through fine-tuning

Why fine-tuning matters at production scale

Our fine-tuning pipeline

Results

The evaluation engine behind Helios AI agents

10 principles of a production-grade voice AI agent

Agentic AI for customer experience: what it actually means

Discover what Helios
can do for your team.

Why fine-tuning matters at production scale

Our fine-tuning pipeline

Results

The evaluation engine behind Helios AI agents

10 principles of a production-grade voice AI agent

Agentic AI for customer experience: what it actually means

Discover what Helioscan do for your team.

Discover what Helios
can do for your team.