Designing layered guardrails for reliable AI agents

Valeria RiosAI Safety EngineerJuly 29, 20259 min read

Building a reliable AI agent for enterprise customer interactions requires more than a well-crafted system prompt. A single safety layer is insufficient — models have context limits, adversarial inputs can bypass prompt-based instructions, and novel scenarios will always emerge in production that were not anticipated at configuration time.

The architecture we have converged on uses layered guardrails: multiple independent safety mechanisms applied at different stages of the agent pipeline. Each layer catches different failure modes, and the overlap between layers provides resilience when one mechanism is bypassed.

Layer 1: Input classification

Before the primary LLM processes a customer message, a lightweight classifier evaluates the input for policy violations: personal attacks, attempts to manipulate the agent into off-policy behavior, PII that should be masked, and out-of-scope requests that should be immediately escalated.

This classifier runs in under 50ms and handles the most common safety issues without consuming primary LLM capacity or adding visible latency to the conversation.

Layer 2: System prompt and instruction following

The primary LLM receives carefully structured system prompts that define agent identity, scope, escalation criteria, and prohibited behaviors. We use constitutional AI techniques to make these instructions robust to adversarial prompting.

System prompts alone are not sufficient, but they establish the behavioral baseline and handle the majority of edge cases that do not require deeper intervention.

Layer 3: Response evaluation

Before the agent's response is delivered to the customer, a second LLM pass evaluates it against safety criteria: factual accuracy for high-stakes claims, tone appropriateness, absence of hallucinated policies or prices, and consistency with the conversation context.

This layer catches errors that slipped through the primary LLM — fabricated information presented with high confidence, inappropriate tone shifts, or responses that are technically accurate but contextually inappropriate.

Layer 4: Post-hoc monitoring

Not all safety issues are catchable in real time. Some failure modes only become visible in aggregate — a systematic bias in how certain customer segments are handled, or a slow drift in response quality as the underlying model behavior shifts.

We run continuous monitoring on conversation samples using our LLM-as-judge framework, flagging conversations that fall below quality thresholds for human review. This provides a continuous feedback signal into the configuration and fine-tuning pipeline.

The cost of layered safety

Layered guardrails add latency and cost. Our input classifier adds approximately 40ms. Our response evaluation pass adds 80-120ms depending on response length. For voice agents where total round-trip latency must stay under 1.5 seconds, this requires careful optimization of each safety layer.

The tradeoff is worth it. Enterprise customers cannot afford a safety failure that goes viral. A single highly visible incident — an agent that fabricates a refund policy, handles a complaint offensively, or gets manipulated into providing harmful information — can damage customer trust in ways that take months to repair.

Valeria Rios

AI Safety Engineer

Designing layered guardrails for reliable AI agents

Layer 1: Input classification

Layer 2: System prompt and instruction following

Layer 3: Response evaluation

Layer 4: Post-hoc monitoring

The cost of layered safety

The evaluation engine behind Helios AI agents

10 principles of a production-grade voice AI agent

Agentic AI for customer experience: what it actually means

Discover what Helios
can do for your team.

Layer 1: Input classification

Layer 2: System prompt and instruction following

Layer 3: Response evaluation

Layer 4: Post-hoc monitoring

The cost of layered safety

The evaluation engine behind Helios AI agents

10 principles of a production-grade voice AI agent

Agentic AI for customer experience: what it actually means

Discover what Helioscan do for your team.

Discover what Helios
can do for your team.