Designing layered guardrails for reliable AI agents
Building a reliable AI agent for enterprise customer interactions requires more than a well-crafted system prompt. A single safety layer is insufficient — models have context limits, adversarial inputs can bypass prompt-based instructions, and novel scenarios will always emerge in production that were not anticipated at configuration time.
The architecture we have converged on uses layered guardrails: multiple independent safety mechanisms applied at different stages of the agent pipeline. Each layer catches different failure modes, and the overlap between layers provides resilience when one mechanism is bypassed.
Layer 1: Input classification
Before the primary LLM processes a customer message, a lightweight classifier evaluates the input for policy violations: personal attacks, attempts to manipulate the agent into off-policy behavior, PII that should be masked, and out-of-scope requests that should be immediately escalated.
This classifier runs in under 50ms and handles the most common safety issues without consuming primary LLM capacity or adding visible latency to the conversation.
Layer 2: System prompt and instruction following
The primary LLM receives carefully structured system prompts that define agent identity, scope, escalation criteria, and prohibited behaviors. We use constitutional AI techniques to make these instructions robust to adversarial prompting.
System prompts alone are not sufficient, but they establish the behavioral baseline and handle the majority of edge cases that do not require deeper intervention.
Layer 3: Response evaluation
Before the agent's response is delivered to the customer, a second LLM pass evaluates it against safety criteria: factual accuracy for high-stakes claims, tone appropriateness, absence of hallucinated policies or prices, and consistency with the conversation context.
This layer catches errors that slipped through the primary LLM — fabricated information presented with high confidence, inappropriate tone shifts, or responses that are technically accurate but contextually inappropriate.
Layer 4: Post-hoc monitoring
Not all safety issues are catchable in real time. Some failure modes only become visible in aggregate — a systematic bias in how certain customer segments are handled, or a slow drift in response quality as the underlying model behavior shifts.
We run continuous monitoring on conversation samples using our LLM-as-judge framework, flagging conversations that fall below quality thresholds for human review. This provides a continuous feedback signal into the configuration and fine-tuning pipeline.
The cost of layered safety
Layered guardrails add latency and cost. Our input classifier adds approximately 40ms. Our response evaluation pass adds 80-120ms depending on response length. For voice agents where total round-trip latency must stay under 1.5 seconds, this requires careful optimization of each safety layer.
The tradeoff is worth it. Enterprise customers cannot afford a safety failure that goes viral. A single highly visible incident — an agent that fabricates a refund policy, handles a complaint offensively, or gets manipulated into providing harmful information — can damage customer trust in ways that take months to repair.
Valeria Rios
AI Safety Engineer