Skip to content

f5: Guardrails

Once an agent is out in the world, people will ask it things you never intended: off-topic questions, and the occasional unsafe one. A guardrail is how you handle that. It is a cheap check that runs before your real agent and decides whether to let the request through.

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

  1. Run npm run f5 (RUN = 2) and watch TripMate answer an off-topic question because nothing checks it first.
  2. Read the throwaway worked example below to learn the two-call gate, then edit start/agent.ts: TODO 1 (run guardrail first and turn its 1/0 verdict into a boolean), TODO 2 (if not allowed, print a refusal and return early).
  3. Done when RUN 1 passes through to TripMate while RUN 2 and RUN 3 get the refusal and TripMate never runs.

One small classifier call sits in front of the expensive one. It looks at the user’s message and returns a verdict, and only allowed messages reach TripMate.

query  ->  [ guardrail check ]  ->  allowed?  --yes-->  TripMate  ->  answer
              (cheap; runs first)      \---no--->  refuse (TripMate never runs)

The guardrail is a cheap call in front; a blocked verdict returns before the real agent ever runs. Both calls here use .generate(): the gate consumes a one-character verdict, and there is nothing to stream about a 1 or a 0.

The gate is two calls and a branch. Take a billing-support bot that has nothing to do with travel: a cheap classifier decides, and only an allowed message reaches the real bot. The classifier is the small piece, one character out, 1 to allow and 0 to block, because cheapness is the point:

const checker = new ToolLoopAgent({
  model,
  instructions: `
You gate a billing-support bot.
Reply with ONE character: 1 if the message is about billing or payments, 0 for anything else.
Nothing else.
`.trim(),
});

From there the gate is three moves, and they are yours to write:

  • Run the checker first, before the real bot, on the user’s message.
  • Turn its reply into a boolean, failing closed: only a clear 1 passes, so trim the reply and check it starts with 1, and treat anything else as a block.
  • Return early with a refusal when it’s blocked, so the real bot never runs.

You did the typed-output and instructions work in f1 to f3; this is the same toolkit wired into a gate. Carry the three moves over to TripMate’s guardrail below.

Open start/agent.ts. There are two agents already wired: a guardrail and the plain tripmate from f1. The GUARDRAIL_SYSTEM brief is provided, it allows travel questions, blocks everything else and anything unsafe, and answers with a single 1 or 0. What’s missing is the gate itself: you write it, applying the three moves above.

Run it:

npm run f5

As it ships there is no guardrail wired in, so TripMate answers whatever you send it. RUN is set to an off-topic question, and you’ll watch TripMate answer it. That is the gap: an assistant with no gate has no idea what it should refuse.

  1. Run it and watch the gap. Run npm run f5 with RUN = 2 (an off-topic question). TripMate answers it, because nothing is checking the request first. Try RUN = 3 (an unsafe question) and watch it engage with that too.

  2. Run the guardrail first (TODO 1). Before TripMate runs, send the query to the guardrail and turn its one-character reply into a boolean, the first two moves from the worked example, now on TripMate’s guardrail. Fail closed: only a clear allow passes; treat anything else as a block. Replace the const allowed = true; placeholder with the real verdict.

  3. Refuse blocked queries (TODO 2). If the verdict was a block, print a fixed refusal and return before tripmate runs, the third move. Then re-run all three: RUN = 1 (a real trip question) passes through to TripMate; RUN = 2 and RUN = 3 get the refusal and TripMate never sees them.

    Stuck? finish/agent.ts is the canonical version. Read it after you’ve had a real go.

  4. Check you’ve got it. You should be able to point at the two-call shape: the guardrail runs first, and only an allowed verdict lets TripMate run. Scroll up to the trace: a blocked query shows the guardrail call and then nothing, an allowed one shows the guardrail call and then the TripMate call.

Why a separate call instead of one clever prompt?

You could tell TripMate “refuse anything off-topic” in its own instructions, and that helps, but it is the same model that wants to be helpful deciding to refuse itself, on every turn, mixed in with the real work. A separate guardrail is one job, judged on its own, and you can make it cheap and strict without touching how TripMate answers. It is also the seam where you would later log refusals, swap in a faster model, or tighten the policy.

Keep the guardrail cheap

A guardrail runs on every request, so it should be the smallest call you can make. This one returns a single character, 1 or 0, so the model writes almost nothing: no JSON to assemble, no reason string to compose, just one token. You could ask it for a typed { allowed, reason } instead, and the reason is handy while you are learning why a verdict went the way it did, but every extra token is one that every request pays for. When in doubt, keep the gate cheap and log the blocked queries somewhere else. On a small local model the call is not instant either way; the point is the pattern, a light check in front, not a stopwatch number.

Scope is a guardrail too

Guardrails are not only about unsafe content. The most common use is scope: keeping the assistant on the job it was built for. A travel bot that will write your code or your essays is a support headache and a bigger surface to test. “Only answer travel questions” is a guardrail, and it is the one you will reach for most.

Input gates vs. output guardrails (and middleware)

f5 is an input guardrail: it checks the request before the model runs and can refuse it. The other half is output guardrails, which inspect what the model generated and clean it, for example redacting personal data before the reply leaves your system. The AI SDK’s idiomatic home for those is middleware: wrapLanguageModel({ model, middleware }) wraps a model so a wrapGenerate hook can rewrite the result, and the wrapped model drops into any agent unchanged. Middleware is also where reusable, model-agnostic concerns like logging and caching live. The self-serve track guardrails-middleware builds one. A real system often runs both: an input gate in front, an output filter behind.

That is the gate in front of the agent. Next up is f6, where the model picks among several tools from their descriptions alone.