Skip to content

Resilience: failures as data

Self-serve track. Not part of the live 90-minute block; do it any time after Foundations. Run with npm run resilience (reference: npm run solution:resilience).

Every tool so far has worked. Real APIs go down, and users type places that do not exist. This challenge is about what the agent does when a tool fails.

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

  1. Run npm run resilience and watch getWeather throw for Atlantis, leaving the model a bare “Error: …” string and a thin recovery.
  2. Edit start/agent.ts: do the cycle 2 TODOs (return { error: ... } with a recovery hint wrapped in try/catch instead of throwing, and add an instructions clause telling TripMate what to do with an error field).
  3. Done when the run does not crash and the agent names the Atlantis failure plainly and offers a real alternative instead of stalling.

You ask TripMate to plan a weekend in Atlantis, which no tool can find, and you make that failure into something the model can handle well.

The one idea here: return failures as data, do not throw them. A returned { error: "..." } is a normal tool result the model reads and acts on, and the message is one you wrote, so you can include recovery guidance. A thrown error gives the model a terse, generic string you don’t control.

Forget travel. Say a chargeCard tool can fail. The instinct is to throw:

execute: async ({ amount }) => {
  const ok = await charge(amount);
  if (!ok) throw new Error("charge failed");   // the model sees a bare string you did not word
  return { ok };
};

The AI SDK catches a thrown error inside a tool and reports it back to the model, so the run does not crash. But the model only sees a terse "Error: ..." you did not write, with no hint what to do next. Return the failure as data instead, with a message and recovery guidance you control:

execute: async ({ amount }) => {
  try {
    const ok = await charge(amount);
    if (!ok) return { error: "<name what failed, and tell the model what to do next>" };
    return { ok };
  } catch {
    return { error: "<same: a message the model can act on>" };
  }
};

Both keep the agent alive; the returned version gives the model something to act on, and the wording is yours. That wording is the whole lesson, so it is yours to write, not to copy. Below you do this to TripMate’s getWeather.

Open start/agent.ts. getWeather throws for any city it does not know, and the prompt asks about Atlantis, which no tool can find. Cycle 2 is yours: replace the throw with a try/catch that returns { error }, and add a clause to the agent’s instructions telling TripMate what to do with an error field.

npm run resilience
  1. Run it and read the thin recovery (cycle 1). Run npm run resilience. getWeather throws for Atlantis, and there’s no flight to Atlantis either, so both tools fail. The agent does not crash: the SDK reports the throw to the model, which apologises and moves on. Read how it recovers. It works, but the message the model saw was a bare "Error: ..." you did not write, with no guidance about what to do next.

  2. Return the error as data (cycle 2, TODO). In getWeather, replace the throw with a return { error: ... } whose message names what failed and tells the model what to do next (suggest a real city, or continue without the weather). The exact wording is yours to write, and it is the whole point: the model can only act on what your message says. Wrap the body in a try/catch too, so even an unexpected throw becomes a returned { error } rather than an exception. The TODO comment in start/agent.ts marks the spot.

  3. Tell the agent what to do with an error field (cycle 2, TODO). Add a clause to the agent’s instructions, in your own words, telling TripMate what to do when a tool returns an error field: acknowledge the failure plainly, then ask for a real alternative or continue with what worked. Write the sentence yourself; there is a marked spot in the instructions string.

    Run again. The model now gets a message you wrote, including what to do about it, and the recovery is cleaner: it names the failure (“I could not find weather for Atlantis, it is not a real destination”) and offers real alternatives.

  4. Run the bare-vs-hinted poke (cycle 3). Temporarily change your error message to just "No weather data." Predict how TripMate recovers, then run. With nothing to act on, the recovery goes thin: it apologises and stalls. Put your hinted message back and run again. Same failure, two recoveries. The model can only act on what your message says, which is the whole reason you return the error instead of letting the SDK hand it a bare string.

  5. Verify what you’ve got. npm run resilience does not crash on the thrown error. You wrote the { error } return with a recovery hint (and a try/catch) and your own recovery clause in the instructions. The agent names the Atlantis failure and offers a real alternative. You should be able to say why returning an error beats throwing one, even though neither crashes.

  • You expected cycle 1 to crash. It won’t: the SDK catches tool throws. The reason to return errors as data is control and recovery guidance, not crash prevention.

  • The recovery is thin even in cycle 2. Put more in the error message. The model can only act on what the message says; a bare "error: failed" gives it nothing to work with.

  • A small model still pushes ahead and invents. granite4.1:3b sometimes does, if any tool succeeded and gave it material. Here both tools fail for Atlantis, so there’s nothing to confabulate around, which keeps the recovery honest. A stronger model recovers more reliably in general.

Why does the model recover even from a thrown error here?

AI SDK v6 wraps tool execution: a thrown error becomes a tool-error result that’s fed back to the model, so the model still gets a turn to respond. Older framework versions, and some other stacks, were less forgiving, which is where the “always return, never throw” rule comes from. The rule still holds, for two reasons that survive a forgiving framework: you control the message (so you can add recovery guidance), and a try/catch returning { error } guarantees no exception escapes regardless of the framework’s behaviour.

Why make both tools fail for Atlantis?

If the flight lookup had succeeded for Atlantis, the model would have a real price to build a pitch around and would happily bury the weather failure under a confident itinerary, especially a small model. Making every tool fail for the fake destination removes the material to confabulate with, so the only honest response is to surface the failure. In your own tools, think about what the model will do with a partial success: it will lean on whatever worked.

That’s the resilience lesson: a failure becomes data you control and the model can recover from, not an exception that derails the run. With this, TripMate plans, gets typed output, uses tools it did and did not write, streams, and recovers when a tool fails.

Next, the streaming self-serve track swaps .generate() for .stream(), and the MCP self-serve track adds a tool served by a separate process.