Skip to content

f7: Testing the agent

In f5 you put a guardrail in front of TripMate. It looked right when you ran it, but “looked right once” is not the same as “I can prove it still works tomorrow.” How do you actually know it blocks an off-topic question and lets a trip question through?

The obvious idea, call the real model and check, is a bad test. It is slow, it costs money on a hosted model, and the model will not reliably say “block” on command, so the test passes one run and fails the next. A flaky test is worse than no test.

The fix is to stop using the real model. agent.override(model=TestModel()) runs your agent with a stand-in that makes no network call, and TestModel(custom_output_text="0") makes that stand-in return exactly the verdict you want. Now you can drive the guardrail to “block” or “allow” on demand and assert what your code does with each. Same idea as a unit test: deterministic, fast, and yours to control.

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

  1. Run make f7 and watch both checks fail (, 0/2 green): the two test bodies are unwritten.
  2. Edit start/agent.py: TODO 1 (force the guardrail to “0”, assert handle() returns the refusal), TODO 2 (force it to “1”, stub TripMate, assert the reply is not the refusal).
  3. Done when both checks print a ✓ and the run finishes without a single real model call.
real model in a test    ->  slow, costs money, won't "block" on command  (flaky)
TestModel + override    ->  instant, deterministic; tests YOUR branch logic

Swap the model for one you control, and each branch of the gate becomes a check you can trust.

Forget travel. Say you have a comment box with a moderation gate: a checker votes 1 to keep a comment or 0 to remove it.

from pydantic_ai.models.test import TestModel

moderator = Agent(model, instructions="Reply 1 to keep a comment, 0 to remove it.")

def post(text: str) -> str:
    verdict = moderator.run_sync(text).output
    return text if verdict.strip().startswith("1") else "[removed]"

def test_removes_a_bad_comment():
    # Make the moderator vote "0": no real model, no guessing.
    with moderator.override(model=TestModel(custom_output_text="0")):
        assert post("garbage") == "[removed]"

override swaps the model just for that with block, and custom_output_text is the verdict you put in its mouth. You are not testing whether the model moderates well, that is its own problem. You are testing that your code does the right thing once a verdict comes back. Below you write the same two moves for TripMate’s gate.

Open start/agent.py. The f5 gate is already here, pulled into one testable function handle(query): the guardrail agent votes, and only an allowed query reaches tripmate. Two test functions are stubbed out for you to fill in. At the top, ALLOW_MODEL_REQUESTS = False is set as a backstop: if a test forgets to override an agent, the real call raises instead of quietly passing.

make f7

Unlike every challenge before it, this one needs no model server. Every test swaps in a fake model, so f7 runs offline and instantly, even with Ollama stopped.

  1. Test the block branch (TODO 1). Write test_blocks_off_topic. Override the guardrail with TestModel(custom_output_text="0"), call handle(...) inside the with block, and assert the reply equals REFUSAL. Because a blocked query never reaches TripMate, you do not need to touch tripmate at all. Mirror the shape of the moderator test above; don’t copy it verbatim, the agents and the assertion are different.

  2. Test the allow branch (TODO 2). Write test_allows_trip. This time the guardrail votes "1", so handle() does call TripMate, which would be a real model request. Override both agents, the guardrail with custom_output_text="1" and tripmate with a plain TestModel() (it returns generated text, no real call), then assert the reply is a real answer (non-empty) and not REFUSAL.

  3. Run it. make f7 should now print 2/2 green with a ✓ on each line. Nothing waited on Ollama; the whole thing is instant. That speed is the point: you can run this on every commit.

  4. Prove the backstop works. Comment out the tripmate.override(...) in TODO 2 and run again. That test now fails instantly with a ✗ and a model-request error, instead of hanging on a slow call, because ALLOW_MODEL_REQUESTS = False caught the un-stubbed agent. Put it back.

Stuck? finish/agent.py is the canonical version. Read it after you’ve had a real go.

  • The allow test hangs or errors on a real call. You overrode the guardrail but not TripMate. An allowed query reaches TripMate, so stub it too. The backstop turns this into an immediate raise instead of a slow hang.
  • custom_output_text ignored. It only applies to the agent you wrapped in override. Wrap the guardrail, not TripMate, when you want to control the verdict.
  • Asserting the model’s quality. A TestModel says whatever you tell it, so a test on it proves nothing about how well the real model moderates. These tests cover your wiring and control flow, the part you actually wrote. Judging the model’s answers is evals, a different tool.
What does TestModel do by default?

With no arguments, TestModel() calls every tool the agent has (with made-up arguments) and then returns generated text as the output. That is enough to prove the plumbing works: tools are registered, an output_type validates, your code runs end to end. When you need a specific answer instead, custom_output_text sets the text and custom_output_args sets the fields of a structured output. After a run, the_model.last_model_request_parameters.function_tools tells you which tools were on offer.

Why ALLOW_MODEL_REQUESTS = False?

It is a global switch that makes any real model request raise. TestModel and friends are exempt, so your overridden runs work fine, but a run you forgot to stub turns into a loud error instead of a slow, costly, flaky call to a live model. It is cheap insurance against a test that secretly depends on the network. Many teams set it once in their test setup.

These are real tests; we just run them with a script

To keep every challenge to one runnable file, main() calls the two tests and prints a ✓ for each. In a real project they would live in a file like test_gate.py as async def test_* functions, with no main(): pytest (plus pytest-asyncio or anyio) discovers them, awaits each one, and reports a failed assert as a failing test. The bodies are identical; only the runner changes.

This is the last of the foundations. You have an agent that takes instructions, returns structured output, calls tools, gates its input, routes on descriptions, and now proves it behaves.

Now the path forks. Choose a track (do one, several, or stop here — the Discussion closes the workshop wherever you end up):

  • Patterns (p1–p7): compose several model calls in shapes you design (chaining, routing, parallelization, evaluator-optimizer), then hand control to the model with agentic, delegation, and conversation.
  • RAG (r1–r2): give the agent your own documents to retrieve and rank by similarity, then chunk the long ones.
  • Full-Stack: put this agent behind a streaming web chat UI.

If you’re carrying on with patterns, p1 is chaining: draft, check it with a gate, then fix only what failed.