Skip to content

p4: Evaluator-optimizer

A chain (p1) checks once and fixes once. An evaluator-optimizer loops: score the work, improve it, score the new version, and keep going until it is good enough or you hit a safety cap. One agent writes, another grades it against a rubric, and the feedback flows back into the next attempt.

The rubric is what makes the score mean something, so the rubric is what you write here.

In a hurry? These three steps are the whole challenge. Everything below is the why and the how.

  1. Run make p4: the evaluator has no rubric (arbitrary score) and the loop body is empty, so the same score repeats every round until it stops at the cap.
  2. Edit start/agent.py: write the evaluator’s rubric (TODO 1), improve the pitch with the editor (TODO 2), re-score with the evaluator (TODO 3), then report which ending happened (TODO 4).
  3. Done when the score climbs across rounds and the loop ends by either clearing BAR (8) or stopping at MAX_ROUNDS (3), naming which.
draft  ->  score  ->  good enough?  --yes-->  done
             ^             \--no-->  improve  --.
             '----------------------------------'

The loop exits on the BAR (a high enough score) or the cap (MAX_ROUNDS), so it always terminates.

Forget travel. Say you are grading an answer to a SQL question and improving it until it is good. The evaluator’s prompt is a rubric: the named criteria the score measures against.

class Scorecard(BaseModel):
    score: int = Field(ge=1, le=10, description="overall quality, 1 to 10")
    feedback: str = Field(description="the single most useful change to raise the score, one sentence")


evaluator = Agent(
    model,
    output_type=Scorecard,
    instructions=(
        "Score a SQL answer from 1 to 10 against three musts: it returns the right rows, "
        "it uses an index, and it is readable. Deduct for each missing must. Give one "
        "concrete piece of feedback."
    ),
)

The rubric is the lever. “Make it better” grades nothing, but three named musts give a real score and give the editor a target. A loop then scores, improves with the feedback, and scores again until the number clears a bar or a round cap stops it. Below you write TripMate’s rubric, then wire that loop.

Open start/agent.py. The two dials (BAR, MAX_ROUNDS), the Scorecard schema, the writer, and the editor are provided. The evaluator’s instructions are blank, that rubric is TODO 1. The while loop is wired but its body is empty; the two operations inside (TODO 2, TODO 3) are what make each round count.

Use it when feedback measurably improves the output and the model can give that feedback: translation, where a critic catches a lost nuance; writing, where a draft is graded and tightened; code, where tests are the evaluator. Skip it when there is no clear criterion to score against, or one pass is already good enough. Every extra round costs two more calls, so the bar should be worth the spend.

  1. Run it and watch the gap. Run make p4. With no rubric the score is arbitrary, and with an empty loop body it repeats every round before stopping at the cap. The two are what you fill.
  2. Write the rubric (TODO 1). Fill the evaluator’s instructions: choose 2-3 concrete musts a good Lisbon pitch needs, say to deduct for each missing one, and ask for one piece of feedback. Re-run; the score now reflects real criteria (still static until the loop body is wired).
  3. Improve, then re-score (TODO 2, 3). Inside the loop, send the latest card.feedback and pitch to the editor and set pitch to its new .output; then score the new pitch with the evaluator and reassign card from its .output so the while condition sees the new number.
  4. Report honestly (TODO 4). After the loop, print a line naming which ending happened, cleared BAR or hit MAX_ROUNDS, then the pitch. A card.score >= BAR check tells them apart.
  5. Turn the dials (poke it). Raise BAR to 10 and watch it run to the cap (a perfect score is hard). Lower it to 5 and watch it stop after one round. The bar sets the standard, the cap sets the budget.
  6. Check you’ve got it. Point at the two dials and say what each controls, and how this differs from the single check-and-fix in p1. The trace alternates write, score, improve, score; a run that cleared the bar early has fewer spans than one that hit the cap.

Stuck? finish/agent.py is one version. Read it after you’ve had a real go; your rubric will read differently.

  • No cap. while card.score < BAR with no round limit can loop forever if the score never clears the bar. The cap is not optional.
  • A bar nothing can reach. If BAR is higher than the model reliably scores, every run burns all rounds and stops short. Set it where improvement is real but reachable.
  • Vague rubric. “Make it better” gives the editor nothing to act on. Named musts (“mention the budget”) are what drive the score up.
  • Trusting one judge. A single evaluator can be wrong. For high-stakes scoring, combine this with the voting idea from p3.
Evaluator-optimizer vs the chain in p1

p1 writes, checks, and fixes exactly once: draft, gate, edit, done. This loops the check-and-fix until a measurable bar is met. Use the chain when one corrective pass is enough; use the loop when quality is a moving target and each pass demonstrably improves it, holding to a target like a thermostat.

That closes the workflow half of the Patterns track. Next up is p5, where the model takes over the control flow: one agent with several tools sequences them itself.