p4: Evaluator-optimizer
A chain (p1) checks once and fixes once. An evaluator-optimizer loops: score the work, improve it, score the new version, and keep going until it is good enough or you hit a safety cap. One agent writes, another grades it against a rubric, and the feedback flows back into the next attempt.
The rubric is what makes the score mean something, so the rubric is what you write here.
Quick path
Section titled “Quick path”In a hurry? These three steps are the whole challenge. Everything below is the why and the how.
- Run
make p4: the evaluator has no rubric (arbitrary score) and the loop body is empty, so the same score repeats every round until it stops at the cap. - Edit
start/agent.py: write the evaluator’s rubric (TODO 1), improve the pitch with theeditor(TODO 2), re-score with theevaluator(TODO 3), then report which ending happened (TODO 4). - Done when the score climbs across rounds and the loop ends by either clearing
BAR(8) or stopping atMAX_ROUNDS(3), naming which.
Mental model
Section titled “Mental model”The loop exits on the BAR (a high enough score) or the cap (MAX_ROUNDS), so it always terminates.
The mechanic, in another domain
Section titled “The mechanic, in another domain”Forget travel. Say you are grading an answer to a SQL question and improving it until it is good. The evaluator’s prompt is a rubric: the named criteria the score measures against.
The rubric is the lever. “Make it better” grades nothing, but three named musts give a real score and give the editor a target. A loop then scores, improves with the feedback, and scores again until the number clears a bar or a round cap stops it. Below you write TripMate’s rubric, then wire that loop.
The setup
Section titled “The setup”Open start/agent.py. The two dials (BAR, MAX_ROUNDS), the Scorecard schema, the writer, and the editor are provided. The evaluator’s instructions are blank, that rubric is TODO 1. The while loop is wired but its body is empty; the two operations inside (TODO 2, TODO 3) are what make each round count.
When this fits
Section titled “When this fits”Use it when feedback measurably improves the output and the model can give that feedback: translation, where a critic catches a lost nuance; writing, where a draft is graded and tightened; code, where tests are the evaluator. Skip it when there is no clear criterion to score against, or one pass is already good enough. Every extra round costs two more calls, so the bar should be worth the spend.
Build it
Section titled “Build it”- Run it and watch the gap. Run
make p4. With no rubric the score is arbitrary, and with an empty loop body it repeats every round before stopping at the cap. The two are what you fill. - Write the rubric (TODO 1). Fill the evaluator’s
instructions: choose 2-3 concrete musts a good Lisbon pitch needs, say to deduct for each missing one, and ask for one piece of feedback. Re-run; the score now reflects real criteria (still static until the loop body is wired). - Improve, then re-score (TODO 2, 3). Inside the loop, send the latest
card.feedbackandpitchto theeditorand setpitchto its new.output; then score the new pitch with theevaluatorand reassigncardfrom its.outputso thewhilecondition sees the new number. - Report honestly (TODO 4). After the loop, print a line naming which ending happened, cleared
BARor hitMAX_ROUNDS, then the pitch. Acard.score >= BARcheck tells them apart. - Turn the dials (poke it). Raise
BARto 10 and watch it run to the cap (a perfect score is hard). Lower it to 5 and watch it stop after one round. The bar sets the standard, the cap sets the budget. - Check you’ve got it. Point at the two dials and say what each controls, and how this differs from the single check-and-fix in p1. The trace alternates write, score, improve, score; a run that cleared the bar early has fewer spans than one that hit the cap.
Stuck? finish/agent.py is one version. Read it after you’ve had a real go; your rubric will read differently.
- No cap.
while card.score < BARwith no round limit can loop forever if the score never clears the bar. The cap is not optional. - A bar nothing can reach. If
BARis higher than the model reliably scores, every run burns all rounds and stops short. Set it where improvement is real but reachable. - Vague rubric. “Make it better” gives the editor nothing to act on. Named musts (“mention the budget”) are what drive the score up.
- Trusting one judge. A single evaluator can be wrong. For high-stakes scoring, combine this with the voting idea from p3.
A couple of things worth knowing
Section titled “A couple of things worth knowing”Evaluator-optimizer vs the chain in p1
p1 writes, checks, and fixes exactly once: draft, gate, edit, done. This loops the check-and-fix until a measurable bar is met. Use the chain when one corrective pass is enough; use the loop when quality is a moving target and each pass demonstrably improves it, holding to a target like a thermostat.
That closes the workflow half of the Patterns track. Next up is p5, where the model takes over the control flow: one agent with several tools sequences them itself.