Testing & Iteration

Test Models on Real Agent Tasks

Chat benchmarks don't predict agent performance. Tool-calling reliability, instruction following, and structured output matter more than raw reasoning scores.

What to test:

Can it make well-formed tool calls consistently?
Does it follow multi-step instructions without skipping steps?
Does it verify results before claiming success?
Does it handle errors gracefully or spiral?

Operators have seen models that benchmark beautifully but hallucinate tool execution — claiming to have run commands and produced files that don't exist. Test on your actual workflows before committing.

For detailed model behavioral profiles, see Model Behavioral Differences.

Expect Iteration

The gap between a first conversation and reliable daily autonomous operation is real. It closes with each release, but it's still measured in weeks, not hours.

Realistic trajectory:

Week 1: Basic setup, first working conversation, initial workspace files, first "why did it do that?" moment
Week 2: First cron job, one integration working end-to-end, initial guardrails from mistakes
Week 3-4: Multiple integrations, reliable crons, agent personality solidifying, learning from daily notes
Month 2+: Autonomous operation within bounded domains, self-improvement loops, the agent starts catching things before you do

Every failure is a rule you didn't write yet. Every repeated mistake is a guardrail gap. The workspace files are a living document — they get better because things go wrong.

Testing & Iteration ​

Test Models on Real Agent Tasks ​

Expect Iteration ​

Testing & Iteration

Test Models on Real Agent Tasks

Expect Iteration