Our model out-simulates the frontier.
Apriori predicts what a real user does on a real screen. We published the proof: on screen-conditioned action prediction, a fine-tuned 8B model beats frontier models used zero-shot — and we put the benchmark, method, and limits in the open so you can check every number.
Semantic similarity, fine-tuned Qwen3-VL-8B — vs 0.459 (Claude Opus 4.7), 0.482 (GPT-5.5) zero-shot.
Held-out rows clearing the 0.7 quality bar, vs 1–2% for the frontier baselines.
PiSAR training tuples from app reviews, demographics, and shopping traces; 661-row held-out test set.
Benchmark
The PiSAR leaderboard — our model, the frontier baselines, and the metric definitions. Run your own model and we'll list it.
Method
Plain-English: what semantic similarity measures, why 0.7 is the bar, what held-out means, and why task-specific fine-tuning beats zero-shot personas.
Limits
What simulations can and can't do — stated honestly, including what the paper itself shows doesn't transfer.
Changelog
Model versions and evaluation deltas over time.