Science

Our model out-simulates the frontier.

Apriori predicts what a real user does on a real screen. We published the proof: on screen-conditioned action prediction, a fine-tuned 8B model beats frontier models used zero-shot — and we put the benchmark, method, and limits in the open so you can check every number.

0.783

Semantic similarity, fine-tuned Qwen3-VL-8B — vs 0.459 (Claude Opus 4.7), 0.482 (GPT-5.5) zero-shot.

79%

Held-out rows clearing the 0.7 quality bar, vs 1–2% for the frontier baselines.

12,929

PiSAR training tuples from app reviews, demographics, and shopping traces; 661-row held-out test set.

Read the PDF ↗Paper summary

Benchmark

The PiSAR leaderboard — our model, the frontier baselines, and the metric definitions. Run your own model and we'll list it.

Method

Plain-English: what semantic similarity measures, why 0.7 is the bar, what held-out means, and why task-specific fine-tuning beats zero-shot personas.

Limits

What simulations can and can't do — stated honestly, including what the paper itself shows doesn't transfer.

Changelog

Model versions and evaluation deltas over time.