Science

Our model out-simulates the frontier.

Apriori predicts what a real user does on a real screen. We published the proof: on screen-conditioned action prediction, a fine-tuned 8B model beats frontier models used zero-shot — and we put the benchmark, method, and limits in the open so you can check every number.

0.783

Semantic similarity, fine-tuned Qwen3-VL-8B — vs 0.459 (Claude Opus 4.7), 0.482 (GPT-5.5) zero-shot.

79%

Held-out rows clearing the 0.7 quality bar, vs 1–2% for the frontier baselines.

12,929

PiSAR training tuples from app reviews, demographics, and shopping traces; 661-row held-out test set.