Science

The PiSAR benchmark

PiSAR evaluates screen-conditioned action prediction: given a persona and a product screen, predict the action a real user would take. Scores are semantic similarity between the predicted action and the action the real user actually took, on a 661-row held-out test set. Higher is better.

ModelTypeSemantic similarity% ≥ 0.7
Apriori — fine-tuned Qwen3-VL-8B-InstructFine-tuned0.78379%
GPT-5.5Zero-shot0.4821–2%
Claude Opus 4.7Zero-shot0.4591–2%
Gemma-4-26B-A4B-IT (same recipe)Fine-tuned0.441

Eval set: 661-row PiSAR held-out split · baselines evaluated zero-shot · “% ≥ 0.7” is the share of rows clearing the quality bar (1–2% is the combined baseline figure). Submitted May 28, 2026.

What the table shows

A task-specific fine-tune of an 8B vision-language model beats far larger frontier models used zero-shot — by a wide margin on both average similarity and the share of predictions that actually clear the quality bar.

The last row is the honest part. The same training recipe applied to a larger reasoning-tuned model (Gemma-4-26B-A4B-IT) reaches only 0.441 — below the zero-shot frontier. We call this a recipe-vs-model mismatch: the bigger reasoning-tuned model resists displacement and would need more data or a stronger method to move. Fine-tuning is not free lift; the architecture and recipe have to match.

Run your model on PiSAR

The benchmark is meant to be contested. If you have a model — frontier, open, or fine-tuned — evaluate it on the PiSAR held-out split and send us the number; we'll add it to this table with a link to your method. Email rahul.bissa@apriori.work.