The PiSAR benchmark
PiSAR evaluates screen-conditioned action prediction: given a persona and a product screen, predict the action a real user would take. Scores are semantic similarity between the predicted action and the action the real user actually took, on a 661-row held-out test set. Higher is better.
| Model | Type | Semantic similarity | % ≥ 0.7 |
|---|---|---|---|
| Apriori — fine-tuned Qwen3-VL-8B-Instruct | Fine-tuned | 0.783 | 79% |
| GPT-5.5 | Zero-shot | 0.482 | 1–2% |
| Claude Opus 4.7 | Zero-shot | 0.459 | 1–2% |
| Gemma-4-26B-A4B-IT (same recipe) | Fine-tuned | 0.441 | — |
Eval set: 661-row PiSAR held-out split · baselines evaluated zero-shot · “% ≥ 0.7” is the share of rows clearing the quality bar (1–2% is the combined baseline figure). Submitted May 28, 2026.
What the table shows
A task-specific fine-tune of an 8B vision-language model beats far larger frontier models used zero-shot — by a wide margin on both average similarity and the share of predictions that actually clear the quality bar.
The last row is the honest part. The same training recipe applied to a larger reasoning-tuned model (Gemma-4-26B-A4B-IT) reaches only 0.441 — below the zero-shot frontier. We call this a recipe-vs-model mismatch: the bigger reasoning-tuned model resists displacement and would need more data or a stronger method to move. Fine-tuning is not free lift; the architecture and recipe have to match.
Run your model on PiSAR
The benchmark is meant to be contested. If you have a model — frontier, open, or fine-tuned — evaluate it on the PiSAR held-out split and send us the number; we'll add it to this table with a link to your method. Email rahul.bissa@apriori.work.