Science

The PiSAR benchmark

PiSAR evaluates screen-conditioned action prediction: given a persona and a product screen, predict the action a real user would take. Scores are semantic similarity between the predicted action and the action the real user actually took, on a 661-row held-out test set. Higher is better.

Model	Type	Semantic similarity	% ≥ 0.7
Apriori — fine-tuned Qwen3-VL-8B-Instruct	Fine-tuned	0.783	79%
GPT-5.5	Zero-shot	0.482	1–2%
Claude Opus 4.7	Zero-shot	0.459	1–2%
Gemma-4-26B-A4B-IT (same recipe)	Fine-tuned	0.441	—

Eval set: 661-row PiSAR held-out split · baselines evaluated zero-shot · “% ≥ 0.7” is the share of rows clearing the quality bar (1–2% is the combined baseline figure). Submitted May 28, 2026.

What the table shows

A task-specific fine-tune of an 8B vision-language model beats far larger frontier models used zero-shot — by a wide margin on both average similarity and the share of predictions that actually clear the quality bar.

The last row is the honest part. The same training recipe applied to a larger reasoning-tuned model (Gemma-4-26B-A4B-IT) reaches only 0.441 — below the zero-shot frontier. We call this a recipe-vs-model mismatch: the bigger reasoning-tuned model resists displacement and would need more data or a stronger method to move. Fine-tuning is not free lift; the architecture and recipe have to match.

Run your model on PiSAR

The benchmark is meant to be contested. If you have a model — frontier, open, or fine-tuned — evaluate it on the PiSAR held-out split and send us the number; we'll add it to this table with a link to your method. Email rahul.bissa@apriori.work.

Read the PDF ↗Paper summary