Science

How we measure accuracy

A single number like 0.783 means nothing without knowing what it measures. Here is the whole method in plain English — no jargon walls — so you can decide for yourself whether the claim holds.

The task: screen-conditioned action prediction

Every claim on this site comes from one task. We give a model two things: a description of a person (a persona — their demographics, context, and goals) and a single product screen they are looking at. The model has to predict the action that person takes next — tap this, scroll past that, abandon here.

This is the atomic unit of a product simulation. If a model can predict the next action well across thousands of persona-and-screen pairs, it can walk a persona through a whole flow and tell you where real users would hesitate or drop off.

The data: PiSAR

We built a dataset called PiSAR — 12,929 persona-screen-action tuples drawn from real signal: app reviews, demographic data, and shopping traces. From it we hold out a 661-row test set that no model is trained on. Every accuracy number is computed on that held-out set, so a model can't score well just by memorising what it has already seen. “Held-out” is the part that makes the number honest.

The metric: semantic similarity, not string match

Real actions are described in words, and there are many right ways to say the same thing — “taps Continue” and “proceeds to the next step” are the same decision. So we don't check for an exact string match. We measure semantic similarity: how close the meaning of the predicted action is to the action the real user actually took, on a 0-to-1 scale. 1.0 is the same decision; 0.0 is unrelated.

The bar: why 0.7

Average similarity is one view, but a single average can hide a model that is vaguely-right-on-average and decisively-wrong-in-practice. So we also report a pass rate: the share of predictions that clear a 0.7 similarity bar — the point at which a prediction is close enough to count as the same decision a real user made. This is the stricter, more honest number, because it counts decisions, not vibes.

A fine-tuned Qwen3-VL-8B clears 0.7 on 79% of held-out rows. The frontier zero-shot baselines clear it on 1–2%. That gap — not the raw average alone — is the result.

Why fine-tuning beats zero-shot personas

A frontier model asked to role-play a persona “cold” is guessing how a demographic behaves from its general training. A model fine-tuned on PiSAR has learned the actual mapping from persona-and-screen to action from real traces. For this narrow, repeated task, the specialised smaller model wins — decisively — over much larger general models.

But fine-tuning is not automatic lift. Applying the same recipe to a larger reasoning-tuned model (Gemma-4-26B-A4B-IT) scores below the zero-shot frontier. The recipe and the architecture have to match. We report that openly on the benchmark and discuss where it doesn't transfer in limits.

What this number is not

Semantic similarity on next-action prediction is a strong proxy for behavioural fidelity — but it is a proxy, not a guarantee of real-world conversion. We treat simulations as directional evidence that finds where and why users hesitate, fast and cheaply, before you build. For how far that goes and where it stops, read the limits.

Read the PDF ↗Paper summary