What simulations can and can't do
The fastest way to lose a product team's trust is to oversell. Here is the honest version: what to use Apriori for, what to read as directional, and what it should not replace.
What it does well
- Finds where and why users hesitate. Across a flow, it surfaces the screens and steps that lose specific persona types, with the reasoning behind each drop-off — the “why” that analytics and session replay can't give you.
- Compares variants before you build. Directional A/B reads on copy, layout, pricing, and flow order in hours, against a consistent synthetic audience.
- Segments the result. Which cohorts convert, which bounce, and what drives each — instead of an aggregate that hides the disagreement.
Where it is directional, not deterministic
- Predicted, not measured. Our accuracy is semantic similarity on next-action prediction — a strong proxy for behaviour, not a guarantee of your live conversion rate. Treat outputs as evidence and hypotheses, not a forecast to bank revenue on.
- As good as the persona inputs. Simulations reflect the audience you describe. A misjudged audience produces a confident, wrong answer.
- Sample size matters. A 5-persona run is an early directional signal; quantitative claims need larger runs. We label small samples as directional in the reports themselves.
What our own paper shows doesn't transfer
We benchmark this in the open. The fine-tuning recipe that takes Qwen3-VL-8B to 0.783 does not transfer to every model — applied to a larger reasoning-tuned model (Gemma-4-26B-A4B-IT) it reaches only 0.441, below the zero-shot frontier. Gains are tied to the architecture-and-recipe match, not guaranteed by scale. The benchmark is one task on one held-out test set; other tasks and audiences may behave differently.
What it should not replace
- High-stakes, irreversible calls. For decisions where being wrong is expensive — pricing you can't walk back, regulatory or safety-critical flows — use simulation to narrow the options, then confirm with live users.
- The final word on a launch. Simulation tells you where to look and what to fix first. Real-world measurement still settles the question.
Used this way — fast directional evidence that focuses expensive live research — simulation earns its place in the loop. Overused as an oracle, it doesn't. We would rather you trust the numbers you can check.