Science

Model changelog

A public log of how the model and benchmark evolve. This is the start — we add an entry each time the eval moves.

June 2026benchmark

PiSAR leaderboard published

Opened the benchmark publicly — model, frontier baselines, metric definitions, and an invitation to submit your own model.

May 28, 2026model · v1

PiSAR benchmark + first fine-tune

Submitted the PiSAR paper (arXiv:2605.29400). A fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 semantic similarity on the held-out test set — 79% of rows clearing the 0.7 bar, versus 1–2% for frontier zero-shot baselines.