Science
Model changelog
A public log of how the model and benchmark evolve. This is the start — we add an entry each time the eval moves.
June 2026benchmark
PiSAR leaderboard published
Opened the benchmark publicly — model, frontier baselines, metric definitions, and an invitation to submit your own model.
May 28, 2026model · v1
PiSAR benchmark + first fine-tune
Submitted the PiSAR paper (arXiv:2605.29400). A fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 semantic similarity on the held-out test set — 79% of rows clearing the 0.7 bar, versus 1–2% for frontier zero-shot baselines.