Why did o3-mini-high jump from 0.8% to 4.8% on Vectara’s benchmark and what it means for document-length evaluations
https://files.fm/u/tb6h7mschr
Which specific questions about o3-mini-high, Vectara benchmark versions, and document length will I answer and why they matter? Quick list of the questions I’ll answer and why each matters to engineers, evaluation teams, and procurement