MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
The stark performance gap between current models and humans—especially the fact that even the top proprietary model only hits 40%—highlights how underexplored and underdeveloped multi-image spatial reasoning still is.