VisRes Bench

On Evaluating the Visual Reasoning Capabilities of VLMs

Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

arXiv 2512.21194 PDF Hugging Face Dataset

Benchmark overview

Level Name Description
1 Perceptual completion & matching Perceptual completion and global image matching under perturbations: blur, texture changes, occlusion, rotation.
2 Rule-based inference Rule-based inference over a single attribute (e.g., color, count, orientation).
3 Compositional reasoning Compositional reasoning integrating multiple visual attributes.
Real samples from each level of VisRes Bench.
Real samples from each level. Level 1 (top) involves direct visual completion and matching without explicit rule inference (e.g., patch-C correctly continues the ceiling texture compared to patch-D), while Levels 2 and 3 (bottom) require increasingly complex rule-based reasoning over perceptual attributes. Accurate perception of individual attributes is necessary but not sufficient for solving compositional tasks. Current VLMs show poor performance on these compositional tasks. See Section 4.2.

Benchmark statistics

Level Tasks Total examples
1 global_occlusion_50, global_occlusion_70, global_occlusion_80, edges, location_random_sampling, brightness, blur, rotation, rotation_random_sampling, edges_random_sampling, location 11,000
2 uniform_count, count_progression, uniform_orientation, count_2_same_1_diff, orientation_2same_1diff, uniform_color, count_arithmetic, count_minmax, orientation_3_diff, color_2same_1diff, color_3_diff, count_3_diff 5,956
3 spiral_color_orientation, coupled_color_count, independent_color_object_orientation, coupled_color_orientation, Independent_count_object_color 2,522
Total 19,478
Config / task Level Examples
level_1_global_occlusion_5011,000
level_1_global_occlusion_7011,000
level_1_global_occlusion_8011,000
level_1_edges11,000
level_1_location_random_sampling11,000
level_1_brightness11,000
level_1_blur11,000
level_1_rotation11,000
level_1_rotation_random_sampling11,000
level_1_edges_random_sampling11,000
level_1_location11,000
level_2_uniform_count2500
level_2_count_progression2500
level_2_uniform_orientation2458
level_2_count_2_same_1_diff2500
level_2_orientation_2same_1diff2498
level_2_uniform_color2500
level_2_count_arithmetic2500
level_2_count_minmax2500
level_2_orientation_3_diff2500
level_2_color_2same_1diff2500
level_2_color_3_diff2500
level_2_count_3_diff2500
level_3_spiral_color_orientation3350
level_3_spiral_color_orientation3464
level_3_coupled_color_count3500
level_3_independent_color_object_orientation3355
level_3_coupled_color_orientation3374
level_3_Independent_count_object_color3479

Main results

Accuracy (%) across levels and subtasks. Random chance = 25%. Guided prompting; thinking mode when available. Source: Hugging Face dataset card.

Setting GPT-5 GPT-4o Gemini-2.5 Qwen3-VL-4B Qwen3-VL-30B Mimo-VL-7B
Level-1
Edges27.1723.9125.0016.6725.0022.30
Location23.7120.6226.0023.1622.4025.77
Rotation35.4226.0434.3837.5036.0529.17
Brightness25.2627.3727.3731.5229.4727.37
Blur31.1825.2626.3224.7324.2826.32
Global@50%42.8620.8857.1437.5047.2548.35
Global@80%32.6122.8336.9625.8835.8730.43
Level-1 Average31.1023.8633.2828.1731.2029.22
Level-2
Uniform Color96.0021.0097.0066.2088.0078.95
Uniform Count61.0025.0090.9140.8259.0052.75
Uniform Orientation22.2225.2526.5326.0023.0019.19
Count Progression50.0013.0077.0037.2048.0036.96
Count Arithmetic52.0022.0075.7643.2049.0033.33
Level-2 Average49.7924.1262.2937.1846.7539.15
Level-3
Independent Color-Object-Orientation34.0025.2538.0027.3932.6019.00
Independent Count-Object-Color34.0024.0044.0029.4536.3429.00
Coupled Color-Orientation24.2424.0016.3326.1329.4320.00
Coupled Color-Count30.0022.0021.2127.4633.3328.00
Spiral Color-Count-Object56.0030.0054.1728.6336.0033.00
Level-3 Average34.3923.8633.7326.3131.3625.17

Finetuning on Level-1 (Qwen2.5-VL-3B)

Setting Original Finetuned Human Baseline
Location24.342.894.1
Blur23.937.584.3
Brightness23.739.885.6
Rotation25.550.892.0
Edges25.133.282.6
Global (50%)24.952.296.1
Global (80%)23.938.698.0
Average24.543.790.4

Single-attribute recognition (perceptual grounding)

Accuracy (%) when models report a single attribute (color, orientation, or count) for one grid cell.

Attribute GPT-4o GPT-5
Color84.697.6
Orientation39.849.6
Count72.494.2

Impact of thinking mode

Accuracy (%) with thinking enabled (✓) vs disabled (✗). Open-source models improve substantially with thinking.

Level GPT-5 (high) GPT-5 (low) Mimo-VL ✓ Mimo-VL ✗ Qwen3-4B ✓ Qwen3-4B ✗ Qwen3-30B ✓ Qwen3-30B ✗
Level-132.6131.4329.2223.9128.1723.1631.2023.60
Level-249.7947.0139.1526.6837.1824.0846.7528.25
Level-334.3932.8925.1725.2326.3123.5031.3624.00

Impact of image resolution (GPT-5)

Accuracy (%) at different input resolutions. All levels improve with higher resolution.

Resolution Level-1 Level-2 Level-3
512×51245.1742.8331.63
1024×102454.0149.6135.48
2048×204856.5148.9940.07

How to evaluate your models

VisRes Bench is integrated into lmms-eval, the unified evaluation toolkit for multimodal models. Use it to run reproducible evaluations with the same pipeline as in the paper.

Install and run:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval.git
cd lmms-eval && uv pip install -e ".[all]"

# Run evaluation (example: Qwen2.5-VL on VisRes Bench)
python -m lmms_eval \
  --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
  --tasks visres_bench \
  --batch_size 1

See the lmms-eval repository for supported models, task variants (e.g. by level or config), and full documentation.

Citation

BibTeX
@article{visres2025,
  title={VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs},
  author={Malagurski T{\"o}rtei, Brigitta and Dahou, Yasser and Huynh, Ngoc Dung and Para, Wamiq Reyaz and L{\^e} Khac, Ph{\'u}c H. and Singh, Ankit and Chaybouti, Sofian and Narayan, Sanath},
  journal={arXiv preprint arXiv:2512.21194},
  year={2025}
}