PixelEyes repeatedly proposes a search region, runs SAMTok grounding, computes a crop from the mask, and continues until enough evidence supports an answer.
PixelEyes separates what to search from where it is. A strong VLM performs reasoning, while specialist perception localizes visual evidence. This modular design improves visual evidence seeking efficiency and accuracy in high-resolution, tiny-ROI settings.
PixelEyes asks for semantic targets and delegates dense grounding to SAMTok, turning vague visual search into verified evidence crops.
The agent explores candidate regions breadth-first to reduce repeated crop loops and preserve broad scene coverage before zooming in.
Reasoning remains independent from localization, allowing the model to search globally, inspect candidate regions, or answer from evidence.
These trajectories show why the benchmark is difficult: the target evidence is tiny, ambiguous, or visually easy to misread.
PixelEyes improves both pinpoint search and evidence-grounded answering across visual benchmarks.
| Model | Size | V* | HR-4K | HR-8K | VP-H | VP-M | VP-E | Pinpoint Acc. | TAE | LSR | MME-R-L | Tree-Bench |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Closed-source Models | ||||||||||||
| Gemini-3-Flash | - | 84.82 | 89.25 | 85.50 | 47.17 | 50.75 | 67.38 | 42.26 | - | - | 60.34 | 56.54 |
| Open-source Base Models | ||||||||||||
| Qwen-2.5-VL | 7B | 75.50 | 68.20 | 62.70 | 23.90 | 26.00 | 39.10 | 39.03 | - | - | 44.37 | 41.48 |
| Qwen-3-VL | 4B | 80.10 | 78.25 | 72.88 | 34.91 | 40.30 | 56.74 | 46.19 | - | - | 44.55 | 42.71 |
| Qwen-3-VL | 8B | 86.39 | 78.88 | 74.63 | 51.89 | 40.67 | 65.25 | 49.88 | - | - | 49.04 | 46.91 |
| Qwen-3-VL | 235B | 87.96 | 84.50 | 81.62 | - | - | - | - | - | - | - | - |
| Expert Active Agents | ||||||||||||
| Pixel Reasoner | 7B | 86.30 | 74.00 | 66.90 | 28.80 | 29.60 | 58.40 | 29.79 | 15.56 | 46.88 | 54.32 | 40.98 |
| Thyme | 7B | 82.20 | 77.00 | 72.00 | 46.23 | 43.28 | 62.41 | 40.42 | - | - | 50.13 | 39.75 |
| DeepEyes | 7B | 83.30 | 73.20 | 69.50 | 35.10 | 29.80 | 60.10 | 39.72 | 14.89 | 20.79 | 53.53 | 37.28 |
| Mini-o3 | 7B | 85.34 | 71.75 | 67.50 | 45.28 | 48.51 | 63.12 | 44.34 | 8.38 | 78.52 | 42.26 | 40.25 |
| Ours | ||||||||||||
| PixelEyes | 4B | 91.62 +11.52 | 81.75 +3.50 | 79.88 +7.00 | 54.72 +19.81 | 55.22 +14.92 | 68.79 +12.05 | 54.73 +8.54 | 26.13 | 76.91 | 54.51 +9.96 | 45.93 +3.22 |
| PixelEyes | 8B | 94.24 +7.85 | 85.00 +6.12 | 83.15 +8.52 | 59.44 +7.55 | 55.22 +14.55 | 71.63 +6.38 | 55.20 +5.32 | 26.64 | 74.83 | 59.25 +10.21 | 48.40 +1.49 |
| Model | VP-H | VP-M | VP-E | Pinpoint |
|---|---|---|---|---|
| Qwen-3-VL | 34.91 | 40.30 | 56.74 | 46.19 |
| w/ Mini-o3 SFT | 24.52 | 33.58 | 38.29 | 29.56 |
| w/ Our SFT | 50.94 | 52.24 | 68.09 | 52.66 |
| w/ Our SFT+RL | 54.72 | 55.22 | 68.79 | 54.73 |
| Model | Acc. | TAE | LSR |
|---|---|---|---|
| Mask+BFS | 52.66 | 25.31 | 75.98 |
| Mask+Free | 50.58 | 24.33 | 66.74 |
| BBox+BFS | 48.73 | 20.43 | 68.13 |
| BBox+DFS | 48.97 | 23.56 | 65.13 |
| BBox+Free | 48.51 | 21.54 | 68.59 |
| Model | HR-4K | HR-8K |
|---|---|---|
| Qwen-3-VL | 78.25 | 72.88 |
| w/o Switchable | 80.00 | 79.50 |
| w/ Switchable | 81.75 | 79.88 |
| Model | Acc. | TAE | LSR | ISR |
|---|---|---|---|---|
| SAMTok | 54.73 | 26.13 | 76.91 | 99.17 |
| Sa2VA | 46.19 | 21.41 | 50.58 | 65.29 |
Gemini -> Tool Rollout -> Filtering -> PixelEyes-6K. The data contains 5,800 mask-guided search trajectories with semantic BFS behavior.
@article{pixeleyes2026,
title={PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking},
author={Gong, Dengxian and Wu, Yuanzheng and Yuan, Haobo and Hu, Zhengdong and Zhang, Tao and Zhou, Yikang and Chen, Shihao and Niu, Quanzhu and Wang, Haochen and Wang, Kai and Qi, Lu and Li, Jason and Ji, Shunping and Yang, Ming-Hsuan},
journal={arXiv preprint},
year={2026}
}