PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Dengxian Gong1,* Yuanzheng Wu1,* Haobo Yuan2 Zhengdong Hu3 Tao Zhang1 Yikang Zhou1 Shihao Chen1 Quanzhu Niu1 Kai Wang4 Jason Li5 Haochen Wang6 Lu Qi1,† Shunping Ji1,† Ming-Hsuan Yang2
1Wuhan University 2UC Merced 3UTS 4NUS 5NTU 6CASIA
*Equal contribution, Corresponding authors

{gooodx,jishunping}@whu.edu.cn

Search Trajectory Demo

PixelEyes repeatedly proposes a search region, runs SAMTok grounding, computes a crop from the mask, and continues until enough evidence supports an answer.

PixelEyes trajectory frame
What is the line of text below "30.480 KGS"?
Round 1 Search
SAMTok contour grounding result
Localized Target Localized target crop
Prediction 67.200 LBS GT: 67.200 LBS
PixelEyes teaser figure

Existing visual-search agents suffer from inattentional blindness and inefficient search trajectories. PixelEyes decouples perception and reasoning through mask-guided search and semantic-region BFS.

TL;DR

PixelEyes separates what to search from where it is. A strong VLM performs reasoning, while specialist perception localizes visual evidence. This modular design improves visual evidence seeking efficiency and accuracy in high-resolution, tiny-ROI settings.

Method

PixelEyes method pipeline

Mask-guided Visual Search

PixelEyes asks for semantic targets and delegates dense grounding to SAMTok, turning vague visual search into verified evidence crops.

Semantic-region BFS

The agent explores candidate regions breadth-first to reduce repeated crop loops and preserve broad scene coverage before zooming in.

Switchable Tool Use

Reasoning remains independent from localization, allowing the model to search globally, inspect candidate regions, or answer from evidence.

Pinpoint-Bench

433samples
5500 x 3516average resolution
0.07%average ROI area
Zero-Hintprotocol
Pinpoint-Bench statistics

Pinpoint-Bench Challenging Cases

These trajectories show why the benchmark is difficult: the target evidence is tiny, ambiguous, or visually easy to misread.

Pinpoint-Bench challenging trajectory frame
The number the hour hand might be pointing to on the clock.
Round 1 Search
SAMTok contour grounding result
Localized Target Localized challenging target crop
Prediction 4 GT: six

Results

PixelEyes improves both pinpoint search and evidence-grounded answering across visual benchmarks.

Model Size V* HR-4K HR-8K VP-H VP-M VP-E Pinpoint Acc. TAE LSR MME-R-L Tree-Bench
Closed-source Models
Gemini-3-Flash-84.8289.2585.5047.1750.7567.3842.26--60.3456.54
Open-source Base Models
Qwen-2.5-VL7B75.5068.2062.7023.9026.0039.1039.03--44.3741.48
Qwen-3-VL4B80.1078.2572.8834.9140.3056.7446.19--44.5542.71
Qwen-3-VL8B86.3978.8874.6351.8940.6765.2549.88--49.0446.91
Qwen-3-VL235B87.9684.5081.62--------
Expert Active Agents
Pixel Reasoner7B86.3074.0066.9028.8029.6058.4029.7915.5646.8854.3240.98
Thyme7B82.2077.0072.0046.2343.2862.4140.42--50.1339.75
DeepEyes7B83.3073.2069.5035.1029.8060.1039.7214.8920.7953.5337.28
Mini-o37B85.3471.7567.5045.2848.5163.1244.348.3878.5242.2640.25
Ours
PixelEyes4B91.62 +11.5281.75 +3.5079.88 +7.0054.72 +19.8155.22 +14.9268.79 +12.0554.73 +8.5426.1376.9154.51 +9.9645.93 +3.22
PixelEyes8B94.24 +7.8585.00 +6.1283.15 +8.5259.44 +7.5555.22 +14.5571.63 +6.3855.20 +5.3226.6474.8359.25 +10.2148.40 +1.49

Inattentional Blindness Ablation

ModelVP-HVP-MVP-EPinpoint
Qwen-3-VL34.9140.3056.7446.19
w/ Mini-o3 SFT24.5233.5838.2929.56
w/ Our SFT50.9452.2468.0952.66
w/ Our SFT+RL54.7255.2268.7954.73

Localization Strategy

ModelAcc.TAELSR
Mask+BFS52.6625.3175.98
Mask+Free50.5824.3366.74
BBox+BFS48.7320.4368.13
BBox+DFS48.9723.5665.13
BBox+Free48.5121.5468.59

Switchable Tool Use

ModelHR-4KHR-8K
Qwen-3-VL78.2572.88
w/o Switchable80.0079.50
w/ Switchable81.7579.88

SAMTok vs Sa2VA

ModelAcc.TAELSRISR
SAMTok54.7326.1376.9199.17
Sa2VA46.1921.4150.5865.29

Dataset Creation

PixelEyes dataset creation pipeline

Gemini -> Tool Rollout -> Filtering -> PixelEyes-6K. The data contains 5,800 mask-guided search trajectories with semantic BFS behavior.

BibTeX

@article{pixeleyes2026,
  title={PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking},
  author={Gong, Dengxian and Wu, Yuanzheng and Yuan, Haobo and Hu, Zhengdong and Zhang, Tao and Zhou, Yikang and Chen, Shihao and Niu, Quanzhu and Wang, Haochen and Wang, Kai and Qi, Lu and Li, Jason and Ji, Shunping and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2026}
}