PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Gong, Dengxian; Wu, Yuanzheng; Yuan, Haobo; Hu, Zhengdong; Zhang, Tao; Zhou, Yikang; Chen, Shihao; Niu, Quanzhu; Wang, Haochen; Wang, Kai; Qi, Lu; Li, Jason; Ji, Shunping; Yang, Ming-Hsuan

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Dengxian Gong^1,* Yuanzheng Wu^1,* Haobo Yuan² Zhengdong Hu³ Tao Zhang¹ Yikang Zhou¹ Shihao Chen¹ Quanzhu Niu¹ Kai Wang⁴ Jason Li⁵ Haochen Wang⁶ Lu Qi^1,† Shunping Ji^1,† Ming-Hsuan Yang²

¹Wuhan University ²UC Merced ³UTS ⁴NUS ⁵NTU ⁶CASIA
^*Equal contribution, ^†Corresponding authors
{gooodx,jishunping}@whu.edu.cn

arXiv Code Model Dataset Pinpoint-Bench BibTeX

Search Trajectory Demo

PixelEyes repeatedly proposes a search region, runs SAMTok grounding, computes a crop from the mask, and continues until enough evidence supports an answer.

What is the line of text below "30.480 KGS"?

Round 1 Search

Localized Target Localized target crop

Prediction 67.200 LBS GT: 67.200 LBS

Existing visual-search agents suffer from inattentional blindness and inefficient search trajectories. PixelEyes decouples perception and reasoning through mask-guided search and semantic-region BFS.

TL;DR

PixelEyes separates what to search from where it is. A strong VLM performs reasoning, while specialist perception localizes visual evidence. This modular design improves visual evidence seeking efficiency and accuracy in high-resolution, tiny-ROI settings.

Method

Mask-guided Visual Search

PixelEyes asks for semantic targets and delegates dense grounding to SAMTok, turning vague visual search into verified evidence crops.

Semantic-region BFS

The agent explores candidate regions breadth-first to reduce repeated crop loops and preserve broad scene coverage before zooming in.

Switchable Tool Use

Reasoning remains independent from localization, allowing the model to search globally, inspect candidate regions, or answer from evidence.

Pinpoint-Bench

View on Hugging Face

433samples

5500 x 3516average resolution

0.07%average ROI area

Zero-Hintprotocol

Pinpoint-Bench Challenging Cases

These trajectories show why the benchmark is difficult: the target evidence is tiny, ambiguous, or visually easy to misread.

Pinpoint-Bench challenging trajectory frame

The number the hour hand might be pointing to on the clock.

Round 1 Search

Localized Target Localized challenging target crop

Prediction 4 GT: six

Results

PixelEyes improves both pinpoint search and evidence-grounded answering across visual benchmarks.

Model	Size	V*	HR-4K	HR-8K	VP-H	VP-M	VP-E	Pinpoint Acc.	TAE	LSR	MME-R-L	Tree-Bench
Closed-source Models
Gemini-3-Flash	-	84.82	89.25	85.50	47.17	50.75	67.38	42.26	-	-	60.34	56.54
Open-source Base Models
Qwen-2.5-VL	7B	75.50	68.20	62.70	23.90	26.00	39.10	39.03	-	-	44.37	41.48
Qwen-3-VL	4B	80.10	78.25	72.88	34.91	40.30	56.74	46.19	-	-	44.55	42.71
Qwen-3-VL	8B	86.39	78.88	74.63	51.89	40.67	65.25	49.88	-	-	49.04	46.91
Qwen-3-VL	235B	87.96	84.50	81.62	-	-	-	-	-	-	-	-
Expert Active Agents
Pixel Reasoner	7B	86.30	74.00	66.90	28.80	29.60	58.40	29.79	15.56	46.88	54.32	40.98
Thyme	7B	82.20	77.00	72.00	46.23	43.28	62.41	40.42	-	-	50.13	39.75
DeepEyes	7B	83.30	73.20	69.50	35.10	29.80	60.10	39.72	14.89	20.79	53.53	37.28
Mini-o3	7B	85.34	71.75	67.50	45.28	48.51	63.12	44.34	8.38	78.52	42.26	40.25
Ours
PixelEyes	4B	91.62 +11.52	81.75 +3.50	79.88 +7.00	54.72 +19.81	55.22 +14.92	68.79 +12.05	54.73 +8.54	26.13	76.91	54.51 +9.96	45.93 +3.22
PixelEyes	8B	94.24 +7.85	85.00 +6.12	83.15 +8.52	59.44 +7.55	55.22 +14.55	71.63 +6.38	55.20 +5.32	26.64	74.83	59.25 +10.21	48.40 +1.49

Inattentional Blindness Ablation

Model	VP-H	VP-M	VP-E	Pinpoint
Qwen-3-VL	34.91	40.30	56.74	46.19
w/ Mini-o3 SFT	24.52	33.58	38.29	29.56
w/ Our SFT	50.94	52.24	68.09	52.66
w/ Our SFT+RL	54.72	55.22	68.79	54.73

Localization Strategy

Model	Acc.	TAE	LSR
Mask+BFS	52.66	25.31	75.98
Mask+Free	50.58	24.33	66.74
BBox+BFS	48.73	20.43	68.13
BBox+DFS	48.97	23.56	65.13
BBox+Free	48.51	21.54	68.59

Switchable Tool Use

Model	HR-4K	HR-8K
Qwen-3-VL	78.25	72.88
w/o Switchable	80.00	79.50
w/ Switchable	81.75	79.88

SAMTok vs Sa2VA

Model	Acc.	TAE	LSR	ISR
SAMTok	54.73	26.13	76.91	99.17
Sa2VA	46.19	21.41	50.58	65.29

Dataset Creation

PixelEyes SFT Dataset

Gemini -> Tool Rollout -> Filtering -> PixelEyes-6K. The data contains 5,800 mask-guided search trajectories with semantic BFS behavior.

BibTeX

@article{pixeleyes2026,
  title={PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking},
  author={Gong, Dengxian and Wu, Yuanzheng and Yuan, Haobo and Hu, Zhengdong and Zhang, Tao and Zhou, Yikang and Chen, Shihao and Niu, Quanzhu and Wang, Haochen and Wang, Kai and Qi, Lu and Li, Jason and Ji, Shunping and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2026}
}