Quick answer
AI Summary: Introduces a purely vision-based agent capable of navigating web and desktop interfaces zero-shot, eliminating the reliance on brittle HTML DOM structures.
AI Summary: Introduces a purely vision-based agent capable of navigating web and desktop interfaces zero-shot, eliminating the reliance on brittle HTML DOM structures.
Current web agents rely heavily on underlying HTML DOM structures, making them brittle to website updates and entirely incapable of navigating dynamic, canvas-based, or non-web applications. We propose VisionNav, a purely vision-based autonomous agent that navigates user interfaces across web, mobile, and desktop platforms zero-shot. By combining a multimodal LLM with a specialized spatial grounding module, VisionNav accurately translates high-level natural language intents into precise pixel-level actions (clicks, scrolls, typing) without accessing underlying code. Evaluations across the OmniWeb and DesktopAgent benchmarks show VisionNav achieving state-of-the-art success rates, proving the viability of the 'Death of the Dashboard' paradigm.
Share your opinion to help other learners triage faster.
Write a reviewInvite someone by email to share an invited review for Zero-Shot Cross-Platform UI Navigation for Autonomous Web Agents.