← Home

Quick answer

AI Summary: Introduces a purely vision-based agent capable of navigating web and desktop interfaces zero-shot, eliminating the reliance on brittle HTML DOM structures.

Claim

Zero-Shot Cross-Platform UI Navigation for Autonomous Web Agents

Tatsunori Hashimoto·
Percy Liang·
Xinyi Wang

ABSTRACT

Current web agents rely heavily on underlying HTML DOM structures, making them brittle to website updates and entirely incapable of navigating dynamic, canvas-based, or non-web applications. We propose VisionNav, a purely vision-based autonomous agent that navigates user interfaces across web, mobile, and desktop platforms zero-shot. By combining a multimodal LLM with a specialized spatial grounding module, VisionNav accurately translates high-level natural language intents into precise pixel-level actions (clicks, scrolls, typing) without accessing underlying code. Evaluations across the OmniWeb and DesktopAgent benchmarks show VisionNav achieving state-of-the-art success rates, proving the viability of the 'Death of the Dashboard' paradigm.

Review Snapshot

Explore ratings

4.4
★★★★
5 ratings
5 star
40%
4 star
60%
3 star
0%
2 star
0%
1 star
0%

Recommendation

100%

recommend this content.

Review this content

Share your opinion to help other learners triage faster.

Write a review

Invite a reviewer

Invite someone by email to share an invited review for Zero-Shot Cross-Platform UI Navigation for Autonomous Web Agents.

Author Inquiries

Public questions about this content. Attendemia will route your question to the author. Vote on the most important ones. No guarantee of response.
Post an inquiry
Sort by: Most helpful