Microsoft Phi-Ground-Any: 4B Vision Model Achieves SOTA for AI GUI Grounding

Microsoft has released Phi-Ground-Any, a 4-billion-parameter vision model on Hugging Face achieving state-of-the-art results on ScreenSpot-pro and UI-Vision benchmarks — enabling AI agents to precisely identify and interact with screen elements in graphical user interfaces.

1 min read|agenticonsult Intelligence

Microsoft Phi-Ground-Any: 4B Vision Model Achieves SOTA for AI GUI Grounding

Microsoft has released Phi-Ground-Any on Hugging Face — a 4-billion-parameter vision model that achieves state-of-the-art results on ScreenSpot-pro and UI-Vision, the two primary benchmarks for GUI grounding (the ability of AI agents to identify and precisely interact with interface elements on screen). The model enables AI agents to click specific buttons, form fields, and UI elements without requiring programmatic API access — a key capability for computer-use agents operating on general desktop or web interfaces.

Why It Matters

GUI grounding at SOTA at 4B parameters means this capability is now efficient enough to integrate into broader agent systems without dominating their compute budget. It removes a key capability gap for computer-use agents that need to interact with any software interface, not just those with agent-facing APIs.

This breaking-news item was assembled from the cited primary source with AI assistance. It is intended for rapid situational awareness — refer to the original publication for the definitive statement.

Microsoft Phi-Ground-Any: 4B Vision Model Achieves SOTA for AI GUI Grounding

Microsoft Phi-Ground-Any: 4B Vision Model Achieves SOTA for AI GUI Grounding

Why It Matters

Live Intel Feed