Microsoft Phi-Ground-Any: 4B Vision Model Achieves SOTA for AI GUI Grounding
Microsoft has released Phi-Ground-Any on Hugging Face — a 4-billion-parameter vision model that achieves state-of-the-art results on ScreenSpot-pro and UI-Vision, the two primary benchmarks for GUI grounding (the ability of AI agents to identify and precisely interact with interface elements on screen). The model enables AI agents to click specific buttons, form fields, and UI elements without requiring programmatic API access — a key capability for computer-use agents operating on general desktop or web interfaces.
Why It Matters
GUI grounding at SOTA at 4B parameters means this capability is now efficient enough to integrate into broader agent systems without dominating their compute budget. It removes a key capability gap for computer-use agents that need to interact with any software interface, not just those with agent-facing APIs.