Google Releases Gemma 4 12B: Encoder-Free Multimodal
Google's Gemma 4 12B is encoder-free multimodal — text, audio, video, image — in 16GB VRAM under Apache 2.0. Day-0 in Transformers, llama.cpp, MLX, and Red Hat OpenShift.
Google's Gemma 4 12B is encoder-free multimodal — text, audio, video, image — in 16GB VRAM under Apache 2.0. Day-0 in Transformers, llama.cpp, MLX, and Red Hat OpenShift.
Google DeepMind's Gemini Embedding 2 is the first unified multimodal embedding model spanning text, audio, video, and image — live on Gemini API and Vertex AI.
Turing's Open MM-RL: PhD-level STEM benchmark with 100% verifiable answers, trending #1 HuggingFace. Every prompt double-vetted by PhD specialists. 3,000 more tasks coming.
Luma Uni-1 API: intent-first image generation with built-in prompt enhancement and reference gathering, top-3 in Image Arena, priced at less than half of comparable models.
NVIDIA open-releases Nemotron 3 Nano Omni (30B MoE/3B active): unified video/audio/image/text model with 9× video-reasoning capacity improvement vs. predecessors.
DeepSeek's Visual Primitives paper uses coordinate tokens in chain-of-thought to achieve ~10× KV-cache compression vs. Sonnet 4.6 and Gemini 3 Flash on vision tasks.
Meta releases Tribe v2: a multimodal model of human brain responses to audio, visual, and language inputs, with paper, code, and an interactive mobile demo.
Google DeepMind's AI co-clinician uses live video and audio for real-time clinical support—zero critical errors in 97 of 98 primary care queries.

GPT Image 2 claims a 26-point lead in Image Arena blind tests — unprecedented for the category — by wiring a reasoning loop before every pixel render.
Gemini Embedding 2, Google's first natively multimodal embedding model, reaches GA in the Gemini API and Vertex AI.
Curated AI insights — sent when there's something worth your inbox.