10 articles

#multimodal

Google Releases Gemma 4 12B: Encoder-Free Multimodal

Google's Gemma 4 12B is encoder-free multimodal — text, audio, video, image — in 16GB VRAM under Apache 2.0. Day-0 in Transformers, llama.cpp, MLX, and Red Hat OpenShift.

June 7, 20261 min read

Technologybreaking

Google Releases Gemini Embedding 2: One Model for All Modalities

Google DeepMind's Gemini Embedding 2 is the first unified multimodal embedding model spanning text, audio, video, and image — live on Gemini API and Vertex AI.

May 29, 20261 min read

Researchbreaking

Turing's Open MM-RL Hits #1 Trending on HuggingFace with PhD-Level STEM Benchmark

Turing's Open MM-RL: PhD-level STEM benchmark with 100% verifiable answers, trending #1 HuggingFace. Every prompt double-vetted by PhD specialists. 3,000 more tasks coming.

May 14, 20261 min read

Toolsbreaking

Luma Uni-1 Reasoning-First Image Generation API Goes Live

Luma Uni-1 API: intent-first image generation with built-in prompt enhancement and reference gathering, top-3 in Image Arena, priced at less than half of comparable models.

May 6, 20261 min read

Technologybreaking

NVIDIA Releases Nemotron 3 Nano Omni: Open 30B Multimodal Model

NVIDIA open-releases Nemotron 3 Nano Omni (30B MoE/3B active): unified video/audio/image/text model with 9× video-reasoning capacity improvement vs. predecessors.

May 3, 20261 min read

Researchbreaking

DeepSeek's Visual Primitives Paper Claims 10× KV-Cache Compression

DeepSeek's Visual Primitives paper uses coordinate tokens in chain-of-thought to achieve ~10× KV-cache compression vs. Sonnet 4.6 and Gemini 3 Flash on vision tasks.

May 3, 20261 min read

Researchbreaking

Meta Tribe v2: Foundation Model of Human Brain Responses to Sound, Sight, and Language

Meta releases Tribe v2: a multimodal model of human brain responses to audio, visual, and language inputs, with paper, code, and an interactive mobile demo.

May 1, 20261 min read

Technologybreaking

Google DeepMind Reveals AI Co-Clinician for Real-Time Clinical Decision Support

Google DeepMind's AI co-clinician uses live video and audio for real-time clinical support—zero critical errors in 97 of 98 primary care queries.

May 1, 20261 min read

Glowing reasoning nodes dissolving into a crystallising pixel lattice, blue-to-amber gradient

Technology

GPT Image 2 Wins 93% of Blind Tests — Reasoning Joined the Visual Stack

GPT Image 2 claims a 26-point lead in Image Arena blind tests — unprecedented for the category — by wiring a reasoning loop before every pixel render.

April 25, 20262 min read

Technologybreaking

Gemini Embedding 2 Now Generally Available in Gemini API and Vertex AI

Gemini Embedding 2, Google's first natively multimodal embedding model, reaches GA in the Gemini API and Vertex AI.

April 24, 20261 min read

AI Intelligence Newsletter

Curated AI insights — sent when there's something worth your inbox.