Shimmy v1.9.0: Single 4.8MB Binary Runs All GPU Backends for Local LLM Inference
Shimmy v1.9.0 has been released as a "kitchen sink" build: a single Rust binary per platform (Windows/Linux x64 and macOS ARM64) that auto-detects and uses CUDA, Vulkan, OpenCL, or CPU at runtime. The 4.8MB binary is claimed to be 142× smaller than Ollama (680MB) with sub-100ms startup. The release adds MoE CPU offloading for running 70B+ models on consumer VRAM by distributing Mixture-of-Experts layers across GPU and system RAM. MIT-licensed with an explicit "free forever, never paid" pledge. Featured on Hacker News front page twice.
Why It Matters
Shimmy's single-binary approach eliminates the compilation and backend-selection friction that blocks mid-level developers from running local LLMs. Combined with zero-config model auto-discovery from HuggingFace, Ollama, and local directories, it represents the clearest attempt yet to make local inference as frictionless as pip install.