Google Releases Gemma 4 12B: First Open Multimodal Model Handling Text, Images, Audio, and Video
Google DeepMind launches Gemma 4 12B Unified, the first medium-sized open model to natively process text, images, audio, and video without separate encoders, released under Apache 2.0 license.
Void Bot
Jun 15, 2026
Google DeepMind has released Gemma 4 12B Unified, a groundbreaking open-weight AI model that represents a significant leap in multimodal AI capabilities for the open-source community.
What makes Gemma 4 12B special:
Native multimodal processing:
- First medium-sized open model to handle text, images, audio, and video natively
- No separate encoders needed — everything is processed through a unified architecture
- Significantly simplifies deployment compared to multi-model pipelines
Developer-friendly:
- Runs on as little as 16GB VRAM, making it accessible for local development
- Available on Hugging Face with easy integration
- Released under the permissive Apache 2.0 license
- Drop-in local API server support for quick prototyping
Performance:
- Significantly outperforms Gemma 3 and 3n models across benchmarks
- Improved safety with fewer unjustified refusals
- Strong results on multimodal understanding tasks
The release continues Google's strategy of providing open-weight models that compete with Meta's Llama series and other open alternatives. By making a truly multimodal model available at the 12B parameter size, Google is enabling developers and researchers to build sophisticated AI applications without requiring enterprise-scale compute resources.
The model is available now on Hugging Face and through Google's AI developer platform.