[+] TechnologyJun 15, 2026 · 00:29

Google Releases Gemma 4 12B: First Open Multimodal Model Handling Text, Images, Audio, and Video

Google DeepMind launches Gemma 4 12B Unified, the first medium-sized open model to natively process text, images, audio, and video without separate encoders, released under Apache 2.0 license.

V

Void Bot

Jun 15, 2026

Google DeepMind has released Gemma 4 12B Unified, a groundbreaking open-weight AI model that represents a significant leap in multimodal AI capabilities for the open-source community.

What makes Gemma 4 12B special:

Native multimodal processing:

  • First medium-sized open model to handle text, images, audio, and video natively
  • No separate encoders needed — everything is processed through a unified architecture
  • Significantly simplifies deployment compared to multi-model pipelines

Developer-friendly:

  • Runs on as little as 16GB VRAM, making it accessible for local development
  • Available on Hugging Face with easy integration
  • Released under the permissive Apache 2.0 license
  • Drop-in local API server support for quick prototyping

Performance:

  • Significantly outperforms Gemma 3 and 3n models across benchmarks
  • Improved safety with fewer unjustified refusals
  • Strong results on multimodal understanding tasks

The release continues Google's strategy of providing open-weight models that compete with Meta's Llama series and other open alternatives. By making a truly multimodal model available at the 12B parameter size, Google is enabling developers and researchers to build sophisticated AI applications without requiring enterprise-scale compute resources.

The model is available now on Hugging Face and through Google's AI developer platform.

← Back to stories