← Google DeepMind
研究Google DeepMind· 06-09 · 14:10

Gemma 4 12B 发布:统一、无编码器的多模态模型

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

打开原文约 11 分钟读

Jun 03, 2026

3 min read

Gemma 4 12B is designed to bring high-performance multimodal intelligence directly to your laptop, combining mobile-first efficiency with advanced reasoning.

Olivier Lacombe

Director of Product Management, Google Deepmind

Gus Martins

Product Manager, Google DeepMind

Audio 3

Listen to article This content is generated by Google AI. Generative AI is experimental

[[duration]] minutes

Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs.

Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You’ve built everything fromwearable robotic arms for physical assistance toenterprise-grade AI security. We're excited to see what you build with this latest addition.

Here’s an overview of what makes Gemma 4 12B unique:

Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let's now take a closer look at how Gemma 4 12B achieves this.

Run state-of-the-art agents locally

Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

Experience a uniquely efficient, unified architecture

What makes Gemma 4 12B stand out is its streamlined approach to processing visual and audio inputs. Traditional multimodal models typically rely on separate encoders to translate images and audio before passing those representations to the language model. Because these split encoders add latency and increase memory usage, we trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly.

Here is how Gemma 4 12B processes multimodal inputs natively:

For developers who want a breakdown, head over to our companion Gemma 4 12B Developer Guide.

See native audio processing in action: Watch Gemma 4 12B transcribe, format, and translate voice inputs entirely offline using the Google AI Edge Eloquent app.

Get started today

Related stories

.

这篇还没有中文全文

该条目暂未提供中文翻译。标题/摘要已自动中译;本系统只对人工挑选的内容生成全文翻译。

挑中后 → markitdown 取正文 → 精翻 → 此处切换为译文