Running Gemma 4 on a Mac With MLX — Without Ollama or LLama.cpp
The shortest path for trying Gemma 4 on your Mac.
You have probably already heard of Gemma 4.
So had I.
And like every reasonable local-AI person, I wanted to try it out.
The shortest path is Ollama. It supports MLX-powered Apple Silicon support in preview now.
This installs the Gemma mlx versions:
ollama run gemma4:31b-mlx-bf16 # requires ~60GB RAM
# or the smaller:
ollama run gemma4:e4b-mlx-bf16 # requires ~10GB RAMThe article could end here.
But I wanted to try Gemma 4 with less wrappers and more control:
Gemma 4 + MLX + uv + a tiny local server.
Why MLX?
MLX is Apple’s machine learning framework for Apple Silicon.
More precisely, Apple describes MLX as an array framework optimized for the unified memory architecture of Apple Silicon, with Python, Swift, C, and C++ bindings.
Apple Silicon shares CPU and GPU in a unified memory. A Mac with 64 GB or 128 GB unified memory is not the same thing as a traditional laptop with a small discrete GPU. MLX is designed for this exact setup.
llama.cpp is also excellent on Apple Silicon, but it lives in the GGUF/GGML/Metal world. llama.cpp explicitly supports Apple Silicon through ARM NEON, Accelerate, and Metal.
MLX is different.
It is the native Apple ML framework path. And it is slightly faster (~10–30%) than GGUF.
Which Gemma 4?
Gemma 4 comes in several versions:
|
Model |
Meaning |
Rough use case |
|---|---|---|
|
Gemma 4 E2B |
small “effective 2B” model |
edge/mobile/browser-style use |
|
Gemma 4 E4B |
small “effective 4B” model |
practical local use |
|
Gemma 4 31B |
dense 31B model |
big local workstation model |
|
Gemma 4 26B A4B |
MoE model, 26B total, about 4B active per token |
efficient stronger model |
The “E” in E2B/E4B means effective parameters. Google explains that these smaller models use Per-Layer Embeddings, so their memory use is higher than the effective parameter count alone suggests.
The “A4B” in 26B A4B means roughly 4B active parameters per token, but all 26B parameters still need to be loaded for routing and inference.
Google’s approximate inference memory table says:
|
Model |
BF16 |
8-bit |
4-bit |
|---|---|---|---|
|
Gemma 4 E2B |
9.6 GB |
4.6 GB |
3.2 GB |
|
Gemma 4 E4B |
15 GB |
7.5 GB |
5 GB |
|
Gemma 4 31B |
58.3 GB |
30.4 GB |
17.4 GB |
|
Gemma 4 26B A4B |
48 GB |
25 GB |
15.6 GB |
Those are base model estimates. Runtime overhead and KV cache can push real usage higher.
For this test I used:
mlx-community/gemma-4-31b-it-bf16This is the large dense BF16 version. It is not polite to your RAM.
In my tiny “Say hello” test, the server reported about 62.8 GB peak memory.
That is not surprising:
31 billion parameters × 2 bytes ≈ 62 GBBF16 is heavy.
Setup
Create a clean project:
mkdir -p ~/local-ai/gemma4-31b
cd ~/local-ai/gemma4-31b
# Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh
uv init .
uv python pin 3.12
uv add mlx mlx-lm mlx-vlm huggingface_hub pillow ipythonSet the model as a variable:
export MLX_MODEL="mlx-community/gemma-4-31b-it-bf16"This makes switching models later much easier.
First test: one-shot generation
uv run python -m mlx_vlm generate \
--model "$MLX_MODEL" \
--max-tokens 150 \
--temperature 0.2 \
--prompt "Explain in simple terms what MLX is and why it is useful on Apple Silicon."This works. It first time loads the model for ~40 minutes to 60 minutes, so please be patient!
The command loads the model, answers, and exits (de-loads it).
For a 31B BF16 model, that is not a workflow. That is like moving a piano for every sentence.
So we start a server.
Run Gemma 4 as a local server
In one terminal:
uv run python -m mlx_vlm server \
--host 127.0.0.1 \
--port 8080 \
--model "$MLX_MODEL"Keep it running.
Now the model stays loaded.
In another terminal ensure first the existence of jq:
brew install jqAnd then define a small helper:
askmlx() {
local prompt="$*"
jq -n \
--arg model "$MLX_MODEL" \
--arg prompt "$prompt" \
'{
model: $model,
messages: [
{role: "user", content: $prompt}
],
max_tokens: 300,
temperature: 0.2,
stream: false
}' |
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d @- |
jq -r '
if .error then
"ERROR: " + (.error.message // (.error | tostring))
elif .detail then
"DETAIL: " + (.detail | tostring)
elif .choices[0].message.content then
.choices[0].message.content
elif .choices[0].text then
.choices[0].text
else
.
end
'
}Now ask:
askmlx "Write me a story about Einstein."That is the whole trick. Congrats! You have now gemma4 running in your computer and can ask it as many questions as you want.
The server keeps the model warm.
The shell function sends prompts.
Your Mac does the work.
Please beware: We set max_tokens to 300, but you have to increase this number, for longer answers!
A more informative helper
The MLX server can return useful stats: tokens, speed, and peak memory.
So I also like this version:
askmlx_info() {
local prompt="$*"
jq -n \
--arg model "$MLX_MODEL" \
--arg prompt "$prompt" \
'{
model: $model,
messages: [
{role: "user", content: $prompt}
],
max_tokens: 300,
temperature: 0.2,
stream: false
}' |
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d @- |
jq '{
model: .model,
answer: .choices[0].message.content,
finish_reason: .choices[0].finish_reason,
usage: {
input_tokens: .usage.input_tokens,
output_tokens: .usage.output_tokens,
total_tokens: .usage.total_tokens,
prompt_tps: .usage.prompt_tps,
generation_tps: .usage.generation_tps,
peak_memory_gb: .usage.peak_memory
}
}'
}Use it like this:
askmlx_info "Say hello in one sentence."Example result:
{
"model": "mlx-community/gemma-4-31b-it-bf16",
"answer": "Hello!",
"finish_reason": "stop",
"usage": {
"input_tokens": 19,
"output_tokens": 3,
"total_tokens": 22,
"prompt_tps": 29.34344239503555,
"generation_tps": 8.755821998655938,
"peak_memory_gb": 62.818192432
}
}This is useful because local AI is not just about “does it answer?”
It is about:
How much RAM?
How many tokens per second?
How annoying is the model to use repeatedly?That is the real local-model question.
This Model Can Also Look at Images!
One pleasant surprise is, that Gemma4 (as well as Gemma3) is not just a text model. It can also "see" images while many open-source local models are still text-only.
For this let's define in the zsh some more helper functions:
askmlx_image() {
local image_path="$1"
shift
local prompt="$*"
jq -n \
--arg model "$MLX_MODEL" \
--arg prompt "$prompt" \
--arg image_path "$image_path" \
'{
model: $model,
messages: [
{
role: "user",
content: [
{
type: "text",
text: $prompt
},
{
type: "input_image",
image_url: $image_path
}
]
}
],
max_tokens: 500,
temperature: 0.2,
stream: false
}' |
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d @- |
jq -r '
if .error then
"ERROR: " + (.error.message // (.error | tostring))
elif .detail then
"DETAIL: " + (.detail | tostring)
elif .choices[0].message.content then
.choices[0].message.content
elif .choices[0].text then
.choices[0].text
else
.
end
'
}
askmlx_image_json() {
local image_path="$1"
shift
local prompt="$*"
jq -n \
--arg model "$MLX_MODEL" \
--arg prompt "$prompt" \
--arg image_path "$image_path" \
'{
model: $model,
messages: [
{
role: "user",
content: [
{type: "text", text: $prompt},
{type: "input_image", image_url: $image_path}
]
}
],
max_tokens: 500,
temperature: 0.2,
stream: false
}' |
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d @- |
jq '{
answer: .choices[0].message.content,
finish_reason: .choices[0].finish_reason,
usage: {
input_tokens: .usage.input_tokens,
output_tokens: .usage.output_tokens,
total_tokens: .usage.total_tokens,
prompt_tps: .usage.prompt_tps,
generation_tps: .usage.generation_tps,
peak_memory_gb: .usage.peak_memory
}
}'
}Now, you can query the model with an image:
askmlx_image ~/Desktop/photo.jpg "Describe this image in detail."
# or with more information:
askmlx_image_json ~/Desktop/photo.jpg "Describe this image in detail."Switching models
Because we used MLX_MODEL, switching models is simple.
Try a smaller Gemma 4 model:
export MLX_MODEL="mlx-community/gemma-4-e4b-it-bf16"(Restart the server!)
uv run python -m mlx_vlm server \
--host 127.0.0.1 \
--port 8080 \
--model "$MLX_MODEL"Or try Gemma 3:
export MLX_MODEL="mlx-community/gemma-3-27b-it-4bit"Restart the server.
uv run python -m mlx_vlm server \
--host 127.0.0.1 \
--port 8080 \
--model "$MLX_MODEL"The rest of the script stays the same.
Model experiments become one-line changes.
If you want very different models with a very similar setup, just give ChatGPT, Claude, Gemini etc. or ask Gemma4 now for the right setup for the other model, e.g. Qwen-3.6-35b-a3b.
Look for the exact model names here:
https://huggingface.co/mlx-community/collections
The main lesson
Ollama is the easy route.
llama.cpp is the GGUF/Metal route.
MLX is the native Apple Silicon route.
For daily convenience, I may still use Ollama.
But when I want more control, and 10–30% more speed, I would use this setup.
Happy tinkering with Gemma 4!