Running Gemma 4 on a Mac With MLX — Without Ollama or LLama.cpp

Gwang-Jin Kim

28 Apr 2026 • 5 min read

Photo by Walling / Unsplash

The shortest path for trying Gemma 4 on your Mac.

You have probably already heard of Gemma 4.

So had I.

And like every reasonable local-AI person, I wanted to try it out.

The shortest path is Ollama. It supports MLX-powered Apple Silicon support in preview now.

This installs the Gemma mlx versions:

ollama run gemma4:31b-mlx-bf16   # requires ~60GB RAM

# or the smaller:

ollama run gemma4:e4b-mlx-bf16    # requires ~10GB RAM

The article could end here.

But I wanted to try Gemma 4 with less wrappers and more control:

Gemma 4 + MLX + uv + a tiny local server.

Why MLX?

MLX is Apple’s machine learning framework for Apple Silicon.

More precisely, Apple describes MLX as an array framework optimized for the unified memory architecture of Apple Silicon, with Python, Swift, C, and C++ bindings.

Apple Silicon shares CPU and GPU in a unified memory. A Mac with 64 GB or 128 GB unified memory is not the same thing as a traditional laptop with a small discrete GPU. MLX is designed for this exact setup.

llama.cpp is also excellent on Apple Silicon, but it lives in the GGUF/GGML/Metal world. llama.cpp explicitly supports Apple Silicon through ARM NEON, Accelerate, and Metal.

MLX is different.

It is the native Apple ML framework path. And it is slightly faster (~10–30%) than GGUF.

Which Gemma 4?

Gemma 4 comes in several versions:

Model	Meaning	Rough use case
Gemma 4 E2B	small “effective 2B” model	edge/mobile/browser-style use
Gemma 4 E4B	small “effective 4B” model	practical local use
Gemma 4 31B	dense 31B model	big local workstation model
Gemma 4 26B A4B	MoE model, 26B total, about 4B active per token	efficient stronger model

The “E” in E2B/E4B means effective parameters. Google explains that these smaller models use Per-Layer Embeddings, so their memory use is higher than the effective parameter count alone suggests.

The “A4B” in 26B A4B means roughly 4B active parameters per token, but all 26B parameters still need to be loaded for routing and inference.

Google’s approximate inference memory table says:

Model	BF16	8-bit	4-bit
Gemma 4 E2B	9.6 GB	4.6 GB	3.2 GB
Gemma 4 E4B	15 GB	7.5 GB	5 GB
Gemma 4 31B	58.3 GB	30.4 GB	17.4 GB
Gemma 4 26B A4B	48 GB	25 GB	15.6 GB

Those are base model estimates. Runtime overhead and KV cache can push real usage higher.

For this test I used:

mlx-community/gemma-4-31b-it-bf16

This is the large dense BF16 version. It is not polite to your RAM.

In my tiny “Say hello” test, the server reported about 62.8 GB peak memory.

That is not surprising:

31 billion parameters × 2 bytes ≈ 62 GB

BF16 is heavy.

Setup

Create a clean project:

mkdir -p ~/local-ai/gemma4-31b
cd ~/local-ai/gemma4-31b

# Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh
uv init .
uv python pin 3.12
uv add mlx mlx-lm mlx-vlm huggingface_hub pillow ipython

Set the model as a variable:

export MLX_MODEL="mlx-community/gemma-4-31b-it-bf16"

This makes switching models later much easier.

First test: one-shot generation

uv run python -m mlx_vlm generate \
  --model "$MLX_MODEL" \
  --max-tokens 150 \
  --temperature 0.2 \
  --prompt "Explain in simple terms what MLX is and why it is useful on Apple Silicon."

This works. It first time loads the model for ~40 minutes to 60 minutes, so please be patient!

The command loads the model, answers, and exits (de-loads it).

For a 31B BF16 model, that is not a workflow. That is like moving a piano for every sentence.

So we start a server.

Run Gemma 4 as a local server

In one terminal:

uv run python -m mlx_vlm server \
  --host 127.0.0.1 \
  --port 8080 \
  --model "$MLX_MODEL"

Keep it running.

Now the model stays loaded.

In another terminal ensure first the existence of jq:

brew install jq

And then define a small helper:

askmlx() {
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    '{
      model: $model,
      messages: [
        {role: "user", content: $prompt}
      ],
      max_tokens: 300,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq -r '
    if .error then
      "ERROR: " + (.error.message // (.error | tostring))
    elif .detail then
      "DETAIL: " + (.detail | tostring)
    elif .choices[0].message.content then
      .choices[0].message.content
    elif .choices[0].text then
      .choices[0].text
    else
      .
    end
  '
}

Now ask:

askmlx "Write me a story about Einstein."

That is the whole trick. Congrats! You have now gemma4 running in your computer and can ask it as many questions as you want.

The server keeps the model warm.
The shell function sends prompts.
Your Mac does the work.

Please beware: We set max_tokens to 300, but you have to increase this number, for longer answers!

A more informative helper

The MLX server can return useful stats: tokens, speed, and peak memory.

So I also like this version:

askmlx_info() {
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    '{
      model: $model,
      messages: [
        {role: "user", content: $prompt}
      ],
      max_tokens: 300,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq '{
    model: .model,
    answer: .choices[0].message.content,
    finish_reason: .choices[0].finish_reason,
    usage: {
      input_tokens: .usage.input_tokens,
      output_tokens: .usage.output_tokens,
      total_tokens: .usage.total_tokens,
      prompt_tps: .usage.prompt_tps,
      generation_tps: .usage.generation_tps,
      peak_memory_gb: .usage.peak_memory
    }
  }'
}

Use it like this:

askmlx_info "Say hello in one sentence."

Example result:

{
  "model": "mlx-community/gemma-4-31b-it-bf16",
  "answer": "Hello!",
  "finish_reason": "stop",
  "usage": {
    "input_tokens": 19,
    "output_tokens": 3,
    "total_tokens": 22,
    "prompt_tps": 29.34344239503555,
    "generation_tps": 8.755821998655938,
    "peak_memory_gb": 62.818192432
  }
}

This is useful because local AI is not just about “does it answer?”

It is about:

How much RAM?
How many tokens per second?
How annoying is the model to use repeatedly?

That is the real local-model question.

This Model Can Also Look at Images!

One pleasant surprise is, that Gemma4 (as well as Gemma3) is not just a text model. It can also "see" images while many open-source local models are still text-only.

For this let's define in the zsh some more helper functions:

askmlx_image() {
  local image_path="$1"
  shift
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    --arg image_path "$image_path" \
    '{
      model: $model,
      messages: [
        {
          role: "user",
          content: [
            {
              type: "text",
              text: $prompt
            },
            {
              type: "input_image",
              image_url: $image_path
            }
          ]
        }
      ],
      max_tokens: 500,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq -r '
    if .error then
      "ERROR: " + (.error.message // (.error | tostring))
    elif .detail then
      "DETAIL: " + (.detail | tostring)
    elif .choices[0].message.content then
      .choices[0].message.content
    elif .choices[0].text then
      .choices[0].text
    else
      .
    end
  '
}

askmlx_image_json() {
  local image_path="$1"
  shift
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    --arg image_path "$image_path" \
    '{
      model: $model,
      messages: [
        {
          role: "user",
          content: [
            {type: "text", text: $prompt},
            {type: "input_image", image_url: $image_path}
          ]
        }
      ],
      max_tokens: 500,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq '{
    answer: .choices[0].message.content,
    finish_reason: .choices[0].finish_reason,
    usage: {
      input_tokens: .usage.input_tokens,
      output_tokens: .usage.output_tokens,
      total_tokens: .usage.total_tokens,
      prompt_tps: .usage.prompt_tps,
      generation_tps: .usage.generation_tps,
      peak_memory_gb: .usage.peak_memory
    }
  }'
}

Now, you can query the model with an image:

askmlx_image ~/Desktop/photo.jpg "Describe this image in detail."

# or with more information:

askmlx_image_json ~/Desktop/photo.jpg "Describe this image in detail."

Switching models

Because we used MLX_MODEL, switching models is simple.

Try a smaller Gemma 4 model:

export MLX_MODEL="mlx-community/gemma-4-e4b-it-bf16"

(Restart the server!)

uv run python -m mlx_vlm server \
  --host 127.0.0.1 \
  --port 8080 \
  --model "$MLX_MODEL"

Or try Gemma 3:

export MLX_MODEL="mlx-community/gemma-3-27b-it-4bit"

Restart the server.

uv run python -m mlx_vlm server \
  --host 127.0.0.1 \
  --port 8080 \
  --model "$MLX_MODEL"

The rest of the script stays the same.

Model experiments become one-line changes.

If you want very different models with a very similar setup, just give ChatGPT, Claude, Gemini etc. or ask Gemma4 now for the right setup for the other model, e.g. Qwen-3.6-35b-a3b.

Look for the exact model names here:

https://huggingface.co/mlx-community/collections

The main lesson

Ollama is the easy route.

llama.cpp is the GGUF/Metal route.

MLX is the native Apple Silicon route.

For daily convenience, I may still use Ollama.

But when I want more control, and 10–30% more speed, I would use this setup.

Happy tinkering with Gemma 4!