Running Gemma 4 on a Mac With MLX — Without Ollama or LLama.cpp

Running Gemma 4 on a Mac With MLX — Without Ollama or LLama.cpp
Photo by Walling / Unsplash

The shortest path for trying Gemma 4 on your Mac.

You have probably already heard of Gemma 4.

So had I.

And like every reasonable local-AI person, I wanted to try it out.

The shortest path is Ollama. It supports MLX-powered Apple Silicon support in preview now.

This installs the Gemma mlx versions:

ollama run gemma4:31b-mlx-bf16   # requires ~60GB RAM

# or the smaller:

ollama run gemma4:e4b-mlx-bf16    # requires ~10GB RAM

The article could end here.

But I wanted to try Gemma 4 with less wrappers and more control:

Gemma 4 + MLX + uv + a tiny local server.


Why MLX?

MLX is Apple’s machine learning framework for Apple Silicon.

More precisely, Apple describes MLX as an array framework optimized for the unified memory architecture of Apple Silicon, with Python, Swift, C, and C++ bindings.  

Apple Silicon shares CPU and GPU in a unified memory. A Mac with 64 GB or 128 GB unified memory is not the same thing as a traditional laptop with a small discrete GPU. MLX is designed for this exact setup.

llama.cpp is also excellent on Apple Silicon, but it lives in the GGUF/GGML/Metal world. llama.cpp explicitly supports Apple Silicon through ARM NEON, Accelerate, and Metal.  

MLX is different.

It is the native Apple ML framework path. And it is slightly faster (~10–30%) than GGUF. 


Which Gemma 4?

Gemma 4 comes in several versions:

Model

Meaning

Rough use case

Gemma 4 E2B

small “effective 2B” model

edge/mobile/browser-style use

Gemma 4 E4B

small “effective 4B” model

practical local use

Gemma 4 31B

dense 31B model

big local workstation model

Gemma 4 26B A4B

MoE model, 26B total, about 4B active per token

efficient stronger model

The “E” in E2B/E4B means effective parameters. Google explains that these smaller models use Per-Layer Embeddings, so their memory use is higher than the effective parameter count alone suggests.  

The “A4B” in 26B A4B means roughly 4B active parameters per token, but all 26B parameters still need to be loaded for routing and inference.  

Google’s approximate inference memory table says:

Model

BF16

8-bit

4-bit

Gemma 4 E2B

9.6 GB

4.6 GB

3.2 GB

Gemma 4 E4B

15 GB

7.5 GB

5 GB

Gemma 4 31B

58.3 GB

30.4 GB

17.4 GB

Gemma 4 26B A4B

48 GB

25 GB

15.6 GB

Those are base model estimates. Runtime overhead and KV cache can push real usage higher.  

For this test I used:

mlx-community/gemma-4-31b-it-bf16

This is the large dense BF16 version. It is not polite to your RAM.

In my tiny “Say hello” test, the server reported about 62.8 GB peak memory.

That is not surprising:

31 billion parameters × 2 bytes ≈ 62 GB

BF16 is heavy.


Setup

Create a clean project:

mkdir -p ~/local-ai/gemma4-31b
cd ~/local-ai/gemma4-31b

# Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh
uv init .
uv python pin 3.12
uv add mlx mlx-lm mlx-vlm huggingface_hub pillow ipython

Set the model as a variable:

export MLX_MODEL="mlx-community/gemma-4-31b-it-bf16"

This makes switching models later much easier.


First test: one-shot generation

uv run python -m mlx_vlm generate \
  --model "$MLX_MODEL" \
  --max-tokens 150 \
  --temperature 0.2 \
  --prompt "Explain in simple terms what MLX is and why it is useful on Apple Silicon."

This works. It first time loads the model for ~40 minutes to 60 minutes, so please be patient!

The command loads the model, answers, and exits (de-loads it).

For a 31B BF16 model, that is not a workflow. That is like moving a piano for every sentence.

So we start a server.


Run Gemma 4 as a local server

In one terminal:

uv run python -m mlx_vlm server \
  --host 127.0.0.1 \
  --port 8080 \
  --model "$MLX_MODEL"

Keep it running.

Now the model stays loaded.

In another terminal ensure first the existence of jq:

brew install jq

And then define a small helper:

askmlx() {
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    '{
      model: $model,
      messages: [
        {role: "user", content: $prompt}
      ],
      max_tokens: 300,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq -r '
    if .error then
      "ERROR: " + (.error.message // (.error | tostring))
    elif .detail then
      "DETAIL: " + (.detail | tostring)
    elif .choices[0].message.content then
      .choices[0].message.content
    elif .choices[0].text then
      .choices[0].text
    else
      .
    end
  '
}

Now ask:

askmlx "Write me a story about Einstein."

That is the whole trick. Congrats! You have now gemma4 running in your computer and can ask it as many questions as you want.

The server keeps the model warm.
The shell function sends prompts.
Your Mac does the work.

Please beware: We set max_tokens to 300, but you have to increase this number, for longer answers!


A more informative helper

The MLX server can return useful stats: tokens, speed, and peak memory.

So I also like this version:

askmlx_info() {
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    '{
      model: $model,
      messages: [
        {role: "user", content: $prompt}
      ],
      max_tokens: 300,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq '{
    model: .model,
    answer: .choices[0].message.content,
    finish_reason: .choices[0].finish_reason,
    usage: {
      input_tokens: .usage.input_tokens,
      output_tokens: .usage.output_tokens,
      total_tokens: .usage.total_tokens,
      prompt_tps: .usage.prompt_tps,
      generation_tps: .usage.generation_tps,
      peak_memory_gb: .usage.peak_memory
    }
  }'
}

Use it like this:

askmlx_info "Say hello in one sentence."

Example result:

{
  "model": "mlx-community/gemma-4-31b-it-bf16",
  "answer": "Hello!",
  "finish_reason": "stop",
  "usage": {
    "input_tokens": 19,
    "output_tokens": 3,
    "total_tokens": 22,
    "prompt_tps": 29.34344239503555,
    "generation_tps": 8.755821998655938,
    "peak_memory_gb": 62.818192432
  }
}

This is useful because local AI is not just about “does it answer?”

It is about:

How much RAM?
How many tokens per second?
How annoying is the model to use repeatedly?

That is the real local-model question.


This Model Can Also Look at Images!

One pleasant surprise is, that Gemma4 (as well as Gemma3) is not just a text model. It can also "see" images while many open-source local models are still text-only.

For this let's define in the zsh some more helper functions:

askmlx_image() {
  local image_path="$1"
  shift
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    --arg image_path "$image_path" \
    '{
      model: $model,
      messages: [
        {
          role: "user",
          content: [
            {
              type: "text",
              text: $prompt
            },
            {
              type: "input_image",
              image_url: $image_path
            }
          ]
        }
      ],
      max_tokens: 500,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq -r '
    if .error then
      "ERROR: " + (.error.message // (.error | tostring))
    elif .detail then
      "DETAIL: " + (.detail | tostring)
    elif .choices[0].message.content then
      .choices[0].message.content
    elif .choices[0].text then
      .choices[0].text
    else
      .
    end
  '
}

askmlx_image_json() {
  local image_path="$1"
  shift
  local prompt="$*"

  jq -n \
    --arg model "$MLX_MODEL" \
    --arg prompt "$prompt" \
    --arg image_path "$image_path" \
    '{
      model: $model,
      messages: [
        {
          role: "user",
          content: [
            {type: "text", text: $prompt},
            {type: "input_image", image_url: $image_path}
          ]
        }
      ],
      max_tokens: 500,
      temperature: 0.2,
      stream: false
    }' |
  curl -s http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @- |
  jq '{
    answer: .choices[0].message.content,
    finish_reason: .choices[0].finish_reason,
    usage: {
      input_tokens: .usage.input_tokens,
      output_tokens: .usage.output_tokens,
      total_tokens: .usage.total_tokens,
      prompt_tps: .usage.prompt_tps,
      generation_tps: .usage.generation_tps,
      peak_memory_gb: .usage.peak_memory
    }
  }'
}

Now, you can query the model with an image:

askmlx_image ~/Desktop/photo.jpg "Describe this image in detail."

# or with more information:

askmlx_image_json ~/Desktop/photo.jpg "Describe this image in detail."

Switching models

Because we used MLX_MODEL, switching models is simple.

Try a smaller Gemma 4 model:

export MLX_MODEL="mlx-community/gemma-4-e4b-it-bf16"

(Restart the server!)

uv run python -m mlx_vlm server \
  --host 127.0.0.1 \
  --port 8080 \
  --model "$MLX_MODEL"

Or try Gemma 3:

export MLX_MODEL="mlx-community/gemma-3-27b-it-4bit"

Restart the server.

uv run python -m mlx_vlm server \
  --host 127.0.0.1 \
  --port 8080 \
  --model "$MLX_MODEL"

The rest of the script stays the same.

Model experiments become one-line changes.

If you want very different models with a very similar setup, just give ChatGPT, Claude, Gemini etc. or ask Gemma4 now for the right setup for the other model, e.g. Qwen-3.6-35b-a3b.

Look for the exact model names here:

https://huggingface.co/mlx-community/collections


The main lesson

Ollama is the easy route.

llama.cpp is the GGUF/Metal route.

MLX is the native Apple Silicon route.

For daily convenience, I may still use Ollama.

But when I want more control, and 10–30% more speed, I would use this setup.

Happy tinkering with Gemma 4!