Skip to content

Instantly share code, notes, and snippets.

@xlab
Created July 31, 2025 18:47
Show Gist options
  • Save xlab/711f2bc8eb987b6a2572ac6a3deccdf9 to your computer and use it in GitHub Desktop.
Save xlab/711f2bc8eb987b6a2572ac6a3deccdf9 to your computer and use it in GitHub Desktop.

Survey of available implementations for using Gemma 3n on local devices.

In the pursue to make an impactful app that uses Gemma 3n model, recently published by Google, I have been looking for available implementations for using Gemma 3n on local consumer devices, such as your regular iPhone and Android phones, maybe macOS and even Windows-based laptops.

Gemma 3n is a multi-modal model that can process text, images, and audio. See overview https://ai.google.dev/gemma/docs/gemma-3n

This is a very beginning of hackathon, so the open-source implementations available today are limited to those that google pushed to the public just days before the release. The vest survey so far was published on Hugging Face, see https://huggingface.co/blog/gemma3n

It says "Gemma 3n fully available in the open-source ecosystem", but definition of "fully" is blurry as the multi-modal context is very complicated topic and the designs of some ecosystem toolings are not even ready to accept such "multi" context to begin with. Let alone missing implementation of the guts of the new architecture with sliding cache, offloading, etc.

Let's analyze what is available today across the open-source ecosystem..

Inference with transformers

This is good and obvious choice, as transformers is the most popular open-source library for inference with LLMs. It is basically the reference implementation. The downside is that it's not a component that is easily embeddable in the end-user applications for various platforms, and also it runs on CUDA as the only GPU backend, so it's not a good fit for Apple Silicon devices.

Let's remember that Gemma 3n is an AI-edge targeted model, so the goal of the integration is to make it available on the usual phones, not CUDA-basded desktop-glued devkits.

Inference with MLX

MLX is a CUDA alternative for Apple Silicon devices. It runs on Metal, so the inference is GPU accelerated across all modern Apple devices: iPhone, iPad, Macs.

The integration is not trivial, as MLX is a low-level library that requires a lot of manual work to integrate with the end-user application.

The most correct implementation of Gemma 3n with MLX is available in the Blaizzy/mlx-vlm repository.

See PR Blaizzy/mlx-vlm#391 that added it and there is some following PRs that fix stuff. MLX-VLM is a great thing, it shows how it's done using MLX GPUs and Python. When your project is running on Python, like a scientific notepad or a studio like LMStudio https://github.com/lmstudio-ai/mlx-engine/commit/030cc3eca8efefad919718baebe797c4663123cd the change needed to support the new model is not that big, just bump mlx-vlm dependency and you are good to go.

The model has to be converted into MLX BFloat16 format, but Gemma 3n is already cooked here https://huggingface.co/collections/mlx-community/gemma-3n-685d6c8d02d7486c7e77a7dc by mlx-community. (mlx-vlm handled the conversion too).

Text-only inference PRs outside of mlx-vlm are currently in WIP: ml-explore/mlx-lm#258 and ml-explore/mlx-lm#260 as text inference historically handled in a separate repo ml-explore/mlx-lm.

However, things are more complicated when we're looking at iPhone and Android targets.

MLX Swift

ml-explore/mlx-swift-examples#340 is a PR that adds Gemma 3n support to MLX Swift examples. That means, once it's finalized, we can run models on iPhone and iPad easily, same for the M-based Macs.

The challenge is to port the code from Blaizzy/mlx-vlm (visual and audio) and ml-explore/mlx-lm (text only) to Swift and make sure it's not broken in the process, so for today it's an ongoing work.

llama.cpp and Ollama

  • Ollama is a great tool for running LLMs on your local devices. But it's in Go, and the model has to be prepared in GGML format, means that it has to be converted from the original format and may lose some of the quality, additionally it's text only as of today. ollama/ollama#11204 (the kernels are borrowed from llama.cpp).

  • llama.cpp: addition to MLX, Gemma 3n (text only) works out of the box with llama.cpp, but has same limitations - it's text only as of today and GGML/GGUF quantization only.

Both are not good fit for edge devices such as phones, but even on the desktop it's questionable how a project can benefit from it.

Ollama challenge prize

Given there is a prize for "the best project that utilizes and showcases the capabilities of Gemma 3n running locally via Ollama.", I see this as a bounty for implementation the missing modalities for Ollama and contributing them back to the community.

Google AI Edge

Speaking of bounties, there is "Google AI Edge implementation of Gemma 3n" bounty:

Google AI Edge bounty

The problem is, that the implementation in Google AI Edge is not available yet.

That framework uses LiteRT runtime https://github.com/google-ai-edge/LiteRT and models have to be converted into the format https://github.com/google-ai-edge/ai-edge-torch/tree/main/ai_edge_torch/generative/ (from PyTorch to a TFLite model).

The conversion itself is a challenge, as there is very scarse examples of how to do it, the only that is closely resembling is Gemma3 1B conversion into TFLite: https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/gemma3/convert_gemma3_to_tflite.py

The only model that is already converted is a preview vesion of Gemma 3n that was available months before: https://huggingface.co/google/gemma-3n-E2B-it-litert-preview

I find it funny that Google managed to add Gemma 3n to a lot of OSS environments, but not to their own framework. So the $10k prize looks like a bounty for that now.

A bit of context on gemma-3n-E2B-it-litert-preview

Pun intended. The preview model is a multi-modal version as well, but the context window limited to 4046 tokens, and the demo app is limited to 1 image + 1 text prompt only.

The app is Google AI Edge Gallery https://github.com/google-ai-edge/gallery, and the Gemma 3n with libert preview limitations is availabe there to download. It runs slowly but just fine on CPU (Snapdragon 8 Elite on Redmagic 10 Pro). It doesn't run on GPU (Adreno 830) at all.

So, I am not sure if there is any other demo app available for Gemma 3n with LiteRT. But this example is outdated and doesn't cover the full modalities for now.

It's technically possible to convert the existing Gemma 3n distribution from PyTorch to TFLite and then pack it into .task file and load into existing LiteRT Gallery app: https://github.com/google-ai-edge/gallery/wiki/6.-Importing-Local-Models-(optional). Good luck with that lol.

GPU vs NPU

I've mentioned previously that GPU doesn't currently work for gemma-3n-E2B-it-litert-preview in the app, but to be future proof, the support has to be done for the actual NPU. That one is proprietary and gated by Qualcomm, so to get any access to SDK, one needs to submit a request for "LiteRT NPU EAP". https://ai.google.dev/edge/litert/next/npu - confidential, yo.

ONNX Runtime

Surprisingly, ONNX Runtime from Microsoft actually supports Gemma 3n from Day 0, there are good examples on how to run it via Transformers.js (in a browser!! The true edge!!), and of course onnxruntime in Python. Yes, Transformers.js uses ONNX Runtime to run models in the browser. But don't get too excited: "Due to the model's large size, we currently only support Node.js, Deno, and Bun execution.". For today, there is no WebGPU support for Gemma 3n via ONNX Runtime.

The best documentation with exampes is on gemma-3n-E2B-it-ONNX Hugging Face page where the ONNX flavour of the model is posted https://huggingface.co/onnx-community/gemma-3n-E2B-it-ONNX

My personal favourite is the official ONNX Runtime for React Native https://www.npmjs.com/package/onnxruntime-react-native that is a wrapper around the onnxruntime native library. This basically enables quick and painless deployment of Gemma 3n on iOS and Android.

I am pretty sure one has to hack a lot inside the actual native wrapper https://github.com/microsoft/onnxruntime/blob/main/js/react_native/ios/OnnxruntimeModule.mm to make it work for Gemma 3n. But the fact that ONNX Runtime for RN is officially supported by Microsoft and there are all bindings already in place, cheers me up.

Unfortunately, in Gemma 3n Impact Challenge there is no prize for the best project that utilizes ONNX Runtime for React Native. sob

Conclusion

On June 28th, 2025, the following table shows the current state of the art in the open-source ecosystem for running Gemma 3n:

Method CPU GPU Text Audio Visual Mobile Browser
transformers (Python) ✅ (CUDA)
MLX-VLM (Python) ✅ (Metal)
MLX Swift (Swift) ✅ (Metal)
llama.cpp (C++) ✅ (CUDA, Metal)
Ollama (Go) ✅ (CUDA, Metal)
Google AI Edge (LiteRT)
ONNX Runtime (Python)
ONNX Runtime (JS)
ONNX Runtime for RN (React Native)

✅: Yes, ❌: No, ⏳: Work-in-progress, ❔: Unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment