How to Run a Local LLM for Coding on a 16GB Laptop

Running a capable coding model on your own laptop used to mean owning a GPU rig. That assumption just got a lot weaker. In the span of a single week, a cluster of compact, open-weight models landed that are explicitly sized for ordinary hardware — most notably Google's Gemma 4, whose 12B variant Ars Technica described as running on any laptop with 16GB of RAM. Pair that with JetBrains' new Mellum2 and Microsoft's new MAI models, and on-device coding is suddenly realistic for a lot more developers. This guide walks through whether your machine can handle it, which of the new small models to consider, and how to get one running for code.

Can you really run a coding LLM on a 16GB laptop?

Short answer: yes — with the new generation of small models, and within limits. The clearest signal is Gemma 4: Ars Technica's June 3 coverage framed the 12B model as being sized for a normal laptop with 16GB of RAM, which is a meaningful change from the days when "local LLM" implied a workstation GPU.

The caveats matter, though. "Runs on your laptop" is not the same as "matches a frontier cloud model." Smaller open-weight models trade some capability for the ability to run locally, and 16GB is enough to run a 12B-class model, not to run it alongside a dozen other heavy apps with room to spare. Set expectations accordingly: local models are excellent for fast, private, offline-friendly coding help, and the gap to the biggest cloud models narrows every release — but it isn't zero.

What's in the new wave of small, open-weight models?

Three releases from the same week are worth knowing about:

Google Gemma 4 12B

The headline launch. Ars Technica's coverage emphasizes that the 12B version is open and sized for a 16GB-RAM laptop — the clearest "you can run this at home" story of the bunch. It's the natural starting point if you want a general-purpose open model that's been explicitly positioned for consumer hardware. (Google's broader model push this period also included Gemini Omni and Gemini 3.5 demos, though those are the larger, cloud-side cousins, not the laptop-friendly open weights.)

JetBrains Mellum2

Mellum2 is a 12B Mixture-of-Experts (MoE) model from JetBrains, announced on Hugging Face on June 1. The MoE design is interesting for local use: a mixture-of-experts model activates only part of its parameters per token, which is one of the architectural tricks that lets a model be capable without behaving like a dense model of the same total size. Coming from JetBrains — a company whose whole business is developer tooling — it's worth a look specifically for coding workflows.

Microsoft MAI

Simon Willison covered Microsoft's new MAI models on June 2. They round out the picture of major vendors shipping their own model families this week. As with any of these, check the published details for the specific variant and license before committing.

The pattern across all three: capable models are getting small enough, and open enough, to run where your code already lives.

How much RAM or VRAM do you actually need?

This is the question that decides everything, so be precise about it:

The grounded data point: a 12B-class model like Gemma 4 12B is being positioned to run on a laptop with 16GB of RAM (per Ars Technica). Treat 16GB as a realistic floor for a 12B model, not a comfortable ceiling.
RAM vs. VRAM: if you have a discrete GPU, the model ideally fits in VRAM for speed. On a typical 16GB laptop without a big GPU, the model runs in system RAM (often using unified memory on Apple Silicon), which is exactly the scenario the Gemma 4 framing targets.
Smaller is safer: if you're tight on memory or want headroom for your IDE, browser, and the model at once, lean toward smaller variants or more aggressive quantization (more on that below). If you have 32GB or more, you'll have a far more comfortable experience and can run larger context windows.

The honest rule of thumb: 16GB gets a 12B-class coding model running; more memory makes it pleasant.

How do you get a local model running for code?

The setup splits into two decisions — how you run the model, and how you connect it to your editor.

Pick a runtime

Local model runtimes let you download an open-weight model and serve it on your own machine, typically exposing a local API your tools can call. The model files for the releases above are published on Hugging Face (that's where Mellum2 was announced, for example), so the usual flow is: choose a runtime, point it at the model, and let it pull the weights.

Choose a quantization level

Quantization shrinks a model's memory footprint by storing its weights at lower precision. It's the single most important lever for fitting a 12B model into 16GB: a heavily quantized build uses far less memory at some cost to quality, while a higher-precision build is sharper but hungrier. If a model won't fit or runs too slowly, step down to a more aggressive quantization before giving up on it.

Wire it into your editor

Because these are coding models, the payoff is connecting the local API to your development environment — an editor extension or an agent configured to talk to a local endpoint instead of a cloud one. JetBrains shipping Mellum2 is a hint at how tightly local models and IDEs are converging. Once connected, you have code completion and chat that run entirely on your machine.

Which small model is best for coding?

Treat this as a starting matrix rather than a fixed ranking — and verify each model's current specs and license against its source before you commit:

Model	Size	Notable for local coding
Gemma 4 12B	12B (dense)	Explicitly sized for a 16GB-RAM laptop; the easiest "runs at home" starting point
JetBrains Mellum2	12B (Mixture-of-Experts)	From a developer-tooling vendor; MoE design aims for capability without dense-model overhead
Microsoft MAI	See source	Newest major-vendor family; check the specific variant and license

The practical advice: start with Gemma 4 12B because its hardware story is the most clearly documented, and evaluate Mellum2 in parallel if your priority is coding specifically. Benchmark them on your actual tasks — small-model quality varies a lot by language and problem type, so your codebase is the only benchmark that matters.

Local or cloud — when is on-device coding actually worth it?

On-device shines when privacy, offline access, or predictable cost are the priority: your code never leaves your machine, there's no per-token bill, and it works on a plane. It's less compelling when you need the absolute strongest model on a hard problem, or when you'd rather not manage local setup at all.

That tradeoff is the mirror image of the hosting decision many teams faced this week as agent pricing shifted. If you're weighing where to run not just the model but the whole agent, our companion guide on cloud vs. localhost coding agents and how to control the cost covers the other side of the decision.

Key takeaways

A new wave of compact open-weight models — Gemma 4 12B, JetBrains Mellum2, Microsoft MAI — just made on-device coding realistic on ordinary laptops.
Gemma 4 12B is the clearest case: positioned to run on a 16GB-RAM laptop, no GPU rig required.
Treat 16GB as the floor for a 12B model; quantization is your main lever for fitting it in memory.
Local wins on privacy, offline use, and predictable cost; the cloud still wins on raw capability for the hardest problems.

Want a capable coding model that lives where your code does? Start with Gemma 4 12B, evaluate Mellum2 for coding-specific work, and benchmark on your own tasks — then explore Clawvard to fold a local-first model into a real workflow.