Qwen 3.5 Just Dropped: Here's What You Need to Run It Locally (With Real Benchmarks)

Qwen 3.5 Medium Models Are Here

Four days ago, on February 24, Alibaba’s Qwen team released the open-weight Qwen 3.5 Medium series: three models you can download right now and run on your own hardware. The flagship 397B-A17B had already launched on February 15 via API, but the Medium release is the one that matters for local inference.

The lineup: Qwen3.5-35B-A3B (35 billion parameters, 3 billion active per token), Qwen3.5-122B-A10B (122 billion, 10 billion active), and Qwen3.5-27B (27 billion, dense). All Apache 2.0 licensed. All available on HuggingFace right now.

The 35B-A3B is the headliner. It gets over 100 tokens per second on an RTX 3090, handles a million-token context window on a 32GB GPU, and outperforms the previous-generation Qwen3-235B-A22B, which needed 22 billion active parameters to match what this model does with 3 billion. That’s a 7x reduction in compute for equivalent quality.

The models also introduce a new hybrid attention architecture (Gated Delta Networks combined with traditional attention), native multimodal support for text/image/video, built-in agentic tool calling, and coverage of 201 languages (up from 82 in Qwen 3).

Sources: Qwen 3.5 Official Announcement, Qwen 3.5 on GitHub, VentureBeat Coverage, MarkTechPost Coverage

How We Got Here: The Qwen Timeline

Alibaba’s Qwen team has shipped a major release roughly every two months since spring 2025. If you’re just tuning in, here’s the context:

April 2025: Qwen 3 launched with eight models, from 0.6B to 235B parameters. This was the release that introduced Mixture-of-Experts to the Qwen family: a 30B model with only 3B active parameters, and a 235B model with 22B active. All Apache 2.0. Trained on 36 trillion tokens across 119 languages.

July 2025: Qwen3-Coder brought coding-specific models. The flagship, Qwen3-Coder-480B-A35B, matched Claude Sonnet 4 on agentic coding benchmarks. The lightweight Coder-Next (80B total, 3B active) runs on a single 24GB GPU for everyday coding work.

September 2025: Qwen3-Next and Qwen3-Omni followed with improved general capabilities and multimodal support.

February 2026: Qwen 3.5 is the latest. The 397B-A17B flagship hit APIs on Feb 15, and the three open-weight Medium models (27B, 35B-A3B, 122B-A10B) dropped on Feb 24.

Sources: Qwen 3 Official Blog, CNBC: Alibaba Unveils Qwen3.5

Why MoE Changes Everything for Local Inference

The thing that makes Qwen practical on consumer hardware is Mixture-of-Experts. Here’s the idea: a 35B-parameter model doesn’t activate all 35 billion parameters for every token. It routes each token through a small subset of “expert” networks, activating only 3 billion parameters at a time. You get the knowledge of a 35B model with the compute cost of a 3B model.

In practice, this means:

Faster generation. Fewer active parameters means fewer floating-point operations per token. The Qwen3.5-35B-A3B generates tokens 3-5x faster than the dense 27B model on identical hardware.
Lower VRAM needs. At Q4 quantization, the 35B MoE model uses just 19GB of VRAM at short context, and the VRAM scales gracefully with context length (22GB at 131K context, 25GB at 262K).
Practical long context. The 35B-A3B model can push past one million tokens on a 32GB GPU. The dense 27B model hits VRAM limits at 262K context with 33GB.

The Qwen team benchmarks back this up. Qwen3-30B-A3B outperforms QwQ-32B (a dense 32B reasoning model) despite using roughly 10x fewer active parameters. And Qwen3-4B matches the performance of Qwen2.5-72B-Instruct. Each generation roughly halves the parameters needed for equivalent quality.

Source: Qwen3 Technical Report (arXiv)

Hardware Benchmarks: What Real People Are Getting

These numbers come from published benchmarks and community reports, not marketing claims.

NVIDIA GPUs

Qwen3.5-35B-A3B (Q4 quantization, the model most people should run):

GPU	Context	Tokens/sec
RTX 3090 (24GB)	4K	111 t/s
RTX 3090 (24GB)	131K	79 t/s
RTX 5090 (32GB)	4K	165 t/s
RTX 5090 (32GB)	262K	97 t/s

Qwen3-Coder-Next-80B-A3B (Q4, for coding specifically):

GPU	Context	Tokens/sec
RTX PRO 6000 (48GB)	4K	86 t/s
RTX PRO 6000 (48GB)	256K	61 t/s
Dual RTX 3090 (Q3)	32K	71 t/s
Dual RTX 3090 (Q3)	256K	48 t/s

Qwen3-32B Dense (Q4_K_M, for when you want a dense model):

GPU	Tokens/sec
RTX 4090 (24GB)	25-35 t/s
RTX 5090 (32GB)	~40-50 t/s

Sources: Hardware Corner Qwen3.5 Benchmarks, Hardware Corner Qwen3-Coder-Next, Bored Consultant Consumer GPU Benchmarks

Apple Silicon

Apple’s unified memory architecture is a huge advantage for running LLMs. You don’t need a discrete GPU; the model loads into shared memory and runs on the GPU cores. MLX, Apple’s machine learning framework, is the fastest option here (roughly 50% faster than Ollama on the same hardware).

Model	Mac	Quantization	Tokens/sec
Qwen3-30B-A3B	M4 Max 64GB	4-bit MLX	~100+
Qwen3-30B-A3B	M3 Pro 36GB	4-bit MLX	60-80
Qwen3-30B-A3B	M2 Max 64GB	4-bit MLX	80-90
Qwen3-32B Dense	M4 Max 64GB	Q4_K_M	~15
Qwen3-235B-A22B	M3 Ultra 512GB	4-bit MLX	Usable (specific numbers not published)
Qwen3-0.6B	M4 Max	4-bit MLX	~525

Federico Viticci at MacStories tested Qwen3-235B-A22B on a Mac Studio with M3 Ultra and 512GB RAM. The machine ran the full 235B-parameter MoE model in 4-bit with barely audible fan noise. You need a lot of memory for a model that size, but the fact that it runs at all on a desktop machine is remarkable.

Sources: MacStories Mac Studio AI Benchmarks, Apple MLX Research

AMD APUs

AMD’s Strix Halo platform (Ryzen AI Max+ 395) with 64GB of shared memory is another interesting option for local inference:

Model	Quantization	Context	Tokens/sec
Qwen3.5-35B-A3B	Q8	4K	38.5 t/s
Qwen3.5-35B-A3B	Q8	131K	27 t/s
Qwen3-Coder-Next-80B-A3B	Q4	32K	~37 t/s

Source: Hardware Corner

VRAM Requirements at a Glance

Planning your hardware purchase? Here’s what the Qwen3.5 models actually need at Q4 quantization:

Context Length	27B Dense	35B-A3B MoE
4K tokens	16 GB	19 GB
131K tokens	24 GB	22 GB
262K tokens	33 GB	25 GB

The MoE model’s VRAM advantage grows with context length. At 262K context, the 35B MoE needs 8GB less than the 27B dense model, while also generating tokens 3-5x faster. This is why MoE is so well-suited to local inference.

Which Quantization Should You Use?

Quantization compresses model weights to use fewer bits per parameter. Lower bits mean less memory and faster inference, but some quality loss.

Q4_K_M is the community default. It’s the best balance of quality, speed, and memory usage. If you’re unsure, start here.

Q3_K is for squeezing larger models into limited VRAM. You’ll see some quality degradation, but it’s acceptable for most tasks. People use this to run 80B MoE models on dual RTX 3090s.

Q6_K_L gives near-perfect quality when you have VRAM to spare.

Q8_0 is maximum quality quantization, roughly double the memory of Q4_K_M. Worth it on high-memory systems like Apple Silicon Macs or AMD APUs with 64GB+.

4-bit MLX is the standard for Apple Silicon users. The MLX framework handles quantization natively and is optimized for Apple’s GPU architecture.

How to Actually Run Qwen Locally

The easiest path for most people:

Ollama is the simplest option. Install it, then: ollama run qwen3.5:35b-a3b. It downloads the quantized model and starts a chat session. Works on Mac, Linux, and Windows.

LM Studio gives you a GUI. Browse models, pick a quantization, click download, start chatting. Good for people who don’t want to touch a terminal.

MLX (Mac only) is the fastest option on Apple Silicon. Install mlx-lm via pip, download a model from HuggingFace, and run it. About 50% faster than Ollama for token generation.

llama.cpp gives you maximum control. Build from source, pick your GGUF quantization, tune GPU layer offloading, batch sizes, and context length. This is what the benchmark numbers above mostly come from.

vLLM is the production choice. If you’re running Qwen as a service (local API server, team deployment), vLLM’s PagedAttention and continuous batching give you much better throughput than single-user tools.

Sources: Unsloth Qwen3 Guide, Ollama

Qwen vs. the Competition for Local Use

Qwen vs. DeepSeek-R1: DeepSeek-R1 is stronger at pure mathematical reasoning (92.8% vs 88.8% on MATH-500 at the 7B tier). But Qwen3 is better at everything else: coding, chat, multilingual tasks, instruction following. At 32B, Qwen3 wins on both reasoning and general tasks. For most local use cases, Qwen is the better all-rounder.

Qwen vs. Llama: Llama 4 brought MoE architecture to Meta’s lineup, and it’s competitive on STEM benchmarks. But Qwen3’s wider range of model sizes (0.6B to 235B) gives more flexibility for different hardware setups. Qwen also leads on coding benchmarks.

Qwen vs. Gemma 3: Google’s Gemma 3 is fast at short-form chat and excels on edge devices. For longer, more complex tasks (coding sessions, document analysis, reasoning chains), Qwen is the stronger choice.

The community consensus: Qwen for general tasks and coding, DeepSeek-R1 for when you specifically need strong mathematical reasoning.

Sources: HuggingFace Open-Source LLMs Comparison, DataCamp Qwen3 Guide

Practical Hardware Recommendations

Budget build (~$300-500 used): RTX 3090 (24GB). Still one of the best price-to-performance GPUs for local LLMs. Runs Qwen3.5-35B-A3B at 111 t/s with Q4 quantization. You can find these used for $400-500. Pair it with 32GB system RAM and you’re set.

Mac users: M3/M4 Pro with 36GB+ RAM. The MoE models fly on Apple Silicon. Qwen3-30B-A3B at 4-bit MLX gets 60-80 t/s on an M3 Pro with 36GB. If you already have the machine, you don’t need to buy anything else.

High-end: RTX 5090 (32GB) or M4 Max (64-128GB). The 5090 pushes 165 t/s on the 35B MoE. The M4 Max breaks 100 t/s and can comfortably run even the larger 235B-A22B MoE model in quantized form.

Dual GPU: Two RTX 3090s or 4090s. If you want to run the bigger coding models (Qwen3-Coder-Next at 80B) at high context lengths, dual GPUs give you 48GB of combined VRAM and 33-71 t/s depending on quantization.

No GPU at all: AMD Ryzen AI Max+ 395. AMD’s latest APUs with 64GB shared memory can run the 35B MoE at 38 t/s in Q8. Not as fast as a discrete GPU, but it works.

What This Means

A year ago, running a model this capable locally required enterprise hardware or cloud API credits. Now a used RTX 3090 and a good quantization gets you 111 tokens per second from a 35-billion-parameter model. A MacBook Pro with 36GB of RAM gets you 60-80 tokens per second from the same model family.

The pace of improvement is worth noting. Qwen 3.5’s 35B-A3B with 3 billion active parameters outperforms models that needed 22 billion active parameters just one generation earlier. If that trajectory holds, the next release will likely push frontier-quality models into even smaller hardware envelopes.

For developers, researchers, and anyone who cares about running AI without sending their data to an API, this is the most practical the local LLM landscape has ever been. Qwen isn’t the only good option, but it’s the most consistently excellent one across every size tier.

Sources: