Developer Achieves 40 Tokens/Second Running Qwen 3.5 Locally on M4 Mac with 24GB

A developer has published detailed benchmarks showing that Apple's M4 MacBook Pro with 24GB of unified memory can run local large language models at practical speeds for everyday use. Testing multiple models, the developer settled on Qwen 3.5 9B (quantized to Q4_K_S) running on LM Studio, achieving approximately 40 tokens per second with thinking mode enabled and a 128K context window.

Multiple Models Tested on Apple Silicon

The developer experimented with several models to find the optimal balance between performance and resource usage on the M4 chip. Models that failed to perform adequately included Qwen 3.6 Q3, GPT-OSS 20B, Devstral Small 24B, and Gemma 4B. While Gemma 4B ran smoothly, it struggled with tool use, making it unsuitable for the developer's workflow. The successful configuration used qwen3.5-9b@q4_k_s with specific parameters: temperature 0.6, top_p 0.95, top_k 20, min_p 0.0, zero presence penalty, and 1.0 repetition penalty.

Practical Performance for Daily Development Work

The Qwen 3.5 setup proved capable of handling basic tasks, research, and planning while leaving enough system resources for other applications to run concurrently. However, the developer noted important limitations compared to state-of-the-art cloud models, including occasional distractions, loops, and misinterpretations. The workflow required more user guidance and planning than working with frontier models, representing a trade-off between privacy and convenience.

Growing Interest in On-Device AI

This benchmark arrives amid increasing developer interest in privacy-preserving on-device models. The practical demonstration shows that mid-tier Apple Silicon configurations can support local AI workflows without requiring top-end hardware. The 24GB unified memory configuration represents a mainstream option rather than a high-end workstation, making these results relevant to a broad developer audience considering local AI deployment.

Key Takeaways

M4 MacBook Pro with 24GB RAM runs Qwen 3.5 9B at approximately 40 tokens/second using LM Studio
Quantized Q4_K_S model variant enables 128K context window while maintaining usable generation speeds
Several larger models including Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B failed to run efficiently on this configuration
Local model performance requires more user guidance than cloud alternatives but enables privacy-preserving workflows
Mid-tier Apple Silicon configurations prove sufficient for practical local AI development without requiring high-end hardware

Multiple Models Tested on Apple Silicon

Practical Performance for Daily Development Work

Growing Interest in On-Device AI

Key Takeaways

M4 MacBook Pro with 24GB RAM runs Qwen 3.5 9B at approximately 40 tokens/second using LM Studio

Quantized Q4_K_S model variant enables 128K context window while maintaining usable generation speeds

Several larger models including Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B failed to run efficiently on this configuration

Local model performance requires more user guidance than cloud alternatives but enables privacy-preserving workflows

Mid-tier Apple Silicon configurations prove sufficient for practical local AI development without requiring high-end hardware