A developer has published detailed benchmarks showing that Apple's M4 MacBook Pro with 24GB of unified memory can run local large language models at practical speeds for everyday use. Testing multiple models, the developer settled on Qwen 3.5 9B (quantized to Q4_K_S) running on LM Studio, achieving approximately 40 tokens per second with thinking mode enabled and a 128K context window.
Multiple Models Tested on Apple Silicon
The developer experimented with several models to find the optimal balance between performance and resource usage on the M4 chip. Models that failed to perform adequately included Qwen 3.6 Q3, GPT-OSS 20B, Devstral Small 24B, and Gemma 4B. While Gemma 4B ran smoothly, it struggled with tool use, making it unsuitable for the developer's workflow. The successful configuration used qwen3.5-9b@q4_k_s with specific parameters: temperature 0.6, top_p 0.95, top_k 20, min_p 0.0, zero presence penalty, and 1.0 repetition penalty.
Practical Performance for Daily Development Work
The Qwen 3.5 setup proved capable of handling basic tasks, research, and planning while leaving enough system resources for other applications to run concurrently. However, the developer noted important limitations compared to state-of-the-art cloud models, including occasional distractions, loops, and misinterpretations. The workflow required more user guidance and planning than working with frontier models, representing a trade-off between privacy and convenience.
Growing Interest in On-Device AI
This benchmark arrives amid increasing developer interest in privacy-preserving on-device models. The practical demonstration shows that mid-tier Apple Silicon configurations can support local AI workflows without requiring top-end hardware. The 24GB unified memory configuration represents a mainstream option rather than a high-end workstation, making these results relevant to a broad developer audience considering local AI deployment.
Key Takeaways
- M4 MacBook Pro with 24GB RAM runs Qwen 3.5 9B at approximately 40 tokens/second using LM Studio
- Quantized Q4_K_S model variant enables 128K context window while maintaining usable generation speeds
- Several larger models including Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B failed to run efficiently on this configuration
- Local model performance requires more user guidance than cloud alternatives but enables privacy-preserving workflows
- Mid-tier Apple Silicon configurations prove sufficient for practical local AI development without requiring high-end hardware