Matt Gallagher demonstrated that Swift can achieve competitive performance for machine learning workloads on Apple Silicon, publishing a detailed optimization journey on May 10, 2026 that took matrix multiplication performance from 0.054 tokens per second to 11.123 tokens per second—a 382x improvement. The comprehensive article on Cocoa with Love benchmarks a GPT-2 compatible model with 124 million parameters against Andrej Karpathy's llm.c implementation.
Progressive Optimization Unlocks Dramatic Performance Gains
Gallagher's optimization journey progressed through six distinct stages, each building on previous improvements:
- Basic Swift: 0.054 tokens/s (7.3% of llm.c baseline)
- Memory optimization: 0.533 tokens/s using mutableSpan to eliminate copy-on-write overhead
- SIMD improvements: 0.918 tokens/s with loop restructuring and fused-multiply-add operations
- Multithreading: 4.356 tokens/s by parallelizing across 16 CPU cores
- AMX acceleration: 5.884 tokens/s leveraging Apple's undocumented Matrix Coprocessor
- Metal GPU: 11.123 tokens/s using GPU compute shaders with tiling
The final implementation achieved 1.1 teraflops, up from an initial 2.8 gigaflops—a nearly 400x improvement in raw computational throughput.
Technical Optimizations Span Memory, CPU, and GPU
Early-stage improvements focused on eliminating Swift's Array overhead, which initially caused _ArrayBuffer.beginCOWMutation() to become the primary bottleneck. Gallagher implemented mutableSpan for direct memory access and applied fused-multiply-add operations through Swift Numerics' Relaxed.multiplyAdd.
Mid-stage optimizations parallelized matrix operations using DispatchQueue.concurrentPerform, distributing work across CPU cores. Advanced optimizations leveraged Apple Silicon-specific features: AMX (Apple Matrix Coprocessor) instructions provided specialized matrix acceleration, while Metal compute shaders with tiling enabled GPU-accelerated operations.
The training workload involved approximately 191 trillion floating-point operations per iteration across forward pass, backward pass, and weight updates. Final inference speed reached 11 tokens per second.
Production Use Not Recommended Despite Performance Gains
Gallagher explicitly cautioned against production deployment of this code, noting that future articles will explore "BLAS, BNNS, CoreML, MPSGraph and other high performance libraries built into macOS." While the 382x speedup demonstrates Swift's potential for high-performance computing, the implementation remains educational rather than production-ready.
The article received 212 points on Hacker News, highlighting developer interest in Swift-based machine learning optimization techniques.
Key Takeaways
- Matt Gallagher achieved 382x speedup training LLMs in Swift on Apple Silicon, from 0.054 to 11.123 tokens/second
- Optimization progressed through six stages: memory management, SIMD operations, multithreading, AMX acceleration, and Metal GPU compute
- Final implementation reached 1.1 teraflops computational throughput, up from initial 2.8 gigaflops
- The GPT-2 compatible model with 124 million parameters performed 191 trillion floating-point operations per training iteration
- Author explicitly recommends against production use, positioning the work as educational exploration of Swift's ML capabilities