Developer Achieves 382x Speedup Training LLMs in Swift on Apple Silicon

Matt Gallagher demonstrated that Swift can achieve competitive performance for machine learning workloads on Apple Silicon, publishing a detailed optimization journey on May 10, 2026 that took matrix multiplication performance from 0.054 tokens per second to 11.123 tokens per second—a 382x improvement. The comprehensive article on Cocoa with Love benchmarks a GPT-2 compatible model with 124 million parameters against Andrej Karpathy's llm.c implementation.

Progressive Optimization Unlocks Dramatic Performance Gains

Gallagher's optimization journey progressed through six distinct stages, each building on previous improvements:

Basic Swift: 0.054 tokens/s (7.3% of llm.c baseline)
Memory optimization: 0.533 tokens/s using mutableSpan to eliminate copy-on-write overhead
SIMD improvements: 0.918 tokens/s with loop restructuring and fused-multiply-add operations
Multithreading: 4.356 tokens/s by parallelizing across 16 CPU cores
AMX acceleration: 5.884 tokens/s leveraging Apple's undocumented Matrix Coprocessor
Metal GPU: 11.123 tokens/s using GPU compute shaders with tiling

The final implementation achieved 1.1 teraflops, up from an initial 2.8 gigaflops—a nearly 400x improvement in raw computational throughput.

Technical Optimizations Span Memory, CPU, and GPU

Early-stage improvements focused on eliminating Swift's Array overhead, which initially caused _ArrayBuffer.beginCOWMutation() to become the primary bottleneck. Gallagher implemented mutableSpan for direct memory access and applied fused-multiply-add operations through Swift Numerics' Relaxed.multiplyAdd.

Mid-stage optimizations parallelized matrix operations using DispatchQueue.concurrentPerform, distributing work across CPU cores. Advanced optimizations leveraged Apple Silicon-specific features: AMX (Apple Matrix Coprocessor) instructions provided specialized matrix acceleration, while Metal compute shaders with tiling enabled GPU-accelerated operations.

The training workload involved approximately 191 trillion floating-point operations per iteration across forward pass, backward pass, and weight updates. Final inference speed reached 11 tokens per second.

Production Use Not Recommended Despite Performance Gains

Gallagher explicitly cautioned against production deployment of this code, noting that future articles will explore "BLAS, BNNS, CoreML, MPSGraph and other high performance libraries built into macOS." While the 382x speedup demonstrates Swift's potential for high-performance computing, the implementation remains educational rather than production-ready.

The article received 212 points on Hacker News, highlighting developer interest in Swift-based machine learning optimization techniques.

Key Takeaways

Matt Gallagher achieved 382x speedup training LLMs in Swift on Apple Silicon, from 0.054 to 11.123 tokens/second
Optimization progressed through six stages: memory management, SIMD operations, multithreading, AMX acceleration, and Metal GPU compute
Final implementation reached 1.1 teraflops computational throughput, up from initial 2.8 gigaflops
The GPT-2 compatible model with 124 million parameters performed 191 trillion floating-point operations per training iteration
Author explicitly recommends against production use, positioning the work as educational exploration of Swift's ML capabilities

Progressive Optimization Unlocks Dramatic Performance Gains

Gallagher's optimization journey progressed through six distinct stages, each building on previous improvements:

Basic Swift: 0.054 tokens/s (7.3% of llm.c baseline)

Memory optimization: 0.533 tokens/s using mutableSpan to eliminate copy-on-write overhead

SIMD improvements: 0.918 tokens/s with loop restructuring and fused-multiply-add operations

Multithreading: 4.356 tokens/s by parallelizing across 16 CPU cores

AMX acceleration: 5.884 tokens/s leveraging Apple's undocumented Matrix Coprocessor

Metal GPU: 11.123 tokens/s using GPU compute shaders with tiling

The final implementation achieved 1.1 teraflops, up from an initial 2.8 gigaflops—a nearly 400x improvement in raw computational throughput.

Technical Optimizations Span Memory, CPU, and GPU

Production Use Not Recommended Despite Performance Gains

The article received 212 points on Hacker News, highlighting developer interest in Swift-based machine learning optimization techniques.

Key Takeaways

Matt Gallagher achieved 382x speedup training LLMs in Swift on Apple Silicon, from 0.054 to 11.123 tokens/second

Optimization progressed through six stages: memory management, SIMD operations, multithreading, AMX acceleration, and Metal GPU compute

Final implementation reached 1.1 teraflops computational throughput, up from initial 2.8 gigaflops

The GPT-2 compatible model with 124 million parameters performed 191 trillion floating-point operations per training iteration

Author explicitly recommends against production use, positioning the work as educational exploration of Swift's ML capabilities