Fergus Finn, working at Doubleword—an inference cloud startup focused on high-volume deployments—successfully deployed DeepSeek-V4-Flash on AMD's MI300X GPU and documented the technical challenges encountered. The effort highlights both the potential and obstacles of using AMD hardware as an alternative to NVIDIA GPUs for AI inference.
AMD MI300X Hardware Offers Compelling Economics
The MI300X GPU presents attractive specifications for inference workloads:
- Memory capacity: 192GB HBM3 per card versus H100's 80GB
- Performance: Comparable FP8 compute to NVIDIA H100
- Pricing: List price roughly 50% of equivalent NVIDIA capacity
- On-demand costs: Significantly lower rental rates than H100/H200
- Architecture: CDNA3 (gfx942) GPU cores
Four Major Technical Obstacles Required Custom Solutions
Finn encountered several software ecosystem challenges that required engineering workarounds:
FP8 Dialect Incompatibility: The MI300X uses AMD's proprietary 'fnuz' FP8 format, while newer AMD chips (MI325, MI350, MI355X) adopted the OCP standard. This created a factor-of-two numerical discrepancy. The vLLM framework's FP8 paths are aware of e4m3 versus e5m2 formats but not of fnuz versus OCP.
Missing AITER Kernel Paths: Three operations lacked optimized implementations on gfx942: paged MQA logits, sparse MLA prefill, and sparse MLA decode. The team implemented fallbacks to Triton kernels to bridge the gap.
HIP Graph Constraints: Recording compute graphs requires pure functions with no dynamic host reads or ragged tensor allocations. The team redesigned sparse MLA decode metadata handling to be capture-safe.
Miscellaneous Bugs: Issues included MoE routing mask shape gating on wrong AITER conditions and Triton kernel padding overflow damaging MoE bitmatrices.
Optimizations Deliver 8.6% Throughput Improvement
After addressing the technical challenges, the deployment achieved measurable performance gains:
- Baseline performance: 2,485 tokens/second per GPU
- Optimized performance: 2,699 tokens/second per GPU
- Improvement: +8.6% throughput gain
Key optimization techniques included rebuilding sparse MLA metadata as static tensors, caching bf16 projection weights across decode steps, and implementing dynamic tile shape selection for MXFP4 MoE operations.
AMD Ecosystem Challenges Are Temporary According to Finn
Finn emphasizes that most obstacles are temporary. Newer AMD GPU generations resolve the FP8 dialect issue, and AITER kernel coverage will improve with time. The fixes are available through a public vLLM repository for community contributions, enabling other developers to deploy on AMD hardware.
The technical documentation appeared on Hacker News with 78 points and 7 comments, indicating strong interest in AMD alternatives to NVIDIA within the technical community.
Key Takeaways
- AMD MI300X GPUs offer 192GB HBM3 memory and comparable FP8 performance to NVIDIA H100 at roughly 50% of the list price
- Deploying DeepSeek-V4-Flash on MI300X required solving four major technical challenges including FP8 dialect incompatibility and missing kernel implementations
- Optimizations achieved 2,699 tokens/second per GPU, representing an 8.6% throughput improvement over baseline performance
- The MI300X uses AMD's proprietary 'fnuz' FP8 format while newer AMD chips adopted the OCP standard, creating compatibility issues with existing frameworks
- Fixes and optimizations are publicly available through a vLLM repository, enabling broader community adoption of AMD hardware for AI inference