Experiment

Vega 48 MPS Services

Every LLM and most online resources say MPS only works on Apple Silicon, or that you need ROCm or CUDA for GPU inference on AMD hardware. I kept being told it wasn't possible, tried it anyway, and it just worked. PyTorch's MPS backend runs via Metal on Intel Macs with AMD discrete GPUs.

Three services running on the Vega 48: text embeddings with nomic-embed-text-v2-moe peaking at 112 embeds/sec (103/sec at the sweet spot of 4 workers × batch 128), vision captioning with Moondream2 at 12.3s for short captions and 28.8s for normal (concurrency doesn't help — MPS serializes everything), and SDXL image generation. All on GPU, not CPU fallback. The 8GB VRAM is the main constraint — embeddings and captioning fit together but SDXL needs the card to itself.

It might be a recent addition to PyTorch or it might have always worked. Either way, if you've got one of these machines and you've been told it can't do inference — it can.

// Tech stack

PythonPyTorch MPSMetalFastAPISentence-TransformersDiffusersTransformers

Live in production

PreviousDocument Intelligence Pipeline