Multi-GPU Showdown: Benchmarking vLLM, Llama.cpp, & Ollama for Maximum Performance
Z
Zack Saadioui
8/11/2025
Multi-GPU Showdown: Benchmarking vLLM, Llama.cpp, & Ollama for Maximum Performance
Alright, let's talk about something that's on every AI enthusiast's mind when they start scaling up: how to get the absolute most out of a multi-GPU setup for running large language models. You've got the hardware, maybe a beastly rig with a couple of 3090s or 4090s, & you're ready to move beyond what a single card can do. But then you hit a wall. Which inference engine should you use? The internet is a mess of conflicting advice, with people swearing by different tools.
Honestly, it's a confusing landscape. You've probably heard of vLLM, Llama.cpp, & Ollama. They're the big three right now, but they are NOT created equal. Turns out, the one you choose can make a GIGANTIC difference in performance, especially when you have more than one GPU in the mix. I've spent a ton of time digging through Reddit threads, blog posts, & benchmarks to get to the bottom of this. So, here's the deal on how these three stack up & which one you should be using to avoid crippling your expensive hardware.
The Core of the Issue: It's Not Just About More GPUs
Here's the thing most people get wrong: simply having multiple GPUs doesn't automatically mean faster inference. The real magic lies in how the software uses those GPUs. The goal is to get them all working together in perfect harmony, a concept known as parallelism. Without it, you could have a monster 6-GPU machine that's getting outperformed by a single, well-optimized setup. It's a classic case of having a powerful engine but a terrible transmission.
The two main players in this game are vLLM & ExLlamaV2. These are the tools built from the ground up for serious, high-performance, multi-GPU inference. Llama.cpp & its user-friendly wrapper, Ollama, have their place, but as you'll see, they're not really designed for this kind of high-stakes performance game.
Let's break it down.
vLLM: The Enterprise-Grade Speed Demon
If there's one takeaway from all my research, it's this: for raw, unadulterated, multi-GPU throughput, vLLM is the king. It's designed for enterprise-level serving, where latency & the number of requests you can handle per second are EVERYTHING. If you're building a service that needs to respond to multiple users at once, vLLM is almost certainly your answer.
So, what makes it so fast? Two key technologies: Tensor Parallelism & PagedAttention.
The Magic Behind vLLM's Speed
Tensor Parallelism: Imagine you have a massive calculation to do. Instead of giving the whole thing to one GPU, Tensor Parallelism is like splitting that calculation into smaller pieces & giving one piece to each of your GPUs. They all work on their little piece at the same time, & then the results are combined. This is a HUGE deal for LLMs because the models themselves are just giant collections of matrices (tensors). vLLM expertly slices up these tensors so that all your GPUs are crunching numbers simultaneously. This is what you're paying for with a multi-GPU setup, & vLLM delivers.
PagedAttention (or Paged KV Cache): This one is a bit more technical, but it's a game-changer for memory management. When an LLM generates text, it has to keep track of the conversation so far. This "memory" is called the KV cache, & it can get HUGE, eating up a ton of your precious VRAM. PagedAttention is a clever system, pioneered by the vLLM team, that organizes this KV cache into smaller, non-contiguous blocks, like pages in a book. This means memory is used WAY more efficiently, especially when you're handling lots of different requests at once (batching). Less wasted memory means you can handle more users & longer conversations without running out of VRAM.
vLLM's Performance in the Real World
The numbers speak for themselves. People are reporting seeing throughput jump from a handful of tokens per second to hundreds. I've seen reports of setups with multiple RTX 3090s hitting 250-350 tokens/second on batched requests. One benchmark even showed a Llama 70B model on 4x H100 GPUs getting a 1.8x throughput improvement with a recent vLLM update. That's some serious speed.
Now, here's the catch. vLLM is a bit of a VRAM hog by design. It pre-allocates about 90% of your available VRAM to itself for maximum speed, assuming the entire model can fit in memory. It's also not great for people who like to switch between different models frequently, as you generally have to restart the vLLM server to load a new one.
Setting Up vLLM for Multi-GPU
Getting vLLM running across multiple GPUs is surprisingly straightforward. The most important part is the
1
--tensor-parallel-size
flag. You just set this to the number of GPUs you want to use. So, for a 2-GPU setup, you'd launch it with:
It's that simple. Of course, you need to have your CUDA environment & PyTorch installed correctly, but the vLLM-specific part is a breeze. For even bigger setups that span multiple machines, vLLM uses the Ray framework to create a distributed computing cluster.
Llama.cpp: The Jack-of-All-Trades for Consumer Hardware
Now, let's talk about Llama.cpp. This project is a LEGEND in the local AI community, & for good reason. It's an incredibly versatile, lightweight, & efficient C++ implementation that made it possible to run LLMs on everything from a Mac M1 to a Raspberry Pi. It's the champion of CPU inference & squeezing models onto hardware with limited VRAM.
But here's the critical distinction: Llama.cpp's multi-GPU support is fundamentally different from vLLM's. It's not about parallel processing for speed; it's about offloading layers to fit a model that's too big for a single GPU.
How Llama.cpp Handles Multiple GPUs
With Llama.cpp, you're essentially just splitting the model's layers across your GPUs. If you have a 70B model that needs 40GB of VRAM & you have two 24GB GPUs, Llama.cpp will put some layers on the first GPU & the rest on the second. It's a sequential process—one GPU does its work, then the next one does its work. It's better than offloading to your slow system RAM, but it's not the lightning-fast parallel processing that vLLM offers.
This is why you'll see people with powerful multi-GPU rigs complaining about performance with Llama.cpp. They're not getting the speed boost they expect because their GPUs are waiting in line, not working together. Some users even report that performance gets worse when adding a third GPU.
When Should You Use Llama.cpp?
So, is Llama.cpp bad? ABSOLUTELY NOT. It's just for a different purpose. You should use Llama.cpp when:
You have to offload to the CPU. If a model is too big for your combined VRAM, Llama.cpp is your best friend. It's the undisputed king of CPU offloading.
You're running on a single GPU or CPU-only. Its performance on a single device is excellent.
You value simplicity & a single, portable executable.
Setting Up Llama.cpp for Multi-GPU
Setting up Llama.cpp for multi-GPU is a matter of compiling it with CUDA support & then using a flag to specify how many layers to put on the GPU. The
1
-ngl
flag (number of GPU layers) is your main tool here. You can tell it to put a certain number of layers on your primary GPU, & it will automatically try to use subsequent GPUs if the model is still too large.
Ollama: The User-Friendly Wrapper
And that brings us to Ollama. Ollama is, for all intents & purposes, a super user-friendly interface for Llama.cpp. It has done wonders for making local LLMs accessible to everyone. With a simple
1
ollama run mistral
command, you can have a powerful model running in seconds. It's brilliant.
But because it's built on top of Llama.cpp, it inherits all of its strengths & weaknesses. It's fantastic for ease of use, trying out different models quickly, & running on single GPUs or with CPU offload. But for high-performance, multi-GPU setups, it's just not the right tool for the job. You'll run into the same sequential processing limitations as you would with Llama.cpp. There really aren't any benchmarks for Ollama in a multi-GPU setting because that's not its intended use case.
What About ExLlamaV2? The Quantization Specialist
I can't finish this comparison without mentioning ExLlamaV2. It's another top-tier inference engine that, like vLLM, supports Tensor Parallelism for true multi-GPU speed. Where ExLlamaV2 really shines is with quantized models, specifically in its own EXL2 format.
Quantization is the process of reducing the precision of a model's weights to make it smaller & faster. ExLlamaV2 is incredibly good at this, allowing you to run very large models with minimal performance loss, often with better speed than other methods.
If you're working with quantized models, especially in lower VRAM environments where every gigabyte counts, ExLlamaV2 is a fantastic choice & a worthy competitor to vLLM.
Setting Up ExLlamaV2 for Multi-GPU
Similar to vLLM, setting up ExLlamaV2 for multiple GPUs is straightforward. After you've installed it, you can use the
1
--gpu_split auto
flag, & it will automatically distribute the model across your available GPUs.
The Bottom Line: Choose the Right Tool for the Job
So, after all that, what's the verdict?
For MAXIMUM multi-GPU performance, especially with batching: Use vLLM. Its implementation of Tensor Parallelism & PagedAttention is state-of-the-art & will give you the highest throughput. This is the choice for production systems & serious power users.
If you're using quantized models (EXL2) & want great multi-GPU speed: Give ExLlamaV2 a try. It's another excellent choice for Tensor Parallelism & is purpose-built for high-performance quantization.
If you need to offload layers to your CPU because you don't have enough VRAM:Llama.cpp is your only real choice. It's the champion of mixed CPU/GPU inference.
For ease of use, model testing, & single-GPU setups:Ollama is the winner. Its simplicity is unmatched for getting up & running quickly.
It's also worth thinking about how you manage customer interactions powered by these models. If you're building a business application, you'll need more than just a fast inference engine. You'll need a platform to manage conversations, provide instant support, & engage visitors. That's where a tool like Arsturn comes in. You can use these powerful open-source models as the brain, & then use Arsturn to build a custom AI chatbot trained on your own business data. This allows you to offer 24/7, instant customer support, answer questions, & even generate leads, all powered by the incredible speed of a well-optimized multi-GPU setup. Arsturn helps bridge the gap between raw model performance & a real-world business solution.
Hope this was helpful. It's a complex topic, but understanding the fundamental differences between these tools is the key to unlocking the true potential of your hardware. Don't cripple your multi-GPU rig with the wrong software! Let me know what you think.