Okay, so you've got the RAG concept down. Now, let's talk about the hardware. If you're serious about this, you're going to want more than one GPU. Why? Because the models you'll be using are BIG. We're talking billions of parameters. Running them on a single GPU is possible, but it can be slow, especially if you want to use the really good models.
So, what kind of hardware are we talking about? A great real-world example comes from a user on Reddit who was building a similar system. They were using an AMD Threadripper CPU (great for lots of PCIe lanes), a motherboard that could handle multiple GPUs at full bandwidth, & a couple of NVIDIA RTX 3090s, each with 24GB of VRAM. This is a solid setup that can handle some serious AI workloads.
But the hardware is only half the battle. You also need the right software to make all those GPUs work together. This is where things get interesting. A lot of people start with a tool called
. It's fantastic for running models on a single GPU, or even for offloading some layers to the CPU if you're short on VRAM. But, and this is a big "but," it's not the best choice for a true multi-GPU setup.
For that, you're going to want to look at something like
.
is designed from the ground up for distributed inference, which is a fancy way of saying it's really, really good at splitting up a model across multiple GPUs & making them work together efficiently. It uses a technique called "tensor parallelism" to get a massive performance boost.
Setting up
is surprisingly straightforward. You can use it as a server that you can then send requests to. A typical command might look something like this: