8/12/2025

What Hardware Can I Run Ollama On? A Guide for Different Setups

So, you've decided to dive into the world of running large language models (LLMs) locally with Ollama. That's awesome! It's a total game-changer for privacy, offline use, & just tinkering with AI without those pesky API bills. But then comes the big question: "What kind of rig do I actually need to run this thing?"
Honestly, the answer is "it depends," which I know isn't super helpful. But stick with me. The hardware you need really comes down to what you want to do. Are you looking to run smaller models for simple tasks, or are you aiming to tame the 70-billion-parameter beasts?
I've spent a TON of time experimenting with Ollama on different setups, from beefy servers to my trusty MacBook, & I'm here to break it all down for you. We'll get into the nitty-gritty of CPUs, GPUs, RAM, & even running Ollama in virtual machines.

The Bare Minimum: What You Need to Get Started

Let's start with the basics. The official Ollama documentation gives us a pretty good starting point. At a minimum, you'll need:
  • Operating System: macOS 11 Big Sur or later, or a modern Linux distribution like Ubuntu 18.04 or later. Windows is also supported, usually through WSL2 (Windows Subsystem for Linux).
  • RAM: 8GB is the absolute minimum, & that'll let you run 3B models. For 7B models, you'll want at least 16GB, & for 13B models, you're looking at 32GB.
  • Disk Space: You'll need some space for the Ollama installation itself, plus room for the models you download. A 512GB SSD is a good starting point, but you might want more if you plan on collecting a lot of models.
Now, you might be thinking, "That's it?" & yeah, you can get Ollama up & running on some pretty modest hardware. I've seen people run it on older laptops & even Raspberry Pis! But performance is a whole other story.

The Great Debate: CPU vs. GPU

Here's where things get interesting. You can totally run Ollama on just your CPU. It works! But if you want any kind of decent speed, a GPU is HIGHLY recommended. The difference is night & day.

Running Ollama on a CPU

If you're running Ollama on a CPU-only system, the most important factor is memory bandwidth. The more memory channels your CPU has, the better it will perform. One user on the Level1Techs forums did some testing & found that performance scaled almost linearly with the number of memory channels. So, a high-end desktop CPU with dual-channel RAM will be faster than a laptop CPU with single-channel RAM.
Here's a rough idea of what to expect from a CPU-only setup:
  • Intel Core i7-1355U (10 cores): Around 7.5 tokens per second.
  • AMD Ryzen 5 4600G (6 cores): Around 12.3 tokens per second.
As you can see, it's not exactly lightning-fast. For simple prompts, it's usable. But for more complex tasks or longer responses, you'll be waiting a while.
The main advantage of a CPU-only setup is that you can load much larger models if you have a lot of RAM. If you've got a workstation with 128GB or 256GB of RAM, you can run those huge 405B parameter models that would be impossible to fit into the VRAM of most consumer GPUs.

The Power of the GPU

Now, let's talk about GPUs. This is where you get the REAL speed. The parallel processing power of a GPU is perfect for the kind of math that LLMs do. Even a modest GPU will give you a significant performance boost over a CPU.
The key here is VRAM (video RAM). You want to be able to fit the entire model into your GPU's VRAM. If the model is too big, it will have to be split between your VRAM & your system RAM, which will slow things down considerably.

NVIDIA GPUs: The King of the Hill

When it comes to AI, NVIDIA is the undisputed king. Their CUDA platform is the industry standard, & it's what most AI software, including Ollama, is optimized for.
Here's a look at how some different NVIDIA GPUs perform with Ollama:
  • NVIDIA RTX 4090 (24GB VRAM): This is the top-of-the-line consumer GPU, & it's a beast for running LLMs. You can expect incredibly fast performance, even with larger models.
  • NVIDIA RTX 4080 (16GB VRAM): A great all-around card that offers a good balance of price & performance.
  • NVIDIA RTX 3080 (10GB VRAM): Still a very capable card that can handle medium-sized models with ease.
  • NVIDIA RTX 4060 (8GB VRAM): A good budget-friendly option for running smaller models. You can expect around 40-50 tokens per second with 7B models.
  • Older NVIDIA GPUs: You don't necessarily need the latest & greatest. People have reported good performance with older cards like the GTX 1080, & even much older cards like the NVS 510 can provide a significant boost over a CPU.
If you're serious about running LLMs locally, an NVIDIA GPU is a solid investment.

AMD GPUs: A Worthy Contender

AMD has been making a big push into the AI space, & their GPUs are becoming a viable alternative to NVIDIA. Ollama has support for AMD GPUs through ROCm, which is AMD's open-source platform for GPU computing.
Here's what you need to know about running Ollama on an AMD GPU:
  • Performance: AMD GPUs can offer excellent performance, especially for the price. The RX 7900 XTX (24GB VRAM) is a strong competitor to the RTX 4090, & the RX 7800 XT (16GB VRAM) is a great mid-range option. One user reported getting around 44 tokens per second with a 13B model on a 7900 XTX.
  • Setup: Getting AMD GPUs to work with Ollama can be a bit more challenging than with NVIDIA. You'll likely need to be running Linux, as ROCm support on Windows is still a bit hit-or-miss.
  • The Future is Bright: AMD is actively improving its AI software stack, so we can expect performance & compatibility to get even better over time.
If you're on a budget or you're a fan of Team Red, an AMD GPU can be a great choice for your Ollama setup.

Apple Silicon: The New Kid on the Block

Apple's M-series chips (M1, M2, M3) have been a game-changer for Mac users. Their unified memory architecture, which allows the CPU & GPU to share the same pool of memory, is a HUGE advantage for running LLMs.
Here's why Apple Silicon is so great for Ollama:
  • Performance: Even the base M1 chip can offer surprisingly good performance. One user with an M1 MacBook Air with 8GB of RAM reported that it was "quite impressive." The higher-end M3 Max chips are even more capable, with one user reporting speeds of around 40 tokens per second with a 3B model.
  • Ease of Use: Ollama is incredibly easy to set up on a Mac. You just download the app, & you're good to go. The software will automatically detect & use the GPU & Neural Engine for acceleration.
  • Unified Memory: This is the real magic. Because the CPU & GPU share the same memory, you don't have to worry about a model fitting into a separate pool of VRAM. If you have a Mac with 32GB of unified memory, you can run models that are much larger than what you could fit on a dedicated GPU with less VRAM.
If you're a Mac user, you're in for a treat. Ollama on Apple Silicon is a fantastic experience.

Running Ollama in a Virtualized Environment

For those of us who like to keep our systems clean & organized, running Ollama in a virtual machine or a Docker container is a great option.

Docker

Running Ollama in a Docker container is super convenient, especially for deployment & reproducibility. However, there are a few things to keep in mind:
  • Performance: There can be a slight performance hit when running in Docker compared to a bare-metal installation, but it's usually marginal.
  • GPU Passthrough: Getting GPU acceleration to work in a Docker container can be a bit tricky. On Linux, you'll need to install the NVIDIA Container Toolkit. On Windows, Docker Desktop with WSL2 makes it pretty easy to enable GPU access.
  • Macs & Docker: Here's a big one: Docker on Mac does NOT currently support GPU passthrough. This means that if you run Ollama in a Docker container on a Mac, it will only use the CPU, which will be MUCH slower. So, if you're on a Mac, it's best to run Ollama natively.

Proxmox & Other Hypervisors

If you're running a home lab with a server, you might be tempted to run Ollama in a virtual machine on a hypervisor like Proxmox. This is a great way to isolate your AI experiments from the rest of your server.
The key to getting good performance in a VM is GPU passthrough. This allows the virtual machine to have direct access to a physical GPU, just as if it were installed in a bare-metal machine.
Setting up GPU passthrough can be a bit of a process, but it's well worth it if you want to run Ollama with GPU acceleration in a virtualized environment.

The Role of Quantization

We can't talk about hardware without talking about quantization. In simple terms, quantization is a way of making a model smaller by reducing the precision of its weights. This means that you can run larger models on hardware with less VRAM, & it can also make inference faster.
Ollama makes it super easy to use quantized models. When you pull a model, you can specify the quantization level you want to use. For example,
1 ollama run llama3:8b-instruct-q4_0
will pull a 4-bit quantized version of the Llama 3 8B model.
Here's a quick rundown of the different quantization levels:
  • FP16: This is the full-precision model, with no quantization. It offers the best quality but requires the most VRAM.
  • Q8: An 8-bit quantization that offers a good balance of quality & performance.
  • Q4: A 4-bit quantization that is very popular for running models on consumer hardware. The quality is still very good for most tasks, & the performance gains can be significant.
  • Q2: A 2-bit quantization that is the most aggressive. The quality can be noticeably lower, but it allows you to run models on very low-resource hardware.
The right quantization for you will depend on your hardware & your needs. If you have a powerful GPU with a lot of VRAM, you might want to stick with FP16 or Q8 for the best quality. But if you're on a more modest setup, Q4 is a great choice.

Putting It All Together: What Should You Get?

So, after all that, what hardware should you actually get for Ollama? Here are a few recommendations for different budgets & needs:
  • The Budget-Friendly Build:
    • CPU: A modern 6-core CPU from Intel or AMD.
    • RAM: 16GB of DDR4 RAM.
    • GPU: An older NVIDIA GPU with at least 8GB of VRAM, like a GTX 1070 or a Quadro P4000.
    • Storage: A 512GB SSD.
  • The Mid-Range Powerhouse:
    • CPU: A recent 8-core CPU from Intel or AMD.
    • RAM: 32GB of DDR5 RAM.
    • GPU: An NVIDIA RTX 4060 Ti (16GB) or an AMD RX 7800 XT (16GB).
    • Storage: A 1TB NVMe SSD.
  • The "No Compromises" Beast:
    • CPU: A high-end CPU like an Intel Core i9 or an AMD Ryzen 9.
    • RAM: 64GB or more of DDR5 RAM.
    • GPU: An NVIDIA RTX 4090 (24GB).
    • Storage: A 2TB or larger NVMe SSD.

Don't Forget the Software!

Of course, hardware is only half of the equation. Once you've got your rig set up, you'll want to start building cool stuff with it. And if you're looking to create custom AI chatbots for your website or business, you should definitely check out Arsturn.
Arsturn is a no-code platform that lets you build custom AI chatbots trained on your own data. You can use it to provide instant customer support, answer questions, & engage with your website visitors 24/7. It's a great way to leverage the power of local LLMs to create a more personalized & interactive experience for your users.
Hope this was helpful! Let me know what you think, & feel free to share your own Ollama hardware experiences in the comments.

Copyright © Arsturn 2025