8/12/2025

So you've got a beast of a machine with an NVIDIA RTX 4080S & 32GB of RAM, & you're ready to dive into the world of local Large Language Models (LLMs) with Ollama. That's awesome! Running these powerful AI models on your own hardware is a game-changer for privacy, customization, & cost. But here's the thing you might've already noticed: just because you have a powerful setup doesn't mean you're automatically getting the best performance.
Honestly, it can be a little frustrating when you know your hardware is capable of more, but your models are running slower than you'd like, especially when the context window starts to fill up. I've been there, tinkering with settings & wondering what I'm missing. Turns out, there are a TON of ways to squeeze every last drop of performance out of your 4080S & 32GB of RAM.
This guide is your deep dive into improving Ollama performance. We'll go through everything from the ground up, from understanding your hardware's strengths & weaknesses to the nitty-gritty of model quantization & advanced configuration. Let's get your Ollama setup running at lightning speed.

Your Hardware: The 4080S & 32GB RAM

First off, let's talk about the gear you're working with. The RTX 4080S is a powerhouse for AI. With its 16GB of VRAM & impressive CUDA core count, it's more than capable of handling some pretty hefty models. That 16GB of VRAM is your golden ticket for loading larger models directly onto the GPU, which is where you'll see the biggest speed boosts.
The 32GB of system RAM is also a crucial piece of the puzzle. While the GPU's VRAM is faster & what you'll primarily use for model inference, the system RAM comes into play when a model is too big to fit entirely on the GPU. In these cases, your system will offload some of the model's layers to the RAM, which is slower than VRAM but still gets the job done. With 32GB, you have a decent buffer for running larger models that might not fit entirely on your 4080S's VRAM.
So, you've got a great foundation. Now, let's start optimizing.

The Operating System Matters: Windows vs. Linux

This might come as a surprise to some, but the operating system you're running Ollama on can have a noticeable impact on performance. Several users have reported that Ollama runs significantly faster on Linux compared to Windows. We're talking about a potential 25% increase in inference speed, which is nothing to sneeze at.
Why the difference? It likely comes down to a few factors. Linux generally has lower system overhead than Windows, meaning more of your system's resources are available for Ollama to use. Additionally, the way Linux manages system resources & GPU drivers can be more efficient for AI workloads. Some have even noticed that running Ollama within the Windows Subsystem for Linux (WSL2) can offer a speed boost over running it directly on Windows.
If you're serious about getting the absolute best performance, and you're comfortable with it, dual-booting into a Linux distribution like Ubuntu is a solid option. You'll likely see a snappier, more responsive experience with your models.

Model Selection & Quantization: The Biggest Bang for Your Buck

This is, without a doubt, one of the most critical aspects of Ollama performance. The model you choose & how you use it will have a massive impact on speed. Here's what you need to know:

Not All Models Are Created Equal

There are a ton of different LLMs out there, & they all have different sizes, architectures, & performance characteristics. A 7-billion parameter model will be significantly faster than a 70-billion parameter model, but it might not be as capable for complex tasks.
For a 4080S with 16GB of VRAM, you're in a sweet spot. You can comfortably run even larger models, but you'll get the best performance with models that fit entirely within your VRAM. For instance, a Llama 3.1 8B model will run exceptionally well, while a larger model like a 30B+ model might require some offloading to your system RAM, which will slow things down.

The Magic of Quantization

Here's where things get REALLY interesting. Quantization is a process that reduces the precision of a model's weights, which in turn makes the model smaller & faster. It's like compressing a file – you lose a little bit of quality, but the file size is much more manageable.
Most of the models you'll find on the Ollama hub are already quantized. You'll often see them with tags like
1 q4_0
or
1 q5_K_M
. These refer to the quantization level. A lower number generally means a smaller, faster model, but with a potential trade-off in accuracy.
For your 4080S, you can experiment with different quantization levels. A Q4 or Q5 quantization is often a great balance of speed & quality. You might even be able to run a larger model with a more aggressive quantization level & still get great results. The key is to experiment & see what works best for your specific use case.

Understanding GGUF

You'll often see the term "GGUF" when looking at quantized models. GGUF stands for "GPT-Generated Unified Format," & it's the file format used for quantized models in the llama.cpp ecosystem, which Ollama is built on. It's designed to be fast & efficient, & it's what allows us to run these massive models on consumer hardware.
There are different types of GGUF quants, like the "K-quants" which are often a good choice for a balance of performance & quality. The good news is you don't need to be an expert in GGUF to use it. When you pull a model from the Ollama hub, it's already in the GGUF format & ready to go.
For businesses looking to leverage this technology, the ability to run customized, quantized models locally can be a game-changer. Imagine having a fine-tuned model for your specific business needs, running efficiently on your own hardware. This is where a platform like Arsturn can be incredibly valuable. Arsturn helps businesses build no-code AI chatbots trained on their own data, providing personalized customer experiences. By using an optimized local model, you can power your Arsturn chatbot for instant, private, & cost-effective customer support.

Offloading Layers to the CPU: When Models Are Too Big

So what happens when you want to run a model that's larger than your 16GB of VRAM? This is where offloading comes in. Ollama can automatically offload some of the model's layers to your system RAM. This is a fantastic feature that allows you to run models that would otherwise be out of reach.
However, there's a performance cost. Your system RAM is significantly slower than your GPU's VRAM, so you'll notice a drop in inference speed when offloading is happening. With 32GB of RAM, you have a good amount of space to offload to, but you'll want to strike a balance.
If you find that a model is running too slowly, you might be better off using a smaller, fully GPU-accelerated model, or a more aggressively quantized version of the larger model. It's a trade-off between model size/capability & speed.

Tweaking the Context Window: A Balancing Act

The context window is another crucial factor in Ollama performance. It's essentially the model's short-term memory – the amount of text it can "see" at once to understand the conversation. A larger context window is great for long, detailed conversations, but it also consumes more resources & can slow down inference.
Ollama's default context window is often 2048 tokens. For many tasks, this is perfectly fine. But if you're working with large documents or complex, multi-turn conversations, you might want to increase it.
You can adjust the context window size when you run a model in Ollama. For example, to set the context window to 4096 tokens, you can use the command
1 /set parameter num_ctx 4096
within the Ollama CLI.
Be mindful that a larger context window will use more VRAM. With your 4080S, you have some headroom to experiment with larger context windows, but keep an eye on your resource usage. If you notice a significant performance drop, you might need to dial it back.

Advanced Configuration & Environment Variables

For those who really want to get under the hood, there are a few environment variables you can use to fine-tune Ollama's performance.
  • 1 OLLAMA_NUM_THREADS
    : This allows you to specify the number of CPU threads Ollama can use. If you're running a model that's partially on the CPU, adjusting this can help optimize performance.
  • 1 OLLAMA_CUDA
    : This is typically enabled by default if you have an NVIDIA GPU, but it's worth making sure it's set to
    1 1
    to ensure GPU acceleration is active.
These are just a couple of examples, & the Ollama documentation has a more comprehensive list. For most users, the default settings are fine, but if you're an advanced user looking for that extra edge, it's worth exploring these options.
For businesses, optimizing the performance of their AI solutions is paramount. A slow, unresponsive chatbot can be a frustrating experience for customers. This is another area where a platform like Arsturn shines. Arsturn helps businesses create custom AI chatbots that provide instant customer support, answer questions, & engage with website visitors 24/7. By ensuring your underlying Ollama models are finely tuned for performance, you can deliver a seamless & engaging customer experience with your Arsturn-powered chatbot.

Putting It All Together: A Practical Workflow

So, how do you apply all of this information to your 4080S & 32GB RAM setup? Here's a practical workflow you can follow:
  1. Start with the Right OS: If you're serious about performance, consider setting up a dual-boot with a user-friendly Linux distribution like Ubuntu.
  2. Choose Your Model Wisely: For your 4080S, start with a high-quality 7B or 8B parameter model, like a quantized version of Llama 3.1. These should fit comfortably in your VRAM & give you a great baseline for performance.
  3. Experiment with Quantization: Don't be afraid to try different quantization levels. A Q4 or Q5 quant is a great starting point. You might find that a more aggressively quantized larger model gives you the perfect balance of speed & capability.
  4. Monitor Your Resources: Keep an eye on your VRAM usage. If a model is consistently offloading to your system RAM & running slowly, consider a smaller or more quantized alternative.
  5. Adjust the Context Window as Needed: Start with the default context window & only increase it if you find that your conversations are losing context. Remember that a larger context window will consume more resources.
  6. Tweak Advanced Settings (If You're Feeling Adventurous): If you've done all of the above & you're still looking for more, you can start to explore Ollama's environment variables.

Final Thoughts

I hope this was helpful! Getting the most out of your 4080S & 32GB of RAM for Ollama is a process of experimentation & finding the right balance for your needs. There's no single "best" setting – it all depends on the models you're using & the tasks you're trying to accomplish.
The key is to understand the tools at your disposal – from your choice of OS to the intricacies of model quantization. By following the steps outlined in this guide, you'll be well on your way to a blazing-fast & incredibly powerful local LLM setup.
Let me know what you think! Have you found any other cool tricks for improving Ollama performance? I'd love to hear about them. Happy tinkering

Copyright © Arsturn 2025