Okay, so let's say you have a beast of a Mac with 32GB or even 64GB of RAM. You should be flying, right? But you're still seeing frustrating delays, especially with the really big 70B models. This one is a bit more of a deep cut.
It turns out, macOS is VERY aggressive about how it manages memory, especially VRAM (the memory on your GPU). Even if you tell Ollama to keep a model loaded in memory forever (
), macOS might decide it knows better. It can silently offload the model from the super-fast VRAM on the Apple Silicon chip back to your regular system RAM.
When you go to ask your next question, there’s a delay of a few seconds while the system shuffles that massive model back into VRAM. It's not a huge delay, but it's enough to make the experience feel laggy & unresponsive. You might notice your memory usage spike, then quickly drop after you get a response – that's macOS "helpfully" clearing out the VRAM.
You can actually tell macOS to dedicate more of your unified memory to the GPU. This makes it less likely to offload your Ollama model.
You'll need to open up the Terminal app & run a command. Be careful here, as you're changing a system setting.