Monitor Ollama Performance with Prometheus & Grafana (2024)

8/12/2025

So, you've dived headfirst into the world of local LLMs with Ollama. It's pretty amazing, right? Running powerful models like Llama 3, Phi-3, & others right on your own machine is a game-changer. But as you start to use it more, especially for projects that others rely on, you'll quickly realize that you're flying blind. Is it running slow? Is it eating up all your GPU memory? Which models are being used the most?

Honestly, without proper monitoring, you're just guessing. That's where Prometheus comes in. It's the de-facto standard for metrics-based monitoring in the open-source world. By pulling in data from Ollama, you can get a clear picture of what's going on under the hood.

In this guide, I'm going to walk you through everything you need to know about monitoring Ollama with Prometheus. We'll cover why it's so important, the different ways to get those juicy metrics, how to set up Prometheus to scrape them, & how to build some slick dashboards in Grafana to visualize it all. This is going to be a deep dive, so grab a coffee & let's get started.

Why Bother Monitoring Ollama in the First Place?

Look, if you're just tinkering with Ollama on your personal machine for fun, you probably don't need a full-blown monitoring setup. But the moment you start building applications on top of it, or sharing your Ollama server with your team, monitoring becomes non-negotiable. Here's why:

Performance Optimization: You can't fix what you can't see. Is your LLM running slower than you'd like? Monitoring will tell you if it's a problem with prompt evaluation time, token generation speed, or something else entirely. Maybe you're running a model that's too big for your hardware, & you need to scale down. The metrics will tell the story.
Resource Management: LLMs are hungry for resources, especially GPU memory. If you're running multiple models, you need to know which ones are taking up the most VRAM. You can also track CPU & memory usage to make sure your server isn't about to fall over.
Cost Management (Even on Local Hardware): "Cost" isn't just about dollars & cents. It's about the "cost" of your hardware resources. If you're running Ollama on a shared server, you need to know who's using it & how much. This is especially true in a business context where you need to justify hardware allocation.
Reliability & Availability: Is your Ollama server even up & running? Are users getting errors? Monitoring can alert you the moment your server goes down or starts throwing a bunch of errors, so you can fix it before anyone even notices.
Usage Patterns & Insights: Which models are the most popular? What's the average number of tokens per request? This kind of information is GOLD for understanding how your LLM is being used & for making decisions about which models to support in the future.

How to Get Metrics from Ollama for Prometheus

Alright, so you're sold on the "why". Now for the "how". Turns out, Ollama doesn't have a built-in

/metrics

endpoint for Prometheus out of the box (at least, not yet!). But don't worry, the community has come up with a few clever ways to get the data we need.

Method 1: The Transparent Metrics Proxy

This is a pretty neat solution I came across on Reddit. Someone built a small, transparent proxy that sits between your application & the Ollama server. Your app makes requests to the proxy, which then forwards them to Ollama. As the requests & responses pass through, the proxy collects a bunch of useful metrics & exposes them on a Prometheus-compatible

/metrics

endpoint.

Pros:

Zero client-side changes: You don't have to modify your application code at all. You just point it to the proxy's address instead of Ollama's.
Easy to set up: It's a standalone tool, so you just need to run it.

Cons:

Another moving part: You have to manage & monitor the proxy itself. If it goes down, your whole setup goes down.
Limited to what it's programmed to collect: You're reliant on the metrics the proxy author decided to include.

Method 2: The "Ollama Monitor" Python Script

There's a cool project on GitHub called

Ollama_monitor

that's a Python script designed to test & monitor an Ollama server. It can check endpoints, run load tests, &, most importantly for us, export Prometheus metrics. You configure it with a

config.yaml

file, telling it which endpoints to check & what to expect.

Pros:

Good for health checks: It's great for making sure your Ollama server is up & responding correctly to requests.
Customizable: You can define the endpoints & expected responses in the config file.

Cons:

More of a "health checker" than a deep metrics collector: It's not going to give you detailed performance metrics like token generation speed or GPU usage.
Requires Python & dependencies: You'll need a Python environment to run it.

Method 3: The OpenTelemetry & OpenLIT Approach (HIGHLY Recommended)

This is, in my opinion, the most robust & future-proof way to monitor Ollama. OpenTelemetry is an open-source observability framework that's backed by the Cloud Native Computing Foundation (CNCF). It's becoming the standard for instrumenting applications to collect metrics, traces, & logs.

OpenLIT is a project that makes it SUPER easy to use OpenTelemetry with LLM applications. It automatically instruments your Python code, so you don't have to manually add a bunch of monitoring code yourself. It's compatible with the Ollama Python SDK (version

0.2.0

or higher), so it's a natural fit.

Here's how it works:

You add a couple of lines of code to your Python application to initialize OpenLIT.
OpenLIT automatically instruments your Ollama API calls. This means it wraps them in code that collects a ton of useful data.
This data is then sent to an OpenTelemetry Collector. The Collector is a powerful tool that can receive telemetry data in various formats, process it, & then export it to different backends, including... you guessed it... Prometheus!

Pros:

Incredibly detailed metrics: You get a rich set of metrics out of the box, including performance data, token usage, & more.
Standardized: OpenTelemetry is a widely adopted standard, so you're not locking yourself into a proprietary solution.
Extensible: You can collect traces & logs in addition to metrics, giving you a complete observability solution.
Future-proof: As Ollama evolves, the OpenTelemetry community will likely keep up, ensuring you'll always have a way to monitor it.

Cons:

A little more complex to set up: You need to run an OpenTelemetry Collector in addition to Prometheus. But trust me, it's worth it.
Requires modifying your application code: You have to add a few lines of code to your Python app. But it's a small price to pay for what you get in return.

For the rest of this guide, we're going to focus on the OpenTelemetry approach. It's the most powerful & flexible option, & it's the one I'd recommend for any serious Ollama deployment.

A Deep Dive into Monitoring Ollama with OpenTelemetry, Prometheus, & Grafana

Okay, let's get our hands dirty. Here's a step-by-step guide to setting up a full-blown Ollama monitoring stack.

Step 1: Get Your Tools in Order

First things first, you'll need to have a few things installed & running:

Ollama: I'm assuming you already have this up & running. If not, head over to the Ollama website & get it installed.
Prometheus: This is our monitoring mothership. If you don't have it, you can find the installation instructions on the Prometheus website.
Grafana: This is where we'll build our beautiful dashboards. You can grab it from the Grafana website.
OpenTelemetry Collector: This will act as the middleman between our Ollama application & Prometheus. We'll download this in a bit.

Step 2: Instrument Your Ollama Application with OpenLIT

This is where the magic starts. We need to tell our Python application to start collecting telemetry data.

Install the necessary Python libraries: