8/27/2024

Setting Up Ollama with Prometheus for Metrics

Introduction

Setting up metrics to monitor your applications is CRUCIAL in today's fast-paced tech environment. For those diving into the world of AI models, integrating Ollama with Prometheus can unlock previously uncharted insights into your model's performance, resource utilization, and overall efficiency.

In this blog post, we’ll walk through the steps of setting up Ollama with Prometheus, allowing you to monitor vital metrics like GPU usage, request counts, and latency. By the end of this, you’ll have a fully functional metrics monitoring setup that can offer deeper insights into your LLM (Large Language Model) operations.

Why Ollama and Prometheus?

Ollama, an easy-to-use tool for running AI models locally, simplifies the process of deploying and managing large language models. When combined with Prometheus, a leading monitoring tool, you can gather metrics effortlessly, visualize data trends, and set alerts for optimal performance. This combination is particularly useful if you work with AI applications that need to handle high loads and require timely insights for continuous improvements.

What You'll Need

Before we get started, make sure you have the following ready:

A running instance of Ollama (you can follow the Ollama Server Setup Guide).
Prometheus installed on your local machine. You can find installation instructions on the Prometheus documentation.
Familiarity with command line and basic terminal commands.
(Optional) A Grafana setup to visualize your metrics. You can find more about Grafana in their documentation.

Step 1: Install Ollama and Create a Simple Model

Ensure you have Ollama running correctly on your machine. To quickly install Ollama, you can run:

1
2

bash
curl -fsSL https://ollama.com/install.sh | sh

To create a model, utilize the command:

1
2

bash
ollama create mymodel -f ./Modelfile

Make sure the model file exists with the required configuration.

You can run any AI model you want thereafter. For example:

1
2

bash
ollama run model_name

This command will launch your model locally.

Step 2: Add Prometheus Metrics Endpoint in Ollama

To allow Prometheus to scrape metrics from your Ollama instance, you need to enable a metrics endpoint. As discussed on GitHub, the Ollama team has been working on a

/metrics

endpoint that exposes various metrics such as GPU utilization, memory utilization, and request counts. You can follow along with community discussions and enhancement requests here: GitHub Issue on Metrics Endpoint.

Ensure you have the latest version of Ollama that supports this endpoint. You can check and update your version as necessary. Once your server is running, you should be able to access metrics via:

1
2

plaintext
http://localhost:11434/metrics

Run the following command to confirm the metrics endpoint works correctly:

1
2

bash
curl http://127.0.0.1:11434/metrics

You should receive a list of metrics currently collected by Ollama.

Step 3: Configure Prometheus to Scrape the Ollama Metrics

Next, we’ll need to configure Prometheus to scrape metrics from your Ollama instance. Create a configuration file named

prometheus.yml

with the following content: ```yaml

prometheus.yml

global: scrape_interval: 30s # Default scrape interval

scrape_configs:

job_name: 'ollama' static_configs:
- targets: ['localhost:11434']
  1 2`` In this file, we have set Prometheus to scrape metrics every 30 seconds from Ollama located at
  localhost:11434`.

Step 4: Start Prometheus

To start Prometheus with your newly created configuration, run:

1
2

bash
./prometheus --config.file=./prometheus.yml

You should see logs indicating that Prometheus is running and scraping the metrics from Ollama. Access the Prometheus dashboard by visiting:

1
2

plaintext
http://localhost:9090

Here, you can explore, query, and visualize all scraped metrics from both Ollama and any other configured targets.

Step 5: Visualize Metrics with Grafana (Optional)

To enhance your monitoring experience, we can visualize these metrics using Grafana.

Install Grafana and run it.
Log into the Grafana dashboard by navigating to:
1 2plaintext http://localhost:3000
In the Grafana dashboard, go to Configuration > Data Sources. Click Add Data Source and select Prometheus.
Point it to your Prometheus server URL:
1 2plaintext http://localhost:9090
Save & Test your data source.

Now you can start creating dashboards to visualize the performance metrics of your Ollama instances. You can display key metrics like model latency, request counts, and resource utilization, allowing for proactive insights and adjustments.

Step 6: Adding Alerts (Optional)

Setting up alerts in Prometheus can help you keep an eye on important metrics that might require immediate action. For instance, you can create alerts for high memory usage or increase in response time for the models. Example alerting rule in your configuration: ```yaml rule_files:

'alerts.yml'

groups:

name: example-alert group: rules:
alert: HighRequestCount expr: sum(rate(ollama_model_requests_total[5m])) > 1000 for: 5m labels: severity: page annotations: summary: