Setting up metrics to monitor your applications is CRUCIAL in today's fast-paced tech environment. For those diving into the world of AI models, integrating Ollama with Prometheus can unlock previously uncharted insights into your model's performance, resource utilization, and overall efficiency.
In this blog post, we’ll walk through the steps of setting up Ollama with Prometheus, allowing you to monitor vital metrics like GPU usage, request counts, and latency. By the end of this, you’ll have a fully functional metrics monitoring setup that can offer deeper insights into your LLM (Large Language Model) operations.
Why Ollama and Prometheus?
Ollama, an easy-to-use tool for running AI models locally, simplifies the process of deploying and managing large language models. When combined with Prometheus, a leading monitoring tool, you can gather metrics effortlessly, visualize data trends, and set alerts for optimal performance. This combination is particularly useful if you work with AI applications that need to handle high loads and require timely insights for continuous improvements.
What You'll Need
Before we get started, make sure you have the following ready:
Prometheus installed on your local machine. You can find installation instructions on the Prometheus documentation.
Familiarity with command line and basic terminal commands.
(Optional) A Grafana setup to visualize your metrics. You can find more about Grafana in their documentation.
Step 1: Install Ollama and Create a Simple Model
Ensure you have Ollama running correctly on your machine. To quickly install Ollama, you can run:
1
2
bash
curl -fsSL https://ollama.com/install.sh | sh
To create a model, utilize the command:
1
2
bash
ollama create mymodel -f ./Modelfile
Make sure the model file exists with the required configuration.
You can run any AI model you want thereafter. For example:
1
2
bash
ollama run model_name
This command will launch your model locally.
Step 2: Add Prometheus Metrics Endpoint in Ollama
To allow Prometheus to scrape metrics from your Ollama instance, you need to enable a metrics endpoint. As discussed on GitHub, the Ollama team has been working on a
1
/metrics
endpoint that exposes various metrics such as GPU utilization, memory utilization, and request counts.
You can follow along with community discussions and enhancement requests here: GitHub Issue on Metrics Endpoint.
Ensure you have the latest version of Ollama that supports this endpoint. You can check and update your version as necessary. Once your server is running, you should be able to access metrics via:
1
2
plaintext
http://localhost:11434/metrics
Run the following command to confirm the metrics endpoint works correctly:
1
2
bash
curl http://127.0.0.1:11434/metrics
You should receive a list of metrics currently collected by Ollama.
Step 3: Configure Prometheus to Scrape the Ollama Metrics
Next, we’ll need to configure Prometheus to scrape metrics from your Ollama instance. Create a configuration file named
You should see logs indicating that Prometheus is running and scraping the metrics from Ollama. Access the Prometheus dashboard by visiting:
1
2
plaintext
http://localhost:9090
Here, you can explore, query, and visualize all scraped metrics from both Ollama and any other configured targets.
Step 5: Visualize Metrics with Grafana (Optional)
To enhance your monitoring experience, we can visualize these metrics using Grafana.
Install Grafana and run it.
Log into the Grafana dashboard by navigating to:
1
2
plaintext
http://localhost:3000
In the Grafana dashboard, go to Configuration > Data Sources. Click Add Data Source and select Prometheus.
Point it to your Prometheus server URL:
1
2
plaintext
http://localhost:9090
Save & Test your data source.
Now you can start creating dashboards to visualize the performance metrics of your Ollama instances. You can display key metrics like model latency, request counts, and resource utilization, allowing for proactive insights and adjustments.
Step 6: Adding Alerts (Optional)
Setting up alerts in Prometheus can help you keep an eye on important metrics that might require immediate action. For instance, you can create alerts for high memory usage or increase in response time for the models.
Example alerting rule in your configuration:
```yaml
rule_files: