Why Prometheus and Grafana
Monitoring is the foundation of reliable infrastructure. Without visibility into CPU usage, memory pressure, disk I/O, network throughput, and application-level metrics, operational teams are flying blind. Prometheus and Grafana have become the industry-standard open-source monitoring stack, used by organizations from startups to Fortune 500 companies.
This guide covers a production-grade deployment of the monitoring stack, from installation through alerting, based on patterns we implement across our server management engagements.
Architecture Overview
Targets (exporters)
├── node_exporter (system metrics)
├── mysqld_exporter (database metrics)
├── nginx_exporter (web server metrics)
└── application /metrics endpoint
│
▼
Prometheus Server
(scrape, store, query)
│
├─────┼──────┐
▼ ▼ ▼
Grafana Alert Recording
(viz) Manager Rules
│
├─────┼──────┐
▼ ▼ ▼
Slack PagerDuty Email
Prometheus pulls metrics from targets at configured intervals, stores them in a time-series database, and evaluates alerting rules. Grafana queries Prometheus to render dashboards. Alertmanager handles alert routing, grouping, and notification delivery.
Installing Prometheus
Docker Compose Deployment
# docker-compose.yml
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--storage.tsdb.retention.size=50GB"
- "--web.enable-lifecycle"
restart: unless-stopped
grafana:
image: grafana/grafana:10.4.0
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_INSTALL_PLUGINS: grafana-piechart-panel
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
static_configs:
- targets:
- "web-server-1:9100"
- "web-server-2:9100"
- "db-server-1:9100"
- job_name: "mysql"
static_configs:
- targets: ["db-server-1:9104"]
- job_name: "nginx"
static_configs:
- targets:
- "web-server-1:9113"
- "web-server-2:9113"
- job_name: "application"
metrics_path: "/metrics"
static_configs:
- targets:
- "app-server-1:8080"
- "app-server-2:8080"
Node Exporter Setup
Install node_exporter on every server to collect system metrics:
# Download and install
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service <<'SERVICE'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.conntrack \
--no-collector.infiniband \
--no-collector.nfs
Restart=always
[Install]
WantedBy=multi-user.target
SERVICE
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Alerting Rules
Define alert rules for critical infrastructure conditions:
# prometheus/rules/infrastructure.yml
groups:
- name: infrastructure
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf "%.1f" }}% for over 10 minutes."
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space critically low on {{ $labels.instance }}"
description: "Only {{ $value | printf "%.1f" }}% disk space remaining."
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf "%.1f" }}%."
- alert: InstanceDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }} exporter on {{ $labels.instance }} has been unreachable for 3 minutes."
Alertmanager Configuration
Route alerts to the right channels based on severity:
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: "https://hooks.slack.com/services/T00/B00/xxxx"
route:
receiver: "slack-default"
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
repeat_interval: 1h
- match:
severity: warning
receiver: "slack-warnings"
repeat_interval: 4h
receivers:
- name: "slack-default"
slack_configs:
- channel: "#monitoring"
title: '{{ template "slack.default.title" . }}'
text: '{{ template "slack.default.text" . }}'
- name: "slack-warnings"
slack_configs:
- channel: "#monitoring-warnings"
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "your-pagerduty-integration-key"
severity: critical
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ["instance"]
The inhibit rule suppresses warning alerts when a critical alert is already firing for the same instance, reducing alert noise during incidents.
Grafana Dashboard Design
Data Source Provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Key Dashboard Panels
Build dashboards with these essential panels:
System Overview: CPU, memory, disk, and network for all servers in a single view using node_exporter metrics.
Application Performance: Request rate, error rate, and latency percentiles (the RED method).
Database Health: Query throughput, slow queries, replication lag, connection pool utilization.
Example PromQL Queries
# Request rate per second (5-minute average)
rate(http_requests_total[5m])
# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100
Recording Rules for Performance
Pre-compute expensive queries as recording rules to speed up dashboards:
# prometheus/rules/recording.yml
groups:
- name: recording
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: instance:node_cpu:utilization
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Best Practices
- Set retention based on your needs — 30 days for operational metrics, use Thanos or Cortex for long-term storage
- Use recording rules for dashboard queries to reduce Prometheus load
- Label targets consistently with
environment,team, andservicelabels - Create runbooks linked from alert annotations for faster incident response
- Monitor the monitoring stack itself — alert on Prometheus scrape failures and high memory usage
- Secure Grafana with SSO and role-based access control
- Integrate with your infrastructure management workflows for automated remediation
A well-designed monitoring stack turns reactive firefighting into proactive operations, catching issues before they impact users.
Need help with this?
Our team handles this kind of work daily. Let us take care of your infrastructure.
Related Articles
The Ultimate Guide to Linux Server Management in 2025
A comprehensive guide to modern Linux server management covering automation, containerization, cloud integration, AI-driven operations, security best practices, and essential tooling for 2025.
Server & DevOpsFixing "421 Misdirected Request" for Plesk Sites on Ubuntu 22.04 After Apache Update
Resolve the 421 Misdirected Request error affecting all HTTPS sites on Plesk for Ubuntu 22.04 after an Apache update, caused by changed SNI requirements in the nginx-to-Apache proxy chain.
Server & DevOpsHow to Set Up GlusterFS on Ubuntu
A complete guide to setting up a distributed, replicated GlusterFS filesystem across multiple Ubuntu 22.04 nodes, including installation, volume creation, client mounting, maintenance, and troubleshooting.