Why Prometheus and Grafana

Monitoring is the foundation of reliable infrastructure. Without visibility into CPU usage, memory pressure, disk I/O, network throughput, and application-level metrics, operational teams are flying blind. Prometheus and Grafana have become the industry-standard open-source monitoring stack, used by organizations from startups to Fortune 500 companies.

This guide covers a production-grade deployment of the monitoring stack, from installation through alerting, based on patterns we implement across our server management engagements.

Architecture Overview

Targets (exporters)
  ├── node_exporter (system metrics)
  ├── mysqld_exporter (database metrics)
  ├── nginx_exporter (web server metrics)
  └── application /metrics endpoint
          │
          ▼
    Prometheus Server
    (scrape, store, query)
          │
    ├─────┼──────┐
    ▼     ▼      ▼
Grafana  Alert   Recording
(viz)   Manager   Rules
          │
    ├─────┼──────┐
    ▼     ▼      ▼
 Slack  PagerDuty Email

Prometheus pulls metrics from targets at configured intervals, stores them in a time-series database, and evaluates alerting rules. Grafana queries Prometheus to render dashboards. Alertmanager handles alert routing, grouping, and notification delivery.

Installing Prometheus

Docker Compose Deployment

# docker-compose.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=50GB"
      - "--web.enable-lifecycle"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_INSTALL_PLUGINS: grafana-piechart-panel
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets:
          - "web-server-1:9100"
          - "web-server-2:9100"
          - "db-server-1:9100"

  - job_name: "mysql"
    static_configs:
      - targets: ["db-server-1:9104"]

  - job_name: "nginx"
    static_configs:
      - targets:
          - "web-server-1:9113"
          - "web-server-2:9113"

  - job_name: "application"
    metrics_path: "/metrics"
    static_configs:
      - targets:
          - "app-server-1:8080"
          - "app-server-2:8080"

Node Exporter Setup

Install node_exporter on every server to collect system metrics:

# Download and install
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service <<'SERVICE'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.conntrack \
  --no-collector.infiniband \
  --no-collector.nfs
Restart=always

[Install]
WantedBy=multi-user.target
SERVICE

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Alerting Rules

Define alert rules for critical infrastructure conditions:

# prometheus/rules/infrastructure.yml
groups:
  - name: infrastructure
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf "%.1f" }}% for over 10 minutes."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critically low on {{ $labels.instance }}"
          description: "Only {{ $value | printf "%.1f" }}% disk space remaining."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf "%.1f" }}%."

      - alert: InstanceDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} exporter on {{ $labels.instance }} has been unreachable for 3 minutes."

Alertmanager Configuration

Route alerts to the right channels based on severity:

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/xxxx"

route:
  receiver: "slack-default"
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: "slack-warnings"
      repeat_interval: 4h

receivers:
  - name: "slack-default"
    slack_configs:
      - channel: "#monitoring"
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

  - name: "slack-warnings"
    slack_configs:
      - channel: "#monitoring-warnings"

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "your-pagerduty-integration-key"
        severity: critical

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["instance"]

The inhibit rule suppresses warning alerts when a critical alert is already firing for the same instance, reducing alert noise during incidents.

Grafana Dashboard Design

Data Source Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Key Dashboard Panels

Build dashboards with these essential panels:

System Overview: CPU, memory, disk, and network for all servers in a single view using node_exporter metrics.

Application Performance: Request rate, error rate, and latency percentiles (the RED method).

Database Health: Query throughput, slow queries, replication lag, connection pool utilization.

Example PromQL Queries

# Request rate per second (5-minute average)
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

Recording Rules for Performance

Pre-compute expensive queries as recording rules to speed up dashboards:

# prometheus/rules/recording.yml
groups:
  - name: recording
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: instance:node_cpu:utilization
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Best Practices

Set retention based on your needs - 30 days for operational metrics, use Thanos or Cortex for long-term storage
Use recording rules for dashboard queries to reduce Prometheus load
Label targets consistently with environment, team, and service labels
Create runbooks linked from alert annotations for faster incident response
Monitor the monitoring stack itself - alert on Prometheus scrape failures and high memory usage
Secure Grafana with SSO and role-based access control
Integrate with your infrastructure management workflows for automated remediation

A well-designed monitoring stack turns reactive firefighting into proactive operations, catching issues before they impact users.

Talk to the engineer who will own your stack.

No account managers, no offshore handoff. Senior DevOps, direct. Tell us what you are dealing with and you get a straight answer.

View Related Service Discuss

Server & DevOps

Server Monitoring with Prometheus and Grafana