Skip to main content
Back to Articles
Server & DevOpsAugust 1, 20259 min read

Server Monitoring with Prometheus and Grafana

Build a production monitoring stack with Prometheus, Grafana, and Alertmanager for infrastructure visibility, custom dashboards, and incident alerting.

Why Prometheus and Grafana

Monitoring is the foundation of reliable infrastructure. Without visibility into CPU usage, memory pressure, disk I/O, network throughput, and application-level metrics, operational teams are flying blind. Prometheus and Grafana have become the industry-standard open-source monitoring stack, used by organizations from startups to Fortune 500 companies.

This guide covers a production-grade deployment of the monitoring stack, from installation through alerting, based on patterns we implement across our server management engagements.

Architecture Overview

Targets (exporters)
  ├── node_exporter (system metrics)
  ├── mysqld_exporter (database metrics)
  ├── nginx_exporter (web server metrics)
  └── application /metrics endpoint
          │
          ▼
    Prometheus Server
    (scrape, store, query)
          │
    ├─────┼──────┐
    ▼     ▼      ▼
Grafana  Alert   Recording
(viz)   Manager   Rules
          │
    ├─────┼──────┐
    ▼     ▼      ▼
 Slack  PagerDuty Email

Prometheus pulls metrics from targets at configured intervals, stores them in a time-series database, and evaluates alerting rules. Grafana queries Prometheus to render dashboards. Alertmanager handles alert routing, grouping, and notification delivery.

Installing Prometheus

Docker Compose Deployment

# docker-compose.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=50GB"
      - "--web.enable-lifecycle"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_INSTALL_PLUGINS: grafana-piechart-panel
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets:
          - "web-server-1:9100"
          - "web-server-2:9100"
          - "db-server-1:9100"

  - job_name: "mysql"
    static_configs:
      - targets: ["db-server-1:9104"]

  - job_name: "nginx"
    static_configs:
      - targets:
          - "web-server-1:9113"
          - "web-server-2:9113"

  - job_name: "application"
    metrics_path: "/metrics"
    static_configs:
      - targets:
          - "app-server-1:8080"
          - "app-server-2:8080"

Node Exporter Setup

Install node_exporter on every server to collect system metrics:

# Download and install
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service <<'SERVICE'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.conntrack \
  --no-collector.infiniband \
  --no-collector.nfs
Restart=always

[Install]
WantedBy=multi-user.target
SERVICE

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Alerting Rules

Define alert rules for critical infrastructure conditions:

# prometheus/rules/infrastructure.yml
groups:
  - name: infrastructure
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf "%.1f" }}% for over 10 minutes."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critically low on {{ $labels.instance }}"
          description: "Only {{ $value | printf "%.1f" }}% disk space remaining."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf "%.1f" }}%."

      - alert: InstanceDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} exporter on {{ $labels.instance }} has been unreachable for 3 minutes."

Alertmanager Configuration

Route alerts to the right channels based on severity:

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/xxxx"

route:
  receiver: "slack-default"
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: "slack-warnings"
      repeat_interval: 4h

receivers:
  - name: "slack-default"
    slack_configs:
      - channel: "#monitoring"
        title: '{{ template "slack.default.title" . }}'
        text: '{{ template "slack.default.text" . }}'

  - name: "slack-warnings"
    slack_configs:
      - channel: "#monitoring-warnings"

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "your-pagerduty-integration-key"
        severity: critical

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["instance"]

The inhibit rule suppresses warning alerts when a critical alert is already firing for the same instance, reducing alert noise during incidents.

Grafana Dashboard Design

Data Source Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Key Dashboard Panels

Build dashboards with these essential panels:

System Overview: CPU, memory, disk, and network for all servers in a single view using node_exporter metrics.

Application Performance: Request rate, error rate, and latency percentiles (the RED method).

Database Health: Query throughput, slow queries, replication lag, connection pool utilization.

Example PromQL Queries

# Request rate per second (5-minute average)
rate(http_requests_total[5m])

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

Recording Rules for Performance

Pre-compute expensive queries as recording rules to speed up dashboards:

# prometheus/rules/recording.yml
groups:
  - name: recording
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: instance:node_cpu:utilization
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Best Practices

  • Set retention based on your needs — 30 days for operational metrics, use Thanos or Cortex for long-term storage
  • Use recording rules for dashboard queries to reduce Prometheus load
  • Label targets consistently with environment, team, and service labels
  • Create runbooks linked from alert annotations for faster incident response
  • Monitor the monitoring stack itself — alert on Prometheus scrape failures and high memory usage
  • Secure Grafana with SSO and role-based access control
  • Integrate with your infrastructure management workflows for automated remediation

A well-designed monitoring stack turns reactive firefighting into proactive operations, catching issues before they impact users.

Need help with this?

Our team handles this kind of work daily. Let us take care of your infrastructure.