Last updated: 2026-04-08

How to Set Up Server Monitoring: Know When Your Server Goes Down

Set up uptime monitoring, alerts, and a basic Prometheus + Grafana stack. Free and paid options compared.

TL;DR

Use free external monitoring (UptimeRobot or HetrixTools) for uptime checks and instant alerts. For deeper insight into CPU, RAM, disk, and container health, deploy Prometheus + Grafana via Docker Compose. Combine both approaches for comprehensive coverage: external monitoring catches outages you cannot detect from inside your own infrastructure, while Prometheus gives you the metrics to diagnose root causes and predict failures before they happen.

Prerequisites

A Linux server (Ubuntu 22.04/24.04 or Debian 12 recommended) with root or sudo access
Docker and Docker Compose installed (docker compose version should return v2.x+)
A domain name pointing to your server (for SSL monitoring and status page)
Basic familiarity with the Linux command line
An email address for alert notifications
Optional: A Slack workspace for Slack alerts

What to Monitor

Before setting up any tools, understand what matters for server reliability:

Infrastructure Metrics

CPU usage — sustained loads above 80% indicate capacity problems
RAM usage — memory exhaustion causes OOM kills and service crashes
Disk usage — a full disk will bring down databases, logging, and often the entire system
Disk I/O — high wait times point to storage bottlenecks
Network traffic — unexpected spikes can indicate attacks or misconfiguration

Application Metrics

HTTP response codes — monitor for 5xx errors
Response time — detect performance degradation before users complain
SSL certificate expiry — expired certificates break trust and access
Docker container status — detect crashed or unhealthy containers
Service-specific health endpoints — databases, queues, caches

Step 1: Free Uptime Monitoring Services

External uptime monitors check your server from outside your network. This is critical because internal monitoring cannot detect network-level outages or routing problems that prevent users from reaching your server.

UptimeRobot

UptimeRobot offers 50 monitors at 5-minute intervals on the free plan.

Create an account at uptimerobot.com
Click Add New Monitor
Configure your first monitor:

Monitor Type: HTTP(s)
Friendly Name: My Server — Main Site
URL: https://yourdomain.com
Monitoring Interval: 5 minutes (free tier)

Add additional monitors for critical endpoints:

https://yourdomain.com/api/health — API health check
A Ping monitor for your server IP — catches network-level outages
A Port monitor for SSH (22), SMTP (587), or database ports if externally accessible
A Keyword monitor that checks for specific text on a page — catches cases where the server responds 200 but serves error content

HetrixTools

HetrixTools offers 15 monitors at 1-minute intervals on the free plan, plus built-in SSL and blacklist monitoring.

Create an account at hetrixtools.com
Navigate to Uptime Monitors and click Add Monitor
Choose Website and enter your URL
Enable SSL Certificate Monitoring — you will be alerted 30, 14, and 7 days before expiry
Add monitoring locations across multiple continents for better outage detection

For SSL expiry monitoring specifically, you can also check certificates from the command line:

echo | openssl s_client -servername yourdomain.com -connect yourdomain.com:443 2>/dev/null | openssl x509 -noout -dates

Step 2: Prometheus + Grafana with Docker Compose

External monitors tell you that something is down. Prometheus and Grafana tell you why and help you see problems coming.

Directory Structure

mkdir -p ~/monitoring/{prometheus,grafana,alertmanager}
cd ~/monitoring

Docker Compose File

Create docker-compose.yml:

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme_s3cure!
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    networks:
      - monitoring

  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    ports:
      - "9115:9115"
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alert-rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://yourdomain.com
        - https://api.yourdomain.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  - job_name: 'blackbox-ssl'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://yourdomain.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Alert Rules

Create prometheus/alert-rules.yml:

groups:
  - name: server-alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes (current: {{ $value | printf \"%.1f\" }}%)"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current: {{ $value | printf \"%.1f\" }}%)"

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Root filesystem is {{ $value | printf \"%.1f\" }}% full"

      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"}) > 60
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: EndpointDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint {{ $labels.instance }} is down"
          description: "HTTP probe has been failing for more than 2 minutes"

      - alert: SslCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate for {{ $labels.instance }} expires in {{ $value | printf \"%.0f\" }} days"

Start the Stack

cd ~/monitoring
docker compose up -d

# Verify all containers are running
docker compose ps

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

Open Grafana at http://your-server-ip:3000, log in with the credentials from the compose file, and add Prometheus as a data source with URL http://prometheus:9090. Import dashboard ID 1860 (Node Exporter Full) and 14282 (cAdvisor) from grafana.com for pre-built visualizations.

Step 3: Email and Slack Alerts

Alertmanager Configuration

Create alertmanager/alertmanager.yml:

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourdomain.com'
  smtp_auth_username: 'alerts@yourdomain.com'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email-slack'

  routes:
    - match:
        severity: critical
      receiver: 'email-slack'
      repeat_interval: 1h

receivers:
  - name: 'email-slack'
    email_configs:
      - to: 'admin@yourdomain.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#server-alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}ALERT{{ else }}RESOLVED{{ end }}: {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'

After creating the file, restart Alertmanager:

docker compose restart alertmanager

Testing Alerts

Verify Alertmanager is receiving from Prometheus:

# Check pending and firing alerts
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname'

# Send a test alert
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"This is a test alert"},"startsAt":"'$(date -u +%Y-%m-%dT%H:%M:%S.000Z)'","generatorURL":"http://localhost:9090"}]'

Step 4: Docker Container Health Checks

Docker health checks let the daemon know whether a container is actually working, not just running. Add them to your application containers:

services:
  webapp:
    image: your-app:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    restart: unless-stopped

  database:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost/ || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

Monitor health status from the command line:

# Check health of all containers
docker ps --format "table {{.Names}}\t{{.Status}}"

# Watch for unhealthy containers
docker events --filter event=health_status

# Script to alert on unhealthy containers
#!/bin/bash
unhealthy=$(docker ps --filter health=unhealthy --format "{{.Names}}" 2>/dev/null)
if [ -n "$unhealthy" ]; then
  echo "UNHEALTHY containers: $unhealthy" | mail -s "Docker Health Alert" admin@yourdomain.com
fi

Add that script to cron as a fallback in case your main monitoring is itself down:

# Run every 5 minutes
*/5 * * * * /usr/local/bin/docker-health-check.sh

Step 5: Simple Status Page

A public status page lets users check service availability without contacting you. Both UptimeRobot and HetrixTools offer free hosted status pages.

UptimeRobot Status Page

Go to My Settings > Status Pages > Add Status Page
Select the monitors to display
Customize with your branding
Use the provided URL or point a CNAME record (e.g., status.yourdomain.com)

Self-Hosted Alternative: Gatus

For a self-hosted status page with built-in monitoring, add Gatus to your stack:

  gatus:
    image: twinproduction/gatus:v5.11.0
    container_name: gatus
    restart: unless-stopped
    volumes:
      - ./gatus/config.yaml:/config/config.yaml
    ports:
      - "8081:8080"
    networks:
      - monitoring

Create gatus/config.yaml:

storage:
  type: sqlite
  path: /data/data.db

endpoints:
  - name: Website
    url: https://yourdomain.com
    interval: 2m
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 2000"
    alerts:
      - type: email
        send-on-resolved: true

  - name: API
    url: https://api.yourdomain.com/health
    interval: 1m
    conditions:
      - "[STATUS] == 200"
      - "[BODY].status == UP"
      - "[RESPONSE_TIME] < 1000"

  - name: SSL Certificate
    url: https://yourdomain.com
    interval: 1h
    conditions:
      - "[CERTIFICATE_EXPIRATION] > 14d"

Troubleshooting

Prometheus shows targets as DOWN

# Check if exporters are reachable from within the Docker network
docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5

# Check container logs
docker logs node-exporter --tail 20

# Verify network connectivity
docker network inspect monitoring_monitoring

Grafana cannot reach Prometheus

When adding the data source, use the Docker service name http://prometheus:9090, not localhost. Both containers must be on the same Docker network.

Alertmanager not sending emails

# Check Alertmanager logs
docker logs alertmanager --tail 30

# Verify config syntax
docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml

# For Gmail: use an App Password, not your regular password
# Enable at https://myaccount.google.com/apppasswords

cAdvisor high CPU usage

On systems with many containers, cAdvisor can consume significant resources. Limit its collection scope:

  cadvisor:
    command:
      - '--housekeeping_interval=30s'
      - '--docker_only=true'
      - '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory'

Disk filling up from Prometheus data

# Check current storage size
du -sh /var/lib/docker/volumes/monitoring_prometheus_data/_data/

# Reduce retention in docker-compose.yml
# Change: --storage.tsdb.retention.time=30d
# To:     --storage.tsdb.retention.time=15d

# Restart Prometheus
docker compose restart prometheus

Prevention and Best Practices

Layered Monitoring Strategy

External uptime monitoring (UptimeRobot/HetrixTools) — catches outages visible to end users
Internal metrics (Prometheus + Grafana) — provides deep visibility into resource usage and trends
Container health checks — enables Docker to auto-restart failed services
Log monitoring — add Loki or ship logs to catch application-level errors

Alert Hygiene

Only alert on conditions that require human action. Noisy alerts get ignored.
Use for: clauses in Prometheus rules to avoid alerting on brief spikes
Set repeat_interval high enough to avoid alert fatigue (4h for warnings, 1h for critical)
Always include send_resolved: true so you know when issues clear

Security Considerations

Never expose Prometheus or Alertmanager to the public internet without authentication
Put Grafana behind a reverse proxy with HTTPS
Change default Grafana credentials immediately
Restrict ports in the compose file to 127.0.0.1:9090:9090 if only accessed locally or through a reverse proxy

Maintenance Routine

# Monthly: update monitoring stack
cd ~/monitoring
docker compose pull
docker compose up -d

# Weekly: check disk usage trend
df -h / | tail -1

# Verify all Prometheus targets are healthy
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | .labels.job'

A well-configured monitoring setup pays for itself the first time it wakes you up at 2 AM instead of letting your users discover the outage at 9 AM. Start with the free external tools today, then layer in Prometheus and Grafana as your infrastructure grows.

Need Expert Help?

Want monitoring + alerts set up properly? €49, done in 30 min.

Book Now — €49

100% money-back guarantee

Harald Roessler

Infrastructure Engineer with 20+ years experience. Founder of DSNCON GmbH.

Is Your Server Secure? A Quick Security Checklist €39 Kubernetes Security & Health Checklist: Is Your Cluster Production-Ready? €150 Is Your Infrastructure a Risk? How to Assess Hosting, CI/CD, and Monitoring €250