← All Articles
Last updated: 2026-03-30

How to Set Up Server Monitoring: Know When Your Server Goes Down

Set up uptime monitoring, alerts, and a basic Prometheus + Grafana stack. Free and paid options compared.

TL;DR

Use free external monitoring (UptimeRobot or HetrixTools) for uptime checks and instant alerts. For deeper insight into CPU, RAM, disk, and container health, deploy Prometheus + Grafana via Docker Compose. Combine both approaches for comprehensive coverage: external monitoring catches outages you cannot detect from inside your own infrastructure, while Prometheus gives you the metrics to diagnose root causes and predict failures before they happen.

Prerequisites

What to Monitor

Before setting up any tools, understand what matters for server reliability:

Infrastructure Metrics

Application Metrics

Step 1: Free Uptime Monitoring Services

External uptime monitors check your server from outside your network. This is critical because internal monitoring cannot detect network-level outages or routing problems that prevent users from reaching your server.

UptimeRobot

UptimeRobot offers 50 monitors at 5-minute intervals on the free plan.

  1. Create an account at uptimerobot.com
  2. Click Add New Monitor
  3. Configure your first monitor:

Add additional monitors for critical endpoints:

HetrixTools

HetrixTools offers 15 monitors at 1-minute intervals on the free plan, plus built-in SSL and blacklist monitoring.

  1. Create an account at hetrixtools.com
  2. Navigate to Uptime Monitors and click Add Monitor
  3. Choose Website and enter your URL
  4. Enable SSL Certificate Monitoring — you will be alerted 30, 14, and 7 days before expiry
  5. Add monitoring locations across multiple continents for better outage detection

For SSL expiry monitoring specifically, you can also check certificates from the command line:

echo | openssl s_client -servername yourdomain.com -connect yourdomain.com:443 2>/dev/null | openssl x509 -noout -dates

Step 2: Prometheus + Grafana with Docker Compose

External monitors tell you that something is down. Prometheus and Grafana tell you why and help you see problems coming.

Directory Structure

mkdir -p ~/monitoring/{prometheus,grafana,alertmanager}
cd ~/monitoring

Docker Compose File

Create docker-compose.yml:

services:
  prometheus:
    image: prom/prometheus:v2.51.0
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.4.0
    container_name: grafana
    restart: unless-stopped
    volumes:
      - grafana_data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme_s3cure!
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - "3000:3000"
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    networks:
      - monitoring

  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    ports:
      - "9115:9115"
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

rule_files:
  - "alert-rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://yourdomain.com
        - https://api.yourdomain.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  - job_name: 'blackbox-ssl'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://yourdomain.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Alert Rules

Create prometheus/alert-rules.yml:

groups:
  - name: server-alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 5 minutes (current: {{ $value | printf \"%.1f\" }}%)"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current: {{ $value | printf \"%.1f\" }}%)"

      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Root filesystem is {{ $value | printf \"%.1f\" }}% full"

      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"}) > 60
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: EndpointDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint {{ $labels.instance }} is down"
          description: "HTTP probe has been failing for more than 2 minutes"

      - alert: SslCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate for {{ $labels.instance }} expires in {{ $value | printf \"%.0f\" }} days"

Start the Stack

cd ~/monitoring
docker compose up -d

# Verify all containers are running
docker compose ps

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

Open Grafana at http://your-server-ip:3000, log in with the credentials from the compose file, and add Prometheus as a data source with URL http://prometheus:9090. Import dashboard ID 1860 (Node Exporter Full) and 14282 (cAdvisor) from grafana.com for pre-built visualizations.

Step 3: Email and Slack Alerts

Alertmanager Configuration

Create alertmanager/alertmanager.yml:

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourdomain.com'
  smtp_auth_username: 'alerts@yourdomain.com'
  smtp_auth_password: 'your-app-password'
  smtp_require_tls: true

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email-slack'

  routes:
    - match:
        severity: critical
      receiver: 'email-slack'
      repeat_interval: 1h

receivers:
  - name: 'email-slack'
    email_configs:
      - to: 'admin@yourdomain.com'
        send_resolved: true
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#server-alerts'
        send_resolved: true
        title: '{{ if eq .Status "firing" }}ALERT{{ else }}RESOLVED{{ end }}: {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'

After creating the file, restart Alertmanager:

docker compose restart alertmanager

Testing Alerts

Verify Alertmanager is receiving from Prometheus:

# Check pending and firing alerts
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname'

# Send a test alert
curl -X POST http://localhost:9093/api/v2/alerts \
  -H "Content-Type: application/json" \
  -d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"This is a test alert"},"startsAt":"'$(date -u +%Y-%m-%dT%H:%M:%S.000Z)'","generatorURL":"http://localhost:9090"}]'

Step 4: Docker Container Health Checks

Docker health checks let the daemon know whether a container is actually working, not just running. Add them to your application containers:

services:
  webapp:
    image: your-app:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    restart: unless-stopped

  database:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost/ || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped

Monitor health status from the command line:

# Check health of all containers
docker ps --format "table {{.Names}}\t{{.Status}}"

# Watch for unhealthy containers
docker events --filter event=health_status

# Script to alert on unhealthy containers
#!/bin/bash
unhealthy=$(docker ps --filter health=unhealthy --format "{{.Names}}" 2>/dev/null)
if [ -n "$unhealthy" ]; then
  echo "UNHEALTHY containers: $unhealthy" | mail -s "Docker Health Alert" admin@yourdomain.com
fi

Add that script to cron as a fallback in case your main monitoring is itself down:

# Run every 5 minutes
*/5 * * * * /usr/local/bin/docker-health-check.sh

Step 5: Simple Status Page

A public status page lets users check service availability without contacting you. Both UptimeRobot and HetrixTools offer free hosted status pages.

UptimeRobot Status Page

  1. Go to My Settings > Status Pages > Add Status Page
  2. Select the monitors to display
  3. Customize with your branding
  4. Use the provided URL or point a CNAME record (e.g., status.yourdomain.com)

Self-Hosted Alternative: Gatus

For a self-hosted status page with built-in monitoring, add Gatus to your stack:

  gatus:
    image: twinproduction/gatus:v5.11.0
    container_name: gatus
    restart: unless-stopped
    volumes:
      - ./gatus/config.yaml:/config/config.yaml
    ports:
      - "8081:8080"
    networks:
      - monitoring

Create gatus/config.yaml:

storage:
  type: sqlite
  path: /data/data.db

endpoints:
  - name: Website
    url: https://yourdomain.com
    interval: 2m
    conditions:
      - "[STATUS] == 200"
      - "[RESPONSE_TIME] < 2000"
    alerts:
      - type: email
        send-on-resolved: true

  - name: API
    url: https://api.yourdomain.com/health
    interval: 1m
    conditions:
      - "[STATUS] == 200"
      - "[BODY].status == UP"
      - "[RESPONSE_TIME] < 1000"

  - name: SSL Certificate
    url: https://yourdomain.com
    interval: 1h
    conditions:
      - "[CERTIFICATE_EXPIRATION] > 14d"

Troubleshooting

Prometheus shows targets as DOWN

# Check if exporters are reachable from within the Docker network
docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5

# Check container logs
docker logs node-exporter --tail 20

# Verify network connectivity
docker network inspect monitoring_monitoring

Grafana cannot reach Prometheus

When adding the data source, use the Docker service name http://prometheus:9090, not localhost. Both containers must be on the same Docker network.

Alertmanager not sending emails

# Check Alertmanager logs
docker logs alertmanager --tail 30

# Verify config syntax
docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml

# For Gmail: use an App Password, not your regular password
# Enable at https://myaccount.google.com/apppasswords

cAdvisor high CPU usage

On systems with many containers, cAdvisor can consume significant resources. Limit its collection scope:

  cadvisor:
    command:
      - '--housekeeping_interval=30s'
      - '--docker_only=true'
      - '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory'

Disk filling up from Prometheus data

# Check current storage size
du -sh /var/lib/docker/volumes/monitoring_prometheus_data/_data/

# Reduce retention in docker-compose.yml
# Change: --storage.tsdb.retention.time=30d
# To:     --storage.tsdb.retention.time=15d

# Restart Prometheus
docker compose restart prometheus

Prevention and Best Practices

Layered Monitoring Strategy

  1. External uptime monitoring (UptimeRobot/HetrixTools) — catches outages visible to end users
  2. Internal metrics (Prometheus + Grafana) — provides deep visibility into resource usage and trends
  3. Container health checks — enables Docker to auto-restart failed services
  4. Log monitoring — add Loki or ship logs to catch application-level errors

Alert Hygiene

Security Considerations

Maintenance Routine

# Monthly: update monitoring stack
cd ~/monitoring
docker compose pull
docker compose up -d

# Weekly: check disk usage trend
df -h / | tail -1

# Verify all Prometheus targets are healthy
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | .labels.job'

A well-configured monitoring setup pays for itself the first time it wakes you up at 2 AM instead of letting your users discover the outage at 9 AM. Start with the free external tools today, then layer in Prometheus and Grafana as your infrastructure grows.

Need Expert Help?

Want monitoring + alerts set up properly? €49, done in 30 min.

Book Now — €49

100% money-back guarantee

HR

Harald Roessler

Infrastructure Engineer with 20+ years experience. Founder of DSNCON GmbH.