How to Set Up Server Monitoring: Know When Your Server Goes Down
Set up uptime monitoring, alerts, and a basic Prometheus + Grafana stack. Free and paid options compared.
TL;DR
Use free external monitoring (UptimeRobot or HetrixTools) for uptime checks and instant alerts. For deeper insight into CPU, RAM, disk, and container health, deploy Prometheus + Grafana via Docker Compose. Combine both approaches for comprehensive coverage: external monitoring catches outages you cannot detect from inside your own infrastructure, while Prometheus gives you the metrics to diagnose root causes and predict failures before they happen.
Prerequisites
- A Linux server (Ubuntu 22.04/24.04 or Debian 12 recommended) with root or sudo access
- Docker and Docker Compose installed (
docker compose versionshould return v2.x+) - A domain name pointing to your server (for SSL monitoring and status page)
- Basic familiarity with the Linux command line
- An email address for alert notifications
- Optional: A Slack workspace for Slack alerts
What to Monitor
Before setting up any tools, understand what matters for server reliability:
Infrastructure Metrics
- CPU usage — sustained loads above 80% indicate capacity problems
- RAM usage — memory exhaustion causes OOM kills and service crashes
- Disk usage — a full disk will bring down databases, logging, and often the entire system
- Disk I/O — high wait times point to storage bottlenecks
- Network traffic — unexpected spikes can indicate attacks or misconfiguration
Application Metrics
- HTTP response codes — monitor for 5xx errors
- Response time — detect performance degradation before users complain
- SSL certificate expiry — expired certificates break trust and access
- Docker container status — detect crashed or unhealthy containers
- Service-specific health endpoints — databases, queues, caches
Step 1: Free Uptime Monitoring Services
External uptime monitors check your server from outside your network. This is critical because internal monitoring cannot detect network-level outages or routing problems that prevent users from reaching your server.
UptimeRobot
UptimeRobot offers 50 monitors at 5-minute intervals on the free plan.
- Create an account at uptimerobot.com
- Click Add New Monitor
- Configure your first monitor:
- Monitor Type: HTTP(s)
- Friendly Name: My Server — Main Site
- URL:
https://yourdomain.com - Monitoring Interval: 5 minutes (free tier)
Add additional monitors for critical endpoints:
https://yourdomain.com/api/health— API health check- A Ping monitor for your server IP — catches network-level outages
- A Port monitor for SSH (22), SMTP (587), or database ports if externally accessible
- A Keyword monitor that checks for specific text on a page — catches cases where the server responds 200 but serves error content
HetrixTools
HetrixTools offers 15 monitors at 1-minute intervals on the free plan, plus built-in SSL and blacklist monitoring.
- Create an account at hetrixtools.com
- Navigate to Uptime Monitors and click Add Monitor
- Choose Website and enter your URL
- Enable SSL Certificate Monitoring — you will be alerted 30, 14, and 7 days before expiry
- Add monitoring locations across multiple continents for better outage detection
For SSL expiry monitoring specifically, you can also check certificates from the command line:
echo | openssl s_client -servername yourdomain.com -connect yourdomain.com:443 2>/dev/null | openssl x509 -noout -dates
Step 2: Prometheus + Grafana with Docker Compose
External monitors tell you that something is down. Prometheus and Grafana tell you why and help you see problems coming.
Directory Structure
mkdir -p ~/monitoring/{prometheus,grafana,alertmanager}
cd ~/monitoring
Docker Compose File
Create docker-compose.yml:
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:10.4.0
container_name: grafana
restart: unless-stopped
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme_s3cure!
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
networks:
- monitoring
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
container_name: blackbox-exporter
restart: unless-stopped
ports:
- "9115:9115"
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
Prometheus Configuration
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alert-rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://yourdomain.com
- https://api.yourdomain.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
- job_name: 'blackbox-ssl'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://yourdomain.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Alert Rules
Create prometheus/alert-rules.yml:
groups:
- name: server-alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes (current: {{ $value | printf \"%.1f\" }}%)"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current: {{ $value | printf \"%.1f\" }}%)"
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Root filesystem is {{ $value | printf \"%.1f\" }}% full"
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"}) > 60
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: EndpointDown
expr: probe_success == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Endpoint {{ $labels.instance }} is down"
description: "HTTP probe has been failing for more than 2 minutes"
- alert: SslCertExpiringSoon
expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate for {{ $labels.instance }} expires in {{ $value | printf \"%.0f\" }} days"
Start the Stack
cd ~/monitoring
docker compose up -d
# Verify all containers are running
docker compose ps
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
Open Grafana at http://your-server-ip:3000, log in with the credentials from the compose file, and add Prometheus as a data source with URL http://prometheus:9090. Import dashboard ID 1860 (Node Exporter Full) and 14282 (cAdvisor) from grafana.com for pre-built visualizations.
Step 3: Email and Slack Alerts
Alertmanager Configuration
Create alertmanager/alertmanager.yml:
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourdomain.com'
smtp_auth_username: 'alerts@yourdomain.com'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email-slack'
routes:
- match:
severity: critical
receiver: 'email-slack'
repeat_interval: 1h
receivers:
- name: 'email-slack'
email_configs:
- to: 'admin@yourdomain.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#server-alerts'
send_resolved: true
title: '{{ if eq .Status "firing" }}ALERT{{ else }}RESOLVED{{ end }}: {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}'
After creating the file, restart Alertmanager:
docker compose restart alertmanager
Testing Alerts
Verify Alertmanager is receiving from Prometheus:
# Check pending and firing alerts
curl -s http://localhost:9093/api/v2/alerts | jq '.[].labels.alertname'
# Send a test alert
curl -X POST http://localhost:9093/api/v2/alerts \
-H "Content-Type: application/json" \
-d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"This is a test alert"},"startsAt":"'$(date -u +%Y-%m-%dT%H:%M:%S.000Z)'","generatorURL":"http://localhost:9090"}]'
Step 4: Docker Container Health Checks
Docker health checks let the daemon know whether a container is actually working, not just running. Add them to your application containers:
services:
webapp:
image: your-app:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
restart: unless-stopped
database:
image: postgres:16
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 30s
timeout: 5s
retries: 3
restart: unless-stopped
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 30s
timeout: 5s
retries: 3
restart: unless-stopped
nginx:
image: nginx:alpine
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost/ || exit 1"]
interval: 30s
timeout: 5s
retries: 3
restart: unless-stopped
Monitor health status from the command line:
# Check health of all containers
docker ps --format "table {{.Names}}\t{{.Status}}"
# Watch for unhealthy containers
docker events --filter event=health_status
# Script to alert on unhealthy containers
#!/bin/bash
unhealthy=$(docker ps --filter health=unhealthy --format "{{.Names}}" 2>/dev/null)
if [ -n "$unhealthy" ]; then
echo "UNHEALTHY containers: $unhealthy" | mail -s "Docker Health Alert" admin@yourdomain.com
fi
Add that script to cron as a fallback in case your main monitoring is itself down:
# Run every 5 minutes
*/5 * * * * /usr/local/bin/docker-health-check.sh
Step 5: Simple Status Page
A public status page lets users check service availability without contacting you. Both UptimeRobot and HetrixTools offer free hosted status pages.
UptimeRobot Status Page
- Go to My Settings > Status Pages > Add Status Page
- Select the monitors to display
- Customize with your branding
- Use the provided URL or point a CNAME record (e.g.,
status.yourdomain.com)
Self-Hosted Alternative: Gatus
For a self-hosted status page with built-in monitoring, add Gatus to your stack:
gatus:
image: twinproduction/gatus:v5.11.0
container_name: gatus
restart: unless-stopped
volumes:
- ./gatus/config.yaml:/config/config.yaml
ports:
- "8081:8080"
networks:
- monitoring
Create gatus/config.yaml:
storage:
type: sqlite
path: /data/data.db
endpoints:
- name: Website
url: https://yourdomain.com
interval: 2m
conditions:
- "[STATUS] == 200"
- "[RESPONSE_TIME] < 2000"
alerts:
- type: email
send-on-resolved: true
- name: API
url: https://api.yourdomain.com/health
interval: 1m
conditions:
- "[STATUS] == 200"
- "[BODY].status == UP"
- "[RESPONSE_TIME] < 1000"
- name: SSL Certificate
url: https://yourdomain.com
interval: 1h
conditions:
- "[CERTIFICATE_EXPIRATION] > 14d"
Troubleshooting
Prometheus shows targets as DOWN
# Check if exporters are reachable from within the Docker network
docker exec prometheus wget -qO- http://node-exporter:9100/metrics | head -5
# Check container logs
docker logs node-exporter --tail 20
# Verify network connectivity
docker network inspect monitoring_monitoring
Grafana cannot reach Prometheus
When adding the data source, use the Docker service name http://prometheus:9090, not localhost. Both containers must be on the same Docker network.
Alertmanager not sending emails
# Check Alertmanager logs
docker logs alertmanager --tail 30
# Verify config syntax
docker exec alertmanager amtool check-config /etc/alertmanager/alertmanager.yml
# For Gmail: use an App Password, not your regular password
# Enable at https://myaccount.google.com/apppasswords
cAdvisor high CPU usage
On systems with many containers, cAdvisor can consume significant resources. Limit its collection scope:
cadvisor:
command:
- '--housekeeping_interval=30s'
- '--docker_only=true'
- '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory'
Disk filling up from Prometheus data
# Check current storage size
du -sh /var/lib/docker/volumes/monitoring_prometheus_data/_data/
# Reduce retention in docker-compose.yml
# Change: --storage.tsdb.retention.time=30d
# To: --storage.tsdb.retention.time=15d
# Restart Prometheus
docker compose restart prometheus
Prevention and Best Practices
Layered Monitoring Strategy
- External uptime monitoring (UptimeRobot/HetrixTools) — catches outages visible to end users
- Internal metrics (Prometheus + Grafana) — provides deep visibility into resource usage and trends
- Container health checks — enables Docker to auto-restart failed services
- Log monitoring — add Loki or ship logs to catch application-level errors
Alert Hygiene
- Only alert on conditions that require human action. Noisy alerts get ignored.
- Use
for:clauses in Prometheus rules to avoid alerting on brief spikes - Set
repeat_intervalhigh enough to avoid alert fatigue (4h for warnings, 1h for critical) - Always include
send_resolved: trueso you know when issues clear
Security Considerations
- Never expose Prometheus or Alertmanager to the public internet without authentication
- Put Grafana behind a reverse proxy with HTTPS
- Change default Grafana credentials immediately
- Restrict ports in the compose file to
127.0.0.1:9090:9090if only accessed locally or through a reverse proxy
Maintenance Routine
# Monthly: update monitoring stack
cd ~/monitoring
docker compose pull
docker compose up -d
# Weekly: check disk usage trend
df -h / | tail -1
# Verify all Prometheus targets are healthy
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | .labels.job'
A well-configured monitoring setup pays for itself the first time it wakes you up at 2 AM instead of letting your users discover the outage at 9 AM. Start with the free external tools today, then layer in Prometheus and Grafana as your infrastructure grows.
Need Expert Help?
Want monitoring + alerts set up properly? €49, done in 30 min.
Book Now — €49100% money-back guarantee