Is Your Infrastructure a Risk? How to Assess Hosting, CI/CD, and Monitoring
Infrastructure assessment guide. Hosting evaluation, CI/CD audit, monitoring, disaster recovery, cloud cost optimization.
TL;DR
Most infrastructure isn't obviously broken — it's silently accumulating risk. No monitoring means you find out about outages from customers. No tested backups means your disaster recovery plan is fiction. Unclear CI/CD means deployments are scary instead of boring. This guide walks you through a systematic assessment of your hosting, CI/CD, monitoring, disaster recovery, and cloud costs — with a decision matrix to prioritize what to fix first.
Prerequisites
- Admin access to your hosting environment (cloud console, server SSH)
- Access to CI/CD configuration (GitHub Actions, GitLab CI, Jenkins, etc.)
- Access to monitoring tools (if any exist)
- A list of all production services and their dependencies
- Cloud billing access for cost analysis
Step 1: Hosting Evaluation
Your hosting setup is the foundation. If it's wrong, everything built on top is unstable.
Hosting Audit Checklist
Infrastructure Inventory:
[ ] All production servers/services are documented
[ ] IP addresses, domains, and DNS records are recorded
[ ] Server OS versions are current (within supported lifecycle)
[ ] No servers are running end-of-life operating systems
[ ] SSH access uses key-based authentication (no passwords)
[ ] Firewall is configured (only necessary ports open)
[ ] TLS/SSL certificates are valid and auto-renewing
Architecture:
[ ] Single points of failure are identified
[ ] Database has replication or automated backup
[ ] Application servers can handle 2x current load
[ ] Static assets are served via CDN
[ ] Load balancer is configured (if multiple app servers)
[ ] Health check endpoints exist for every service
Server Resource Assessment
# Check current resource usage
# CPU usage (last 15 minutes)
uptime
# Memory usage
free -h
# Disk usage
df -h
# Disk I/O
iostat -x 1 5
# Network connections
ss -tuln
# Running processes by resource consumption
top -bn1 | head -20
# Docker resource usage (if applicable)
docker stats --no-stream
# Kubernetes resource usage (if applicable)
kubectl top nodes
kubectl top pods --all-namespaces
Hosting Type Evaluation
Hosting Type | Good For | Watch Out For
----------------+-----------------------------+----------------------------
Shared hosting | Marketing sites, blogs | No scaling, noisy neighbors
VPS (Hetzner, | Small-medium apps, dev | Manual management, single
DigitalOcean) | environments | region, no auto-scaling
Managed K8s | Production workloads, | Complexity, cost, learning
(GKE, EKS) | microservices | curve for small teams
Serverless | Event-driven, variable | Cold starts, vendor lock-in,
(Lambda, Cloud | load, APIs | debugging difficulty
Functions) | |
PaaS (Vercel, | Frontend, simple backends | Cost at scale, limited
Render, Fly) | | configuration control
Decision: If your team has < 3 engineers and no dedicated DevOps,
use a PaaS or managed service. Kubernetes is for teams that can
afford to operate it.
Step 2: CI/CD Assessment
Deployments should be boring. If they're not, your CI/CD needs work.
CI/CD Health Checklist
Build Pipeline:
[ ] Every commit triggers an automated build
[ ] Build time is under 10 minutes (< 5 preferred)
[ ] Build failures are notified immediately (Slack, email)
[ ] Build artifacts are versioned and stored
[ ] Build is reproducible (same commit = same artifact)
Test Pipeline:
[ ] Unit tests run on every PR
[ ] Integration tests run on merge to main
[ ] Test coverage is measured (target: 70%+ for critical paths)
[ ] Flaky tests are tracked and fixed (not just rerun)
[ ] Security scanning runs automatically (dependency audit, SAST)
Deployment Pipeline:
[ ] Deployment to staging is automated on merge to main
[ ] Deployment to production requires explicit approval
[ ] Rollback is possible in under 5 minutes
[ ] Database migrations are forward-compatible (no breaking changes)
[ ] Deployment creates a tagged release with changelog
[ ] Zero-downtime deployment is configured (rolling update, blue-green)
Environment Management:
[ ] Staging environment mirrors production
[ ] Environment variables are managed securely (not in code)
[ ] Secrets rotation is possible without deployment
[ ] Feature flags exist for gradual rollouts
Measuring CI/CD Health — DORA Metrics
Metric | Elite | High | Medium | Low
--------------------------+----------+----------+----------+----------
Deployment Frequency | Multiple | Weekly- | Monthly- | Monthly-
| per day | daily | weekly | yearly
Lead Time for Changes | < 1 hour | 1 day - | 1 week - | 1 month -
| | 1 week | 1 month | 6 months
Change Failure Rate | 0-15% | 16-30% | 16-30% | 46-60%
Mean Time to Recovery | < 1 hour | < 1 day | < 1 day | 1 week -
(MTTR) | | | | 1 month
How to measure:
- Deployment frequency: count deployments per week
- Lead time: measure time from commit to production
- Change failure rate: (failed deployments / total deployments)
- MTTR: average time from incident detection to resolution
Common CI/CD Anti-Patterns
Anti-Pattern | Fix
--------------------------------+------------------------------------
"Works on my machine" | Docker-based build environment
Manual deployment steps | Automate everything in pipeline
No staging environment | Mirror production with same config
Tests only run locally | Enforce tests in CI, block merge
Secrets in code/config files | Use CI/CD secret management
No rollback plan | Blue-green or rolling deployment
Build takes 30+ minutes | Parallelize, cache dependencies
Flaky tests are rerun, not fixed | Track flaky tests, fix root cause
Step 3: Monitoring Gaps
If you don't know something is broken, you can't fix it. Most teams have significant monitoring gaps.
The Four Pillars of Monitoring
Pillar | What It Tells You | Tools
----------------+--------------------------------+-----------------------
Metrics | Is the system healthy? | Prometheus, Datadog,
| (CPU, memory, request rate, | CloudWatch
| error rate, latency) |
Logs | What happened? | ELK, Loki, CloudWatch
| (application events, errors, | Logs
| access logs) |
Traces | Where is it slow? | Jaeger, Tempo, Datadog
| (request flow through | APM, New Relic
| services) |
Alerts | When should you wake up? | PagerDuty, Opsgenie,
| (threshold breaches, | Alertmanager
| anomalies, downtime) |
Monitoring Checklist
Infrastructure Monitoring:
[ ] CPU usage per server/container (alert > 80% sustained)
[ ] Memory usage (alert > 85%)
[ ] Disk usage (alert > 80%)
[ ] Network throughput and errors
[ ] SSL certificate expiry (alert 30 days before)
[ ] DNS resolution time
Application Monitoring:
[ ] Request rate (requests per second)
[ ] Error rate (5xx responses / total responses)
[ ] Latency (p50, p95, p99)
[ ] Apdex score or SLI/SLO tracking
[ ] Background job queue depth and failure rate
[ ] Database query performance (slow query log)
Business Monitoring:
[ ] User signups / logins (sudden drops = problem)
[ ] Payment processing (failures, success rate)
[ ] Key feature usage (is the core product working?)
[ ] API usage by client (who is consuming what?)
Alert Quality:
[ ] Every alert has a runbook linked
[ ] Alerts are actionable (not just "CPU is high")
[ ] Alert fatigue is measured (if you ignore alerts, they're wrong)
[ ] On-call rotation exists with escalation policy
Minimum Viable Monitoring Stack
# For small teams (< 5 engineers), start here:
1. Uptime monitoring: UptimeRobot or Better Stack (free tier)
- Ping every production URL every 1 minute
- Alert via Slack + SMS
2. Error tracking: Sentry (free tier)
- Catches unhandled exceptions
- Groups errors, shows frequency
3. Log aggregation: Loki + Grafana or managed service
- Search logs across all services
- Correlate errors with timestamps
4. Infrastructure metrics: Prometheus + Grafana or Datadog
- CPU, memory, disk, network
- Custom application metrics
Time to set up: 4-8 hours
Cost: $0-50/month for small deployments
Step 4: Disaster Recovery — RTO and RPO
Every business has disaster recovery needs. Most don't know what they are until it's too late.
Define Your RTO and RPO
RTO (Recovery Time Objective): How long can your service be down?
RPO (Recovery Point Objective): How much data can you afford to lose?
Example scenarios:
E-commerce platform:
RTO: 1 hour (every hour of downtime = lost revenue)
RPO: 0 (zero data loss for orders)
→ Requires: hot standby, real-time replication, automated failover
Internal knowledge base:
RTO: 24 hours (team can work without it for a day)
RPO: 24 hours (losing a day of edits is acceptable)
→ Requires: daily backups, documented restore procedure
SaaS product:
RTO: 4 hours (SLA commitment)
RPO: 1 hour (hourly backups or WAL shipping)
→ Requires: automated backups, tested restore, standby infrastructure
Disaster Recovery Audit
Backups:
[ ] Automated backups exist for all databases
[ ] Backup frequency matches RPO requirement
[ ] Backups are stored offsite (different region/provider)
[ ] Backup encryption is enabled
[ ] Backup restore has been tested in the last 90 days
[ ] Time to restore is documented and within RTO
Failover:
[ ] Failover procedure is documented
[ ] Failover has been tested (not just documented)
[ ] DNS TTL is low enough for fast failover (300s or less)
[ ] Application can run in a different region/zone
[ ] Database replication is configured and monitored
Communication:
[ ] Incident response team is defined
[ ] Communication channels exist (not dependent on the failing infra)
[ ] Status page is on separate infrastructure
[ ] Customer notification template is prepared
Test Your Backups
# PostgreSQL backup test
# 1. Create the backup
pg_dump -Fc mydb > /backup/mydb_$(date +%Y%m%d).dump
# 2. Restore to a test database
createdb mydb_restore_test
pg_restore -d mydb_restore_test /backup/mydb_latest.dump
# 3. Verify data integrity
psql mydb_restore_test -c "SELECT count(*) FROM users;"
psql mydb_restore_test -c "SELECT count(*) FROM orders;"
# 4. Compare counts with production
# If they match, your backup works. If not, investigate.
# 5. Clean up
dropdb mydb_restore_test
# Schedule this test monthly. Put it in the calendar.
Step 5: Cloud Cost Optimization
Cloud bills grow silently. By the time someone looks, you're paying 2-5x what you should.
Quick Cost Wins
Check | Typical Savings | Effort
------------------------------+-----------------+--------
Rightsize over-provisioned VMs| 20-40% | Low
Delete unused resources | 5-15% | Low
(unattached volumes, old | |
snapshots, idle load balancers)| |
Use reserved/committed use | 30-60% | Medium
discounts | |
Schedule dev/staging shutdown | 40-70% on dev | Low
(stop overnight, weekends) | |
Switch to ARM instances | 20-30% | Medium
(Graviton, Ampere) | |
Optimize data transfer | 10-30% | Medium
(CDN, compression, caching) | |
Review log storage costs | 10-50% | Low
(retention policies) | |
Cost Visibility Checklist
[ ] Resource tagging is enforced (team, project, environment)
[ ] Monthly cost reports are reviewed by engineering
[ ] Cost per service/team is tracked
[ ] Budget alerts are configured (50%, 80%, 100% of budget)
[ ] Unused resources are identified monthly
[ ] Cost anomaly detection is enabled (sudden spikes)
The "Right Tool" Cost Comparison
# Example: Running a 4-core, 16GB app server
AWS EC2 (on-demand, eu-central-1, t3.xlarge):
$0.1664/hour × 730 hours = ~$121/month
AWS EC2 (1-year reserved, no upfront):
$0.105/hour × 730 hours = ~$77/month (36% savings)
Hetzner Cloud (CX41):
~$15/month (88% savings vs AWS on-demand)
Hetzner Dedicated (AX41-NVMe):
~$44/month (64% savings, bare metal performance)
Question: Does your workload need AWS-specific services
(RDS, SQS, Lambda)? If not, a European VPS provider may
deliver the same result at 70-90% less cost.
Step 6: Infrastructure Decision Matrix
After the assessment, prioritize what to fix. Not everything is equally urgent.
Risk Scoring Matrix
For each finding, score:
Impact (1-5):
1 = Cosmetic / minor inconvenience
2 = Degraded experience for some users
3 = Service disruption for some users
4 = Full service outage
5 = Data loss or security breach
Likelihood (1-5):
1 = Very unlikely (once in 5+ years)
2 = Unlikely (once per year)
3 = Possible (once per quarter)
4 = Likely (once per month)
5 = Almost certain (weekly or more)
Risk Score = Impact × Likelihood
Priority:
20-25: Critical — fix this week
12-19: High — fix this month
6-11: Medium — plan for next quarter
1-5: Low — backlog
Example Assessment Output
Finding | Impact | Likelihood | Risk | Priority
--------------------------------+--------+------------+------+----------
No tested backup restore | 5 | 3 | 15 | High
No monitoring or alerting | 4 | 4 | 16 | High
SSH password auth enabled | 5 | 2 | 10 | Medium
No staging environment | 3 | 4 | 12 | High
Manual deployments | 3 | 4 | 12 | High
Over-provisioned cloud resources| 1 | 5 | 5 | Low
No log aggregation | 2 | 4 | 8 | Medium
SSL cert not auto-renewing | 4 | 3 | 12 | High
No incident response plan | 4 | 2 | 8 | Medium
Dev environment uses prod data | 5 | 3 | 15 | High
Troubleshooting & Considerations
"We don't have time to fix everything"
You don't have to. Use the risk matrix to focus on high-impact, high-likelihood items first. Fixing the top 3 findings usually eliminates 80% of the risk. Schedule the rest as part of regular engineering work — one infrastructure improvement per sprint.
"Our cloud bill is too high but we can't migrate"
Start with the quick wins: rightsize instances, delete unused resources, schedule dev environments. These require no architecture changes and typically save 20-40%. Then evaluate reserved instances or committed use discounts for stable workloads.
"We don't know what we're running"
This is finding #1. Create an inventory: every server, every service, every domain, every database. Use cloud console tags, Terraform state, or even a spreadsheet. You cannot assess what you cannot see. Budget 2-4 hours for the initial inventory.
"Management doesn't see infrastructure as a priority"
Translate findings to business language. "No tested backups" becomes "We could lose all customer data with no recovery option." "No monitoring" becomes "We find out about outages when customers complain on Twitter." "Manual deployments" becomes "Every release carries risk of human error causing downtime." The risk matrix quantifies the business impact.
Prevention & Best Practices
Quarterly Infrastructure Reviews
Run this assessment every 90 days. Set a calendar reminder. Infrastructure degrades silently — disk fills up, certificates expire, dependencies become vulnerable, costs drift upward. A quarterly 2-hour review catches problems before they become incidents.
Infrastructure as Code
Every piece of infrastructure should be defined in code (Terraform, Pulumi, CloudFormation). Manual changes in cloud consoles are invisible, undocumented, and unreproducible. If it's not in code, it doesn't exist in disaster recovery.
Runbooks for Everything
For every service in production, write a runbook: how to deploy, how to rollback, how to restore from backup, how to debug common issues, who to contact. The runbook should be usable by someone who has never seen the service before — because at 3 AM, that might be the person on call.
Cost Reviews as Engineering Practice
Include cloud costs in sprint reviews. Make cost visible to the team. When engineers see the bill, they make different architecture decisions. A monthly 15-minute cost review prevents the annual "why is our AWS bill $50k?" surprise.
Game Days
Once a quarter, simulate a failure. Kill a server, corrupt a database (in staging), trigger failover. This is the only way to know if your disaster recovery actually works. Netflix calls this Chaos Engineering. You don't need Netflix's scale to benefit from it — a simple "what happens if this server dies" test is already valuable.
Need Expert Help?
Want a professional assessment? €250, 90-min deep dive + written report.
Book Now — €250100% money-back guarantee