Kubernetes Security & Health Checklist: Is Your Cluster Production-Ready?
K8s production readiness checklist. RBAC, network policies, resource limits, secrets management, monitoring.
TL;DR
A Kubernetes cluster that runs workloads is not the same as a production-ready cluster. This checklist covers the 8 areas most commonly misconfigured: RBAC, network policies, resource limits, pod security, secrets management, Helm hygiene, monitoring, and etcd backups. Run through each section, execute the commands, and fix what fails. If everything passes, your cluster is in better shape than 90% of what we audit.
Prerequisites
- kubectl access to the cluster with admin privileges
- Helm 3 installed
- A functioning cluster (managed or self-hosted)
- Access to the cluster's etcd (for backup section, self-hosted only)
Step 1: RBAC Audit
Role-Based Access Control is the foundation of cluster security. Misconfigured RBAC is the most common finding in Kubernetes audits.
Check for Overprivileged Service Accounts
# List all ClusterRoleBindings that grant cluster-admin
kubectl get clusterrolebindings -o json | \
jq '.items[] | select(.roleRef.name=="cluster-admin") |
{name: .metadata.name, subjects: .subjects}'
# Expected: Only system components and your admin user
# Red flag: Application service accounts with cluster-admin
Check Default Service Account Usage
# Find pods using the default service account
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.serviceAccountName=="default" or
.spec.serviceAccountName==null) |
{namespace: .metadata.namespace, name: .metadata.name}'
Every workload should have its own service account with minimal permissions. The default service account should not be used — ever.
Disable Automount of Service Account Tokens
# Check which pods automount tokens unnecessarily
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.automountServiceAccountToken != false) |
{namespace: .metadata.namespace, name: .metadata.name}'
# Fix: In your pod spec, add:
# spec:
# automountServiceAccountToken: false
If a pod does not need to talk to the Kubernetes API (most don't), disable token automounting. A compromised pod with a service account token is an escalation path to the entire cluster.
Audit Existing Roles
# List all roles and their permissions
kubectl get roles --all-namespaces -o json | \
jq '.items[] | {namespace: .metadata.namespace,
name: .metadata.name, rules: .rules}'
# Look for:
# - Wildcard verbs: ["*"] (too broad)
# - Wildcard resources: ["*"] (too broad)
# - Secrets access without justification
# - Pod exec permissions (allows shell into any pod)
Step 2: Network Policies
By default, every pod can talk to every other pod. This means one compromised pod can reach your database, your secrets store, and your internal APIs.
Check if Network Policies Exist
# List all network policies
kubectl get networkpolicies --all-namespaces
# If this returns empty, you have ZERO network segmentation.
# Every pod can communicate with every other pod.
Default Deny Policy
Start with default deny for every namespace, then whitelist what's needed:
# default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {} # Applies to all pods in namespace
policyTypes:
- Ingress
- Egress
Allow Specific Traffic
# allow-frontend-to-backend.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-frontend-to-api
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
# allow-api-to-database.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-db
namespace: production
spec:
podSelector:
matchLabels:
app: postgres
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 5432
Allow DNS (Required)
# allow-dns.yaml (apply to every namespace with deny-all)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
Step 3: Resource Limits and Requests
Without resource limits, one misbehaving pod can starve the entire node. Without requests, the scheduler makes bad placement decisions.
Find Pods Without Limits
# Pods without resource limits
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.limits == null) |
{namespace: .metadata.namespace, name: .metadata.name}'
Recommended Resource Configuration
resources:
requests:
cpu: "100m" # Minimum CPU guaranteed
memory: "128Mi" # Minimum memory guaranteed
limits:
cpu: "500m" # Maximum CPU allowed
memory: "512Mi" # Maximum memory allowed (OOMKilled if exceeded)
Set Namespace-Level Defaults with LimitRange
# limitrange.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
Set Namespace-Level Quotas
# resourcequota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: namespace-quota
namespace: production
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
pods: "50"
Step 4: Pod Security
Pods running as root with full capabilities are the Kubernetes equivalent of running everything as Administrator on Windows XP.
Find Privileged Pods
# Pods running as root or with privileged mode
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(
.spec.containers[].securityContext.privileged == true or
.spec.containers[].securityContext.runAsUser == 0
) | {namespace: .metadata.namespace, name: .metadata.name}'
Secure Pod Template
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
Pod Security Standards (PSS)
# Enforce restricted security standard on namespace
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
Step 5: Secrets Management
Kubernetes Secrets are base64-encoded, not encrypted. Anyone with read access to Secrets in a namespace can decode them trivially.
Check etcd Encryption
# Verify encryption at rest is configured (self-hosted clusters)
# Check the API server for --encryption-provider-config flag
ps aux | grep kube-apiserver | grep encryption-provider-config
# If not present, secrets are stored in PLAIN TEXT in etcd
Enable etcd Encryption at Rest
# encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
- name: key1
secret:
- identity: {} # Fallback for reading old unencrypted secrets
External Secrets Management
For production, consider external secrets stores:
# Using External Secrets Operator with HashiCorp Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: db-credentials
data:
- secretKey: password
remoteRef:
key: secret/data/production/database
property: password
Audit Secret Access
# Who can read secrets in production namespace?
kubectl auth can-i get secrets -n production --as=system:serviceaccount:production:default
# List all subjects with secrets access
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json | \
jq '.items[] | select(.roleRef.name as $role |
["admin","cluster-admin","edit"] | index($role)) |
{name: .metadata.name, subjects: .subjects}'
Step 6: Helm Hygiene
Helm is the most common way to deploy applications in Kubernetes, but sloppy Helm practices create security and reliability issues.
Check for Outdated Charts
# List all Helm releases with their chart versions
helm list --all-namespaces
# Check for updates
helm repo update
helm search repo --versions | head -5
Helm Values Hygiene
# Never put secrets in values.yaml — they end up in:
# 1. Your git repository (even if you .gitignore later, it's in history)
# 2. Helm release secrets (stored as Kubernetes secrets)
# 3. Your CI/CD logs
# BAD:
# values.yaml
# database:
# password: "supersecret123"
# GOOD: Reference external secrets
# values.yaml
# database:
# existingSecret: "db-credentials"
# existingSecretKey: "password"
Pin Chart Versions
# BAD: No version pinning
helm install myapp bitnami/postgresql
# GOOD: Pin the chart version
helm install myapp bitnami/postgresql --version 16.4.1
# BETTER: Pin in a helmfile or values file
# helmfile.yaml
releases:
- name: postgresql
chart: bitnami/postgresql
version: 16.4.1
values:
- values/postgresql.yaml
Audit Helm Release History
# Check release history for failed deployments
helm history myapp -n production
# Clean up old revisions (keep last 5)
helm upgrade myapp -n production --history-max 5
Step 7: Monitoring with Prometheus and Grafana
If you cannot see what your cluster is doing, you cannot secure or operate it. Monitoring is not optional for production.
Install kube-prometheus-stack
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set grafana.adminPassword="change-me-immediately" \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
Essential Alerts to Configure
# Must-have alerts (included in kube-prometheus-stack):
# - KubePodCrashLooping: Pod restarting repeatedly
# - KubePodNotReady: Pod stuck in non-ready state
# - KubeDeploymentReplicasMismatch: Desired != actual replicas
# - KubeNodeNotReady: Node offline
# - KubeQuotaExceeded: Resource quota hit
# - etcdHighNumberOfLeaderChanges: etcd instability
# - CPUThrottlingHigh: Pods being CPU-throttled
# Custom alert example:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: monitoring
spec:
groups:
- name: custom.rules
rules:
- alert: HighMemoryUsage
expr: |
container_memory_working_set_bytes /
container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} memory > 90% of limit"
Key Grafana Dashboards
# Import these dashboard IDs in Grafana:
# 315 - Kubernetes cluster overview
# 6417 - Kubernetes pod resources
# 13770 - Kubernetes node resources
# 14981 - kube-prometheus-stack overview
Step 8: etcd Backup and Disaster Recovery
etcd is the brain of your Kubernetes cluster. Lose etcd, lose everything. This section applies primarily to self-hosted clusters — managed Kubernetes providers handle etcd backup for you.
Manual etcd Backup
# Create a snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-$(date +%Y%m%d).db \
--write-table
Automated Backup CronJob
# etcd-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: etcd-backup
image: bitnami/etcd:latest
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
--endpoints=https://etcd:2379 \
--cacert=/certs/ca.crt \
--cert=/certs/server.crt \
--key=/certs/server.key
volumeMounts:
- name: backup-volume
mountPath: /backup
- name: etcd-certs
mountPath: /certs
restartPolicy: OnFailure
volumes:
- name: backup-volume
persistentVolumeClaim:
claimName: etcd-backup-pvc
- name: etcd-certs
secret:
secretName: etcd-certs
Test Your Restore
# CRITICAL: Test restore on a non-production cluster
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restored \
--initial-cluster="default=https://127.0.0.1:2380" \
--initial-advertise-peer-urls="https://127.0.0.1:2380" \
--name=default
# A backup you haven't tested restoring is not a backup.
Troubleshooting & Considerations
"Network policies are blocking legitimate traffic"
Start with default deny + allow-dns in a staging namespace first. Then add allow policies one by one. Use kubectl describe networkpolicy to verify selectors match your pods. Check that your CNI plugin supports network policies — not all do (Flannel does not, Calico and Cilium do).
"Pods are OOMKilled after setting memory limits"
Your memory limits are too low. Check actual memory usage with kubectl top pods or Prometheus metrics before setting limits. Set the limit at 1.5-2x the typical usage. Java applications are especially tricky — the JVM uses memory outside the heap that you must account for.
"Pods fail to start after enabling Pod Security Standards"
Start with warn mode instead of enforce. This logs violations without blocking. Fix violations one by one, then switch to enforce. The most common issue: containers that require running as root. Use init containers or modified images that run as non-root.
"Prometheus uses too much storage"
Reduce retention period, increase scrape interval for less critical metrics, or use remote write to a long-term storage like Thanos or Cortex. Also check for high-cardinality metrics (metrics with many unique label combinations).
Prevention & Best Practices
Automate the Audit
Use tools that continuously scan your cluster: kube-bench (CIS benchmarks), Trivy (container vulnerabilities), Polaris (configuration best practices), Falco (runtime security). Run these in CI/CD and as cluster-resident agents.
GitOps for Everything
Every configuration in this guide should live in a Git repository and be applied via GitOps (ArgoCD or Flux). No manual kubectl apply in production. If it's not in Git, it didn't happen.
Regular Audit Schedule
Run this checklist quarterly. Set calendar reminders. Security degrades over time as new workloads are added, permissions are expanded "temporarily," and monitoring is overlooked during feature sprints.
Upgrade Strategy
Keep your Kubernetes version within the supported window (latest 3 minor versions). Patch versions should be applied within 2 weeks of release. Have a tested upgrade runbook that includes backup verification before every upgrade.
Need Expert Help?
Want a professional K8s audit? €150, 60-min review + prioritized action report.
Book Now — €150100% money-back guarantee