← All Articles
Last updated: 2026-03-30

Kubernetes Security & Health Checklist: Is Your Cluster Production-Ready?

K8s production readiness checklist. RBAC, network policies, resource limits, secrets management, monitoring.

TL;DR

A Kubernetes cluster that runs workloads is not the same as a production-ready cluster. This checklist covers the 8 areas most commonly misconfigured: RBAC, network policies, resource limits, pod security, secrets management, Helm hygiene, monitoring, and etcd backups. Run through each section, execute the commands, and fix what fails. If everything passes, your cluster is in better shape than 90% of what we audit.

Prerequisites

Step 1: RBAC Audit

Role-Based Access Control is the foundation of cluster security. Misconfigured RBAC is the most common finding in Kubernetes audits.

Check for Overprivileged Service Accounts

# List all ClusterRoleBindings that grant cluster-admin
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | 
  {name: .metadata.name, subjects: .subjects}'

# Expected: Only system components and your admin user
# Red flag: Application service accounts with cluster-admin

Check Default Service Account Usage

# Find pods using the default service account
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.serviceAccountName=="default" or 
  .spec.serviceAccountName==null) | 
  {namespace: .metadata.namespace, name: .metadata.name}'

Every workload should have its own service account with minimal permissions. The default service account should not be used — ever.

Disable Automount of Service Account Tokens

# Check which pods automount tokens unnecessarily
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.automountServiceAccountToken != false) | 
  {namespace: .metadata.namespace, name: .metadata.name}'

# Fix: In your pod spec, add:
# spec:
#   automountServiceAccountToken: false

If a pod does not need to talk to the Kubernetes API (most don't), disable token automounting. A compromised pod with a service account token is an escalation path to the entire cluster.

Audit Existing Roles

# List all roles and their permissions
kubectl get roles --all-namespaces -o json | \
  jq '.items[] | {namespace: .metadata.namespace, 
  name: .metadata.name, rules: .rules}'

# Look for:
# - Wildcard verbs: ["*"] (too broad)
# - Wildcard resources: ["*"] (too broad)  
# - Secrets access without justification
# - Pod exec permissions (allows shell into any pod)

Step 2: Network Policies

By default, every pod can talk to every other pod. This means one compromised pod can reach your database, your secrets store, and your internal APIs.

Check if Network Policies Exist

# List all network policies
kubectl get networkpolicies --all-namespaces

# If this returns empty, you have ZERO network segmentation.
# Every pod can communicate with every other pod.

Default Deny Policy

Start with default deny for every namespace, then whitelist what's needed:

# default-deny-all.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}  # Applies to all pods in namespace
  policyTypes:
    - Ingress
    - Egress

Allow Specific Traffic

# allow-frontend-to-backend.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

# allow-api-to-database.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-db
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: api
      ports:
        - protocol: TCP
          port: 5432

Allow DNS (Required)

# allow-dns.yaml (apply to every namespace with deny-all)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

Step 3: Resource Limits and Requests

Without resource limits, one misbehaving pod can starve the entire node. Without requests, the scheduler makes bad placement decisions.

Find Pods Without Limits

# Pods without resource limits
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(.spec.containers[].resources.limits == null) | 
  {namespace: .metadata.namespace, name: .metadata.name}'

Recommended Resource Configuration

resources:
  requests:
    cpu: "100m"       # Minimum CPU guaranteed
    memory: "128Mi"   # Minimum memory guaranteed
  limits:
    cpu: "500m"       # Maximum CPU allowed
    memory: "512Mi"   # Maximum memory allowed (OOMKilled if exceeded)

Set Namespace-Level Defaults with LimitRange

# limitrange.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      type: Container

Set Namespace-Level Quotas

# resourcequota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"

Step 4: Pod Security

Pods running as root with full capabilities are the Kubernetes equivalent of running everything as Administrator on Windows XP.

Find Privileged Pods

# Pods running as root or with privileged mode
kubectl get pods --all-namespaces -o json | \
  jq '.items[] | select(
    .spec.containers[].securityContext.privileged == true or
    .spec.containers[].securityContext.runAsUser == 0
  ) | {namespace: .metadata.namespace, name: .metadata.name}'

Secure Pod Template

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL

Pod Security Standards (PSS)

# Enforce restricted security standard on namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

Step 5: Secrets Management

Kubernetes Secrets are base64-encoded, not encrypted. Anyone with read access to Secrets in a namespace can decode them trivially.

Check etcd Encryption

# Verify encryption at rest is configured (self-hosted clusters)
# Check the API server for --encryption-provider-config flag
ps aux | grep kube-apiserver | grep encryption-provider-config

# If not present, secrets are stored in PLAIN TEXT in etcd

Enable etcd Encryption at Rest

# encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
      - secrets
    providers:
      - aescbc:
          keys:
            - name: key1
              secret: 
      - identity: {}  # Fallback for reading old unencrypted secrets

External Secrets Management

For production, consider external secrets stores:

# Using External Secrets Operator with HashiCorp Vault
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: secret/data/production/database
        property: password

Audit Secret Access

# Who can read secrets in production namespace?
kubectl auth can-i get secrets -n production --as=system:serviceaccount:production:default

# List all subjects with secrets access
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json | \
  jq '.items[] | select(.roleRef.name as $role | 
  ["admin","cluster-admin","edit"] | index($role)) | 
  {name: .metadata.name, subjects: .subjects}'

Step 6: Helm Hygiene

Helm is the most common way to deploy applications in Kubernetes, but sloppy Helm practices create security and reliability issues.

Check for Outdated Charts

# List all Helm releases with their chart versions
helm list --all-namespaces

# Check for updates
helm repo update
helm search repo  --versions | head -5

Helm Values Hygiene

# Never put secrets in values.yaml — they end up in:
# 1. Your git repository (even if you .gitignore later, it's in history)
# 2. Helm release secrets (stored as Kubernetes secrets)
# 3. Your CI/CD logs

# BAD:
# values.yaml
# database:
#   password: "supersecret123"

# GOOD: Reference external secrets
# values.yaml
# database:
#   existingSecret: "db-credentials"
#   existingSecretKey: "password"

Pin Chart Versions

# BAD: No version pinning
helm install myapp bitnami/postgresql

# GOOD: Pin the chart version
helm install myapp bitnami/postgresql --version 16.4.1

# BETTER: Pin in a helmfile or values file
# helmfile.yaml
releases:
  - name: postgresql
    chart: bitnami/postgresql
    version: 16.4.1
    values:
      - values/postgresql.yaml

Audit Helm Release History

# Check release history for failed deployments
helm history myapp -n production

# Clean up old revisions (keep last 5)
helm upgrade myapp  -n production --history-max 5

Step 7: Monitoring with Prometheus and Grafana

If you cannot see what your cluster is doing, you cannot secure or operate it. Monitoring is not optional for production.

Install kube-prometheus-stack

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.adminPassword="change-me-immediately" \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

Essential Alerts to Configure

# Must-have alerts (included in kube-prometheus-stack):
# - KubePodCrashLooping: Pod restarting repeatedly
# - KubePodNotReady: Pod stuck in non-ready state
# - KubeDeploymentReplicasMismatch: Desired != actual replicas
# - KubeNodeNotReady: Node offline
# - KubeQuotaExceeded: Resource quota hit
# - etcdHighNumberOfLeaderChanges: etcd instability
# - CPUThrottlingHigh: Pods being CPU-throttled

# Custom alert example:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-alerts
  namespace: monitoring
spec:
  groups:
    - name: custom.rules
      rules:
        - alert: HighMemoryUsage
          expr: |
            container_memory_working_set_bytes / 
            container_spec_memory_limit_bytes > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} memory > 90% of limit"

Key Grafana Dashboards

# Import these dashboard IDs in Grafana:
# 315   - Kubernetes cluster overview
# 6417  - Kubernetes pod resources
# 13770 - Kubernetes node resources
# 14981 - kube-prometheus-stack overview

Step 8: etcd Backup and Disaster Recovery

etcd is the brain of your Kubernetes cluster. Lose etcd, lose everything. This section applies primarily to self-hosted clusters — managed Kubernetes providers handle etcd backup for you.

Manual etcd Backup

# Create a snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-$(date +%Y%m%d).db \
  --write-table

Automated Backup CronJob

# etcd-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: etcd-backup
              image: bitnami/etcd:latest
              command:
                - /bin/sh
                - -c
                - |
                  etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M).db \
                    --endpoints=https://etcd:2379 \
                    --cacert=/certs/ca.crt \
                    --cert=/certs/server.crt \
                    --key=/certs/server.key
              volumeMounts:
                - name: backup-volume
                  mountPath: /backup
                - name: etcd-certs
                  mountPath: /certs
          restartPolicy: OnFailure
          volumes:
            - name: backup-volume
              persistentVolumeClaim:
                claimName: etcd-backup-pvc
            - name: etcd-certs
              secret:
                secretName: etcd-certs

Test Your Restore

# CRITICAL: Test restore on a non-production cluster
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored \
  --initial-cluster="default=https://127.0.0.1:2380" \
  --initial-advertise-peer-urls="https://127.0.0.1:2380" \
  --name=default

# A backup you haven't tested restoring is not a backup.

Troubleshooting & Considerations

"Network policies are blocking legitimate traffic"

Start with default deny + allow-dns in a staging namespace first. Then add allow policies one by one. Use kubectl describe networkpolicy to verify selectors match your pods. Check that your CNI plugin supports network policies — not all do (Flannel does not, Calico and Cilium do).

"Pods are OOMKilled after setting memory limits"

Your memory limits are too low. Check actual memory usage with kubectl top pods or Prometheus metrics before setting limits. Set the limit at 1.5-2x the typical usage. Java applications are especially tricky — the JVM uses memory outside the heap that you must account for.

"Pods fail to start after enabling Pod Security Standards"

Start with warn mode instead of enforce. This logs violations without blocking. Fix violations one by one, then switch to enforce. The most common issue: containers that require running as root. Use init containers or modified images that run as non-root.

"Prometheus uses too much storage"

Reduce retention period, increase scrape interval for less critical metrics, or use remote write to a long-term storage like Thanos or Cortex. Also check for high-cardinality metrics (metrics with many unique label combinations).

Prevention & Best Practices

Automate the Audit

Use tools that continuously scan your cluster: kube-bench (CIS benchmarks), Trivy (container vulnerabilities), Polaris (configuration best practices), Falco (runtime security). Run these in CI/CD and as cluster-resident agents.

GitOps for Everything

Every configuration in this guide should live in a Git repository and be applied via GitOps (ArgoCD or Flux). No manual kubectl apply in production. If it's not in Git, it didn't happen.

Regular Audit Schedule

Run this checklist quarterly. Set calendar reminders. Security degrades over time as new workloads are added, permissions are expanded "temporarily," and monitoring is overlooked during feature sprints.

Upgrade Strategy

Keep your Kubernetes version within the supported window (latest 3 minor versions). Patch versions should be applied within 2 weeks of release. Have a tested upgrade runbook that includes backup verification before every upgrade.

Need Expert Help?

Want a professional K8s audit? €150, 60-min review + prioritized action report.

Book Now — €150

100% money-back guarantee

HR

Harald Roessler

Infrastructure Engineer with 20+ years experience. Founder of DSNCON GmbH.