Kubernetes CronJobs silently fail more than you think
A backup job missed 24 days of runs. Nobody knew. The CronJob looked fine in kubectl get cronjobs. No alerts fired. The last successful run timestamp in the status field just sat there, quietly getting older. The root cause: the CronJob controller had silently given up scheduling after missing 100 runs. Logged an error. Stopped trying. Moved on. This article explains why Kubernetes CronJobs are structurally unreliable without external monitoring, and what you can do about it. This is the one that produces the war stories. The Kubernetes CronJob controller checks how many schedules it missed since the last successful run. If that number exceeds 100, it permanently stops scheduling that CronJob — and logs a single error line: Cannot determine if job needs to be started: too many missed start time (> 100) That's it. No event. No alert. kubectl describe cronjob shows the last scheduled time getting stale. The CronJob shows as ACTIVE: 0. Everything looks fine until you notice your data is 24 days old. This happens if: The CronJob controller crashes or the API server is unreachable for an extended period You set startingDeadlineSeconds too low and the cluster was briefly overloaded A node outage prevented scheduling for long enough The fix is restarting the CronJob (delete and recreate it, or bump the schedule), but the point is: you won't know it happened until you check manually. Your CronJob container can exit 0 after: Connecting to a read replica that's 6 hours behind Finding an empty queue and processing nothing Silently swallowing an exception in a try/catch Successfully completing a database backup of 0 bytes Kubernetes marks the Job as Succeeded. The CronJob status shows the last successful run timestamp updated. Everything looks healthy. Your data pipeline has been doing nothing for a week. By default: successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 1 After three successful runs, the oldest Job pod is deleted. Its logs go with it. When you eventually notice something's wrong and go looking for "what happened on Tuesday?", the evidence no longer exists. You can increase these limits, but you'll never retain more than a handful of runs. A real audit trail requires shipping logs to an external system. All of these failure modes share the same root cause: your monitoring system lives inside the cluster, so it fails along with the cluster. If your alerting depends on the cluster being healthy, it won't alert you when the cluster is unhealthy. And CronJob failures almost always correlate with cluster health problems. What you need is a check that runs outside your cluster and asks: "did this job run? Did it do something?" If the answer is no, it pages you — regardless of what the cluster thinks. This is the dead man's switch pattern: instead of your monitoring system checking whether the job ran, the job checks in with an external system, and the external system alerts if it stops hearing from the job. Add a start/success/fail ping to your job. Here's a minimal implementation: #!/bin/bash set -euo pipefail BASE="https://deadmancheck.io/ping/${DEADMANCHECK_TOKEN}" # Signal start (enables duration monitoring) — || true so a network blip never kills the job curl -fsS "${BASE}/start" > /dev/null || true # Alert on any error trap 'curl -fsS "${BASE}/fail" > /dev/null' ERR # Your actual job ROWS=$(/app/run-export.sh) # Signal success + row count for output assertion curl -fsS -X POST -H "Content-Type: application/json" \ -d "{\"count\": ${ROWS}}" \ "${BASE}" > /dev/null import requests import os import sys TOKEN = os.environ["DEADMANCHECK_TOKEN"] BASE = f"https://deadmancheck.io/ping/{TOKEN}" def main(): # Signal start — wrapped so a monitoring outage never kills the job try: requests.get(f"{BASE}/start", timeout=5) except Exception: pass try: records_processed = run_job() # POST count for output assertion: alert if count is 0 requests.post(BASE, json={"count": records_processed}, timeout=5) except Exception: try: requests.get(f"{BASE}/fail", timeout=5) except Exception: pass sys.exit(1) if __name__ == "__main__": main() Store the token in a Kubernetes Secret: kubectl create secret generic deadmancheck-secret \ --from-literal=token=your-token-here Reference it in your CronJob spec: apiVersion: batch/v1 kind: CronJob metadata: name: daily-export namespace: production spec: schedule: "0 2 * * *" successfulJobsHistoryLimit: 5 # keep more history than the default failedJobsHistoryLimit: 3 jobTemplate: spec: template: spec: containers: - name: exporter image: your-registry/exporter:latest env: - name: DEADMANCHECK_TOKEN valueFrom: secretKeyRef: name: deadmancheck-secret key: token restartPolicy: OnFailure Output assertions are the piece most monitoring tutorials skip. Here's why it matters. Your job runs. Exits 0. Kubernetes marks it Succeeded. But the job processed 0 records. If your monitoring only checks "did the job ping?" — like every other cron monitoring tool — you don't get alerted. The job pinged. It just pinged with count=0. DeadManCheck lets you configure an output assertion: alert if count 0. Now your job can't silently export nothing without triggering an alert. This catches the failure mode that pure heartbeat monitoring misses: the job that runs, succeeds by every technical measure, and still does nothing useful. Failure mode kubectl catches? External monitoring catches? Pod CrashLoopBackOff Visible in logs/events YES (missed ping) 100 missed-schedule limit hit No alert fires YES (missed ping) Job exits 0, processes nothing No YES (output assertion) Cluster outage kills controller No YES (missed ping) Job takes 5× longer than usual No YES (duration anomaly) CronJob accidentally deleted No YES (missed ping) For an existing CronJob: Create a free monitor — takes 2 minutes Set interval to match your schedule + buffer (e.g., 25h for a daily job) Enable output assertion if your job reports a count Add the start/success/fail pings to your container script Create the Secret, update the CronJob spec Deploy and verify the first ping arrives Total: 15-20 minutes including deployment. The first time a silent failure happens, you'll have wished you'd done it sooner. While you're in the CronJob spec, increase the history limits from the defaults: successfulJobsHistoryLimit: 10 failedJobsHistoryLimit: 5 This doesn't replace external monitoring, but it gives you more context in kubectl describe cronjob when you're investigating an incident. The default of 3/1 is genuinely too low for production jobs. DeadManCheck is open source and self-hostable if you'd rather run it on your own infrastructure. GitHub →
