Why do Kubernetes CronJobs fail silently?

Kubernetes CronJobs can fail without triggering any alert. Common causes include OOMKilled (the container exceeds its memory limit), ImagePullBackOff (the container image cannot be pulled), node eviction (the pod is evicted due to resource pressure), and ConcurrencyPolicy conflicts (a new job is skipped because the previous one is still running). Kubernetes marks the pod as failed but does not notify you unless you have explicit monitoring in place.

What is a dead man's switch for Kubernetes CronJobs?

A dead man's switch expects a heartbeat ping from your CronJob at regular intervals. You add a curl command to the end of your CronJob container spec. If the job completes successfully, the ping is sent. If the job fails for any reason — OOMKilled, crash, timeout — no ping is sent, and the monitoring service alerts you via email or webhook.

March 28, 2026 · 9 min read

How to Monitor Kubernetes CronJobs with a Dead Man's Switch API

Kubernetes CronJobs are the backbone of scheduled workloads—database backups, ETL pipelines, report generation, cache warming. But when they fail, they fail silently. No alert. No notification. Just a pod in Error or OOMKilled status that nobody checks until something downstream breaks. Here's how to fix that with a dead man's switch API and a single line of YAML.

Why Kubernetes CronJobs Fail Silently

If you've operated Kubernetes in production, you've seen this before. A CronJob that ran fine for months suddenly stops completing. The cluster is healthy. The nodes are up. Prometheus shows green dashboards. But the nightly backup hasn't run in three days.

Kubernetes CronJobs fail in ways that don't trigger obvious alerts:

OOMKilled — The container exceeds its memory limit. Kubernetes kills the process instantly. Exit code 137. No log output, no stack trace. The pod shows OOMKilled in its status, but unless you're actively watching kubectl get pods, you won't notice.
ImagePullBackOff — Someone pushed a new tag to the registry but the pull secret expired. Or the image was deleted. The pod never starts. It sits in ImagePullBackOff indefinitely while Kubernetes retries.
Node eviction — The node is under resource pressure. Kubernetes evicts low-priority pods, including your CronJob. The job is marked as failed, but if backoffLimit is exhausted, it just stops retrying.
ConcurrencyPolicy: Forbid — If a previous job is still running (maybe it's stuck), the new scheduled run is simply skipped. No error. No event. The schedule continues but the work doesn't happen.
startingDeadlineSeconds exceeded — The scheduler missed the window. This happens during control plane disruptions or heavy cluster load. The job is silently dropped.
Completed but failed — The container exited with code 0, but the actual task inside failed. Your Python script caught an exception, logged it, and exited cleanly. Kubernetes thinks everything is fine.

The common thread: Kubernetes doesn't have built-in alerting for CronJob failures. It tracks state, but it won't page you at 3 AM when your billing reconciliation job hasn't run. That's your responsibility.

The fundamental problem: Monitoring whether a server is up tells you nothing about whether a scheduled job actually ran. Uptime monitoring checks presence. Dead man's switch monitoring checks absence—it alerts you when something expected didn't happen.

The Dead Man's Switch Pattern for k8s CronJobs

A dead man's switch (also called a heartbeat monitor) works on a simple principle: your job pings an external endpoint every time it completes successfully. If the ping doesn't arrive within the expected window, the monitoring service assumes the job failed and sends an alert.

For Kubernetes CronJobs, the implementation is straightforward:

Create a monitor with an expected interval matching your CronJob schedule
Add a curl command to the end of your CronJob container spec
If the job succeeds, the ping is sent. If it fails for any reason—OOMKilled, crash, timeout, eviction—no ping is sent
The monitoring service detects the missing ping and alerts you via webhook or email

This approach catches every failure mode listed above. It doesn't matter why the job failed. The only thing that matters is whether the ping arrived on time.

Step-by-Step: Integrating CronPerek with Kubernetes CronJobs

1Create a CronPerek account

First, create an account via the API. This gives you an API key for managing monitors.

curl -X POST https://cronpeek.web.app/api/accounts \
  -H "Content-Type: application/json" \
  -d '{
    "email": "ops@yourcompany.com",
    "webhookUrl": "https://hooks.slack.com/services/T00/B00/xxxxx"
  }'

The response includes your accountId and apiKey. Store these securely—a Kubernetes Secret is ideal.

2Create a monitor for each CronJob

Create a monitor that matches your CronJob's schedule. Set the intervalSeconds to the expected time between successful completions, plus a grace period.

# Monitor for a job that runs every hour
curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "k8s-nightly-backup",
    "intervalSeconds": 90000,
    "graceSeconds": 300
  }'

# Response:
# {
#   "id": "mon_abc123",
#   "token": "ping_xxxxxxxxxxxxxxxx",
#   "pingUrl": "https://cronpeek.web.app/api/ping/ping_xxxxxxxxxxxxxxxx"
# }

The token is what you'll add to your CronJob spec. The graceSeconds gives your job a buffer—if it usually takes 10 minutes to complete, set grace to 600 so you aren't alerted for slow-but-successful runs.

3Add the ping to your CronJob YAML

The simplest approach: add a curl command after your main task. Use && so the ping only fires on success.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
  namespace: production
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 600
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 3600
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: your-registry/backup-tool:latest
            command:
            - /bin/sh
            - -c
            - |
              # Run the actual backup
              /usr/local/bin/backup-db.sh \
                --host=$DB_HOST \
                --output=/tmp/backup.sql.gz &&
              # Upload to S3
              aws s3 cp /tmp/backup.sql.gz \
                s3://backups/nightly/$(date +%Y-%m-%d).sql.gz &&
              # Ping CronPerek on success
              curl -fsS --retry 3 --max-time 10 \
                https://cronpeek.web.app/api/ping/$CRONPEEK_TOKEN
            env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: host
            - name: CRONPEEK_TOKEN
              valueFrom:
                secretKeyRef:
                  name: cronpeek-secrets
                  key: backup-monitor-token
            resources:
              requests:
                memory: "256Mi"
                cpu: "100m"
              limits:
                memory: "512Mi"
                cpu: "500m"

Store the ping token in a Kubernetes Secret so it's not hardcoded in your manifests:

kubectl create secret generic cronpeek-secrets \
  --namespace=production \
  --from-literal=backup-monitor-token=ping_xxxxxxxxxxxxxxxx

Why curl -fsS --retry 3? The -f flag makes curl fail silently on HTTP errors (so && chains break correctly). -sS suppresses the progress bar but shows errors. --retry 3 handles transient network issues. --max-time 10 prevents the curl from hanging if the endpoint is slow. This ensures the ping is reliable without blocking your job.

4Alternative: sidecar container pattern

For jobs where you can't easily modify the main container's command, use an init container or a post-completion sidecar pattern. A cleaner approach for Helm-templated charts:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etl-pipeline
  namespace: data
spec:
  schedule: "*/30 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: etl
            image: your-registry/etl-runner:v2.4
            command: ["/usr/local/bin/run-etl"]
          - name: heartbeat
            image: curlimages/curl:8.5.0
            command:
            - /bin/sh
            - -c
            - |
              # Wait for the main container to finish
              while [ -f /shared/running ]; do sleep 5; done
              # Check if it succeeded
              if [ -f /shared/success ]; then
                curl -fsS --retry 3 --max-time 10 \
                  https://cronpeek.web.app/api/ping/$CRONPEEK_TOKEN
              fi
            env:
            - name: CRONPEEK_TOKEN
              valueFrom:
                secretKeyRef:
                  name: cronpeek-secrets
                  key: etl-monitor-token
            volumeMounts:
            - name: shared
              mountPath: /shared
          volumes:
          - name: shared
            emptyDir: {}

Setting Up Webhook Alerts for Missed Pings

When CronPerek detects a missed heartbeat, it fires a webhook to the URL you configured. This is where you connect it to your existing incident response pipeline.

Slack webhook

Point your CronPerek webhook at a Slack incoming webhook URL. When a ping is missed, your channel gets a message immediately. No custom integration needed.

PagerDuty / Opsgenie via webhook

Both PagerDuty and Opsgenie accept generic webhook events. Set your CronPerek webhook URL to their event ingestion endpoint. A missed ping becomes an incident automatically.

Custom webhook handler

For more control, point the webhook at your own endpoint. The payload includes the monitor name, expected interval, and time since last ping—enough context to route the alert intelligently.

# CronPerek webhook payload (POST to your URL):
{
  "event": "monitor.missed",
  "monitor": {
    "id": "mon_abc123",
    "name": "k8s-nightly-backup",
    "lastPing": "2026-03-27T02:14:33Z",
    "expectedBy": "2026-03-28T02:19:33Z"
  }
}

Monitoring Multiple CronJobs Across Namespaces

In a real cluster, you don't have one CronJob. You have dozens, spread across namespaces: production, staging, data, monitoring. Each needs its own monitor with its own schedule and grace period.

Naming convention

Use a consistent naming pattern so you can identify which cluster, namespace, and job triggered an alert:

# Pattern: {cluster}-{namespace}-{cronjob-name}
curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "prod-us-east-production-nightly-backup",
    "intervalSeconds": 90000,
    "graceSeconds": 300
  }'

curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "prod-us-east-data-etl-hourly",
    "intervalSeconds": 3900,
    "graceSeconds": 300
  }'

curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "prod-us-east-billing-invoice-gen",
    "intervalSeconds": 90000,
    "graceSeconds": 600
  }'

Automating monitor creation with a script

If you're managing many CronJobs, automate the monitor creation. This script reads all CronJobs in a namespace and creates a CronPerek monitor for each:

#!/bin/bash
# create-monitors.sh — Bulk-create CronPerek monitors from k8s CronJobs

CLUSTER="prod-us-east"
NAMESPACE="production"
API_KEY="YOUR_API_KEY"

kubectl get cronjobs -n "$NAMESPACE" -o json | \
  jq -r '.items[] | "\(.metadata.name) \(.spec.schedule)"' | \
while read NAME SCHEDULE; do
  # Convert cron schedule to approximate interval in seconds
  # (simplified — adjust for your schedules)
  INTERVAL=$(python3 -c "
from croniter import croniter
from datetime import datetime
c = croniter('$SCHEDULE', datetime.now())
n1 = c.get_next(datetime)
n2 = c.get_next(datetime)
print(int((n2 - n1).total_seconds()))
")

  echo "Creating monitor: $CLUSTER-$NAMESPACE-$NAME (interval: ${INTERVAL}s)"

  curl -sX POST https://cronpeek.web.app/api/monitors \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d "{
      \"name\": \"$CLUSTER-$NAMESPACE-$NAME\",
      \"intervalSeconds\": $((INTERVAL + 300)),
      \"graceSeconds\": 300
    }"
  echo
done

Storing tokens in a shared Secret

After creating monitors, store all ping tokens in a single Kubernetes Secret per namespace. Your CronJob specs reference the appropriate key:

kubectl create secret generic cronpeek-tokens \
  --namespace=production \
  --from-literal=nightly-backup=ping_xxxx1 \
  --from-literal=etl-hourly=ping_xxxx2 \
  --from-literal=invoice-gen=ping_xxxx3

What About Prometheus and Kube-State-Metrics?

You might be thinking: "I already have Prometheus. Can't kube_job_status_failed catch this?" Yes and no.

Prometheus with kube-state-metrics can alert on failed jobs—pods that exited with a non-zero code. But it cannot detect:

Jobs that never started — skipped due to ConcurrencyPolicy: Forbid or missed startingDeadlineSeconds
Jobs that "succeeded" but didn't work — exit code 0 but the actual task failed internally
Schedule drift — the job ran but 4 hours late due to cluster issues

A dead man's switch covers all of these because it checks for the positive signal (the ping arrived) rather than the negative signal (a failure was reported). The two approaches are complementary. Use Prometheus for cluster-level observability. Use CronPerek for business-logic-level assurance that your scheduled work actually happened.

Pricing: CronPerek vs Enterprise Monitoring

If you're running 20–50 CronJobs across a couple of namespaces, here's what monitoring costs look like:

Cronitor: ~$100/mo for 50 monitors ($2/monitor/mo)
Datadog Synthetic Monitoring: Included in expensive tiers, but you're already paying $23+/host/mo
CronPerek: $9/mo for 50 monitors. $29/mo for unlimited.

For a startup running Kubernetes, $9/mo for complete CronJob monitoring is a rounding error in your cloud bill. But it's the $9 that saves you from discovering your database backup hasn't run in two weeks.

Monitor every CronJob in your cluster

Free tier includes 5 monitors. Pro plan: $9/mo for 50 monitors. Set up your first monitor in under 2 minutes.

Start monitoring free →

How to Monitor Kubernetes CronJobs with a Dead Man's Switch API

Why Kubernetes CronJobs Fail Silently

The Dead Man's Switch Pattern for k8s CronJobs

Step-by-Step: Integrating CronPerek with Kubernetes CronJobs

1Create a CronPerek account

2Create a monitor for each CronJob

3Add the ping to your CronJob YAML

4Alternative: sidecar container pattern

Setting Up Webhook Alerts for Missed Pings

Slack webhook

PagerDuty / Opsgenie via webhook

Custom webhook handler

Monitoring Multiple CronJobs Across Namespaces

Naming convention

Automating monitor creation with a script

Storing tokens in a shared Secret

What About Prometheus and Kube-State-Metrics?

Pricing: CronPerek vs Enterprise Monitoring

Monitor every CronJob in your cluster

More developer APIs from the Peek Suite