March 28, 2026 · 9 min read

How to Monitor Kubernetes CronJobs with a Dead Man's Switch API

Kubernetes CronJobs are the backbone of scheduled workloads—database backups, ETL pipelines, report generation, cache warming. But when they fail, they fail silently. No alert. No notification. Just a pod in Error or OOMKilled status that nobody checks until something downstream breaks. Here's how to fix that with a dead man's switch API and a single line of YAML.

Why Kubernetes CronJobs Fail Silently

If you've operated Kubernetes in production, you've seen this before. A CronJob that ran fine for months suddenly stops completing. The cluster is healthy. The nodes are up. Prometheus shows green dashboards. But the nightly backup hasn't run in three days.

Kubernetes CronJobs fail in ways that don't trigger obvious alerts:

The common thread: Kubernetes doesn't have built-in alerting for CronJob failures. It tracks state, but it won't page you at 3 AM when your billing reconciliation job hasn't run. That's your responsibility.

The fundamental problem: Monitoring whether a server is up tells you nothing about whether a scheduled job actually ran. Uptime monitoring checks presence. Dead man's switch monitoring checks absence—it alerts you when something expected didn't happen.

The Dead Man's Switch Pattern for k8s CronJobs

A dead man's switch (also called a heartbeat monitor) works on a simple principle: your job pings an external endpoint every time it completes successfully. If the ping doesn't arrive within the expected window, the monitoring service assumes the job failed and sends an alert.

For Kubernetes CronJobs, the implementation is straightforward:

  1. Create a monitor with an expected interval matching your CronJob schedule
  2. Add a curl command to the end of your CronJob container spec
  3. If the job succeeds, the ping is sent. If it fails for any reason—OOMKilled, crash, timeout, eviction—no ping is sent
  4. The monitoring service detects the missing ping and alerts you via webhook or email

This approach catches every failure mode listed above. It doesn't matter why the job failed. The only thing that matters is whether the ping arrived on time.

Step-by-Step: Integrating CronPerek with Kubernetes CronJobs

1Create a CronPerek account

First, create an account via the API. This gives you an API key for managing monitors.

curl -X POST https://cronpeek.web.app/api/accounts \
  -H "Content-Type: application/json" \
  -d '{
    "email": "ops@yourcompany.com",
    "webhookUrl": "https://hooks.slack.com/services/T00/B00/xxxxx"
  }'

The response includes your accountId and apiKey. Store these securely—a Kubernetes Secret is ideal.

2Create a monitor for each CronJob

Create a monitor that matches your CronJob's schedule. Set the intervalSeconds to the expected time between successful completions, plus a grace period.

# Monitor for a job that runs every hour
curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "k8s-nightly-backup",
    "intervalSeconds": 90000,
    "graceSeconds": 300
  }'

# Response:
# {
#   "id": "mon_abc123",
#   "token": "ping_xxxxxxxxxxxxxxxx",
#   "pingUrl": "https://cronpeek.web.app/api/ping/ping_xxxxxxxxxxxxxxxx"
# }

The token is what you'll add to your CronJob spec. The graceSeconds gives your job a buffer—if it usually takes 10 minutes to complete, set grace to 600 so you aren't alerted for slow-but-successful runs.

3Add the ping to your CronJob YAML

The simplest approach: add a curl command after your main task. Use && so the ping only fires on success.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
  namespace: production
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 600
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 3600
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: your-registry/backup-tool:latest
            command:
            - /bin/sh
            - -c
            - |
              # Run the actual backup
              /usr/local/bin/backup-db.sh \
                --host=$DB_HOST \
                --output=/tmp/backup.sql.gz &&
              # Upload to S3
              aws s3 cp /tmp/backup.sql.gz \
                s3://backups/nightly/$(date +%Y-%m-%d).sql.gz &&
              # Ping CronPerek on success
              curl -fsS --retry 3 --max-time 10 \
                https://cronpeek.web.app/api/ping/$CRONPEEK_TOKEN
            env:
            - name: DB_HOST
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: host
            - name: CRONPEEK_TOKEN
              valueFrom:
                secretKeyRef:
                  name: cronpeek-secrets
                  key: backup-monitor-token
            resources:
              requests:
                memory: "256Mi"
                cpu: "100m"
              limits:
                memory: "512Mi"
                cpu: "500m"

Store the ping token in a Kubernetes Secret so it's not hardcoded in your manifests:

kubectl create secret generic cronpeek-secrets \
  --namespace=production \
  --from-literal=backup-monitor-token=ping_xxxxxxxxxxxxxxxx

Why curl -fsS --retry 3? The -f flag makes curl fail silently on HTTP errors (so && chains break correctly). -sS suppresses the progress bar but shows errors. --retry 3 handles transient network issues. --max-time 10 prevents the curl from hanging if the endpoint is slow. This ensures the ping is reliable without blocking your job.

4Alternative: sidecar container pattern

For jobs where you can't easily modify the main container's command, use an init container or a post-completion sidecar pattern. A cleaner approach for Helm-templated charts:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etl-pipeline
  namespace: data
spec:
  schedule: "*/30 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: etl
            image: your-registry/etl-runner:v2.4
            command: ["/usr/local/bin/run-etl"]
          - name: heartbeat
            image: curlimages/curl:8.5.0
            command:
            - /bin/sh
            - -c
            - |
              # Wait for the main container to finish
              while [ -f /shared/running ]; do sleep 5; done
              # Check if it succeeded
              if [ -f /shared/success ]; then
                curl -fsS --retry 3 --max-time 10 \
                  https://cronpeek.web.app/api/ping/$CRONPEEK_TOKEN
              fi
            env:
            - name: CRONPEEK_TOKEN
              valueFrom:
                secretKeyRef:
                  name: cronpeek-secrets
                  key: etl-monitor-token
            volumeMounts:
            - name: shared
              mountPath: /shared
          volumes:
          - name: shared
            emptyDir: {}

Setting Up Webhook Alerts for Missed Pings

When CronPerek detects a missed heartbeat, it fires a webhook to the URL you configured. This is where you connect it to your existing incident response pipeline.

Slack webhook

Point your CronPerek webhook at a Slack incoming webhook URL. When a ping is missed, your channel gets a message immediately. No custom integration needed.

PagerDuty / Opsgenie via webhook

Both PagerDuty and Opsgenie accept generic webhook events. Set your CronPerek webhook URL to their event ingestion endpoint. A missed ping becomes an incident automatically.

Custom webhook handler

For more control, point the webhook at your own endpoint. The payload includes the monitor name, expected interval, and time since last ping—enough context to route the alert intelligently.

# CronPerek webhook payload (POST to your URL):
{
  "event": "monitor.missed",
  "monitor": {
    "id": "mon_abc123",
    "name": "k8s-nightly-backup",
    "lastPing": "2026-03-27T02:14:33Z",
    "expectedBy": "2026-03-28T02:19:33Z"
  }
}

Monitoring Multiple CronJobs Across Namespaces

In a real cluster, you don't have one CronJob. You have dozens, spread across namespaces: production, staging, data, monitoring. Each needs its own monitor with its own schedule and grace period.

Naming convention

Use a consistent naming pattern so you can identify which cluster, namespace, and job triggered an alert:

# Pattern: {cluster}-{namespace}-{cronjob-name}
curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "prod-us-east-production-nightly-backup",
    "intervalSeconds": 90000,
    "graceSeconds": 300
  }'

curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "prod-us-east-data-etl-hourly",
    "intervalSeconds": 3900,
    "graceSeconds": 300
  }'

curl -X POST https://cronpeek.web.app/api/monitors \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "name": "prod-us-east-billing-invoice-gen",
    "intervalSeconds": 90000,
    "graceSeconds": 600
  }'

Automating monitor creation with a script

If you're managing many CronJobs, automate the monitor creation. This script reads all CronJobs in a namespace and creates a CronPerek monitor for each:

#!/bin/bash
# create-monitors.sh — Bulk-create CronPerek monitors from k8s CronJobs

CLUSTER="prod-us-east"
NAMESPACE="production"
API_KEY="YOUR_API_KEY"

kubectl get cronjobs -n "$NAMESPACE" -o json | \
  jq -r '.items[] | "\(.metadata.name) \(.spec.schedule)"' | \
while read NAME SCHEDULE; do
  # Convert cron schedule to approximate interval in seconds
  # (simplified — adjust for your schedules)
  INTERVAL=$(python3 -c "
from croniter import croniter
from datetime import datetime
c = croniter('$SCHEDULE', datetime.now())
n1 = c.get_next(datetime)
n2 = c.get_next(datetime)
print(int((n2 - n1).total_seconds()))
")

  echo "Creating monitor: $CLUSTER-$NAMESPACE-$NAME (interval: ${INTERVAL}s)"

  curl -sX POST https://cronpeek.web.app/api/monitors \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d "{
      \"name\": \"$CLUSTER-$NAMESPACE-$NAME\",
      \"intervalSeconds\": $((INTERVAL + 300)),
      \"graceSeconds\": 300
    }"
  echo
done

Storing tokens in a shared Secret

After creating monitors, store all ping tokens in a single Kubernetes Secret per namespace. Your CronJob specs reference the appropriate key:

kubectl create secret generic cronpeek-tokens \
  --namespace=production \
  --from-literal=nightly-backup=ping_xxxx1 \
  --from-literal=etl-hourly=ping_xxxx2 \
  --from-literal=invoice-gen=ping_xxxx3

What About Prometheus and Kube-State-Metrics?

You might be thinking: "I already have Prometheus. Can't kube_job_status_failed catch this?" Yes and no.

Prometheus with kube-state-metrics can alert on failed jobs—pods that exited with a non-zero code. But it cannot detect:

A dead man's switch covers all of these because it checks for the positive signal (the ping arrived) rather than the negative signal (a failure was reported). The two approaches are complementary. Use Prometheus for cluster-level observability. Use CronPerek for business-logic-level assurance that your scheduled work actually happened.

Pricing: CronPerek vs Enterprise Monitoring

If you're running 20–50 CronJobs across a couple of namespaces, here's what monitoring costs look like:

For a startup running Kubernetes, $9/mo for complete CronJob monitoring is a rounding error in your cloud bill. But it's the $9 that saves you from discovering your database backup hasn't run in two weeks.

Monitor every CronJob in your cluster

Free tier includes 5 monitors. Pro plan: $9/mo for 50 monitors. Set up your first monitor in under 2 minutes.

Start monitoring free →

More developer APIs from the Peek Suite