How to Monitor Kubernetes CronJobs with a Dead Man's Switch API
Kubernetes CronJobs are the backbone of scheduled workloads—database backups, ETL pipelines, report generation, cache warming. But when they fail, they fail silently. No alert. No notification. Just a pod in Error or OOMKilled status that nobody checks until something downstream breaks. Here's how to fix that with a dead man's switch API and a single line of YAML.
Why Kubernetes CronJobs Fail Silently
If you've operated Kubernetes in production, you've seen this before. A CronJob that ran fine for months suddenly stops completing. The cluster is healthy. The nodes are up. Prometheus shows green dashboards. But the nightly backup hasn't run in three days.
Kubernetes CronJobs fail in ways that don't trigger obvious alerts:
- OOMKilled — The container exceeds its memory limit. Kubernetes kills the process instantly. Exit code 137. No log output, no stack trace. The pod shows
OOMKilledin its status, but unless you're actively watchingkubectl get pods, you won't notice. - ImagePullBackOff — Someone pushed a new tag to the registry but the pull secret expired. Or the image was deleted. The pod never starts. It sits in
ImagePullBackOffindefinitely while Kubernetes retries. - Node eviction — The node is under resource pressure. Kubernetes evicts low-priority pods, including your CronJob. The job is marked as failed, but if
backoffLimitis exhausted, it just stops retrying. - ConcurrencyPolicy: Forbid — If a previous job is still running (maybe it's stuck), the new scheduled run is simply skipped. No error. No event. The schedule continues but the work doesn't happen.
- startingDeadlineSeconds exceeded — The scheduler missed the window. This happens during control plane disruptions or heavy cluster load. The job is silently dropped.
- Completed but failed — The container exited with code 0, but the actual task inside failed. Your Python script caught an exception, logged it, and exited cleanly. Kubernetes thinks everything is fine.
The common thread: Kubernetes doesn't have built-in alerting for CronJob failures. It tracks state, but it won't page you at 3 AM when your billing reconciliation job hasn't run. That's your responsibility.
The fundamental problem: Monitoring whether a server is up tells you nothing about whether a scheduled job actually ran. Uptime monitoring checks presence. Dead man's switch monitoring checks absence—it alerts you when something expected didn't happen.
The Dead Man's Switch Pattern for k8s CronJobs
A dead man's switch (also called a heartbeat monitor) works on a simple principle: your job pings an external endpoint every time it completes successfully. If the ping doesn't arrive within the expected window, the monitoring service assumes the job failed and sends an alert.
For Kubernetes CronJobs, the implementation is straightforward:
- Create a monitor with an expected interval matching your CronJob schedule
- Add a
curlcommand to the end of your CronJob container spec - If the job succeeds, the ping is sent. If it fails for any reason—OOMKilled, crash, timeout, eviction—no ping is sent
- The monitoring service detects the missing ping and alerts you via webhook or email
This approach catches every failure mode listed above. It doesn't matter why the job failed. The only thing that matters is whether the ping arrived on time.
Step-by-Step: Integrating CronPerek with Kubernetes CronJobs
1Create a CronPerek account
First, create an account via the API. This gives you an API key for managing monitors.
curl -X POST https://cronpeek.web.app/api/accounts \
-H "Content-Type: application/json" \
-d '{
"email": "ops@yourcompany.com",
"webhookUrl": "https://hooks.slack.com/services/T00/B00/xxxxx"
}'
The response includes your accountId and apiKey. Store these securely—a Kubernetes Secret is ideal.
2Create a monitor for each CronJob
Create a monitor that matches your CronJob's schedule. Set the intervalSeconds to the expected time between successful completions, plus a grace period.
# Monitor for a job that runs every hour
curl -X POST https://cronpeek.web.app/api/monitors \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"name": "k8s-nightly-backup",
"intervalSeconds": 90000,
"graceSeconds": 300
}'
# Response:
# {
# "id": "mon_abc123",
# "token": "ping_xxxxxxxxxxxxxxxx",
# "pingUrl": "https://cronpeek.web.app/api/ping/ping_xxxxxxxxxxxxxxxx"
# }
The token is what you'll add to your CronJob spec. The graceSeconds gives your job a buffer—if it usually takes 10 minutes to complete, set grace to 600 so you aren't alerted for slow-but-successful runs.
3Add the ping to your CronJob YAML
The simplest approach: add a curl command after your main task. Use && so the ping only fires on success.
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-backup
namespace: production
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 600
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 3600
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: your-registry/backup-tool:latest
command:
- /bin/sh
- -c
- |
# Run the actual backup
/usr/local/bin/backup-db.sh \
--host=$DB_HOST \
--output=/tmp/backup.sql.gz &&
# Upload to S3
aws s3 cp /tmp/backup.sql.gz \
s3://backups/nightly/$(date +%Y-%m-%d).sql.gz &&
# Ping CronPerek on success
curl -fsS --retry 3 --max-time 10 \
https://cronpeek.web.app/api/ping/$CRONPEEK_TOKEN
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: db-credentials
key: host
- name: CRONPEEK_TOKEN
valueFrom:
secretKeyRef:
name: cronpeek-secrets
key: backup-monitor-token
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
Store the ping token in a Kubernetes Secret so it's not hardcoded in your manifests:
kubectl create secret generic cronpeek-secrets \
--namespace=production \
--from-literal=backup-monitor-token=ping_xxxxxxxxxxxxxxxx
Why curl -fsS --retry 3? The -f flag makes curl fail silently on HTTP errors (so && chains break correctly). -sS suppresses the progress bar but shows errors. --retry 3 handles transient network issues. --max-time 10 prevents the curl from hanging if the endpoint is slow. This ensures the ping is reliable without blocking your job.
4Alternative: sidecar container pattern
For jobs where you can't easily modify the main container's command, use an init container or a post-completion sidecar pattern. A cleaner approach for Helm-templated charts:
apiVersion: batch/v1
kind: CronJob
metadata:
name: etl-pipeline
namespace: data
spec:
schedule: "*/30 * * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: etl
image: your-registry/etl-runner:v2.4
command: ["/usr/local/bin/run-etl"]
- name: heartbeat
image: curlimages/curl:8.5.0
command:
- /bin/sh
- -c
- |
# Wait for the main container to finish
while [ -f /shared/running ]; do sleep 5; done
# Check if it succeeded
if [ -f /shared/success ]; then
curl -fsS --retry 3 --max-time 10 \
https://cronpeek.web.app/api/ping/$CRONPEEK_TOKEN
fi
env:
- name: CRONPEEK_TOKEN
valueFrom:
secretKeyRef:
name: cronpeek-secrets
key: etl-monitor-token
volumeMounts:
- name: shared
mountPath: /shared
volumes:
- name: shared
emptyDir: {}
Setting Up Webhook Alerts for Missed Pings
When CronPerek detects a missed heartbeat, it fires a webhook to the URL you configured. This is where you connect it to your existing incident response pipeline.
Slack webhook
Point your CronPerek webhook at a Slack incoming webhook URL. When a ping is missed, your channel gets a message immediately. No custom integration needed.
PagerDuty / Opsgenie via webhook
Both PagerDuty and Opsgenie accept generic webhook events. Set your CronPerek webhook URL to their event ingestion endpoint. A missed ping becomes an incident automatically.
Custom webhook handler
For more control, point the webhook at your own endpoint. The payload includes the monitor name, expected interval, and time since last ping—enough context to route the alert intelligently.
# CronPerek webhook payload (POST to your URL):
{
"event": "monitor.missed",
"monitor": {
"id": "mon_abc123",
"name": "k8s-nightly-backup",
"lastPing": "2026-03-27T02:14:33Z",
"expectedBy": "2026-03-28T02:19:33Z"
}
}
Monitoring Multiple CronJobs Across Namespaces
In a real cluster, you don't have one CronJob. You have dozens, spread across namespaces: production, staging, data, monitoring. Each needs its own monitor with its own schedule and grace period.
Naming convention
Use a consistent naming pattern so you can identify which cluster, namespace, and job triggered an alert:
# Pattern: {cluster}-{namespace}-{cronjob-name}
curl -X POST https://cronpeek.web.app/api/monitors \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"name": "prod-us-east-production-nightly-backup",
"intervalSeconds": 90000,
"graceSeconds": 300
}'
curl -X POST https://cronpeek.web.app/api/monitors \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"name": "prod-us-east-data-etl-hourly",
"intervalSeconds": 3900,
"graceSeconds": 300
}'
curl -X POST https://cronpeek.web.app/api/monitors \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"name": "prod-us-east-billing-invoice-gen",
"intervalSeconds": 90000,
"graceSeconds": 600
}'
Automating monitor creation with a script
If you're managing many CronJobs, automate the monitor creation. This script reads all CronJobs in a namespace and creates a CronPerek monitor for each:
#!/bin/bash
# create-monitors.sh — Bulk-create CronPerek monitors from k8s CronJobs
CLUSTER="prod-us-east"
NAMESPACE="production"
API_KEY="YOUR_API_KEY"
kubectl get cronjobs -n "$NAMESPACE" -o json | \
jq -r '.items[] | "\(.metadata.name) \(.spec.schedule)"' | \
while read NAME SCHEDULE; do
# Convert cron schedule to approximate interval in seconds
# (simplified — adjust for your schedules)
INTERVAL=$(python3 -c "
from croniter import croniter
from datetime import datetime
c = croniter('$SCHEDULE', datetime.now())
n1 = c.get_next(datetime)
n2 = c.get_next(datetime)
print(int((n2 - n1).total_seconds()))
")
echo "Creating monitor: $CLUSTER-$NAMESPACE-$NAME (interval: ${INTERVAL}s)"
curl -sX POST https://cronpeek.web.app/api/monitors \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d "{
\"name\": \"$CLUSTER-$NAMESPACE-$NAME\",
\"intervalSeconds\": $((INTERVAL + 300)),
\"graceSeconds\": 300
}"
echo
done
Storing tokens in a shared Secret
After creating monitors, store all ping tokens in a single Kubernetes Secret per namespace. Your CronJob specs reference the appropriate key:
kubectl create secret generic cronpeek-tokens \
--namespace=production \
--from-literal=nightly-backup=ping_xxxx1 \
--from-literal=etl-hourly=ping_xxxx2 \
--from-literal=invoice-gen=ping_xxxx3
What About Prometheus and Kube-State-Metrics?
You might be thinking: "I already have Prometheus. Can't kube_job_status_failed catch this?" Yes and no.
Prometheus with kube-state-metrics can alert on failed jobs—pods that exited with a non-zero code. But it cannot detect:
- Jobs that never started — skipped due to
ConcurrencyPolicy: Forbidor missedstartingDeadlineSeconds - Jobs that "succeeded" but didn't work — exit code 0 but the actual task failed internally
- Schedule drift — the job ran but 4 hours late due to cluster issues
A dead man's switch covers all of these because it checks for the positive signal (the ping arrived) rather than the negative signal (a failure was reported). The two approaches are complementary. Use Prometheus for cluster-level observability. Use CronPerek for business-logic-level assurance that your scheduled work actually happened.
Pricing: CronPerek vs Enterprise Monitoring
If you're running 20–50 CronJobs across a couple of namespaces, here's what monitoring costs look like:
- Cronitor: ~$100/mo for 50 monitors ($2/monitor/mo)
- Datadog Synthetic Monitoring: Included in expensive tiers, but you're already paying $23+/host/mo
- CronPerek: $9/mo for 50 monitors. $29/mo for unlimited.
For a startup running Kubernetes, $9/mo for complete CronJob monitoring is a rounding error in your cloud bill. But it's the $9 that saves you from discovering your database backup hasn't run in two weeks.
Monitor every CronJob in your cluster
Free tier includes 5 monitors. Pro plan: $9/mo for 50 monitors. Set up your first monitor in under 2 minutes.
Start monitoring free →