Never Miss a Cron Job Failure Again: Heartbeat Monitoring Guide

Never Miss a Cron Job Failure Again: Heartbeat Monitoring Guide

Horux Team

January 5, 2026

• 9 min read

The Problem with Cron Jobs

Here's a story you might recognize:

It's 3 AM. Your database server just died. No problem—you have backups! You restore from the most recent backup and... wait. The backup is three months old. Your backup cron job has been failing silently for 90 days, and nobody noticed.

I've lived this nightmare. Twice. The second time, we lost six weeks of customer data because our backup job was writing to a disk that filled up in November. It's February now.

Cron jobs fail silently. They run in the background, often on a single server, with no one watching. When they fail, they fail quietly. And by the time you notice, it's usually too late.

What is Heartbeat Monitoring?

Heartbeat monitoring flips the traditional monitoring model on its head.

Instead of Horux checking if your job is running (how would it even know?), your job tells Horux "hey, I'm alive" every time it runs. If Horux doesn't hear from your job within the expected timeframe, it raises an alert.

Think of it like a check-in system:

  • Your backup job is supposed to run every night at 2 AM
  • After it completes, it pings Horux
  • If Horux doesn't get a ping by 3 AM, something went wrong
  • You get an alert immediately, not three months later

Why This Works Better

Traditional monitoring tries to run checks against your cron jobs. But how do you check a backup job? SSH into the server and verify the backup file exists? Parse log files? Both are fragile and complicated.

Heartbeat monitoring is simpler:

  1. Your job pings Horux when it succeeds
  2. Horux expects pings on a schedule
  3. Missing ping = alert

No SSH required. No log parsing. No complex checks. Just a simple "I'm done" message.

Setting Up Heartbeat Monitoring

Let's monitor a database backup job. First, create a cron job monitor in Horux:

In the Horux UI:

  1. Go to your service
  2. Create a new monitor → "Cron Job / Heartbeat"
  3. Configure:
    • Name: "Database Backup"
    • Schedule: "0 2 * * *" (every day at 2 AM)
    • Grace Period: 30 minutes (allow time for job to complete)
    • Timezone: Your server's timezone

Horux gives you a unique heartbeat URL. Copy it.

In your backup script:

#!/bin/bash
# backup-database.sh

# Your unique check-in key from Horux
CHECK_IN_KEY="abc123_your_key_here"
HORUX_URL="https://api.horux.io/api/v1/cron-jobs/check-in/$CHECK_IN_KEY"

# Run the backup
pg_dump mydb > /backups/mydb-$(date +%Y%m%d).sql

# Check if backup succeeded
if [ $? -eq 0 ]; then
  # Notify Horux that backup completed successfully
  curl -X POST "$HORUX_URL" \
    -H "Content-Type: application/json" \
    -d '{"metadata": {"status": "success"}}'
else
  # Notify Horux that backup failed
  curl -X POST "$HORUX_URL" \
    -H "Content-Type: application/json" \
    -d '{"metadata": {"status": "failure", "error": "pg_dump failed"}}'
fi

That's it. Now if your backup fails to run, runs but fails, or succeeds but forgets to ping Horux, you'll get an alert.

Using Node.js for Cleaner Code

If your job is in Node.js, you can easily wrap the check-in call. Since the Horux SDK focuses on metrics and logs, we'll use a simple fetch helper for cron heartbeats:

// helper.ts
export async function sendHeartbeat(checkInKey: string, metadata?: Record<string, any>) {
  try {
    await fetch(`https://api.horux.io/api/v1/cron-jobs/check-in/${checkInKey}`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ metadata })
    });
  } catch (error) {
    // Log error but don't crash the job if monitoring fails
    console.error('Failed to send heartbeat:', error);
  }
}

Now use it in your jobs:

import { sendHeartbeat } from './helper';

const CHECK_IN_KEY = process.env.BACKUP_CHECK_IN_KEY!;

async function runBackup() {
  try {
    // Your backup logic here
    await performDatabaseBackup();

    // Report success
    await sendHeartbeat(CHECK_IN_KEY, {
      status: 'success',
      backupSize: '2.4GB',
      duration: 45 // seconds
    });
  } catch (error) {
    // Report failure with details
    await sendHeartbeat(CHECK_IN_KEY, {
      status: 'failure',
      error: error instanceof Error ? error.message : 'Unknown error'
    });

    throw error; // Re-throw so your job scheduler knows it failed
  }
}

Real-World Examples

1. Database Backups (obviously)

Already covered above. This is the most critical one to monitor.

2. Report Generation

// Generate weekly user report
cron.schedule('0 9 * * MON', async () => {
  try {
    const report = await generateWeeklyReport();
    await sendReportEmail(report);

    await sendHeartbeat('weekly-report-key', {
      status: 'success',
      reportRows: report.length,
      recipients: 5
    });
  } catch (error) {
    await sendHeartbeat('weekly-report-key', {
      status: 'failure',
      error: error.message
    });
  }
});

3. Cleanup Jobs

// Delete old records every night
cron.schedule('0 1 * * *', async () => {
  try {
    const deleted = await db.userSessions.deleteMany({
      where: { expiresAt: { lt: new Date() } }
    });

    await sendHeartbeat('cleanup-sessions-key', {
      status: 'success',
      deleted: deleted.count
    });
  } catch (error) {
    await sendHeartbeat('cleanup-sessions-key', {
      status: 'failure',
      error: error.message
    });
  }
});

4. Data Sync Jobs

// Sync with external API every hour
cron.schedule('0 * * * *', async () => {
  try {
    const synced = await syncWithExternalAPI();

    await sendHeartbeat('api-sync-key', {
      status: 'success',
      recordsSynced: synced.length,
      lastSyncId: synced[synced.length - 1]?.id
    });
  } catch (error) {
    await sendHeartbeat('api-sync-key', {
      status: 'failure',
      error: error.message
    });
  }
});

Advanced Patterns

Time-Based vs Completion-Based Monitoring

You can monitor in two ways:

Time-based (what we've shown):

  • Expected: Job should ping by 2:30 AM
  • Alert: If no ping by 2:30 AM, send alert

Completion-based:

  • Expected: Job should ping every 24 hours
  • Alert: If no ping in last 24 hours, send alert

Use time-based for scheduled jobs (backups, reports). Use completion-based for continuous processes that should check in regularly.

Tracking Job Duration

Include timing information in your heartbeat:

const start = Date.now();

try {
  await performBackup();

  await sendHeartbeat('backup-key', {
    status: 'success',
    duration: Date.now() - start,
    backupSize: size
  });
} catch (error) {
  await sendHeartbeat('backup-key', {
    status: 'failure',
    duration: Date.now() - start,
    error: error.message
  });
}

Now you can alert if your backup job is taking way longer than usual—often a sign that something's wrong.

Handling Failures Gracefully

Don't let a failed heartbeat call crash your job:

// Our helper function already handles errors with try/catch!
// See the implementation above.

Your backup job should complete successfully even if Horux is temporarily down.

Multiple Environments

Use different heartbeat URLs/Keys for dev/staging/prod:

# production
HORUX_URL="https://api.horux.io/api/v1/cron-jobs/check-in/prod-key-abc"

# staging
HORUX_URL="https://api.horux.io/api/v1/cron-jobs/check-in/staging-key-xyz"

This way you won't get woken up at 3 AM because your staging backup failed.

What to Monitor (and What Not To)

Definitely monitor:

  • Database backups
  • Critical data syncs
  • Report generation for customers
  • Cleanup jobs that affect production
  • License/certificate renewal checks

Maybe monitor:

  • Log rotation (usually handled by OS)
  • Cache warming (nice to have, not critical)
  • Analytics aggregation (depends on business impact)

Don't bother:

  • Temporary file cleanup (not critical)
  • Cache invalidation (self-healing)
  • Development environment jobs

The rule of thumb: if it failing would cause an outage or data loss, monitor it. Otherwise, save yourself the alert fatigue.

Alert Configuration

For cron jobs, I recommend:

Critical jobs (backups, payments):

  • Alert immediately when missed
  • Escalate to on-call if not acknowledged in 15 minutes
  • No silence period

Important jobs (reports, sync):

  • Alert after first miss
  • Normal priority, no escalation
  • Can be silenced during maintenance

Nice-to-have jobs:

  • Alert after two consecutive misses
  • Low priority, email only
  • Long silence periods okay

Common Pitfalls

1. Forgetting the Grace Period

If your job takes 20 minutes to run but you set a 5-minute grace period, you'll get false alerts. Always add buffer time.

2. Not Handling Partial Failures

What if your backup job backs up 9 out of 10 databases? That's a success, right? Wrong. Track and report partial failures:

const failures = [];
for (const db of databases) {
  try {
    await backup(db);
  } catch (error) {
    failures.push({ db, error: error.message });
  }
}

await horux.cron.heartbeat('multi-db-backup', {
  status: failures.length === 0 ? 'success' : 'failure',
  metadata: {
    total: databases.length,
    failed: failures.length,
    failures: failures
  }
});

3. Monitoring Too Many Jobs

Start with your 5 most critical jobs. Add more later. Don't try to monitor every cron job on day one.

Wrapping Up

Heartbeat monitoring is one of those features that seems simple but saves your ass when it matters most.

I wish we'd had this when our backup job failed silently for three months. I wish we'd had this when our invoice generation job broke and we didn't bill customers for two weeks. I wish we'd had this when our data sync job stopped running and we served stale data for a week.

You can wish, or you can set it up now. It takes 5 minutes and could save you from a catastrophic data loss.

Check out our cron monitoring documentation for more examples, or sign up for Horux and start monitoring your jobs today.

Sleep better knowing your jobs are actually running. 😴


Got questions about monitoring specific types of jobs? Drop us a line at contact@horux.io and we'll help you set it up.