Skip to main content
Monitoring webhook health is critical for maintaining reliable integrations. Catena provides comprehensive metrics, detailed logs, and replay capabilities to help you track delivery performance and recover from failures.
Proactive Monitoring: Track webhook success rates and response times to identify issues before they impact your integration.

Circuit Breaker

Catena automatically pauses webhooks when delivery performance drops below acceptable thresholds. This prevents wasted resources and signals that immediate attention is needed. When a webhook is paused:
  • Event delivery stops
  • New events are queued but not delivered
  • Webhook status changes to stale
You can subscribe to webhook.staled events to be notified automatically when the circuit breaker trips.
Subscribe to webhook.* to also receive webhook.created, webhook.updated, webhook.deleted, and webhook.paused events — giving you full visibility into your webhook subscription lifecycle.
You can also query for paused webhooks directly:
curl 'https://api.catenatelematics.com/v2/notifications/webhooks?status=stale' \
  -H 'Authorization: Bearer <token>'
The circuit breaker uses different evaluation strategies depending on delivery volume:
VolumeThresholdStrategy
High (50+ deliveries)95% EWMA success rateExponentially weighted moving average — more sensitive to recent failures
Medium (20–49 deliveries)80% success rateSimple success rate over the window
Low (<20 deliveries)10 consecutive failuresTrips after 10 failures in a row
Grace Period: The circuit breaker will not evaluate a webhook for the first 15 minutes after it is created or updated. This prevents false positives during initial setup or after a configuration change.

Reactivation

After identifying and fixing the root cause of delivery failures, reactivate the webhook to resume event delivery:
Catena does not auto-reactivate paused webhooks. Only you know when the underlying issue has been resolved — reactivation must be triggered explicitly through the API.
1

Investigate the Issue

Review metrics and logs to understand why delivery performance degraded.
2

Resolve Problems

Fix endpoint issues, application bugs, or configuration errors.
3

Reactivate Webhook

Call the activate endpoint to resume delivery:
    curl -X POST \
      --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/activate \
      -H 'Authorization: Bearer <token>'
4

Replay Missed Events

Replay events from the DLQ to recover any events that were queued while the webhook was paused. See Event Replay below.
5

Monitor Recovery

Track metrics closely to ensure performance improves and remains healthy.
Reactivation Warning: If the underlying issue isn’t resolved before reactivating, the circuit breaker may trip again shortly after. Monitor metrics closely in the first 15 minutes after reactivation.

Delivery Metrics

The Notifications API provides real-time and historical metrics to monitor webhook performance and reliability across four rolling time windows: 6h, 24h, 7d, and 14d.
curl --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/metrics \
  -H 'Authorization: Bearer <token>'
{
  "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
  "http_attempts": {
    "6h": 105,
    "24h": 420,
    "7d": 3000,
    "14d": 6500
  },
  "http_success_attempts": {
    "6h": 100,
    "24h": 400,
    "7d": 2800,
    "14d": 6000
  },
  "http_failure_attempts": {
    "6h": 5,
    "24h": 20,
    "7d": 200,
    "14d": 500
  },
  "message_count": {
    "6h": 100,
    "24h": 400,
    "7d": 2800,
    "14d": 6000
  },
  "message_success_count": {
    "6h": 98,
    "24h": 390,
    "7d": 2700,
    "14d": 5800
  },
  "success_rate": {
    "6h": 98,
    "24h": 97,
    "7d": 96,
    "14d": 96
  },
  "avg_response_time_ms": {
    "6h": 150,
    "24h": 160,
    "7d": 155,
    "14d": 158
  },
  "ewma_success_rate": 0.97,
  "dlq_count": 3
}
Each time-windowed field returns values for four rolling windows: 6h, 24h, 7d, and 14d. Key fields to monitor:
  • success_rate — The percentage of messages successfully delivered after all retries. If any window drops below ~95%, investigate immediately.
  • ewma_success_rate — The exponentially weighted moving average success rate. This is what the circuit breaker evaluates for high-volume webhooks — if it falls below 0.95, the webhook will be marked stale.
  • http_attempts vs message_count — HTTP attempts includes retries; message count counts unique events. A high ratio of attempts to messages indicates frequent retries.
  • avg_response_time_ms — Sustained values above 2500ms put you at risk of timeouts.
  • dlq_count — Any non-zero value means events need to be replayed before the 14-day retention window expires.

Delivery Logs

Access detailed logs for every webhook delivery attempt. Logs are available for up to 14 days and include status codes, response times, error messages, and per-attempt detail.
curl --url "https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/logs?status=failed" \
  -H 'Authorization: Bearer <token>'
{
  "logs": [
    {
      "created_at": "2026-01-15T10:30:00.596772Z",
      "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
      "message_id": "62cb8fea-e017-4b08-86b7-4469fa872b91",
      "event_name": "vehicle_location.added",
      "status": "failed",
      "status_code": 504,
      "error_message": "Request timeout. Please ensure the webhook endpoint is reachable and acknowledges receipt in less than 3 seconds.",
      "response_time_ms": 3000
    },
    {
      "created_at": "2026-01-15T09:14:22.312445Z",
      "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
      "message_id": "91fa2bcd-3301-4e7a-bc12-7734ab991e22",
      "event_name": "vehicle.modified",
      "status": "failed",
      "status_code": 502,
      "error_message": "Connection error: Unable to establish connection to webhook endpoint.",
      "response_time_ms": null
    }
  ],
  "total": 2,
  "page": 1
}
Debugging Strategy: The error_message and status_code fields together usually pinpoint the root cause — a 504 with “endpoint timeout” suggests async processing isn’t working; 502/503 errors suggest infrastructure issues upstream of your handler.

Event Replay

Recover from delivery failures by replaying events from the Dead Letter Queue (DLQ). When events fail all automatic retry attempts, they’re stored in the DLQ for up to 14 days, giving you time to fix issues and replay them. The replay functionality redelivers all DLQ events for a webhook subscription, allowing you to recover from temporary outages, application bugs, or configuration issues without losing data.

Common Replay Scenarios

Endpoint Downtime

Recover events lost during maintenance windows or infrastructure outages.

Application Errors

Reprocess events after fixing bugs in your webhook handler.

Configuration Issues

Replay events after correcting webhook URL or authentication problems.

Data Recovery

Reprocess historical events after resolving integration issues.

How to Replay Events

1

Identify Failed Events

Use metrics and logs to determine which events are in the DLQ and need replay.
2

Fix the Root Cause

Resolve the underlying issue that caused delivery failures before replaying.
3

Initiate Replay

Call the replay endpoint to redeliver all DLQ events to your webhook:
    curl -X POST \
      --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/replay \
      -H 'Authorization: Bearer <token>'
4

Monitor Redelivery

Watch logs and metrics to verify replayed events are successfully delivered.
14-Day Retention: Events are permanently deleted from the DLQ after 14 days. Replay critical events before they expire to avoid data loss.

Monitoring Best Practices

Set Up Alerts

Configure automated alerts for success rate and response time degradation to catch issues early.

Track Long-Term Trends

Review metrics across multiple time windows to identify patterns and seasonal variations.

Optimize Performance

Keep response times low by processing webhooks asynchronously and returning acknowledgments quickly.

Monitor the DLQ

Regularly check for events in the Dead Letter Queue and replay them before the retention period expires.

Analyze Failure Patterns

Use delivery logs to identify recurring issues and address root causes systematically.

Validate Configuration

Periodically verify webhook URLs, filters, and secrets remain correct and up to date.

Troubleshooting

Common Causes:
  • Endpoint responding too slowly or timing out
  • Application errors causing failed responses
  • Network connectivity or infrastructure issues
  • Insufficient server resources to handle load
Solutions:
  • Optimize endpoint performance with asynchronous processing
  • Fix application bugs and handle errors gracefully
  • Scale infrastructure to accommodate webhook volume
  • Review logs to identify specific error patterns
Common Causes:
  • Extended endpoint downtime or outages
  • Persistent application errors
  • Misconfigured webhook URL or authentication
Solutions:
  • Verify endpoint accessibility and correct configuration
  • Fix application issues preventing successful processing
  • Replay DLQ events after resolving the root cause
Common Causes:
  • Synchronous processing in the webhook handler
  • Database operations or external API calls in the request path
  • Insufficient server resources
Solutions:
  • Return acknowledgment immediately and process asynchronously
  • Move heavy operations to background jobs or queues
  • Optimize database queries and reduce blocking operations
Common Causes:
  • Delivery performance dropped below circuit breaker thresholds
  • Persistent endpoint unavailability or misconfiguration
Solutions:
  • Review metrics to identify when and why performance degraded
  • Check logs for error patterns and failure modes
  • Fix the underlying issues before reactivating
  • Monitor closely after reactivation to ensure sustained health