Webhooks Operations & Monitoring - Catena Telematics API

Monitoring webhook health is critical for maintaining reliable integrations. Catena provides comprehensive metrics, detailed logs, and replay capabilities to help you track delivery performance and recover from failures.

Proactive Monitoring: Track webhook success rates and response times to identify issues before they impact your integration.

Circuit Breaker

Catena automatically pauses webhooks when delivery performance drops below acceptable thresholds. This prevents wasted resources and signals that immediate attention is needed. When a webhook is paused:

Event delivery stops
New events are queued but not delivered
Webhook status changes to stale

You can subscribe to webhook.staled events to be notified automatically when the circuit breaker trips.

Subscribe to webhook.* to also receive webhook.created, webhook.updated, webhook.deleted, and webhook.paused events — giving you full visibility into your webhook subscription lifecycle.

You can also query for paused webhooks directly:

curl 'https://api.catenatelematics.com/v2/notifications/webhooks?status=stale' \
  -H 'Authorization: Bearer <token>'

The circuit breaker uses different evaluation strategies depending on delivery volume:

Volume	Threshold	Strategy
High (50+ deliveries)	95% EWMA success rate	Exponentially weighted moving average — more sensitive to recent failures
Medium (20–49 deliveries)	80% success rate	Simple success rate over the window
Low (<20 deliveries)	10 consecutive failures	Trips after 10 failures in a row

Grace Period: The circuit breaker will not evaluate a webhook for the first 15 minutes after it is created or updated. This prevents false positives during initial setup or after a configuration change.

Reactivation

After identifying and fixing the root cause of delivery failures, reactivate the webhook to resume event delivery:

Catena does not auto-reactivate paused webhooks. Only you know when the underlying issue has been resolved — reactivation must be triggered explicitly through the API.

Investigate the Issue

Review metrics and logs to understand why delivery performance degraded.

Resolve Problems

Fix endpoint issues, application bugs, or configuration errors.

Reactivate Webhook

Call the activate endpoint to resume delivery:

    curl -X POST \
      --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/activate \
      -H 'Authorization: Bearer <token>'

Replay Missed Events

Replay events from the DLQ to recover any events that were queued while the webhook was paused. See Event Replay below.

Monitor Recovery

Track metrics closely to ensure performance improves and remains healthy.

Reactivation Warning: If the underlying issue isn’t resolved before reactivating, the circuit breaker may trip again shortly after. Monitor metrics closely in the first 15 minutes after reactivation.

Delivery Metrics

The Notifications API provides real-time and historical metrics to monitor webhook performance and reliability across four rolling time windows: 6h, 24h, 7d, and 14d.

curl --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/metrics \
  -H 'Authorization: Bearer <token>'

{
  "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
  "http_attempts": {
    "6h": 105,
    "24h": 420,
    "7d": 3000,
    "14d": 6500
  },
  "http_success_attempts": {
    "6h": 100,
    "24h": 400,
    "7d": 2800,
    "14d": 6000
  },
  "http_failure_attempts": {
    "6h": 5,
    "24h": 20,
    "7d": 200,
    "14d": 500
  },
  "message_count": {
    "6h": 100,
    "24h": 400,
    "7d": 2800,
    "14d": 6000
  },
  "message_success_count": {
    "6h": 98,
    "24h": 390,
    "7d": 2700,
    "14d": 5800
  },
  "success_rate": {
    "6h": 98,
    "24h": 97,
    "7d": 96,
    "14d": 96
  },
  "avg_response_time_ms": {
    "6h": 150,
    "24h": 160,
    "7d": 155,
    "14d": 158
  },
  "ewma_success_rate": 0.97,
  "dlq_count": 3
}

Each time-windowed field returns values for four rolling windows: 6h, 24h, 7d, and 14d. Key fields to monitor:

success_rate — The percentage of messages successfully delivered after all retries. If any window drops below ~95%, investigate immediately.
ewma_success_rate — The exponentially weighted moving average success rate. This is what the circuit breaker evaluates for high-volume webhooks — if it falls below 0.95, the webhook will be marked stale.
http_attempts vs message_count — HTTP attempts includes retries; message count counts unique events. A high ratio of attempts to messages indicates frequent retries.
avg_response_time_ms — Sustained values above 2500ms put you at risk of timeouts.
dlq_count — Any non-zero value means events need to be replayed before the 14-day retention window expires.

Delivery Logs

Access detailed logs for every webhook delivery attempt. Logs are available for up to 14 days and include status codes, response times, error messages, and per-attempt detail.

curl --url "https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/logs?status=failed" \
  -H 'Authorization: Bearer <token>'

{
  "logs": [
    {
      "created_at": "2026-01-15T10:30:00.596772Z",
      "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
      "message_id": "62cb8fea-e017-4b08-86b7-4469fa872b91",
      "event_name": "vehicle_location.added",
      "status": "failed",
      "status_code": 504,
      "error_message": "Request timeout. Please ensure the webhook endpoint is reachable and acknowledges receipt in less than 3 seconds.",
      "response_time_ms": 3000
    },
    {
      "created_at": "2026-01-15T09:14:22.312445Z",
      "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
      "message_id": "91fa2bcd-3301-4e7a-bc12-7734ab991e22",
      "event_name": "vehicle.modified",
      "status": "failed",
      "status_code": 502,
      "error_message": "Connection error: Unable to establish connection to webhook endpoint.",
      "response_time_ms": null
    }
  ],
  "total": 2,
  "page": 1
}

Debugging Strategy: The error_message and status_code fields together usually pinpoint the root cause — a 504 with “endpoint timeout” suggests async processing isn’t working; 502/503 errors suggest infrastructure issues upstream of your handler.

Event Replay

Recover from delivery failures by replaying events from the Dead Letter Queue (DLQ). When events fail all automatic retry attempts, they’re stored in the DLQ for up to 14 days, giving you time to fix issues and replay them. The replay functionality redelivers all DLQ events for a webhook subscription, allowing you to recover from temporary outages, application bugs, or configuration issues without losing data.

Common Replay Scenarios

Endpoint Downtime

Recover events lost during maintenance windows or infrastructure outages.

Application Errors

Reprocess events after fixing bugs in your webhook handler.

Configuration Issues

Replay events after correcting webhook URL or authentication problems.

Data Recovery

Reprocess historical events after resolving integration issues.

How to Replay Events

Identify Failed Events

Use metrics and logs to determine which events are in the DLQ and need replay.

Fix the Root Cause

Resolve the underlying issue that caused delivery failures before replaying.

Initiate Replay

Call the replay endpoint to redeliver all DLQ events to your webhook:

    curl -X POST \
      --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/replay \
      -H 'Authorization: Bearer <token>'

Monitor Redelivery

Watch logs and metrics to verify replayed events are successfully delivered.

14-Day Retention: Events are permanently deleted from the DLQ after 14 days. Replay critical events before they expire to avoid data loss.

Monitoring Best Practices

Set Up Alerts

Configure automated alerts for success rate and response time degradation to catch issues early.

Track Long-Term Trends

Review metrics across multiple time windows to identify patterns and seasonal variations.

Optimize Performance

Keep response times low by processing webhooks asynchronously and returning acknowledgments quickly.

Monitor the DLQ

Regularly check for events in the Dead Letter Queue and replay them before the retention period expires.

Analyze Failure Patterns

Use delivery logs to identify recurring issues and address root causes systematically.

Validate Configuration

Periodically verify webhook URLs, filters, and secrets remain correct and up to date.

Troubleshooting

Low Success Rate

Common Causes:

Endpoint responding too slowly or timing out
Application errors causing failed responses
Network connectivity or infrastructure issues
Insufficient server resources to handle load

Solutions:

Optimize endpoint performance with asynchronous processing
Fix application bugs and handle errors gracefully
Scale infrastructure to accommodate webhook volume
Review logs to identify specific error patterns

High DLQ Count

Common Causes:

Extended endpoint downtime or outages
Persistent application errors
Misconfigured webhook URL or authentication

Solutions:

Verify endpoint accessibility and correct configuration
Fix application issues preventing successful processing
Replay DLQ events after resolving the root cause

Slow Response Times

Common Causes:

Synchronous processing in the webhook handler
Database operations or external API calls in the request path
Insufficient server resources

Solutions:

Return acknowledgment immediately and process asynchronously
Move heavy operations to background jobs or queues
Optimize database queries and reduce blocking operations

Webhook Deactivated

Common Causes:

Delivery performance dropped below circuit breaker thresholds
Persistent endpoint unavailability or misconfiguration

Solutions:

Review metrics to identify when and why performance degraded
Check logs for error patterns and failure modes
Fix the underlying issues before reactivating
Monitor closely after reactivation to ensure sustained health

​Circuit Breaker

​Reactivation

​Delivery Metrics

​Delivery Logs

​Event Replay

​Common Replay Scenarios

Endpoint Downtime

Application Errors

Configuration Issues

Data Recovery

​How to Replay Events

​Monitoring Best Practices

Set Up Alerts

Track Long-Term Trends

Optimize Performance

Monitor the DLQ

Analyze Failure Patterns

Validate Configuration

​Troubleshooting

Circuit Breaker

Reactivation

Delivery Metrics

Delivery Logs

Event Replay

Common Replay Scenarios

How to Replay Events

Monitoring Best Practices

Troubleshooting