Proactive Monitoring: Track webhook success rates and response times to identify issues before they impact your integration.
Circuit Breaker
Catena automatically pauses webhooks when delivery performance drops below acceptable thresholds. This prevents wasted resources and signals that immediate attention is needed. When a webhook is paused:- Event delivery stops
- New events are queued but not delivered
- Webhook status changes to
stale
webhook.staled events to be notified automatically when the circuit breaker trips.
You can also query for paused webhooks directly:
| Volume | Threshold | Strategy |
|---|---|---|
| High (50+ deliveries) | 95% EWMA success rate | Exponentially weighted moving average — more sensitive to recent failures |
| Medium (20–49 deliveries) | 80% success rate | Simple success rate over the window |
| Low (<20 deliveries) | 10 consecutive failures | Trips after 10 failures in a row |
Grace Period: The circuit breaker will not evaluate a webhook for the first 15 minutes after it is created or updated. This prevents false positives during initial setup or after a configuration change.
Reactivation
After identifying and fixing the root cause of delivery failures, reactivate the webhook to resume event delivery:Replay Missed Events
Replay events from the DLQ to recover any events that were queued while the webhook was paused. See Event Replay below.
Delivery Metrics
The Notifications API provides real-time and historical metrics to monitor webhook performance and reliability across four rolling time windows: 6h, 24h, 7d, and 14d.6h, 24h, 7d, and 14d. Key fields to monitor:
success_rate— The percentage of messages successfully delivered after all retries. If any window drops below ~95%, investigate immediately.ewma_success_rate— The exponentially weighted moving average success rate. This is what the circuit breaker evaluates for high-volume webhooks — if it falls below0.95, the webhook will be markedstale.http_attemptsvsmessage_count— HTTP attempts includes retries; message count counts unique events. A high ratio of attempts to messages indicates frequent retries.avg_response_time_ms— Sustained values above 2500ms put you at risk of timeouts.dlq_count— Any non-zero value means events need to be replayed before the 14-day retention window expires.
Delivery Logs
Access detailed logs for every webhook delivery attempt. Logs are available for up to 14 days and include status codes, response times, error messages, and per-attempt detail.Event Replay
Recover from delivery failures by replaying events from the Dead Letter Queue (DLQ). When events fail all automatic retry attempts, they’re stored in the DLQ for up to 14 days, giving you time to fix issues and replay them. The replay functionality redelivers all DLQ events for a webhook subscription, allowing you to recover from temporary outages, application bugs, or configuration issues without losing data.Common Replay Scenarios
Endpoint Downtime
Recover events lost during maintenance windows or infrastructure outages.
Application Errors
Reprocess events after fixing bugs in your webhook handler.
Configuration Issues
Replay events after correcting webhook URL or authentication problems.
Data Recovery
Reprocess historical events after resolving integration issues.
How to Replay Events
Identify Failed Events
Use metrics and logs to determine which events are in the DLQ and need replay.
Monitoring Best Practices
Set Up Alerts
Configure automated alerts for success rate and response time degradation to catch issues early.
Track Long-Term Trends
Review metrics across multiple time windows to identify patterns and seasonal variations.
Optimize Performance
Keep response times low by processing webhooks asynchronously and returning acknowledgments quickly.
Monitor the DLQ
Regularly check for events in the Dead Letter Queue and replay them before the retention period expires.
Analyze Failure Patterns
Use delivery logs to identify recurring issues and address root causes systematically.
Validate Configuration
Periodically verify webhook URLs, filters, and secrets remain correct and up to date.
Troubleshooting
Low Success Rate
Low Success Rate
Common Causes:
- Endpoint responding too slowly or timing out
- Application errors causing failed responses
- Network connectivity or infrastructure issues
- Insufficient server resources to handle load
- Optimize endpoint performance with asynchronous processing
- Fix application bugs and handle errors gracefully
- Scale infrastructure to accommodate webhook volume
- Review logs to identify specific error patterns
High DLQ Count
High DLQ Count
Common Causes:
- Extended endpoint downtime or outages
- Persistent application errors
- Misconfigured webhook URL or authentication
- Verify endpoint accessibility and correct configuration
- Fix application issues preventing successful processing
- Replay DLQ events after resolving the root cause
Slow Response Times
Slow Response Times
Common Causes:
- Synchronous processing in the webhook handler
- Database operations or external API calls in the request path
- Insufficient server resources
- Return acknowledgment immediately and process asynchronously
- Move heavy operations to background jobs or queues
- Optimize database queries and reduce blocking operations
Webhook Deactivated
Webhook Deactivated
Common Causes:
- Delivery performance dropped below circuit breaker thresholds
- Persistent endpoint unavailability or misconfiguration
- Review metrics to identify when and why performance degraded
- Check logs for error patterns and failure modes
- Fix the underlying issues before reactivating
- Monitor closely after reactivation to ensure sustained health