> ## Documentation Index
> Fetch the complete documentation index at: https://docs.catenatelematics.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Webhooks Operations & Monitoring

> Monitor webhook health, track delivery metrics, replay failed events, and manage webhook lifecycle

Monitoring webhook health is critical for maintaining reliable integrations. Catena provides comprehensive metrics, detailed logs, and replay capabilities to help you track delivery performance and recover from failures.

<Check>
  **Proactive Monitoring:** Track webhook success rates and response times to identify issues before they impact your integration.
</Check>

***

## Circuit Breaker

Catena automatically pauses webhooks when delivery performance drops below acceptable thresholds. This prevents wasted resources and signals that immediate attention is needed.

When a webhook is paused:

* Event delivery stops
* New events are queued but not delivered
* Webhook status changes to `stale`

You can subscribe to `webhook.staled` events to be notified automatically when the circuit breaker trips.

<Tip>
  Subscribe to `webhook.*` to also receive `webhook.created`, `webhook.updated`, `webhook.deleted`, and `webhook.paused` events — giving you full visibility into your webhook subscription lifecycle.
</Tip>

You can also query for paused webhooks directly:

```bash theme={null}
curl 'https://api.catenatelematics.com/v2/notifications/webhooks?status=stale' \
  -H 'Authorization: Bearer <token>'
```

The circuit breaker uses different evaluation strategies depending on delivery volume:

| Volume                    | Threshold               | Strategy                                                                  |
| ------------------------- | ----------------------- | ------------------------------------------------------------------------- |
| High (50+ deliveries)     | 95% EWMA success rate   | Exponentially weighted moving average — more sensitive to recent failures |
| Medium (20–49 deliveries) | 80% success rate        | Simple success rate over the window                                       |
| Low (\<20 deliveries)     | 10 consecutive failures | Trips after 10 failures in a row                                          |

<Info>
  **Grace Period:** The circuit breaker will not evaluate a webhook for the first **15 minutes** after it is created or updated. This prevents false positives during initial setup or after a configuration change.
</Info>

### Reactivation

After identifying and fixing the root cause of delivery failures, reactivate the webhook to resume event delivery:

<Warning>
  Catena does **not** auto-reactivate paused webhooks. Only you know when the underlying issue has been resolved — reactivation must be triggered explicitly through the API.
</Warning>

<Steps>
  <Step title="Investigate the Issue">
    Review metrics and logs to understand why delivery performance degraded.
  </Step>

  <Step title="Resolve Problems">
    Fix endpoint issues, application bugs, or configuration errors.
  </Step>

  <Step title="Reactivate Webhook">
    Call the activate endpoint to resume delivery:

    ```bash theme={null}
        curl -X POST \
          --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/activate \
          -H 'Authorization: Bearer <token>'
    ```
  </Step>

  <Step title="Replay Missed Events">
    Replay events from the DLQ to recover any events that were queued while the webhook was paused. See [Event Replay](#event-replay) below.
  </Step>

  <Step title="Monitor Recovery">
    Track metrics closely to ensure performance improves and remains healthy.
  </Step>
</Steps>

<Warning>
  **Reactivation Warning:** If the underlying issue isn't resolved before reactivating, the circuit breaker may trip again shortly after. Monitor metrics closely in the first 15 minutes after reactivation.
</Warning>

***

## Delivery Metrics

The Notifications API provides real-time and historical metrics to monitor webhook performance and reliability across four rolling time windows: **6h**, **24h**, **7d**, and **14d**.

```bash theme={null}
curl --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/metrics \
  -H 'Authorization: Bearer <token>'
```

```json theme={null}
{
  "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
  "http_attempts": {
    "6h": 105,
    "24h": 420,
    "7d": 3000,
    "14d": 6500
  },
  "http_success_attempts": {
    "6h": 100,
    "24h": 400,
    "7d": 2800,
    "14d": 6000
  },
  "http_failure_attempts": {
    "6h": 5,
    "24h": 20,
    "7d": 200,
    "14d": 500
  },
  "message_count": {
    "6h": 100,
    "24h": 400,
    "7d": 2800,
    "14d": 6000
  },
  "message_success_count": {
    "6h": 98,
    "24h": 390,
    "7d": 2700,
    "14d": 5800
  },
  "success_rate": {
    "6h": 98,
    "24h": 97,
    "7d": 96,
    "14d": 96
  },
  "avg_response_time_ms": {
    "6h": 150,
    "24h": 160,
    "7d": 155,
    "14d": 158
  },
  "ewma_success_rate": 0.97,
  "dlq_count": 3
}
```

Each time-windowed field returns values for four rolling windows: `6h`, `24h`, `7d`, and `14d`. Key fields to monitor:

* **`success_rate`** — The percentage of messages successfully delivered after all retries. If any window drops below \~95%, investigate immediately.
* **`ewma_success_rate`** — The exponentially weighted moving average success rate. This is what the circuit breaker evaluates for high-volume webhooks — if it falls below `0.95`, the webhook will be marked `stale`.
* **`http_attempts` vs `message_count`** — HTTP attempts includes retries; message count counts unique events. A high ratio of attempts to messages indicates frequent retries.
* **`avg_response_time_ms`** — Sustained values above 2500ms put you at risk of timeouts.
* **`dlq_count`** — Any non-zero value means events need to be replayed before the 14-day retention window expires.

***

## Delivery Logs

Access detailed logs for every webhook delivery attempt. Logs are available for up to **14 days** and include status codes, response times, error messages, and per-attempt detail.

```bash theme={null}
curl --url "https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/logs?status=failed" \
  -H 'Authorization: Bearer <token>'
```

```json theme={null}
{
  "logs": [
    {
      "created_at": "2026-01-15T10:30:00.596772Z",
      "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
      "message_id": "62cb8fea-e017-4b08-86b7-4469fa872b91",
      "event_name": "vehicle_location.added",
      "status": "failed",
      "status_code": 504,
      "error_message": "Request timeout. Please ensure the webhook endpoint is reachable and acknowledges receipt in less than 3 seconds.",
      "response_time_ms": 3000
    },
    {
      "created_at": "2026-01-15T09:14:22.312445Z",
      "webhook_id": "247b2dea-a030-48b7-9a05-ee33c1b6ab0a",
      "message_id": "91fa2bcd-3301-4e7a-bc12-7734ab991e22",
      "event_name": "vehicle.modified",
      "status": "failed",
      "status_code": 502,
      "error_message": "Connection error: Unable to establish connection to webhook endpoint.",
      "response_time_ms": null
    }
  ],
  "total": 2,
  "page": 1
}
```

<Tip>
  **Debugging Strategy:** The `error_message` and `status_code` fields together usually pinpoint the root cause — a `504` with "endpoint timeout" suggests async processing isn't working; `502`/`503` errors suggest infrastructure issues upstream of your handler.
</Tip>

***

## Event Replay

Recover from delivery failures by replaying events from the Dead Letter Queue (DLQ). When events fail all automatic retry attempts, they're stored in the DLQ for up to 14 days, giving you time to fix issues and replay them.

The replay functionality redelivers all DLQ events for a webhook subscription, allowing you to recover from temporary outages, application bugs, or configuration issues without losing data.

### Common Replay Scenarios

<CardGroup cols={2}>
  <Card icon="server" title="Endpoint Downtime">
    Recover events lost during maintenance windows or infrastructure outages.
  </Card>

  <Card icon="bug" title="Application Errors">
    Reprocess events after fixing bugs in your webhook handler.
  </Card>

  <Card icon="wrench" title="Configuration Issues">
    Replay events after correcting webhook URL or authentication problems.
  </Card>

  <Card icon="clock-rotate-left" title="Data Recovery">
    Reprocess historical events after resolving integration issues.
  </Card>
</CardGroup>

### How to Replay Events

<Steps>
  <Step title="Identify Failed Events">
    Use metrics and logs to determine which events are in the DLQ and need replay.
  </Step>

  <Step title="Fix the Root Cause">
    Resolve the underlying issue that caused delivery failures before replaying.
  </Step>

  <Step title="Initiate Replay">
    Call the replay endpoint to redeliver all DLQ events to your webhook:

    ```bash theme={null}
        curl -X POST \
          --url https://api.catenatelematics.com/v2/notifications/webhooks/<webhook_id>/replay \
          -H 'Authorization: Bearer <token>'
    ```
  </Step>

  <Step title="Monitor Redelivery">
    Watch logs and metrics to verify replayed events are successfully delivered.
  </Step>
</Steps>

<Warning>
  **14-Day Retention:** Events are permanently deleted from the DLQ after 14 days. Replay critical events before they expire to avoid data loss.
</Warning>

***

## Monitoring Best Practices

<CardGroup cols={2}>
  <Card icon="bell" title="Set Up Alerts">
    Configure automated alerts for success rate and response time degradation to catch issues early.
  </Card>

  <Card icon="chart-line" title="Track Long-Term Trends">
    Review metrics across multiple time windows to identify patterns and seasonal variations.
  </Card>

  <Card icon="clock" title="Optimize Performance">
    Keep response times low by processing webhooks asynchronously and returning acknowledgments quickly.
  </Card>

  <Card icon="inbox" title="Monitor the DLQ">
    Regularly check for events in the Dead Letter Queue and replay them before the retention period expires.
  </Card>

  <Card icon="list" title="Analyze Failure Patterns">
    Use delivery logs to identify recurring issues and address root causes systematically.
  </Card>

  <Card icon="shield-check" title="Validate Configuration">
    Periodically verify webhook URLs, filters, and secrets remain correct and up to date.
  </Card>
</CardGroup>

***

## Troubleshooting

<AccordionGroup>
  <Accordion title="Low Success Rate" icon="triangle-exclamation">
    **Common Causes:**

    * Endpoint responding too slowly or timing out
    * Application errors causing failed responses
    * Network connectivity or infrastructure issues
    * Insufficient server resources to handle load

    **Solutions:**

    * Optimize endpoint performance with asynchronous processing
    * Fix application bugs and handle errors gracefully
    * Scale infrastructure to accommodate webhook volume
    * Review logs to identify specific error patterns
  </Accordion>

  <Accordion title="High DLQ Count" icon="inbox-full">
    **Common Causes:**

    * Extended endpoint downtime or outages
    * Persistent application errors
    * Misconfigured webhook URL or authentication

    **Solutions:**

    * Verify endpoint accessibility and correct configuration
    * Fix application issues preventing successful processing
    * Replay DLQ events after resolving the root cause
  </Accordion>

  <Accordion title="Slow Response Times" icon="gauge-high">
    **Common Causes:**

    * Synchronous processing in the webhook handler
    * Database operations or external API calls in the request path
    * Insufficient server resources

    **Solutions:**

    * Return acknowledgment immediately and process asynchronously
    * Move heavy operations to background jobs or queues
    * Optimize database queries and reduce blocking operations
  </Accordion>

  <Accordion title="Webhook Deactivated" icon="ban">
    **Common Causes:**

    * Delivery performance dropped below circuit breaker thresholds
    * Persistent endpoint unavailability or misconfiguration

    **Solutions:**

    * Review metrics to identify when and why performance degraded
    * Check logs for error patterns and failure modes
    * Fix the underlying issues before reactivating
    * Monitor closely after reactivation to ensure sustained health
  </Accordion>
</AccordionGroup>
