Stripe Webhook Recovery Pipeline
Rebuilt a failing Stripe webhook system processing 50k+ events per day.
StripeCloudflare WorkersD1
Problem
The client’s Stripe webhook integration was failing silently. Events were being dropped, retries were ineffective, and the team had no visibility into what was being lost. At 50,000+ events per day, even a 1% failure rate meant 500+ missed events daily — leading to failed subscriptions, missing invoices, and angry customers.
The existing implementation used a simple synchronous handler in the application server with no retry logic, no dead-letter queue, and no monitoring.
Approach
We rebuilt the webhook pipeline on Cloudflare Workers with three key components:
- Ingestion Worker — Validates Stripe signatures, acknowledges immediately, and queues the event for processing.
- Processing Worker — Handles idempotent event processing with exponential backoff and a D1-backed dead-letter queue.
- Recovery Worker — Replays failed events from the dead-letter queue on a schedule.
Architecture decisions
- Workers over server: Stateless, globally distributed, auto-scaling. No servers to manage.
- D1 for state: The dead-letter queue and processing state are stored in D1. It is durable, queryable, and requires no additional infrastructure.
- Idempotency keys: Every event is processed at most once. Duplicate events are detected and skipped.
Outcome
| Metric | Before | After |
|---|---|---|
| Delivery rate | 82% | 99.98% |
| Average processing time | 12s | 340ms |
| Team time spent on webhook issues | 8 hrs/week | < 30 mins/week |
| Missed events per day | ~9,000 | ~5 |