SazM

Stripe Webhook Recovery Pipeline

Rebuilt a failing Stripe webhook system processing 50k+ events per day.

Repository
StripeCloudflare WorkersD1

Problem

The client’s Stripe webhook integration was failing silently. Events were being dropped, retries were ineffective, and the team had no visibility into what was being lost. At 50,000+ events per day, even a 1% failure rate meant 500+ missed events daily — leading to failed subscriptions, missing invoices, and angry customers.

The existing implementation used a simple synchronous handler in the application server with no retry logic, no dead-letter queue, and no monitoring.

Approach

We rebuilt the webhook pipeline on Cloudflare Workers with three key components:

  1. Ingestion Worker — Validates Stripe signatures, acknowledges immediately, and queues the event for processing.
  2. Processing Worker — Handles idempotent event processing with exponential backoff and a D1-backed dead-letter queue.
  3. Recovery Worker — Replays failed events from the dead-letter queue on a schedule.

Architecture decisions

  • Workers over server: Stateless, globally distributed, auto-scaling. No servers to manage.
  • D1 for state: The dead-letter queue and processing state are stored in D1. It is durable, queryable, and requires no additional infrastructure.
  • Idempotency keys: Every event is processed at most once. Duplicate events are detected and skipped.

Outcome

MetricBeforeAfter
Delivery rate82%99.98%
Average processing time12s340ms
Team time spent on webhook issues8 hrs/week< 30 mins/week
Missed events per day~9,000~5

Related Projects