Stripe Webhook Recovery

Problem

The client’s Stripe webhook integration was failing silently. Events were being dropped, retries were ineffective, and the team had no visibility into what was being lost. At 50,000+ events per day, even a 1% failure rate meant 500+ missed events daily — leading to failed subscriptions, missing invoices, and angry customers.

The existing implementation used a simple synchronous handler in the application server with no retry logic, no dead-letter queue, and no monitoring.

Approach

We rebuilt the webhook pipeline on Cloudflare Workers with three key components:

Ingestion Worker — Validates Stripe signatures, acknowledges immediately, and queues the event for processing.
Processing Worker — Handles idempotent event processing with exponential backoff and a D1-backed dead-letter queue.
Recovery Worker — Replays failed events from the dead-letter queue on a schedule.

Architecture decisions

Workers over server: Stateless, globally distributed, auto-scaling. No servers to manage.
D1 for state: The dead-letter queue and processing state are stored in D1. It is durable, queryable, and requires no additional infrastructure.
Idempotency keys: Every event is processed at most once. Duplicate events are detected and skipped.

Outcome

Metric	Before	After
Delivery rate	82%	99.98%
Average processing time	12s	340ms
Team time spent on webhook issues	8 hrs/week	< 30 mins/week
Missed events per day	~9,000	~5

Stripe Webhook Recovery Pipeline

Problem

Approach

Architecture decisions

Outcome

Related Projects

Payment Orchestration Layer