Here's something that caught me off guard recently: retry logic isn't a code pattern—it's a policy decision that your engineering team is making on behalf of the business.
I used to think retries were just another implementation detail. Add some exponential backoff, handle the occasional timeout, ship it. But after debugging yet another "mysterious" production issue that turned out to be aggressive retry behavior masking a deeper problem, I realized we've been thinking about this completely wrong.
<> When retry assumptions are wrong, systems don't crash. They lie quietly./>
This insight hit me hard because it's so counterintuitive. We expect broken code to fail loudly. But broken retry policies? They create the illusion of success while quietly amplifying failures, overloading struggling services, and making irreversible business decisions without anyone realizing it.
The Hidden Assumptions in Every Retry
Every retry policy encodes four critical assumptions:
Trust: How much do you trust the downstream service to recover? A 30-second retry interval says "I trust you'll be back soon." A 5-minute interval says "I don't."
Responsibility: Who's responsible when things go wrong? Aggressive retries say "the service should handle my load." Conservative retries say "I should back off."
Time sensitivity: How urgent is this operation? Immediate retries for payment processing send a very different message than batch job retries.
Failure types: What kinds of failures do you expect? Retrying a 400 Bad Request reveals assumptions about data quality that might be completely wrong.
The problem is that developers typically make these assumptions implicitly, often defaulting to whatever the library provides. But these aren't technical decisions—they're business policy decisions that affect SLAs, costs, and user experience.
When Retries Become Silent Lies
Here's a real scenario I encountered: An e-commerce service was retrying failed payment attempts with exponential backoff. Sounds reasonable, right? Except the "failures" were actually fraud detection blocks (HTTP 403), and the retries were creating duplicate authorization attempts that looked suspicious to the fraud system, creating a feedback loop.
The worst part? The retries occasionally succeeded due to timing variations in the fraud detection, so the system reported "success" while actually training the fraud detector to be more aggressive. The business metrics looked fine until customer complaints started pouring in.
1// This looks innocent but makes policy decisions
2const paymentResult = await retryWithBackoff(async () => {
3 return paymentService.charge(amount, cardToken);
4}, {
5 maxAttempts: 5,
6 initialDelay: 1000,
7 backoffFactor: 2.0
8});
9
10// What it's actually deciding:
11// - Trust: "Payment failures are usually transient"
12// - Responsibility: "The payment service should handle my retry load"
13// - Time: "5 attempts over ~30 seconds is acceptable delay"
14// - Failure types: "All payment failures are worth retrying"A better approach explicitly encodes the business policy:
1const paymentPolicy = {
2 retryableErrors: ['NETWORK_ERROR', 'SERVICE_UNAVAILABLE', 'TIMEOUT'],
3 nonRetryableErrors: ['FRAUD_DETECTED', 'INSUFFICIENT_FUNDS', 'INVALID_CARD'],
4 maxAttempts: 3, // Business decision: balance UX vs fraud risk
5 backoffStrategy: 'exponential',
6 maxDelay: 10000 // Business decision: 10s max wait time
7};
8The Retry Storm Problem
One of the most insidious issues with treating retries as code is the compound effect across system layers. I've seen applications where:
- The HTTP client retries (3 attempts)
- The service wrapper retries (3 attempts)
- The business logic retries (2 attempts)
- The queue processor retries (5 attempts)
That's potentially 90 total attempts for a single logical operation. When a service starts struggling, this creates a "retry storm" where the cure becomes worse than the disease.
<> Decentralized retry logic turns transient failures into sustained outages./>
The solution is centralized retry policy. Pick one layer—typically the service client or API wrapper—and make it responsible for all retry logic. Lower layers should fail fast and bubble up; upper layers should implement circuit breakers.
1# Good: Centralized retry at the service boundary
2class PaymentServiceClient:
3 def __init__(self, retry_policy):
4 self.retry_policy = retry_policy
5 self.circuit_breaker = CircuitBreaker(
6 failure_threshold=5,
7 recovery_timeout=30
8 )Making Policy Decisions Explicit
The key insight is to externalize retry configuration so it becomes a visible business decision rather than hidden code behavior. Here's what good retry governance looks like:
1. Document the assumptions: What failures do you expect? What's the business impact of delays?
2. Make policies configurable: Don't hardcode retry logic. Use configuration that can be reviewed and adjusted.
3. Add observability: Track retry attempts, success rates after N attempts, and total latency including retries.
4. Regular policy review: Retry policies should be reviewed like any other business policy, especially when SLAs or dependencies change.
1# Example: Explicit retry policies as configuration
2services:
3 payment_service:
4 retry_policy:
5 strategy: "exponential_backoff_with_jitter"
6 max_attempts: 3
7 initial_delay: "1s"
8 max_delay: "10s"Why This Matters Now
As systems become more distributed and dependencies multiply, retry policies become even more critical. The rise of AI agents and automated decision-making means retry behavior can trigger cascading business logic without human oversight.
Your next steps: Audit your current retry logic. For each retry policy, ask:
- What business assumptions am I encoding?
- Who decided these assumptions were correct?
- How do I know when they're wrong?
- What's the blast radius if this policy misbehaves?
Treat retry logic like any other policy decision: document it, review it, and own the consequences. Your future self—and your on-call rotation—will thank you.
