Retry Pattern | Glossaire

The Retry Pattern is an architectural resilience pattern that enables applications to automatically handle transient failures by retrying failed operations. This mechanism, essential in distributed architectures, implements retry logic with progressive delays (backoff) to avoid overloading failing systems while maximizing success probability.

Fundamentals of Retry Pattern

Distinction between transient errors (unstable network, timeout) and permanent errors (authentication failure, non-existent resource)
Backoff strategies: linear, exponential, exponential with jitter to avoid thundering herd
Configurable attempt limits to prevent infinite loops and preserve system resources
Idempotency of critical operations to ensure re-execution doesn't create unwanted side effects

Technical and Business Benefits

Significant improvement in overall system availability against intermittent network failures
Reduction of alert false positives by absorbing temporary failures without manual intervention
Infrastructure cost optimization by avoiding over-provisioning to compensate for network instability
Better user experience with transparency of service micro-interruptions
Facilitated integration with third-party services whose reliability isn't guaranteed at 100%

Practical Implementation Example

retry-service.ts

interface RetryConfig {
  maxAttempts: number;
  baseDelay: number;
  maxDelay: number;
  backoffMultiplier: number;
  jitter: boolean;
}

class RetryService {
  async executeWithRetry<T>(
    operation: () => Promise<T>,
    config: RetryConfig,
    isRetryable: (error: Error) => boolean = () => true
  ): Promise<T> {
    let lastError: Error;
    
    for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error as Error;
        
        // Don't retry if error is not transient
        if (!isRetryable(lastError)) {
          throw error;
        }
        
        // Last attempt failed
        if (attempt === config.maxAttempts) {
          break;
        }
        
        // Calculate delay with exponential backoff
        const delay = this.calculateDelay(attempt, config);
        console.warn(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
        await this.sleep(delay);
      }
    }
    
    throw new Error(`Operation failed after ${config.maxAttempts} attempts: ${lastError.message}`);
  }
  
  private calculateDelay(attempt: number, config: RetryConfig): number {
    let delay = config.baseDelay * Math.pow(config.backoffMultiplier, attempt - 1);
    delay = Math.min(delay, config.maxDelay);
    
    // Add jitter to avoid thundering herd
    if (config.jitter) {
      delay = delay * (0.5 + Math.random() * 0.5);
    }
    
    return Math.floor(delay);
  }
  
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage with external API
const retryService = new RetryService();
const config: RetryConfig = {
  maxAttempts: 5,
  baseDelay: 1000,
  maxDelay: 30000,
  backoffMultiplier: 2,
  jitter: true
};

const isNetworkError = (error: Error) => 
  error.message.includes('ECONNRESET') || 
  error.message.includes('ETIMEDOUT');

const data = await retryService.executeWithRetry(
  () => fetch('https://api.example.com/data').then(r => r.json()),
  config,
  isNetworkError
);

Effective Implementation

Identify candidate operations: network calls, database queries, external service access
Classify errors as transient (timeout, 503, connection loss) and permanent (401, 404, validation errors)
Define appropriate backoff strategy: exponential with jitter for most cases, linear for specific scenarios
Configure appropriate limits: 3-5 attempts for critical operations, progressive timeouts for each attempt
Implement comprehensive observability: structured logs, retry rate metrics, alerts on abnormal failure rates
Ensure operation idempotency or use deduplication identifiers for repeated requests
Test failure scenarios with chaos engineering to validate behavior under load

Production Tip

Combine the Retry Pattern with Circuit Breaker to avoid overloading an already failing service. After 3-5 consecutive failed attempts, open the circuit for 30-60 seconds before retrying. Always add jitter (random variance of ±50%) to delays to prevent all clients from retrying simultaneously after a widespread outage (thundering herd problem).

Tools and Libraries

Polly (.NET): comprehensive resilience library with retry, circuit breaker, timeout and fallback
resilience4j (Java): lightweight framework inspired by Hystrix with native retry pattern support
axios-retry (JavaScript): plugin for axios enabling configurable retries on HTTP requests
Tenacity (Python): general-purpose retry library with multiple backoff strategies
AWS SDK: integrated automatic retry with exponential backoff for all AWS services
Istio/Envoy: automatic retry at service mesh level with declarative configuration

The Retry Pattern constitutes an essential foundation for building resilient systems in distributed environments. By intelligently managing transient failures, it directly improves user-perceived availability while reducing operational costs related to incidents and manual interventions. Its judicious implementation, combined with other resilience patterns, transforms fragile architectures into robust systems capable of maintaining high service levels despite the inherent instability of modern cloud infrastructures.