❌

Reading view

There are new articles available, click to refresh the page.

Learning Notes #23 – Retry Pattern | Cloud Patterns

Today, i refreshed Retry pattern. It handles transient failures ( network issues, throttling, or temporary unavailability of a service).

The Retry Pattern provides a structured approach to handle these failures gracefully, ensuring system reliability and fault tolerance. It is often used in conjunction with related patterns like the Circuit Breaker, which prevents repeated retries during prolonged failures, and the Bulkhead Pattern, which isolates system components to prevent cascading failures.

In this blog, i jot down my notes on Retry pattern for better understanding.

What is the Retry Pattern?

The Retry Pattern is a design strategy used to manage transient failures by retrying failed operations. Instead of immediately failing an operation after an error, the pattern retries it with an optional delay or backoff strategy. This is particularly useful in distributed systems where failures are often temporary.

Key Components of the Retry Pattern

  • Retry Logic: The mechanism that determines how many times to retry and under what conditions.
  • Backoff Strategy: A delay mechanism to space out retries. Common strategies include fixed, incremental, and exponential backoff.
  • Termination Policy: A limit on the number of retries or a timeout to prevent infinite retry loops.
  • Error Handling: A fallback mechanism to gracefully handle persistent failures after retries are exhausted.

Retry Pattern Strategies

1. Fixed Interval Retry

  • Retries are performed at regular intervals.
  • Example: Retry every 2 seconds for up to 5 attempts.

2. Incremental Backoff

  • Retry intervals increase linearly.
  • Example: Retry after 1, 2, 3, 4, and 5 seconds.

3. Exponential Backoff

  • Retry intervals grow exponentially, often with jitter to randomize delays.
  • Example: Retry after 1, 2, 4, 8, and 16 seconds.

4. Custom Backoff

  • Tailored to specific use cases, combining strategies or using domain-specific logic.

Implementing the Retry Pattern in Python with Tenacity

tenacity is a powerful Python library that simplifies the implementation of the Retry Pattern. It provides built-in support for various retry strategies, including fixed interval, incremental backoff, and exponential backoff with jitter.

Example with Fixed Interval Retry


from tenacity import retry, stop_after_attempt, wait_fixed

@retry(stop=stop_after_attempt(5), wait=wait_fixed(2))
def example_operation():
    print("Trying operation...")
    raise Exception("Transient error")

try:
    example_operation()
except Exception as e:
    print(f"Operation failed after retries: {e}")

Example with Exponential Backoff and Jitter


from tenacity import retry, stop_after_attempt, wait_exponential_jitter

@retry(stop=stop_after_attempt(5), wait=wwmultiplier=1, max=10))
def example_operation():
    print("Trying operation...")
    raise Exception("Transient error")

try:
    example_operation()
except Exception as e:
    print(f"Operation failed after retries: {e}")

Example with Custom Termination Policy


from tenacity import retry, stop_after_delay, wait_exponential

@retry(stop=stop_after_delay(10), wait=wait_exponential(multiplier=1))
def example_operation():
    print("Trying operation...")
    raise Exception("Transient error")

try:
    example_operation()
except Exception as e:
    print(f"Operation failed after retries: {e}")

Real-World Use Cases

  • API Rate Limiting – Retrying failed API calls when encountering HTTP 429 errors.
  • Database Operations – Retrying failed database queries due to deadlocks or transient connectivity issues.
  • File Uploads/Downloads – Retrying uploads or downloads in case of network interruptions.
  • Message Processing – Retries for message processing failures in systems like RabbitMQ or Kafka.

Learning Notes #18 – Bulk Head Pattern (Resource Isolation) | Cloud Pattern

Today, i learned about bulk head pattern and how it makes the system resilient to failure, resource exhaustion. In this blog i jot down notes on this pattern for better understanding.

In today’s world of distributed systems and microservices, resiliency is key to ensuring applications are robust and can withstand failures.

The Bulkhead Pattern is a design principle used to improve system resilience by isolating different parts of a system to prevent failure in one component from cascading to others.

What is the Bulkhead Pattern?

The term β€œbulkhead” originates from shipbuilding, where bulkheads are partitions that divide a ship into separate compartments. If one compartment is breached, the others remain intact, preventing the entire ship from sinking. Similarly, in software design, the Bulkhead Pattern isolates components or services so that a failure in one part does not bring down the entire system.

In software systems, bulkheads:

  • Isolate resources (e.g., threads, database connections, or network calls) for different components.
  • Limit the scope of failures.
  • Allow other parts of the system to continue functioning even if one part is degraded or completely unavailable.

Example

Consider an e-commerce application with a product-service that has two endpoints

  1. /product/{id} – This endpoint gives detailed information about a specific product, including ratings and reviews. It depends on the rating-service.
  2. /products – This endpoint provides a catalog of products based on search criteria. It does not depend on any external services.

Consider, with a fixed amount of resource allocated to product-service is loaded with /product/{id} calls, then they can monopolize the thread pool. This delays /products requests, causing users to experience slowness even though these requests are independent. Which leads to resource exhaustion and failures.

With bulkhead pattern, we can allocate separate client, connection pools to isolate the service interaction. we can implement bulkhead by allocating some connection pool (10) to /product/{id} requests and /products requests have a different connection pool (5) .

Even if /product/{id} requests are slow or encounter high traffic, /products requests remain unaffected.

Scenarios Where the Bulkhead Pattern is Needed

  1. Microservices with Shared Resources – In a microservices architecture, multiple services might share limited resources such as database connections or threads. If one service experiences a surge in traffic or a failure, it can exhaust these shared resources, impacting all other services. Bulkheading ensures each service gets a dedicated pool of resources, isolating the impact of failures.
  2. Prioritizing Critical Workloads – In systems with mixed workloads (e.g., processing user transactions and generating reports), critical operations like transaction processing must not be delayed or blocked by less critical tasks. Bulkheading allocates separate resources to ensure critical tasks have priority.
  3. Third-Party API Integration – When an application depends on multiple external APIs, one slow or failing API can delay the entire application if not isolated. Using bulkheads ensures that issues with one API do not affect interactions with others.
  4. Multi-Tenant Systems – In SaaS applications serving multiple tenants, a single tenant’s high resource consumption or failure should not degrade the experience for others. Bulkheads can segregate resources per tenant to maintain service quality.
  5. Cloud-Native Applications – In cloud environments, services often scale independently. A spike in one service’s load should not overwhelm shared backend systems. Bulkheads help isolate and manage these spikes.
  6. Event-Driven Systems – In event-driven architectures with message queues, processing backlogs for one type of event can delay others. By applying the Bulkhead Pattern, separate processing pipelines can handle different event types independently.

What are the Key Points of the Bulkhead Pattern? (Simplified)

  • Define Partitions – (Think of a ship) it’s divided into compartments (partitions) to keep water from flooding the whole ship if one section gets damaged. In software, these partitions are designed around how the application works and its technical needs.
  • Designing with Context – If you’re using a design approach like DDD (Domain-Driven Design), make sure your bulkheads (partitions) match the business logic boundaries.
  • Choosing Isolation Levels – Decide how much isolation is needed. For example: Threads for lightweight tasks. Separate containers or virtual machines for more critical separations. Balance between keeping things separate and the costs or extra effort involved.
  • Combining Other Techniques – Bulkheads work even better with patterns like Retry, Circuit Breaker, Throttling.
  • Monitoring – Keep an eye on each partition’s performance. If one starts getting overloaded, you can adjust resources or change limits.

When Should You Use the Bulkhead Pattern?

  • To Isolate Critical Resources – If one part of your system fails, other parts can keep working. For example, you don’t want search functionality to stop working because the reviews section is down.
  • To Prioritize Important Work – For example, make sure payment processing (critical) is separate from background tasks like sending emails.
  • To Avoid Cascading Failures – If one part of the system gets overwhelmed, it won’t drag down everything else.

When Should You Avoid It?

  • Complexity Isn’t Needed – If your system is simple, adding bulkheads might just make it harder to manage.
  • Resource Efficiency is Critical – Sometimes, splitting resources into separate pools can mean less efficient use of those resources. If every thread, connection, or container is underutilized, this might not be the best approach.

Challenges and Best Practices

  1. Overhead: Maintaining separate resource pools can increase system complexity and resource utilization.
  2. Resource Sizing: Properly sizing the pools is critical to ensure resources are efficiently utilized without bottlenecks.
  3. Monitoring: Use tools to monitor the health and performance of each resource pool to detect bottlenecks or saturation.

References:

  1. AWS https://aws.amazon.com/blogs/containers/building-a-fault-tolerant-architecture-with-a-bulkhead-pattern-on-aws-app-mesh/
  2. Resilience https://resilience4j.readme.io/docs/bulkhead
  3. https://medium.com/nerd-for-tech/bulkhead-pattern-distributed-design-pattern-c673d5e81523
  4. Microsoft https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead

❌