Today, i refreshed Retry pattern. It handles transient failures ( network issues, throttling, or temporary unavailability of a service).
The Retry Pattern provides a structured approach to handle these failures gracefully, ensuring system reliability and fault tolerance. It is often used in conjunction with related patterns like the Circuit Breaker, which prevents repeated retries during prolonged failures, and the Bulkhead Pattern, which isolates system components to prevent cascading failures.
In this blog, i jot down my notes on Retry pattern for better understanding.
What is the Retry Pattern?
The Retry Pattern is a design strategy used to manage transient failures by retrying failed operations. Instead of immediately failing an operation after an error, the pattern retries it with an optional delay or backoff strategy. This is particularly useful in distributed systems where failures are often temporary.
Key Components of the Retry Pattern
Retry Logic: The mechanism that determines how many times to retry and under what conditions.
Backoff Strategy: A delay mechanism to space out retries. Common strategies include fixed, incremental, and exponential backoff.
Termination Policy: A limit on the number of retries or a timeout to prevent infinite retry loops.
Error Handling: A fallback mechanism to gracefully handle persistent failures after retries are exhausted.
Retry Pattern Strategies
1. Fixed Interval Retry
Retries are performed at regular intervals.
Example: Retry every 2 seconds for up to 5 attempts.
2. Incremental Backoff
Retry intervals increase linearly.
Example: Retry after 1, 2, 3, 4, and 5 seconds.
3. Exponential Backoff
Retry intervals grow exponentially, often with jitter to randomize delays.
Example: Retry after 1, 2, 4, 8, and 16 seconds.
4. Custom Backoff
Tailored to specific use cases, combining strategies or using domain-specific logic.
Implementing the Retry Pattern in Python with Tenacity
tenacity is a powerful Python library that simplifies the implementation of the Retry Pattern. It provides built-in support for various retry strategies, including fixed interval, incremental backoff, and exponential backoff with jitter.
Example with Fixed Interval Retry
from tenacity import retry, stop_after_attempt, wait_fixed
@retry(stop=stop_after_attempt(5), wait=wait_fixed(2))
def example_operation():
print("Trying operation...")
raise Exception("Transient error")
try:
example_operation()
except Exception as e:
print(f"Operation failed after retries: {e}")
Example with Exponential Backoff and Jitter
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
@retry(stop=stop_after_attempt(5), wait=wwmultiplier=1, max=10))
def example_operation():
print("Trying operation...")
raise Exception("Transient error")
try:
example_operation()
except Exception as e:
print(f"Operation failed after retries: {e}")
Example with Custom Termination Policy
from tenacity import retry, stop_after_delay, wait_exponential
@retry(stop=stop_after_delay(10), wait=wait_exponential(multiplier=1))
def example_operation():
print("Trying operation...")
raise Exception("Transient error")
try:
example_operation()
except Exception as e:
print(f"Operation failed after retries: {e}")
Real-World Use Cases
API Rate Limiting β Retrying failed API calls when encountering HTTP 429 errors.
Database Operations β Retrying failed database queries due to deadlocks or transient connectivity issues.
File Uploads/Downloads β Retrying uploads or downloads in case of network interruptions.
Message Processing β Retries for message processing failures in systems like RabbitMQ or Kafka.
Meet Jafer, a talented developer (self boast) working at a fast growing tech company. His team is building an innovative app that fetches data from multiple third-party APIs in realtime to provide users with up-to-date information.
Everything is going smoothly until one day, a spike in traffic causes their app to face a wave of βHTTP 500β and βTimeoutβ errors. Requests start failing left and right, and users are left staring at the dreaded βData Unavailableβ message.
Jafer realizes that he needs a way to make their app more resilient against these unpredictable network hiccups. Thatβs when he discovers Tenacity a powerful Python library designed to help developers handle retries gracefully.
Join Jafer as he dives into Tenacity and learns how to turn his app from fragile to robust with just a few lines of code!
Step 0: Mock FLASK Api
from flask import Flask, jsonify, make_response
import random
import time
app = Flask(__name__)
# Scenario 1: Random server errors
@app.route('/random_error', methods=['GET'])
def random_error():
if random.choice([True, False]):
return make_response(jsonify({"error": "Server error"}), 500) # Simulate a 500 error randomly
return jsonify({"message": "Success"})
# Scenario 2: Timeouts
@app.route('/timeout', methods=['GET'])
def timeout():
time.sleep(5) # Simulate a long delay that can cause a timeout
return jsonify({"message": "Delayed response"})
# Scenario 3: 404 Not Found error
@app.route('/not_found', methods=['GET'])
def not_found():
return make_response(jsonify({"error": "Not found"}), 404)
# Scenario 4: Rate-limiting (simulated with a fixed chance)
@app.route('/rate_limit', methods=['GET'])
def rate_limit():
if random.randint(1, 10) <= 3: # 30% chance to simulate rate limiting
return make_response(jsonify({"error": "Rate limit exceeded"}), 429)
return jsonify({"message": "Success"})
# Scenario 5: Empty response
@app.route('/empty_response', methods=['GET'])
def empty_response():
if random.choice([True, False]):
return make_response("", 204) # Simulate an empty response with 204 No Content
return jsonify({"message": "Success"})
if __name__ == '__main__':
app.run(host='localhost', port=5000, debug=True)
To run the Flask app, use the command,
python mock_server.py
Step 1: Introducing Tenacity
Jafer decides to start with the basics. He knows that Tenacity will allow him to retry failed requests without cluttering his codebase with complex loops and error handling. So, he installs the library,
pip install tenacity
With Tenacity ready, Jafer decides to tackle his first problem, retrying a request that fails due to server errors.
Step 2: Retrying on Exceptions
He writes a simple function that fetches data from an API and wraps it with Tenacityβs @retry decorator
import requests
import logging
from tenacity import before_log, after_log
from tenacity import retry, stop_after_attempt, wait_fixed
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(stop=stop_after_attempt(3),
wait=wait_fixed(2),
before=before_log(logger, logging.INFO),
after=after_log(logger, logging.INFO))
def fetch_random_error():
response = requests.get('http://localhost:5000/random_error')
response.raise_for_status() # Raises an HTTPError for 4xx/5xx responses
return response.json()
if __name__ == '__main__':
try:
data = fetch_random_error()
print("Data fetched successfully:", data)
except Exception as e:
print("Failed to fetch data:", str(e))
This code will attempt the request up to 3 times, waiting 2 seconds between each try. Jafer feels confident that this will handle the occasional hiccup. However, he soon realizes that he needs more control over which exceptions trigger a retry.
Step 3: Handling Specific Exceptions
Jaferβs app sometimes receives a β404 Not Foundβ error, which should not be retried because the resource doesnβt exist. He modifies the retry logic to handle only certain exceptions,
import requests
import logging
from tenacity import before_log, after_log
from requests.exceptions import HTTPError, Timeout
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(stop=stop_after_attempt(3),
wait=wait_fixed(2),
retry=retry_if_exception_type((HTTPError, Timeout)),
before=before_log(logger, logging.INFO),
after=after_log(logger, logging.INFO))
def fetch_data():
response = requests.get('http://localhost:5000/timeout', timeout=2) # Set a short timeout to simulate failure
response.raise_for_status()
return response.json()
if __name__ == '__main__':
try:
data = fetch_data()
print("Data fetched successfully:", data)
except Exception as e:
print("Failed to fetch data:", str(e))
Now, the function retries only on HTTPError or Timeout, avoiding unnecessary retries for a β404β error. Jaferβs app is starting to feel more resilient!
Step 4: Implementing Exponential Backoff
A few days later, the team notices that theyβre still getting rate-limited by some APIs. Jafer recalls the concept of exponential backoff a strategy where the wait time between retries increases exponentially, reducing the load on the server and preventing further rate limiting.
He decides to implement it,
import requests
import logging
from tenacity import before_log, after_log
from tenacity import retry, stop_after_attempt, wait_exponential
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=2, max=10),
before=before_log(logger, logging.INFO),
after=after_log(logger, logging.INFO))
def fetch_rate_limit():
response = requests.get('http://localhost:5000/rate_limit')
response.raise_for_status()
return response.json()
if __name__ == '__main__':
try:
data = fetch_rate_limit()
print("Data fetched successfully:", data)
except Exception as e:
print("Failed to fetch data:", str(e))
With this code, the wait time starts at 2 seconds and doubles with each retry, up to a maximum of 10 seconds. Jaferβs app is now much less likely to be rate-limited!
Step 5: Retrying Based on Return Values
Jafer encounters another issue: some APIs occasionally return an empty response (204 No Content). These cases should also trigger a retry. Tenacity makes this easy with the retry_if_result feature,
import requests
import logging
from tenacity import before_log, after_log
from tenacity import retry, stop_after_attempt, retry_if_result
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(retry=retry_if_result(lambda x: x is None), stop=stop_after_attempt(3), before=before_log(logger, logging.INFO),
after=after_log(logger, logging.INFO))
def fetch_empty_response():
response = requests.get('http://localhost:5000/empty_response')
if response.status_code == 204:
return None # Simulate an empty response
response.raise_for_status()
return response.json()
if __name__ == '__main__':
try:
data = fetch_empty_response()
print("Data fetched successfully:", data)
except Exception as e:
print("Failed to fetch data:", str(e))
Now, the function retries when it receives an empty response, ensuring that users get the data they need.
Step 6: Combining Multiple Retry Conditions
But Jafer isnβt done yet. Some situations require combining multiple conditions. He wants to retry on HTTPError, Timeout, or a None return value. With Tenacityβs retry_any feature, he can do just that,
import requests
import logging
from tenacity import before_log, after_log
from requests.exceptions import HTTPError, Timeout
from tenacity import retry_any, retry, retry_if_exception_type, retry_if_result, stop_after_attempt
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(retry=retry_any(retry_if_exception_type((HTTPError, Timeout)), retry_if_result(lambda x: x is None)), stop=stop_after_attempt(3), before=before_log(logger, logging.INFO),
after=after_log(logger, logging.INFO))
def fetch_data():
response = requests.get("http://localhost:5000/timeout")
if response.status_code == 204:
return None
response.raise_for_status()
return response.json()
if __name__ == '__main__':
try:
data = fetch_data()
print("Data fetched successfully:", data)
except Exception as e:
print("Failed to fetch data:", str(e))
This approach covers all his bases, making the app even more resilient!
Step 7: Logging and Tracking Retries
As the app scales, Jafer wants to keep an eye on how often retries happen and why. He decides to add logging,
import logging
import requests
from tenacity import before_log, after_log
from tenacity import retry, stop_after_attempt, wait_fixed
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(stop=stop_after_attempt(2), wait=wait_fixed(2),
before=before_log(logger, logging.INFO),
after=after_log(logger, logging.INFO))
def fetch_data():
response = requests.get("http://localhost:5000/timeout", timeout=2)
response.raise_for_status()
return response.json()
if __name__ == '__main__':
try:
data = fetch_data()
print("Data fetched successfully:", data)
except Exception as e:
print("Failed to fetch data:", str(e))
This logs messages before and after each retry attempt, giving Jafer full visibility into the retry process. Now, he can monitor the appβs behavior in production and quickly spot any patterns or issues.
The Happy Ending
With Tenacity, Jafer has transformed his app into a resilient powerhouse that gracefully handles intermittent failures. Users are happy, the servers are humming along smoothly, and Jaferβs team has more time to work on new features rather than firefighting network errors.
By mastering Tenacity, Jafer has learned that handling network failures gracefully can turn a fragile app into a robust and reliable one. Whether itβs dealing with flaky APIs, network blips, or rate limits, Tenacity is his go-to tool for retrying operations in Python.
So, the next time your app faces unpredictable network challenges, remember Jaferβs story and give Tenacity a try you might just save the day!