Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

How to Manage Multiple Cron Job Executions

16 March 2025 at 06:13

Cron jobs are a fundamental part of automating tasks in Unix-based systems. However, one common problem with cron jobs is multiple executions, where overlapping job runs can cause serious issues like data corruption, race conditions, or unexpected system load.

In this blog, we’ll explore why multiple executions happen, the potential risks, and how flock provides an elegant solution to ensure that a cron job runs only once at a time.

The Problem: Multiple Executions of Cron Jobs

Cron jobs are scheduled to run at fixed intervals, but sometimes a new job instance starts before the previous one finishes.

This can happen due to

  • Long-running jobs: If a cron job takes longer than its interval, a new instance starts while the old one is still running.
  • System slowdowns: High CPU or memory usage can delay job execution, leading to overlapping runs.
  • Simultaneous executions across servers: In a distributed system, multiple servers might execute the same cron job, causing duplication.

Example of a Problematic Cron Job

Let’s say we have the following cron job that runs every minute:

* * * * * /path/to/script.sh

If script.sh takes more than a minute to execute, a second instance will start before the first one finishes.

This can lead to:

✅ Duplicate database writes → Inconsistent data

✅ Conflicts in file processing → Corrupt files

✅ Overloaded system resources → Performance degradation

Real-World Example

Imagine a job that processes user invoices and sends emails

* * * * * /usr/bin/python3 /home/user/process_invoices.py

If the script takes longer than a minute to complete, multiple instances might start running, causing

  1. Users to receive multiple invoices.
  2. The database to get inconsistent updates.
  3. Increased server load due to excessive email sending.

The Solution: Using flock to Prevent Multiple Executions

flock is a Linux utility that manages file locks to ensure that only one instance of a process runs at a time. It works by locking a specific file, preventing other processes from acquiring the same lock.

Using flock in a Cron Job

Modify the cron job as follows

* * * * * /usr/bin/flock -n /tmp/myjob.lock /path/to/script.sh

How It Works

  • flock -n /tmp/myjob.lock → Tries to acquire a lock on /tmp/myjob.lock.
  • If the lock is available, the script runs.
  • If the lock is already held (i.e., another instance is running), flock prevents the new instance from starting.
  • -n (non-blocking) ensures that the job doesn’t wait for the lock and simply exits if it cannot acquire it.

This guarantees that only one instance of the job runs at a time.

Verifying the Solution

You can test the lock by manually running the script with flock

/usr/bin/flock -n /tmp/myjob.lock /bin/bash -c 'echo "Running job..."; sleep 30'

Open another terminal and try to run the same command. You’ll see that the second attempt exits immediately because the lock is already acquired.

Preventing multiple executions of cron jobs is essential for maintaining data consistency, system stability, and efficiency. By using flock, you can easily enforce single execution without complex logic.

✅ Simple & efficient solution. ✅ No external dependencies required. ✅ Works seamlessly with cron jobs.

So next time you set up a cron job, add flock and sleep peacefully knowing your tasks won’t collide. 🚀

Golden Feedbacks for Python Sessions 1.0 from last year (2024)

13 February 2025 at 08:49

Many Thanks to Shrini for documenting it last year. This serves as a good reference to improve my skills. Hope it will help many.

📢 What Participants wanted to improve

🚶‍♂️ Go a bit slower so that everyone can understand clearly without feeling rushed.


📚 Provide more basics and examples to make learning easier for beginners.


🖥 Spend the first week explaining programming basics so that newcomers don’t feel lost.


📊 Teach flowcharting methods to help participants understand the logic behind coding.


🕹 Try teaching Scratch as an interactive way to introduce programming concepts.


🗓 Offer weekend batches for those who prefer learning on weekends.


🗣 Encourage more conversations so that participants can actively engage in discussions.


👥 Create sub-groups to allow participants to collaborate and support each other.


🎉 Get “cheerleaders” within the team to make the classes more fun and interactive.


📢 Increase promotion efforts to reach a wider audience and get more participants.


🔍 Provide better examples to make concepts easier to grasp.


❓ Conduct more Q&A sessions so participants can ask and clarify their doubts.


🎙 Ensure that each participant gets a chance to speak and express their thoughts.


📹 Showing your face in videos can help in building a more personal connection with the learners.


🏆 Organize mini-hackathons to provide hands-on experience and encourage practical learning.


🔗 Foster more interactions and connections between participants to build a strong learning community.


✍ Encourage participants to write blogs daily to document their learning and share insights.


🎤 Motivate participants to give talks in class and other communities to build confidence.

📝 Other Learnings & Suggestions

📵 Avoid creating WhatsApp groups for communication, as the 1024 member limit makes it difficult to manage multiple groups.


✉ Telegram works fine for now, but explore using mailing lists as an alternative for structured discussions.


🔕 Mute groups when necessary to prevent unnecessary messages like “Hi, Hello, Good Morning.”


📢 Teach participants how to join mailing lists like ChennaiPy and KanchiLUG and guide them on asking questions in forums like Tamil Linux Community.


📝 Show participants how to create a free blog on platforms like dev.to or WordPress to share their learning journey.


🛠 Avoid spending too much time explaining everything in-depth, as participants should start coding a small project by the 5th or 6th class.


📌 Present topics as solutions to project ideas or real-world problem statements instead of just theory.


👤 Encourage using names when addressing people, rather than calling them “Sir” or “Madam,” to maintain an equal and friendly learning environment.


💸 Zoom is costly, and since only around 50 people complete the training, consider alternatives like Jitsi or Google Meet for better cost-effectiveness.

Will try to incorporate these learnings in our upcoming sessions.

🚀 Let’s make this learning experience engaging, interactive, and impactful! 🎯

📢 Python Learning 2.0 in Tamil – Call for Participants! 🚀

10 February 2025 at 07:58

After an incredible year of Python learning Watch our journey here, we’re back with an all new approach for 2025!

If you haven’t subscribed to our channel, don’t miss to do it ? Support Us by subscribing

This time, we’re shifting gears from theory to practice with mini projects that will help you build real-world solutions. Study materials will be shared beforehand, and you’ll work hands-on to solve practical problems building actual projects that showcase your skills.

🔑 What’s New?

✅ Real-world mini projects
✅ Task-based shortlisting process
✅ Limited seats for focused learning
✅ Dedicated WhatsApp group for discussions & mentorship
✅ Live streaming of sessions for wider participation
✅ Study materials, quizzes, surprise gifts, and more!

📋 How to Join?

  1. Fill the below RSVP – Open for 20 days (till – March 2) only!
  2. After RSVP closes, shortlisted participants will receive tasks via email.
  3. Complete the tasks to get shortlisted.
  4. Selected students will be added to an exclusive WhatsApp group for intensive training.
  5. It’s a COST-FREE learning. We require your time, effort and support.
  6. Course start date will be announced after RSVP.

📜 RSVP Form

☎ How to Contact for Queries ?

If you have any queries, feel free to message in whatsapp, telegram, signal on this number 9176409201.

You can also mail me at learnwithjafer@gmail.com

Follow us for more oppurtunities/updates and more…

Don’t miss this chance to level up your Python skills Cost Free with hands-on projects and exciting rewards! RSVP now and be part of Python Learning 2.0! 🚀

Our Previous Monthly meets – https://www.youtube.com/watch?v=cPtyuSzeaa8&list=PLiutOxBS1MizPGGcdfXF61WP5pNUYvxUl&pp=gAQB

Our Previous Sessions,

Postgres – https://www.youtube.com/watch?v=04pE5bK2-VA&list=PLiutOxBS1Miy3PPwxuvlGRpmNo724mAlt&pp=gAQB

Python – https://www.youtube.com/watch?v=lQquVptFreE&list=PLiutOxBS1Mizte0ehfMrRKHSIQcCImwHL&pp=gAQB

Docker – https://www.youtube.com/watch?v=nXgUBanjZP8&list=PLiutOxBS1Mizi9IRQM-N3BFWXJkb-hQ4U&pp=gAQB

Note: If you wish to support me for this initiative please share this with your friends, students and those who are in need.

Learning Notes #68 – Buildpacks and Dockerfile

2 February 2025 at 09:32

  1. What is an OCI ?
  2. Does Docker Create OCI Images?
  3. What is a Buildpack ?
  4. Overview of Buildpack Process
  5. Builder: The Image That Executes the Build
    1. Components of a Builder Image
    2. Stack: The Combination of Build and Run Images
  6. Installation and Initial Setups
  7. Basic Build of an Image (Python Project)
    1. Building an image using buildpack
    2. Building an Image using Dockerfile
  8. Unique Benefits of Buildpacks
    1. No Need for a Dockerfile (Auto-Detection)
    2. Automatic Security Updates
    3. Standardized & Reproducible Builds
    4. Extensibility: Custom Buildpacks
  9. Generating SBOM in Buildpacks
    1. a) Using pack CLI to Generate SBOM
    2. b) Generate SBOM in Docker

Last few days, i was exploring on Buildpacks. I am amused at this tool features on reducing the developer’s pain. In this blog i jot down my experience on Buildpacks.

Before going to try Buildpacks, we need to understand what is an OCI ?

What is an OCI ?

An OCI Image (Open Container Initiative Image) is a standard format for container images, defined by the Open Container Initiative (OCI) to ensure interoperability across different container runtimes (Docker, Podman, containerd, etc.).

It consists of,

  1. Manifest – Metadata describing the image (layers, config, etc.).
  2. Config JSON – Information about how the container should run (CMD, ENV, etc.).
  3. Filesystem Layers – The actual file system of the container.

OCI Image Specification ensures that container images built once can run on any OCI-compliant runtime.

Does Docker Create OCI Images?

Yes, Docker creates OCI-compliant images. Since Docker v1.10+, Docker has been aligned with the OCI Image Specification, and all Docker images are OCI-compliant by default.

  • When you build an image with docker build, it follows the OCI Image format.
  • When you push/pull images to registries like Docker Hub, they follow the OCI Image Specification.

However, Docker also supports its legacy Docker Image format, which existed before OCI was introduced. Most modern registries and runtimes (Kubernetes, Podman, containerd) support OCI images natively.

What is a Buildpack ?

A buildpack is a framework for transforming application source code into a runnable image by handling dependencies, compilation, and configuration. Buildpacks are widely used in cloud environments like Heroku, Cloud Foundry, and Kubernetes (via Cloud Native Buildpacks).

Overview of Buildpack Process

The buildpack process consists of two primary phases

  • Detection Phase: Determines if the buildpack should be applied based on the app’s dependencies.
  • Build Phase: Executes the necessary steps to prepare the application for running in a container.

Buildpacks work with a lifecycle manager (e.g., Cloud Native Buildpacks’ lifecycle) that orchestrates the execution of multiple buildpacks in an ordered sequence.

Builder: The Image That Executes the Build

A builder is an image that contains all necessary components to run a buildpack.

Components of a Builder Image

  1. Build Image – Used during the build phase (includes compilers, dependencies, etc.).
  2. Run Image – A minimal environment for running the final built application.
  3. Lifecycle – The core mechanism that executes buildpacks, orchestrates the process, and ensures reproducibility.

Stack: The Combination of Build and Run Images

  • Build Image + Run Image = Stack
  • Build Image: Base OS with tools required for building (e.g., Ubuntu, Alpine).
  • Run Image: Lightweight OS with only the runtime dependencies for execution.

Installation and Initial Setups

Basic Build of an Image (Python Project)

Project Source: https://github.com/syedjaferk/gh_action_docker_build_push_fastapi_app

Building an image using buildpack

Before running these commands, ensure you have Pack CLI (pack) installed.

a) Detect builder suggest

pack builder suggest

b) Build the image

pack build my-app --builder paketobuildpacks/builder:base

c) Run the image locally


docker run -p 8080:8080 my-python-app

Building an Image using Dockerfile

a) Dockerfile


FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .

RUN pip install -r requirements.txt

COPY ./random_id_generator ./random_id_generator
COPY app.py app.py

EXPOSE 8080

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

b) Build and Run


docker build -t my-python-app .
docker run -p 8080:8080 my-python-app

Unique Benefits of Buildpacks

No Need for a Dockerfile (Auto-Detection)

Buildpacks automatically detect the language and dependencies, removing the need for Dockerfile.


pack build my-python-app --builder paketobuildpacks/builder:base

It detects Python, installs dependencies, and builds the app into a container. 🚀 Docker requires a Dockerfile, which developers must manually configure and maintain.

Automatic Security Updates

Buildpacks automatically patch base images for security vulnerabilities.

If there’s a CVE in the OS layer, Buildpacks update the base image without rebuilding the app.


pack rebase my-python-app

No need to rebuild! It replaces only the OS layers while keeping the app the same.

Standardized & Reproducible Builds

Ensures consistent images across environments (dev, CI/CD, production). Example: Running the same build locally and on Heroku/Cloud Run,


pack build my-app

Extensibility: Custom Buildpacks

Developers can create custom Buildpacks to add special dependencies.

Example: Adding ffmpeg to a Python buildpack,


pack buildpack package my-custom-python-buildpack --path .

Generating SBOM in Buildpacks

a) Using pack CLI to Generate SBOM

After building an image with pack, run,


pack sbom download my-python-app --output-dir ./sbom
  • This fetches the SBOM for your built image.
  • The SBOM is saved in the ./sbom/ directory.

✅ Supported formats:

  • SPDX (sbom.spdx.json)
  • CycloneDX (sbom.cdx.json)

b) Generate SBOM in Docker


trivy image --format cyclonedx -o sbom.json my-python-app

Both are helpful in creating images. Its all about the tradeoffs.

Learning Notes #63 – Change Data Capture. What does it do ?

19 January 2025 at 16:22

Few days back i came across a concept of CDC. Like a notifier of database events. Instead of polling, this enables event to be available in a queue, which can be consumed by many consumers. In this blog, i try to explain the concepts, types in a theoretical manner.

You run a library. Every day, books are borrowed, returned, or new books are added. What if you wanted to keep a live record of all these activities so you always know the exact state of your library?

This is essentially what Change Data Capture (CDC) does for your databases. It’s a way to track changes (like inserts, updates, or deletions) in your database tables and send them to another system, like a live dashboard or a backup system. (Might be a bad example. Don’t lose hope. Continue …)

CDC is widely used in modern technology to power,

  • Real-Time Analytics: Live dashboards that show sales, user activity, or system performance.
  • Data Synchronization: Keeping multiple databases or microservices in sync.
  • Event-Driven Architectures: Triggering notifications, workflows, or downstream processes based on database changes.
  • Data Pipelines: Streaming changes to data lakes or warehouses for further processing.
  • Backup and Recovery: Incremental backups by capturing changes instead of full data dumps.

It’s a critical part of tools like Debezium, Kafka, and cloud services such as AWS Database Migration Service (DMS) and Azure Data Factory. CDC enables companies to move towards real-time data-driven decision-making.

What is CDC?

CDC stands for Change Data Capture. It’s a technique that listens to a database and captures every change that happens in it. These changes can then be sent to other systems to,

  • Keep data in sync across multiple databases.
  • Power real-time analytics dashboards.
  • Trigger notifications for certain database events.
  • Process data streams in real time.

In short, CDC ensures your data is always up-to-date wherever it’s needed.

Why is CDC Useful?

Imagine you have an online store. Whenever someone,

  • Places an order,
  • Updates their shipping address, or
  • Cancels an order,

you need these changes to be reflected immediately across,

  • The shipping system.
  • The inventory system.
  • The email notification service.

Instead of having all these systems query the database (this is one of main reasons) constantly (which is slow and inefficient), CDC automatically streams these changes to the relevant systems.

This means,

  1. Real-Time Updates: Systems receive changes instantly.
  2. Improved Performance: Your database isn’t overloaded with repeated queries.
  3. Consistency: All systems stay in sync without manual intervention.

How Does CDC Work?

Note: I haven’t yet tried all these. But conceptually having a feeling.

CDC relies on tracking changes in your database. There are a few ways to do this,

1. Query-Based CDC

This method repeatedly checks the database for changes. For example:

  • Every 5 minutes, it queries the database: “What changed since my last check?”
  • Any new or modified data is identified and processed.

Drawbacks: This can miss changes if the timing isn’t right, and it’s not truly real-time (Long Polling).

2. Log-Based CDC

Most modern databases (like PostgreSQL or MySQL) keep logs of every operation. Log-based CDC listens to these logs and captures changes as they happen.

Advantages

  • It’s real-time.
  • It’s lightweight since it doesn’t query the database directly.

3. Trigger-Based CDC

In this method, the database uses triggers to log changes into a separate table. Whenever a change occurs, a trigger writes a record of it.

Advantages: Simple to set up.

Drawbacks: Can slow down the database if not carefully managed.

Tools That Make CDC Easy

Several tools simplify CDC implementation. Some popular ones are,

  1. Debezium: Open-source and widely used for log-based CDC with databases like PostgreSQL, MySQL, and MongoDB.
  2. Striim: A commercial tool for real-time data integration.
  3. AWS Database Migration Service (DMS): A cloud-based CDC service.
  4. StreamSets: Another tool for real-time data movement.

These tools integrate with databases, capture changes, and deliver them to systems like RabbitMQ, Kafka, or cloud storage.

To help visualize CDC, think of,

  • Social Media Feeds: When someone likes or comments on a post, you see the update instantly. This is CDC in action.
  • Bank Notifications: Whenever you make a transaction, your bank app updates instantly. Another example of CDC.

In upcoming blogs, will include Debezium implementation with CDC.

Learning Notes #62 – Serverless – Just like riding a taxi

19 January 2025 at 04:55

What is Serverless Computing?

Serverless computing allows developers to run applications without having to manage the underlying infrastructure. You write code, deploy it, and the cloud provider takes care of the rest from provisioning servers to scaling applications.

Popular serverless platforms include AWS Lambda, Azure Functions, and Google Cloud Functions.

The Taxi Analogy

Imagine traveling to a destination. There are multiple ways to get there,

  1. Owning a Car (Traditional Servers): You own and maintain your car. This means handling maintenance, fuel, insurance, parking, and everything else that comes with it. It’s reliable and gives you control, but it’s also time-consuming and expensive to manage.
  2. Hiring a Taxi (Serverless): With a taxi, you simply book a ride when you need it. You don’t worry about maintaining the car, fueling it, or where it’s parked afterward. You pay only for the distance traveled, and the service scales to your needs whether you’re alone or with friends.

Why Serverless is Like Taking a Taxi ?

  1. No Infrastructure Management – With serverless, you don’t have to manage or worry about servers, just like you don’t need to maintain a taxi.
  2. Pay-As-You-Go – In a taxi, you pay only for the distance traveled. Similarly, in serverless, you’re billed only for the compute time your application consumes.
  3. On-Demand Availability – Need a ride at midnight? A taxi is just a booking away. Serverless functions work the same way available whenever you need them, scaling up or down as required.
  4. Scalability – Whether you’re a solo traveler or part of a group, taxis can adapt by providing a small car or a larger vehicle. Serverless computing scales resources automatically based on traffic, ensuring optimal performance.
  5. Focus on the Destination – When you take a taxi, you focus on reaching your destination without worrying about the vehicle. Serverless lets you concentrate on writing and deploying code rather than worrying about servers.

Key Benefits of Serverless (and Taxi Rides)

  • Cost-Effectiveness – Avoid upfront costs. No need to buy servers (or cars) you might not fully utilize.
  • Flexibility – Serverless platforms support multiple programming languages and integrations.
    Taxis, too, come in various forms: regular cars, SUVs, and even luxury rides for special occasions.
  • Reduced Overhead – Free yourself from maintenance tasks, whether it’s patching servers or checking tire pressure.

When Not to Choose Serverless (or a Taxi)

  1. Predictable, High-Volume Usage – Owning a car might be cheaper if you’re constantly on the road. Similarly, for predictable and sustained workloads, traditional servers or containers might be more cost-effective than serverless.
  2. Special Requirements – Need a specific type of vehicle, like a truck for moving furniture? Owning one might make sense. Similarly, applications with unique infrastructure requirements may not be a perfect fit for serverless.
  3. Latency Sensitivity – Taxis take time to arrive after booking. Likewise, serverless functions may experience cold starts, adding slight delays. For ultra-low-latency applications, other architectures may be preferable.

Learning Notes #52 – Hybrid Origin Failover Pattern

12 January 2025 at 06:29

Today, i learnt about failover patterns from AWS https://aws.amazon.com/blogs/networking-and-content-delivery/three-advanced-design-patterns-for-high-available-applications-using-amazon-cloudfront/ . In this blog i jot down my understanding on this pattern for future reference,

Hybrid origin failover is a strategy that combines two distinct approaches to handle origin failures effectively, balancing speed and resilience.

The Need for Origin Failover

When an application’s primary origin server becomes unavailable, the ability to reroute traffic to a secondary origin ensures continuity. The failover process determines how quickly and effectively this switch happens. Broadly, there are two approaches to implement origin failover:

  1. Stateful Failover with DNS-based Routing
  2. Stateless Failover with Application Logic

Each has its strengths and limitations, which the hybrid approach aims to mitigate.

Stateful Failover

Stateful failover is a system that allows a standby server to take over for a failed server and continue active sessions. It’s used to create a resilient network infrastructure and avoid service interruptions.

This method relies on a DNS service with health checks to detect when the primary origin is unavailable. Here’s how it works,

  1. Health Checks: The DNS service continuously monitors the health of the primary origin using health checks (e.g., HTTP, HTTPS).
  2. DNS Failover: When the primary origin is marked unhealthy, the DNS service resolves the origin’s domain name to the secondary origin’s IP address.
  3. TTL Impact: The failover process honors the DNS Time-to-Live (TTL) settings. A low TTL ensures faster propagation, but even in the most optimal configurations, this process introduces a delay—often around 60 to 70 seconds.
  4. Stateful Behavior: Once failover occurs, all traffic is routed to the secondary origin until the primary origin is marked healthy again.

Implementation from AWS (as-is from aws blog)

The first approach is using Amazon Route 53 Failover routing policy with health checks on the origin domain name that’s configured as the origin in CloudFront. When the primary origin becomes unhealthy, Route 53 detects it, and then starts resolving the origin domain name with the IP address of the secondary origin. CloudFront honors the origin DNS TTL, which means that traffic will start flowing to the secondary origin within the DNS TTLs. The most optimal configuration (Fast Check activated, a failover threshold of 1, and 60 second DNS TTL) means that the failover will take 70 seconds at minimum to occur. When it does, all of the traffic is switched to the secondary origin, since it’s a stateful failover. Note that this design can be further extended with Route 53 Application Recovery Control for more sophisticated application failover across multiple AWS Regions, Availability Zones, and on-premises.

The second approach is using origin failover, a native feature of CloudFront. This capability of CloudFront tries for the primary origin of every request, and if a configured 4xx or 5xx error is received, then CloudFront attempts a retry with the secondary origin. This approach is simple to configure and provides immediate failover. However, it’s stateless, which means every request must fail independently, thus introducing latency to failed requests. For transient origin issues, this additional latency is an acceptable tradeoff with the speed of failover, but it’s not ideal when the origin is completely out of service. Finally, this approach only works for the GET/HEAD/OPTIONS HTTP methods, because other HTTP methods are not allowed on a CloudFront cache behavior with Origin Failover enabled.

Advantages

  • Works for all HTTP methods and request types.
  • Ensures complete switchover, minimizing ongoing failures.

Disadvantages

  • Relatively slower failover due to DNS propagation time.
  • Requires a reliable health-check mechanism.

Approach 2: Stateless Failover with Application Logic

This method handles failover at the application level. If a request to the primary origin fails (e.g., due to a 4xx or 5xx HTTP response), the application or CDN immediately retries the request with the secondary origin.

How It Works

  1. Primary Request: The application sends a request to the primary origin.
  2. Failure Handling: If the response indicates a failure (configurable for specific error codes), the request is retried with the secondary origin.
  3. Stateless Behavior: Each request operates independently, so failover happens on a per-request basis without waiting for a stateful switchover.

Implementation from AWS (as-is from aws blog)

The hybrid origin failover pattern combines both approaches to get the best of both worlds. First, you configure both of your origins with a Failover Policy in Route 53 behind a single origin domain name. Then, you configure an origin failover group with the single origin domain name as primary origin, and the secondary origin domain name as secondary origin. This means that when the primary origin becomes unavailable, requests are immediately retried with the secondary origin until the stateful failover of Route 53 kicks in within tens of seconds, after which requests go directly to the secondary origin without any latency penalty. Note that this pattern only works with the GET/HEAD/OPTIONS HTTP methods.

Advantages

  • Near-instantaneous failover for failed requests.
  • Simple to configure and doesn’t depend on DNS TTL.

Disadvantages

  • Adds latency for failed requests due to retries.
  • Limited to specific HTTP methods like GET, HEAD, and OPTIONS.
  • Not suitable for scenarios where the primary origin is entirely down, as every request must fail first.

The Hybrid Origin Failover Pattern

The hybrid origin failover pattern combines the strengths of both approaches, mitigating their individual limitations. Here’s how it works:

  1. DNS-based Stateful Failover: A DNS service with health checks monitors the primary origin and switches to the secondary origin if the primary becomes unhealthy. This ensures a complete and stateful failover within tens of seconds.
  2. Application-level Stateless Failover: Simultaneously, the application or CDN is configured to retry failed requests with a secondary origin. This provides an immediate failover mechanism for transient or initial failures.

Implementation Steps

  1. DNS Configuration
    • Set up health checks on the primary origin.
    • Define a failover policy in the DNS service, which resolves the origin domain name to the secondary origin when the primary is unhealthy.
  2. Application Configuration
    • Configure the application or CDN to use an origin failover group.
    • Specify the primary origin domain as the primary origin and the secondary origin domain as the backup.

Behavior

  • Initially, if the primary origin encounters issues, requests are retried immediately with the secondary origin.
  • Meanwhile, the DNS failover switches all traffic to the secondary origin within tens of seconds, eliminating retry latencies for subsequent requests.

Benefits of Hybrid Origin Failover

  1. Faster Failover: Immediate retries for failed requests minimize initial impact, while DNS failover ensures long-term stability.
  2. Reduced Latency: After DNS failover, subsequent requests don’t experience retry delays.
  3. High Resilience: Combines stateful and stateless failover for robust redundancy.
  4. Simplicity and Scalability: Leverages existing DNS and application/CDN features without complex configurations.

Limitations and Considerations

  1. HTTP Method Constraints: Stateless failover works only for GET, HEAD, and OPTIONS methods, limiting its use for POST or PUT requests.
  2. TTL Impact: Low TTLs reduce propagation delays but increase DNS query rates, which could lead to higher costs.
  3. Configuration Complexity: Combining DNS and application-level failover requires careful setup and testing to avoid misconfigurations.
  4. Secondary Origin Capacity: Ensure the secondary origin can handle full traffic loads during failover.

Learning Notes #30 – Queue Based Loading | Cloud Patterns

3 January 2025 at 14:47

Today, i learnt about Queue Based Loading pattern, which helps to manage intermittent peak load to a service via queues. Basically decoupling Tasks from Services. In this blog i jot down notes on this pattern for my future self.

In today’s digital landscape, applications are expected to handle large-scale operations efficiently. Whether it’s processing massive data streams, ensuring real-time responsiveness, or integrating with multiple third-party services, scalability and reliability are paramount. One pattern that elegantly addresses these challenges is the Queue-Based Loading Pattern.

What Is the Queue-Based Loading Pattern?

The Queue-Based Loading Pattern leverages message queues to decouple and coordinate tasks between producers (such as applications or services generating data) and consumers (services or workers processing that data). By using queues as intermediaries, this pattern allows systems to manage workloads efficiently, ensuring seamless and scalable operation.

Key Components of the Pattern

  1. Producers: Producers are responsible for generating tasks or data. They send these tasks to a message queue instead of directly interacting with consumers. Examples include:
    • Web applications logging user activity.
    • IoT devices sending sensor data.
  2. Message Queue: The queue acts as a buffer, storing tasks until consumers are ready to process them. Popular tools for implementing queues include RabbitMQ, Apache Kafka, AWS SQS, and Redis.
  3. Consumers: Consumers retrieve messages from the queue and process them asynchronously. They are typically designed to handle tasks independently and at their own pace.
  4. Processing Logic: This is the core functionality that processes the tasks retrieved by consumers. For example, resizing images, sending notifications, or updating a database.

How It Works

  1. Task Generation: Producers push tasks to the queue as they are generated.
  2. Message Storage: The queue stores tasks in a structured manner (FIFO, priority-based, etc.) and ensures reliable delivery.
  3. Task Consumption: Consumers pull tasks from the queue, process them, and optionally acknowledge completion.
  4. Scalability: New consumers can be added dynamically to handle increased workloads, ensuring the system remains responsive.

Benefits of the Queue-Based Loading Pattern

  1. Decoupling: Producers and consumers operate independently, reducing tight coupling and improving system maintainability.
  2. Scalability: By adding more consumers, systems can easily scale to handle higher workloads.
  3. Fault Tolerance: If a consumer fails, messages remain in the queue, ensuring no data is lost.
  4. Load Balancing: Tasks are distributed evenly among consumers, preventing any single consumer from becoming a bottleneck.
  5. Asynchronous Processing: Consumers can process tasks in the background, freeing producers to continue generating data without delay.

Issues and Considerations

  1. Rate Limiting: Implement logic to control the rate at which services handle messages to prevent overwhelming the target resource. Test the system under load and adjust the number of queues or service instances to manage demand effectively.
  2. One-Way Communication: Message queues are inherently one-way. If tasks require responses, you may need to implement a separate mechanism for replies.
  3. Autoscaling Challenges: Be cautious when autoscaling consumers, as it can lead to increased contention for shared resources, potentially reducing the effectiveness of load leveling.
  4. Traffic Variability: Consider the variability of incoming traffic to avoid situations where tasks pile up faster than they are processed, creating a perpetual backlog.
  5. Queue Persistence: Ensure your queue is durable and capable of persisting messages. Crashes or system limits could lead to dropped messages, risking data loss.

Use Cases

  1. Email and Notification Systems: Sending bulk emails or push notifications without overloading the main application.
  2. Data Pipelines: Ingesting, transforming, and analyzing large datasets in real-time or batch processing.
  3. Video Processing: Queues facilitate tasks like video encoding and thumbnail generation.
  4. Microservices Communication: Ensures reliable and scalable communication between microservices.

Best Practices

  1. Message Durability: Configure your queue to persist messages to disk, ensuring they are not lost during system failures.
  2. Monitoring and Metrics: Use monitoring tools to track queue lengths, processing rates, and consumer health.
  3. Idempotency: Design consumers to handle duplicate messages gracefully.
  4. Error Handling and Dead Letter Queues (DLQs): Route failed messages to DLQs for later analysis and reprocessing.

Learning Notes #28 – Unlogged Table in Postgres

2 January 2025 at 17:30

Today, As part of daily reading, i came across https://raphaeldelio.com/2024/07/14/can-postgres-replace-redis-as-a-cache/ where they discussing about postgres as a cache ! and comparing it with redis !! I was surprised at the title so gave a read through. Then i came across a concept of UNLOGGED table which act as a fast retrieval as cache. In this blog i jot down notes on unlogged table for future reference.

Highly Recommended Links: https://martinheinz.dev/blog/105, https://raphaeldelio.com/2024/07/14/can-postgres-replace-redis-as-a-cache/, https://www.crunchydata.com/blog/postgresl-unlogged-tables

Unlogged tables offer unique benefits in scenarios where speed is paramount, and durability (the guarantee that data is written to disk and will survive crashes) is not critical.

What Are Unlogged Tables?

Postgres Architecture : https://miro.com/app/board/uXjVLD2T5os=/

In PostgreSQL, a table is a basic unit of data storage. By default, PostgreSQL ensures that data in regular tables is durable. This means that all data is written to the disk and will survive server crashes. However, in some situations, durability is not necessary. Unlogged tables are special types of tables in PostgreSQL where the database does not write data changes to the WAL (Write-Ahead Log).

The absence of WAL logging for unlogged tables makes them faster than regular tables because PostgreSQL doesn’t need to ensure data consistency across crashes for these tables. However, this also means that if the server crashes or the system is powered off, the data in unlogged tables is lost.

Key Characteristics of Unlogged Tables

  1. No Write-Ahead Logging (WAL) – By default, PostgreSQL writes changes to the WAL to ensure data durability. For unlogged tables, this step is skipped, making operations like INSERTs, UPDATEs, and DELETEs faster.
  2. No Durability – The absence of WAL means that unlogged tables will lose their data if the database crashes or if the server is restarted. This makes them unsuitable for critical data.
  3. Faster Performance – Since WAL writes are skipped, unlogged tables are faster for data insertion and modification. This can be beneficial for use cases where data is transient and doesn’t need to persist beyond the current session.
  4. Support for Indexes and Constraints – Unlogged tables can have indexes and constraints like regular tables. However, the data in these tables is still non-durable.
  5. Automatic Cleanup – When the PostgreSQL server restarts, the data in unlogged tables is automatically dropped. Therefore, unlogged tables only hold data during the current database session.

Drawbacks of Unlogged Tables

  1. Data Loss on Crash – The most significant disadvantage of unlogged tables is the loss of data in case of a crash or restart. If the application depends on this data, then using unlogged tables would not be appropriate.
  2. Not Suitable for Critical Applications – Applications that require data persistence (such as financial or inventory systems) should avoid using unlogged tables, as the risk of data loss outweighs any performance benefits.
  3. No Replication – Unlogged tables are not replicated in standby servers in a replication setup, as the data is not written to the WAL.

Creating an Unlogged Table

Creating an unlogged table is very straightforward in PostgreSQL. You simply need to add the UNLOGGED keyword when creating the table.


CREATE UNLOGGED TABLE temp_data (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    value INT
);

In this example, temp_data is an unlogged table. All operations performed on this table will not be logged to the WAL.

When to Avoid Unlogged Tables?

  • If you are working with critical data that needs to be durable and persistent across restarts.
  • If your application requires data replication, as unlogged tables are not replicated in standby servers.
  • If your workload involves frequent crash scenarios where data loss cannot be tolerated.

Examples

1. Temporary Storage for processing


CREATE UNLOGGED TABLE etl_staging (
    source_id INT,
    raw_data JSONB,
    processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert raw data into the staging table
INSERT INTO etl_staging (source_id, raw_data)
VALUES 
    (1, '{"key": "value1"}'),
    (2, '{"key": "value2"}');

-- Perform transformations on the data
INSERT INTO final_table (id, key, value)
SELECT source_id, 
       raw_data->>'key' AS key, 
       'processed_value' AS value
FROM etl_staging;

-- Clear the staging table
TRUNCATE TABLE etl_staging;

2. Caching


CREATE UNLOGGED TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id INT,
    last_accessed TIMESTAMP DEFAULT NOW()
);

-- Insert session data
INSERT INTO user_sessions (session_id, user_id)
VALUES 
    (uuid_generate_v4(), 101),
    (uuid_generate_v4(), 102);

-- Update last accessed timestamp
UPDATE user_sessions
SET last_accessed = NOW()
WHERE session_id = 'some-session-id';

-- Delete expired sessions
DELETE FROM user_sessions WHERE last_accessed < NOW() - INTERVAL '1 hour';

Learning Notes #25 – Valet Key Pattern | Cloud Patterns

1 January 2025 at 17:20

Today, I learnt about Valet Key Pattern, which helps clients to directly access the resources without the server using a token. In this blog, i jot down notes on valet key pattern for better understanding.

The Valet Key Pattern is a security design pattern used to provide limited access to a resource or service without exposing full access credentials or permissions. It is akin to a physical valet key for a car, which allows the valet to drive the car without accessing the trunk or glove box. This pattern is widely employed in distributed systems, cloud services, and API design to ensure secure and controlled resource sharing.

Why Use the Valet Key Pattern?

Modern systems often require sharing access to specific resources while minimizing security risks. For instance:

  • A mobile app needs to upload files to a storage bucket but shouldn’t manage the entire bucket.
  • A third-party service requires temporary access to a user’s resource, such as a document or media file.
  • A system needs to allow time-bound or operation-restricted access to sensitive data.

In these scenarios, the Valet Key Pattern provides a practical solution by issuing a scoped, temporary, and revocable token (valet key) that grants specific permissions.

Core Principles of the Valet Key Pattern

  1. Scoped Access: The valet key grants access only to specific resources or operations.
  2. Time-Limited: The access token is typically valid for a limited duration to minimize exposure.
  3. Revocable: The issuing entity can revoke the token if necessary.
  4. Minimal Permissions: Permissions are restricted to the least privilege required to perform the intended task.

How the Valet Key Pattern Works

1. Resource Owner Issues a Valet Key

The resource owner (or controlling entity) generates a token with limited permissions. This token is often a signed JSON Web Token (JWT) or a pre-signed URL in the case of cloud storage.

2. Token Delivery to the Client

The token is securely delivered to the client or third-party application requiring access. For instance, the token might be sent via HTTPS or embedded in an API response.

3. Client Uses the Valet Key

The client includes the token in subsequent requests to access the resource. The resource server validates the token, checks its permissions, and allows or denies the requested operation accordingly.

4. Expiry or Revocation

Once the token expires or is revoked, it becomes invalid, ensuring the client can no longer access the resource.

Examples of the Valet Key Pattern in Action

1. Cloud Storage (Pre-signed URLs)

Amazon S3, Google Cloud Storage, and Azure Blob Storage allow generating pre-signed URLs that enable temporary, scoped access to specific files. For example, a user can upload a file using a URL valid for 15 minutes without needing direct access credentials.

2. API Design

APIs often issue temporary access tokens for limited operations. OAuth 2.0 tokens, for instance, can be scoped to allow access to specific endpoints or resources.

3. Media Sharing Platforms

Platforms like YouTube or Dropbox use the Valet Key Pattern to provide limited access to files. A shareable link often embeds permissions and expiration details.

Implementation Steps

1. Define Permissions Scope

Identify the specific operations or resources the token should allow. Use the principle of least privilege to limit permissions.

2. Generate Secure Tokens

Create tokens with cryptographic signing to ensure authenticity. Include metadata such as:

  • Resource identifiers
  • Permissions
  • Expiry time
  • Issuer information

3. Validate Tokens

The resource server must validate incoming tokens by checking the signature, expiration, and permissions.

4. Monitor and Revoke

Maintain a mechanism to monitor token usage and revoke them if misuse is detected.

Best Practices

  1. Use HTTPS: Always transmit tokens over secure channels to prevent interception.
  2. Minimize Token Lifetime: Short-lived tokens reduce the risk of misuse.
  3. Implement Auditing: Log token usage for monitoring and troubleshooting.
  4. Employ Secure Signing: Use robust cryptographic algorithms to sign tokens and prevent tampering.

Challenges

  • Token Management: Requires robust infrastructure for token generation, validation, and revocation.
  • Revocation Delays: Invalidation mechanisms may not instantly propagate in distributed systems.

Learning Notes #24 – Competing Consumer | Messaging Queue Patterns

1 January 2025 at 09:45

Today, i learnt about competing consumer, its a simple concept of consuming messages with many consumers. In this blog, i jot down notes on competing consumer for better understanding.

The competing consumer pattern is a commonly used design paradigm in distributed systems for handling workloads efficiently. It addresses the challenge of distributing tasks among multiple consumers to ensure scalability, reliability, and better resource utilization. In this blog, we’ll delve into the details of this pattern, its implementation, and its benefits.

What is the Competing Consumer Pattern?

The competing consumer pattern involves multiple consumers that independently compete to process messages or tasks from a shared queue. This pattern is particularly effective in scenarios where the rate of incoming tasks is variable or high, as it allows multiple consumers to process tasks concurrently.

Key Components

  1. Producer: The component that generates tasks or messages.
  2. Queue: A shared storage medium (often a message broker) that holds tasks until a consumer is ready to process them.
  3. Consumer: The component that processes tasks. Multiple consumers operate concurrently and compete for tasks in the queue.
  4. Message Broker: Middleware (e.g., RabbitMQ, Kafka) that manages the queue and facilitates communication between producers and consumers.

How It Works (Message as Tasks)

  1. Task Generation
    • Producers create tasks and push them into the queue.
    • Tasks can represent anything, such as processing an image, sending an email, or handling a database operation.
  2. Task Storage
    • The queue temporarily stores tasks until they are picked up by consumers.
    • Queues often support features like message persistence and delivery guarantees to enhance reliability.
  3. Task Processing
    • Consumers pull tasks from the queue and process them independently.
    • Each consumer works on one task at a time, and no two consumers process the same task simultaneously.
  4. Task Completion
    • Upon successful processing, the consumer acknowledges the task’s completion to the message broker.
    • The message broker then removes the task from the queue.

Handling Poison Messages

A poison message is a task or message that a consumer repeatedly fails to process. Poison messages can cause delays, block the queue, or crash consumers if not handled appropriately.

Strategies for Handling Poison Messages

  1. Retry Mechanism
    • Allow a fixed number of retries for a task before marking it as failed.
    • Use exponential backoff to reduce the load on the system during retries.
  2. Dead Letter Queue (DLQ)
    • Configure a Dead Letter Queue to store messages that cannot be processed after a predefined number of attempts.
    • Poison messages in the DLQ can be analyzed or reprocessed manually.
  3. Logging and Alerting
    • Log details about the poison message for further debugging.
    • Set up alerts to notify administrators when a poison message is encountered.
  4. Idempotent Consumers
    • Design consumers to handle duplicate processing gracefully. This prevents issues if a message is retried multiple times.

RabbitMQ Example

Producer


import pika

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

channel.queue_declare(queue='task_queue', durable=True)

messages = ["Task 1", "Task 2", "Task 3"]

for message in messages:
    channel.basic_publish(
        exchange='',
        routing_key='task_queue',
        body=message,
        properties=pika.BasicProperties(
            delivery_mode=2,  # Makes the message persistent
        )
    )
    print(f"[x] Sent {message}")

connection.close()

Dead Letter Exchange


channel.queue_declare(queue='task_queue', durable=True, arguments={
    'x-dead-letter-exchange': 'dlx_exchange'
})
channel.exchange_declare(exchange='dlx_exchange', exchange_type='fanout')
channel.queue_declare(queue='dlq', durable=True)
channel.queue_bind(exchange='dlx_exchange', queue='dlq')

Consumer Code


import pika
import time

def callback(ch, method, properties, body):
    try:
        print(f"[x] Received {body}")
        # Simulate task processing
        if body == b"Task 2":
            raise ValueError("Cannot process this message")
        time.sleep(1)
        print(f"[x] Processed {body}")
        ch.basic_ack(delivery_tag=method.delivery_tag)
    except Exception as e:
        print(f"[!] Failed to process message: {body}, error: {e}")
        ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

channel.queue_declare(queue='task_queue', durable=True)
print('[*] Waiting for messages. To exit press CTRL+C')

channel.basic_qos(prefetch_count=1)
channel.basic_consume(queue='task_queue', on_message_callback=callback)

channel.start_consuming()

Benefits of the Competing Consumer Pattern

  1. Scalability Adding more consumers allows the system to handle higher workloads.
  2. Fault Tolerance If a consumer fails, other consumers can continue processing tasks.
  3. Resource Optimization Consumers can be distributed across multiple machines to balance the load.
  4. Asynchronous Processing Decouples task generation from task processing, enabling asynchronous workflows.

Challenges and Considerations

  1. Message Duplication – In some systems, messages may be delivered more than once. Implement idempotent processing to handle duplicates.
  2. Load Balancing – Ensure tasks are evenly distributed among consumers to avoid bottlenecks.
  3. Queue Overload – High task rates may lead to queue overflow. Use rate limiting or scale your infrastructure to prevent this.
  4. Monitoring and Metrics – Implement monitoring to track queue sizes, processing rates, and consumer health.
  5. Poison Messages – Implement a robust strategy for handling poison messages, such as using a DLQ or retry mechanism.

References

  1. https://www.enterpriseintegrationpatterns.com/patterns/messaging/CompetingConsumers.html
  2. https://dev.to/willvelida/the-competing-consumers-pattern-4h5n
  3. https://medium.com/event-driven-utopia/competing-consumers-pattern-explained-b338d54eff2b

Learning Notes #22 – Claim Check Pattern | Cloud Pattern

31 December 2024 at 17:03

Today, i learnt about claim check pattern, which tells how to handle a big message into the queue. Every message broker has a defined message size limit. If our message size exceeds the size, it wont work.

The Claim Check Pattern emerges as a pivotal architectural design to address challenges in managing large payloads in a decoupled and efficient manner. In this blog, i jot down notes on my learning for my future self.

What is the Claim Check Pattern?

The Claim Check Pattern is a messaging pattern used in distributed systems to manage large messages efficiently. Instead of transmitting bulky data directly between services, this pattern extracts and stores the payload in a dedicated storage system (e.g., object storage or a database).

A lightweight reference or “claim check” is then sent through the message queue, which the receiving service can use to retrieve the full data from the storage.

This pattern is inspired by the physical process of checking in luggage at an airport: you hand over your luggage, receive a claim check (a token), and later use it to retrieve your belongings.

How Does the Claim Check Pattern Work?

The process typically involves the following steps

  1. Data Submission The sender service splits a message into two parts:
    • Metadata: A small piece of information that provides context about the data.
    • Payload: The main body of data that is too large or sensitive to send through the message queue.
  2. Storing the Payload
    • The sender uploads the payload to a storage service (e.g., AWS S3, Azure Blob Storage, or Google Cloud Storage).
    • The storage service returns a unique identifier (e.g., a URL or object key).
  3. Sending the Claim Check
    • The sender service places the metadata and the unique identifier (claim check) onto the message queue.
  4. Receiving the Claim Check
    • The receiver service consumes the message from the queue, extracts the claim check, and retrieves the payload from the storage system.
  5. Processing
    • The receiver processes the payload alongside the metadata as required.

Use Cases

1. Media Processing Pipelines In video transcoding systems, raw video files can be uploaded to storage while metadata (e.g., video format and length) is passed through the message queue.

2. IoT Systems – IoT devices generate large datasets. Using the Claim Check Pattern ensures efficient transmission and processing of these data chunks.

3. Data Processing Workflows – In big data systems, datasets can be stored in object storage while processing metadata flows through orchestration tools like Apache Airflow.

4. Event-Driven Architectures – For systems using event-driven models, large event payloads can be offloaded to storage to avoid overloading the messaging layer.

Example with RabbitMQ

1.Sender Service


import boto3
import pika

s3 = boto3.client('s3')
bucket_name = 'my-bucket'
object_key = 'data/large-file.txt'

response = s3.upload_file('large-file.txt', bucket_name, object_key)
claim_check = f's3://{bucket_name}/{object_key}'

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare a queue
channel.queue_declare(queue='claim_check_queue')

# Send the claim check
message = {
    'metadata': 'Some metadata',
    'claim_check': claim_check
}
channel.basic_publish(exchange='', routing_key='claim_check_queue', body=str(message))

connection.close()

2. Consumer


import boto3
import pika

s3 = boto3.client('s3')

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare a queue
channel.queue_declare(queue='claim_check_queue')

# Callback function to process messages
def callback(ch, method, properties, body):
    message = eval(body)
    claim_check = message['claim_check']

    bucket_name, object_key = claim_check.replace('s3://', '').split('/', 1)
    s3.download_file(bucket_name, object_key, 'retrieved-large-file.txt')
    print("Payload retrieved and processed.")

# Consume messages
channel.basic_consume(queue='claim_check_queue', on_message_callback=callback, auto_ack=True)

print('Waiting for messages. To exit press CTRL+C')
channel.start_consuming()

References

  1. https://learn.microsoft.com/en-us/azure/architecture/patterns/claim-check
  2. https://medium.com/@dmosyan/claim-check-design-pattern-603dc1f3796d

Learning Notes #19 – Blue Green Deployments – An near ZERO downtime deployment

30 December 2024 at 18:19

Today, i got refreshed on Blue Green Deployment from a podcast https://open.spotify.com/episode/03p86zgOuSEbNezK71CELH. Deployment designing is a plate i haven’t touched yet. In this blog i jot down the notes on blue green deployment for my future self.

What is Blue-Green Deployment?

Blue-Green Deployment is a release management strategy that involves maintaining two identical environments, referred to as “Blue” and “Green.” At any point in time, only one environment is live (receiving traffic), while the other remains idle or in standby. Updates are deployed to the idle environment, thoroughly tested, and then switched to live with minimal downtime.

How It Works

  • This approach involves setting up two environments: the Blue environment, which serves live traffic, and the Green environment, a replica used for staging updates.
  • Updates are first deployed to the Green environment, where comprehensive testing is performed to ensure functionality, performance, and integration meet expectations.
  • Once testing is successful, the routing mechanism, such as a DNS or API Gateway or load balancer, is updated to redirect traffic from the Blue environment to the Green environment.
  • The Green environment then becomes live, while the Blue environment transitions to an idle state.
  • If issues arise, traffic can be reverted to the Blue environment for a quick recovery with minimal impact.

Benefits of Blue-Green Deployment

  • Blue-Green Deployment provides zero downtime during the deployment process, ensuring uninterrupted user experiences.
  • Rollbacks are simplified because the previous version remains intact in the Blue environment, enabling quick reversion if necessary. Consideration of forward and backwar capability is important. eg, Database.
  • It also allows seamless testing in the Green environment before updates go live, reducing risks by isolating production from deployment issues.

Challenges and Considerations

  • Maintaining two identical environments can be resource intensive.
  • Ensuring synchronization between environments is critical to prevent discrepancies in configuration and data.
  • Handling live database changes during the environment switch is complex, requiring careful planning for database migrations.

Implementing Blue-Green Deployment (Not Yet Tried)

  • Several tools and platforms support Blue-Green Deployment. Kubernetes simplifies managing multiple environments through namespaces and services.
  • AWS Elastic Beanstalk offers built-in support for Blue-Green Deployment, while HashiCorp Terraform automates the setup of Blue-Green infrastructure.
  • To implement this strategy, organizations should design infrastructure capable of supporting two identical environments, automate deployments using CI/CD pipelines, monitor and test thoroughly, and define rollback procedures to revert to previous versions when necessary.

Reference:

Learning Notes #18 – Bulk Head Pattern (Resource Isolation) | Cloud Pattern

30 December 2024 at 17:48

Today, i learned about bulk head pattern and how it makes the system resilient to failure, resource exhaustion. In this blog i jot down notes on this pattern for better understanding.

In today’s world of distributed systems and microservices, resiliency is key to ensuring applications are robust and can withstand failures.

The Bulkhead Pattern is a design principle used to improve system resilience by isolating different parts of a system to prevent failure in one component from cascading to others.

What is the Bulkhead Pattern?

The term “bulkhead” originates from shipbuilding, where bulkheads are partitions that divide a ship into separate compartments. If one compartment is breached, the others remain intact, preventing the entire ship from sinking. Similarly, in software design, the Bulkhead Pattern isolates components or services so that a failure in one part does not bring down the entire system.

In software systems, bulkheads:

  • Isolate resources (e.g., threads, database connections, or network calls) for different components.
  • Limit the scope of failures.
  • Allow other parts of the system to continue functioning even if one part is degraded or completely unavailable.

Example

Consider an e-commerce application with a product-service that has two endpoints

  1. /product/{id} – This endpoint gives detailed information about a specific product, including ratings and reviews. It depends on the rating-service.
  2. /products – This endpoint provides a catalog of products based on search criteria. It does not depend on any external services.

Consider, with a fixed amount of resource allocated to product-service is loaded with /product/{id} calls, then they can monopolize the thread pool. This delays /products requests, causing users to experience slowness even though these requests are independent. Which leads to resource exhaustion and failures.

With bulkhead pattern, we can allocate separate client, connection pools to isolate the service interaction. we can implement bulkhead by allocating some connection pool (10) to /product/{id} requests and /products requests have a different connection pool (5) .

Even if /product/{id} requests are slow or encounter high traffic, /products requests remain unaffected.

Scenarios Where the Bulkhead Pattern is Needed

  1. Microservices with Shared Resources – In a microservices architecture, multiple services might share limited resources such as database connections or threads. If one service experiences a surge in traffic or a failure, it can exhaust these shared resources, impacting all other services. Bulkheading ensures each service gets a dedicated pool of resources, isolating the impact of failures.
  2. Prioritizing Critical Workloads – In systems with mixed workloads (e.g., processing user transactions and generating reports), critical operations like transaction processing must not be delayed or blocked by less critical tasks. Bulkheading allocates separate resources to ensure critical tasks have priority.
  3. Third-Party API Integration – When an application depends on multiple external APIs, one slow or failing API can delay the entire application if not isolated. Using bulkheads ensures that issues with one API do not affect interactions with others.
  4. Multi-Tenant Systems – In SaaS applications serving multiple tenants, a single tenant’s high resource consumption or failure should not degrade the experience for others. Bulkheads can segregate resources per tenant to maintain service quality.
  5. Cloud-Native Applications – In cloud environments, services often scale independently. A spike in one service’s load should not overwhelm shared backend systems. Bulkheads help isolate and manage these spikes.
  6. Event-Driven Systems – In event-driven architectures with message queues, processing backlogs for one type of event can delay others. By applying the Bulkhead Pattern, separate processing pipelines can handle different event types independently.

What are the Key Points of the Bulkhead Pattern? (Simplified)

  • Define Partitions – (Think of a ship) it’s divided into compartments (partitions) to keep water from flooding the whole ship if one section gets damaged. In software, these partitions are designed around how the application works and its technical needs.
  • Designing with Context – If you’re using a design approach like DDD (Domain-Driven Design), make sure your bulkheads (partitions) match the business logic boundaries.
  • Choosing Isolation Levels – Decide how much isolation is needed. For example: Threads for lightweight tasks. Separate containers or virtual machines for more critical separations. Balance between keeping things separate and the costs or extra effort involved.
  • Combining Other Techniques – Bulkheads work even better with patterns like Retry, Circuit Breaker, Throttling.
  • Monitoring – Keep an eye on each partition’s performance. If one starts getting overloaded, you can adjust resources or change limits.

When Should You Use the Bulkhead Pattern?

  • To Isolate Critical Resources – If one part of your system fails, other parts can keep working. For example, you don’t want search functionality to stop working because the reviews section is down.
  • To Prioritize Important Work – For example, make sure payment processing (critical) is separate from background tasks like sending emails.
  • To Avoid Cascading Failures – If one part of the system gets overwhelmed, it won’t drag down everything else.

When Should You Avoid It?

  • Complexity Isn’t Needed – If your system is simple, adding bulkheads might just make it harder to manage.
  • Resource Efficiency is Critical – Sometimes, splitting resources into separate pools can mean less efficient use of those resources. If every thread, connection, or container is underutilized, this might not be the best approach.

Challenges and Best Practices

  1. Overhead: Maintaining separate resource pools can increase system complexity and resource utilization.
  2. Resource Sizing: Properly sizing the pools is critical to ensure resources are efficiently utilized without bottlenecks.
  3. Monitoring: Use tools to monitor the health and performance of each resource pool to detect bottlenecks or saturation.

References:

  1. AWS https://aws.amazon.com/blogs/containers/building-a-fault-tolerant-architecture-with-a-bulkhead-pattern-on-aws-app-mesh/
  2. Resilience https://resilience4j.readme.io/docs/bulkhead
  3. https://medium.com/nerd-for-tech/bulkhead-pattern-distributed-design-pattern-c673d5e81523
  4. Microsoft https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead

Learning Notes #12 – Alternate Exchanges | RabbitMQ

27 December 2024 at 10:36

Today i learnt about Alternate Exchange, which provide a way to handle undeliverable messages. In this blog, i share the notes on what alternate exchanges are, why they are useful, and how to implement them in your RabbitMQ setup.

What Are Alternate Exchanges?

In the normal flow, producer will send a message to the exchange and if the queue is binded correctly then it will be placed in the correct queue.

An alternate exchange in RabbitMQ is a fallback exchange configured for another exchange. If a message cannot be routed to any queue bound to the primary exchange, RabbitMQ will publish the message to the alternate exchange instead. This mechanism ensures that undeliverable messages are not lost but can be processed in a different way, such as logging, alerting, or storing them for later inspection.

When this scenario happens

A message goes to an alternate exchange in RabbitMQ in the following scenarios:

1. No Binding for the Routing Key

  • The primary exchange does not have any queue bound to it with the routing key specified in the message.
  • Example: A message with routing key invalid_key is sent to a direct exchange that has no queue bound to invalid_key.

2. Unbound Queues:

  • Even if a queue exists, it is not bound to the primary exchange or the specific routing key used in the message.
  • Example: A queue exists for the primary exchange but is not explicitly bound to any routing key.

3. Exchange Type Mismatch

  • The exchange type (e.g., direct, fanout, topic) does not match the routing pattern of the message.
  • Example: A message is sent with a specific routing key to a fanout exchange that delivers to all bound queues regardless of the key.

4. Misconfigured Bindings

  • Bindings exist but do not align with the routing requirements of the message.
  • Example: A topic exchange has a binding for user.* but receives a message with the routing key order.processed.

5. Queue Deletion After Binding

  • A queue was bound to the exchange but is deleted or unavailable at runtime.
  • Example: A message with a valid routing key arrives, but the corresponding queue is no longer active.

6. TTL (Time-to-Live) Expired Queues

  • Messages routed to a queue with a time-to-live setting expire before being consumed and are re-routed to an alternate exchange if dead-lettering is enabled.
  • Example: A primary exchange routes messages to a TTL-bound queue, and expired messages are forwarded to the alternate exchange.

7. Exchange Misconfiguration

  • The primary exchange is operational, but its configurations prevent messages from being delivered to any queue.
  • Example: A missing or incorrect alternate-exchange argument setup leads to misrouting.

Use Cases for Alternate Exchanges

  • Error Handling: Route undeliverable messages to a dedicated queue for later inspection or reprocessing.
  • Logging: Keep track of messages that fail routing for auditing purposes.
  • Dead Letter Queues: Use alternate exchanges to implement dead-letter queues to analyze why messages could not be routed.
  • Load Balancing: Forward undeliverable messages to another exchange for alternative processing

How to Implement Alternate Exchanges in Python

Let’s walk through the steps to configure and use alternate exchanges in RabbitMQ using Python.

Scenario 1: Handling Messages with Valid and Invalid Routing Keys

producer.py

import pika

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare the alternate exchange
channel.exchange_declare(exchange='alternate_exchange', exchange_type='fanout')

# Declare a queue and bind it to the alternate exchange
channel.queue_declare(queue='unroutable_queue')
channel.queue_bind(exchange='alternate_exchange', queue='unroutable_queue')

# Declare the primary exchange with an alternate exchange argument
channel.exchange_declare(
    exchange='primary_exchange',
    exchange_type='direct',
    arguments={'alternate-exchange': 'alternate_exchange'}
)

# Declare and bind a queue to the primary exchange
channel.queue_declare(queue='valid_queue')
channel.queue_bind(exchange='primary_exchange', queue='valid_queue', routing_key='key1')

# Publish a message with a valid routing key
channel.basic_publish(
    exchange='primary_exchange',
    routing_key='key1',
    body='Message with a valid routing key'
)

print("Message with valid routing key sent to 'valid_queue'.")

# Publish a message with an invalid routing key
channel.basic_publish(
    exchange='primary_exchange',
    routing_key='invalid_key',
    body='Message with an invalid routing key'
)

print("Message with invalid routing key sent to 'alternate_exchange'.")

# Close the connection
connection.close()

consumer.py

import pika

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Consume messages from the alternate queue
method_frame, header_frame, body = channel.basic_get(queue='unroutable_queue', auto_ack=True)
if method_frame:
    print(f"Received message from alternate queue: {body.decode()}")
else:
    print("No messages in the alternate queue")

# Close the connection
connection.close()

Scenario 2: Logging Unroutable Messages

producer.py

import pika

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare the alternate exchange
channel.exchange_declare(exchange='logging_exchange', exchange_type='fanout')

# Declare a logging queue and bind it to the logging exchange
channel.queue_declare(queue='logging_queue')
channel.queue_bind(exchange='logging_exchange', queue='logging_queue')

# Declare the primary exchange with a logging alternate exchange argument
channel.exchange_declare(
    exchange='primary_logging_exchange',
    exchange_type='direct',
    arguments={'alternate-exchange': 'logging_exchange'}
)

# Publish a message with an invalid routing key
channel.basic_publish(
    exchange='primary_logging_exchange',
    routing_key='invalid_logging_key',
    body='Message for logging'
)

print("Message with invalid routing key sent to 'logging_exchange'.")

# Close the connection
connection.close()

consumer.py

import pika

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Consume messages from the logging queue
method_frame, header_frame, body = channel.basic_get(queue='logging_queue', auto_ack=True)
if method_frame:
    print(f"Logged message: {body.decode()}")
else:
    print("No messages in the logging queue")

# Close the connection
connection.close()

Learning Notes #11 – Sidecar Pattern | Cloud Patterns

26 December 2024 at 17:40

Today, I learnt about Sidecar Pattern. Its seems like offloading the common functionalities (logging, networking, …) aside within a pod to be used by other apps within the pod.

Its just not only about pods, but other deployments aswell. In this blog, i am going to curate the items i have learnt for my future self. Its a pattern, not an strict rule.

What is a Sidecar?

Imagine you’re riding a motorbike, and you attach a little sidecar to carry your friend or groceries. The sidecar isn’t part of the motorbike’s engine or core mechanism, but it helps you achieve your goals—whether it’s carrying more stuff or having a buddy ride along.

In the software world, a sidecar is a similar concept. It’s a separate process or container that runs alongside a primary application. Like the motorbike’s sidecar, it supports the main application by offloading or enhancing certain tasks without interfering with its core functionality.

Why Use a Sidecar?

In traditional applications, all responsibilities (logging, communication, monitoring, etc.) are bundled into the main application. This approach can make the application complex and harder to manage. Sidecars address this by handling auxiliary tasks separately, so the main application can focus on its primary purpose.

Here are some key reasons to use a sidecar

  1. Modularity: Sidecars separate responsibilities, making the system easier to develop, test, and maintain.
  2. Reusability: The same sidecar can be used across multiple services. And its language agnostic.
  3. Scalability: You can scale the sidecar independently from the main application.
  4. Isolation: Sidecars provide a level of isolation, reducing the risk of one part affecting the other.

Real-Life Analogies

To make the concept clearer, here are some real-world analogies:

  1. Coffee Maker with a Milk Frother:
    • The coffee maker (main application) brews coffee.
    • The milk frother (sidecar) prepares frothed milk for your latte.
    • Both work independently but combine their outputs for a better experience.
  2. Movie Subtitles:
    • The movie (main application) provides the visuals and sound.
    • The subtitles (sidecar) add clarity for those who need them.
    • You can watch the movie with or without subtitles—they’re optional but enhance the experience.
  3. A School with a Sports Coach:
    • The school (main application) handles education.
    • The sports coach (sidecar) focuses on physical training.
    • Both have distinct roles but contribute to the overall development of students.

Some Random Sidecar Ideas in Software

Let’s look at how sidecars are used in actual software scenarios

  1. Service Meshes (e.g., Istio, Linkerd):
    • A service mesh helps microservices communicate with each other reliably and securely.
    • The sidecar (proxy like Envoy) handles tasks like load balancing, encryption, and monitoring, so the main application doesn’t have to.
  2. Logging and Monitoring:
    • Instead of the main application generating and managing logs, a sidecar can collect, format, and send logs to a centralized system like Elasticsearch or Splunk.
  3. Authentication and Security:
    • A sidecar can act as a gatekeeper, handling user authentication and ensuring that only authorized requests reach the main application.
  4. Data Caching:
    • If an application frequently queries a database, a sidecar can serve as a local cache, reducing database load and speeding up responses.
  5. Service Discovery:
    • Sidecars can aid in service discovery by automatically registering the main application with a registry service or load balancer, ensuring seamless communication in dynamic environments.

How Sidecars Work

In modern environments like Kubernetes, sidecars are often deployed as separate containers within the same pod as the main application. They share the same network and storage, making communication between the two seamless.

Here’s a simplified workflow

  1. The main application focuses on its core tasks (e.g., serving a web page).
  2. The sidecar handles auxiliary tasks (e.g., compressing and encrypting logs).
  3. The two communicate over local connections within the pod.

Pros and Cons of Sidecars

Pros:

  • Simplifies the main application.
  • Encourages reusability and modular design.
  • Improves scalability and flexibility.
  • Enhances observability with centralized logging and metrics.
  • Facilitates experimentation—you can deploy or update sidecars independently.

Cons:

  • Adds complexity to deployment and orchestration.
  • Consumes additional resources (CPU, memory).
  • Requires careful design to avoid tight coupling between the sidecar and the main application.
  • Latency (You are adding an another hop).

Do we always need to use sidecars

No. Not at all.

a. When there is a latency between the parent application and sidecar, then Reconsider.

b. If your application is small, then reconsider.

c. When you are scaling differently or independently from the parent application, then Reconsider.

Some other examples

1. Adding HTTPS to a Legacy Application

Consider a legacy web service which services requests over unencrypted HTTP. We have a requirement to enhance the same legacy system to service requests with HTTPS in future.

The legacy app is configured to serve request exclusively on localhost, which means that only services that share the local network with the server able to access legacy application. In addition to the main container (legacy app) we can add Nginx Sidecar container which runs in the same network namespace as the main container so that it can access the service running on localhost.

2. For Logging (Image from ByteByteGo)

Sidecars are not just technical solutions; they embody the principle of collaboration and specialization. By dividing responsibilities, they empower the main application to shine while ensuring auxiliary tasks are handled efficiently. Next time you hear about sidecars, you’ll know they’re more than just cool attachments for motorcycle they’re an essential part of scalable, maintainable software systems.

Also, do you feel its closely related to Adapter and Ambassador Pattern ? I Do.

References:

  1. Hussein Nasser – https://www.youtube.com/watch?v=zcJWvhzkPsw&pp=ygUHc2lkZWNhcg%3D%3D
  2. Sudo Code – https://www.youtube.com/watch?v=QU5WcwuFpZU&pp=ygUPc2lkZWNhciBwYXR0ZXJu
  3. Software Dude – https://www.youtube.com/watch?v=poPUzN33Oug&pp=ygUPc2lkZWNhciBwYXR0ZXJu
  4. https://medium.com/nerd-for-tech/microservice-design-pattern-sidecar-sidekick-pattern-dbcea9bed783
  5. https://dzone.com/articles/sidecar-design-pattern-in-your-microservices-ecosy-1

Learning Notes #10 – Lazy Queues | RabbitMQ

26 December 2024 at 06:54

What Are Lazy Queues?

  • Lazy Queues are designed to store messages primarily on disk rather than in memory.
  • They are optimized for use cases involving large message backlogs where minimizing memory usage is critical.

Key Characteristics

  1. Disk-Based Storage – Messages are stored on disk immediately upon arrival, rather than being held in memory.
  2. Low Memory Usage – Only minimal metadata for messages is kept in memory.
  3. Scalability – Can handle millions of messages without consuming significant memory.
  4. Message Retrieval – Retrieving messages is slower because messages are fetched from disk.
  5. Durability – Messages persist on disk, reducing the risk of data loss during RabbitMQ restarts.

Trade-offs

  • Latency: Fetching messages from disk is slower than retrieving them from memory.
  • Throughput: Not suitable for high-throughput, low-latency applications.

Choose Lazy Queues if

  • You need to handle very large backlogs of messages.
  • Memory is a constraint in your system.Latency and throughput are less critical.

Implementation

Pre-requisites

1. Install and run RabbitMQ on your local machine.


docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:4.0-management

2. Install the pika library


pip install pika

Producer (producer.py)

This script sends a persistent message to a Lazy Queue.

import pika

# RabbitMQ connection parameters for localhost
connection_params = pika.ConnectionParameters(host="localhost")

# Connect to RabbitMQ
connection = pika.BlockingConnection(connection_params)
channel = connection.channel()

# Custom Exchange and Routing Key
exchange_name = "custom_exchange"
routing_key = "custom_routing_key"
queue_name = "lazy_queue_example"

# Declare the custom exchange
channel.exchange_declare(
    exchange=exchange_name,
    exchange_type="direct",  # Direct exchange routes messages based on the routing key
    durable=True
)

# Declare a Lazy Queue
channel.queue_declare(
    queue=queue_name,
    durable=True,
    arguments={"x-queue-mode": "lazy"}  # Configure the queue as lazy
)

# Bind the queue to the custom exchange with the routing key
channel.queue_bind(
    exchange=exchange_name,
    queue=queue_name,
    routing_key=routing_key
)

# Publish a message
message = "Hello from the Producer via Custom Exchange!"
channel.basic_publish(
    exchange=exchange_name,
    routing_key=routing_key,
    body=message,
    properties=pika.BasicProperties(delivery_mode=2)  # Persistent message
)

print(f"Message sent to Lazy Queue via Exchange: {message}")

# Close the connection
connection.close()

Consumer (consumer.py)

import pika

# RabbitMQ connection parameters for localhost
connection_params = pika.ConnectionParameters(host="localhost")

# Connect to RabbitMQ
connection = pika.BlockingConnection(connection_params)
channel = connection.channel()

# Custom Exchange and Routing Key
exchange_name = "custom_exchange"
routing_key = "custom_routing_key"
queue_name = "lazy_queue_example"

# Declare the custom exchange
channel.exchange_declare(
    exchange=exchange_name,
    exchange_type="direct",  # Direct exchange routes messages based on the routing key
    durable=True
)

# Declare the Lazy Queue
channel.queue_declare(
    queue=queue_name,
    durable=True,
    arguments={"x-queue-mode": "lazy"}  # Configure the queue as lazy
)

# Bind the queue to the custom exchange with the routing key
channel.queue_bind(
    exchange=exchange_name,
    queue=queue_name,
    routing_key=routing_key
)

# Callback function to process messages
def callback(ch, method, properties, body):
    print(f"Received message: {body.decode()}")
    ch.basic_ack(delivery_tag=method.delivery_tag)  # Acknowledge the message

# Start consuming messages
channel.basic_consume(queue=queue_name, on_message_callback=callback, auto_ack=False)

print("Waiting for messages. To exit, press CTRL+C")
try:
    channel.start_consuming()
except KeyboardInterrupt:
    print("Stopped consuming.")

# Close the connection
connection.close()

Explanation

  1. Producer
    • Defines a custom exchange (custom_exchange) of type direct.
    • Declares a Lazy Queue (lazy_queue_example).
    • Binds the queue to the exchange using a routing key (custom_routing_key).
    • Publishes a persistent message via the custom exchange and routing key.
  2. Consumer
    • Declares the same exchange and Lazy Queue to ensure they exist.
    • Consumes messages routed to the queue through the custom exchange and routing key.
  3. Custom Exchange and Binding
    • The direct exchange type routes messages based on an exact match of the routing key.
    • Binding ensures the queue receives messages published to the exchange with the specified key.
  4. Lazy Queue Behavior
    • Messages are stored directly on disk to minimize memory usage.

Learning Notes #9 – Quorum Queues | RabbitMQ

25 December 2024 at 16:42

What Are Quorum Queues?

  • Quorum Queues are distributed queues built on the Raft consensus algorithm.
  • They are designed for high availability, durability, and data safety by replicating messages across multiple nodes in a RabbitMQ cluster.
  • Its a replacement of Mirrored Queues.

Key Characteristics

  1. Replication:
    • Messages are replicated across a quorum (a majority of nodes).
    • A quorum consists of an odd number of replicas (e.g., 3, 5, 7) to ensure a majority can elect a leader during failovers.
  2. Leader-Follower Architecture:
    • Each Quorum Queue has one leader and multiple followers.
    • The leader handles all write and read operations, while followers replicate messages and provide redundancy.
  3. Durability:
    • Messages are written to disk on all quorum nodes, ensuring persistence even if nodes fail.
  4. High Availability:
    • If the leader node fails, RabbitMQ elects a new leader from the remaining quorum, ensuring continued operation.
  5. Consistency:
    • Quorum Queues prioritize consistency over availability.
    • Messages are acknowledged only after replication is successful on a majority of nodes.
  6. Message Ordering:
    • Message ordering is preserved during normal operations but may be disrupted during leader failovers.

Use Cases

  • Mission-Critical Applications – Systems where message loss is unacceptable (e.g., financial transactions, order processing).
  • Distributed Systems – Environments requiring high availability and fault tolerance.
  • Data Safety – Applications prioritizing consistency over throughput (e.g., event logs, audit trails).

Setups

Using rabbitmqctl


rabbitmqctl add_queue quorum_queue --type quorum

Using python


channel.queue_declare(queue="quorum_queue", arguments={"x-queue-type": "quorum"})

References:

  1. https://www.rabbitmq.com/docs/quorum-queues

Learning Notes #8 – SLI, SLA, SLO

25 December 2024 at 16:11

In this blog, i write about SLI, SLA, SLO . I got a refreshing session from a podcast https://open.spotify.com/episode/2Ags7x1WrxaFLRd3KBU50K?si=vbYtW_YVQpOi8HwT9AOM1g. This blog is about that.

In the world of service reliability and performance, the terms SLO, SLA, and SLI are often used interchangeably but have distinct meanings. This blog explains these terms in detail, their importance, and how they relate to each other with practical examples.

1. What are SLIs, SLOs, and SLAs?

Service Level Indicators (SLIs)

An SLI is a metric that quantifies the level of service provided by a system. It measures specific aspects of performance or reliability, such as response time, uptime, or error rate.

Example:

  • Percentage of successful HTTP requests over a time window.
  • Average latency of API responses.

Service Level Objectives (SLOs)

An SLO is a target value or range for an SLI. It defines what “acceptable” performance or reliability looks like from the perspective of the service provider or user.

Example:

  • “99.9% of HTTP requests must succeed within 500ms.”
  • “The application should have 99.95% uptime per quarter.”

Service Level Agreements (SLAs)

An SLA is a formal contract between a service provider and a customer that specifies the agreed-upon SLOs and the consequences of failing to meet them, such as penalties or compensations.

Example:

  • “If the uptime drops below 99.5% in a calendar month, the customer will receive a 10% credit on their monthly bill.”

2. Relationship Between SLIs, SLOs, and SLAs

  • SLIs are the metrics measured.
  • SLOs are the goals or benchmarks derived from SLIs.
  • SLAs are agreements that formalize SLOs and include penalties or incentives.

SLI: Average latency of API requests.
SLO: 95% of API requests should have latency under 200ms.
SLA: If latency exceeds the SLO for two consecutive weeks, the provider will issue service credits.

3. Practical Examples

Example 1: Web Hosting Service

  • SLI: Percentage of time the website is available.
  • SLO: The website must be available 99.9% of the time per month.
  • SLA: If uptime falls below 99.9%, the customer will receive a refund of 20% of their monthly fee.

Example 2: Cloud Storage Service

  • SLI: Time taken to retrieve a file from storage.
  • SLO: 95% of retrieval requests must complete within 300ms.
  • SLA: If retrieval times exceed 300ms for more than 5% of requests in a billing cycle, customers will get free additional storage for the next month.

Example 3: API Service

  • SLI: Error rate of API responses.
  • SLO: Error rate must be below 0.1% for all requests in a day.
  • SLA: If the error rate exceeds 0.1% for more than three days in a row, the customer is entitled to a credit worth 5% of their monthly subscription fee.

❌
❌