❌

Reading view

There are new articles available, click to refresh the page.

Learning Notes #58 – Command Query Responsibility Segregation – An Idea Overview

Today, i came across a video on ByteMonk on Event Sourcing. In that video, they mentioned about CQRS, then i delved into that. This blog is on understanding CQRS from a high level. I am planning to dive deep into Event Driven Architecture conceptually in upcoming weekend.

In this blog, i jot down notes for basic understanding of CQRS.

In the world of software development, there are countless patterns and practices aimed at solving specific problems. One such pattern is CQRS, short for Command Query Responsibility Segregation. While it might sound complex (it did for me), the idea is quite straightforward when broken down into simple terms.

What is CQRS?

Imagine you run a small bookstore. Customers interact with your store in two main ways

  1. They buy books.
  2. They ask for information about books.

These two activities buying (command) and asking (querying) are fundamentally different. Buying a book changes something in your store (your inventory decreases), whereas asking for information doesn’t change anything; it just retrieves details.

CQRS applies the same principle to software. It separates the operations that change data (called commands) from those that read data (called queries). This separation brings clarity and efficiency (not sure yet πŸ™‚ )

In simpler terms,

  • Commands are actions like β€œAdd this book to the inventory” or β€œUpdate the price of this book.” These modify the state of your system.

  • Queries are questions like β€œHow many books are in stock?” or β€œWhat’s the price of this book?” These fetch data but don’t alter it.

By keeping these two types of operations separate, you make your system easier to manage and scale.

Why Should You Care About CQRS?

Let’s revisit our bookstore analogy. Imagine if every time someone asked for information about a book, your staff had to dig through boxes in the storage room. It would be slow and inefficient!

Instead, you might keep a catalog at the front desk that’s easy to browse.

In software, this means that,

  • Better Performance: By separating commands and queries, you can optimize them individually. For instance, you can have a simple, fast database for queries and a robust, detailed database for commands.

  • Simpler Code: Each part of your system does one thing, making it easier to understand and maintain.

  • Flexibility: You can scale the command and query sides independently. If you get a lot of read requests but fewer writes, you can optimize the query side without touching the command side.

CQRS in Action

Let’s say you’re building an app for managing a library. Here’s how CQRS might look,

  • Command: A librarian adds a new book to the catalog or updates the details of an existing book.

  • Query: A user searches for books by title or checks the availability of a specific book.

The app could use one database to handle commands (storing all the book details and history) and another optimized database to handle queries (focused on quickly retrieving book information).

Does CQRS Always Make Sense?

As of now, its making items complicated for small applications. As usual every pattern is devised for their niche problems. Single Bolt can go through all Nuts.

In upcoming blogs, let’s learn more on CQRS.

Learning Notes #57 – Partial Indexing in Postgres

Today, i learnt about partial indexing in postgres, how its optimizes the indexing process to filter subset of table more efficiently. In this blog, i jot down notes on partial indexing.

Partial indexing in PostgreSQL is a powerful feature that provides a way to optimize database performance by creating indexes that apply only to a subset of a table’s rows. This selective indexing can result in reduced storage space, faster index maintenance, and improved query performance, especially when queries frequently involve filters or conditions that only target a portion of the data.

An index in PostgreSQL, like in other relational database management systems, is a data structure that improves the speed of data retrieval operations. However, creating an index on an entire table can sometimes be inefficient, especially when dealing with very large datasets where queries often focus on specific subsets of the data. This is where partial indexing becomes invaluable.

Unlike a standard index that covers every row in a table, a partial index only includes rows that satisfy a specified condition. This condition is defined using a WHERE clause when the index is created.

To understand the mechanics, let us consider a practical example.

Suppose you have a table named orders that stores details about customer orders, including columns like order_id, customer_id, order_date, status, and total_amount. If the majority of your queries focus on pending orders those where the status is pending, creating a partial index specifically for these rows can significantly improve performance.

Example 1:

Here’s how you can create such an index,

CREATE INDEX idx_pending_orders
ON orders (order_date)
WHERE status = 'pending';

In this example, the index idx_pending_orders includes only the rows where status equals pending. This means that any query that involves filtering by status = 'pending' and utilizes the order_date column will leverage this index. For instance, the following query would benefit from the partial index,

SELECT *
FROM orders
WHERE status = 'pending'
AND order_date > '2025-01-01';

The benefits of this approach are significant. By indexing only the rows with status = 'pending', the size of the index is much smaller compared to a full table index.

This reduction in size not only saves disk space but also speeds up the process of scanning the index, as there are fewer entries to traverse. Furthermore, updates or modifications to rows that do not meet the WHERE condition are excluded from index maintenance, thereby reducing the overhead of maintaining the index and improving performance for write operations.

Example 2:

Let us explore another example. Suppose your application frequently queries orders that exceed a certain total amount. You can create a partial index tailored to this use case,

CREATE INDEX idx_high_value_orders
ON orders (customer_id)
WHERE total_amount > 1000;

This index would optimize queries like the following,

SELECT *
FROM orders
WHERE total_amount > 1000
AND customer_id = 123;

The key advantage here is that the index only includes rows where total_amount > 1000. For datasets with a wide range of order amounts, this can dramatically reduce the number of indexed entries. Queries that filter by high-value orders become faster because the database does not need to sift through irrelevant rows.

Additionally, as with the previous example, index maintenance is limited to the subset of rows matching the condition, improving overall performance for insertions and updates.

Partial indexes are also useful for enforcing constraints in a selective manner. Consider a scenario where you want to ensure that no two active promotions exist for the same product. You can achieve this using a unique partial index

CREATE UNIQUE INDEX idx_unique_active_promotion
ON promotions (product_id)
WHERE is_active = true;

This index guarantees that only one row with is_active = true can exist for each product_id.

In conclusion, partial indexing in PostgreSQL offers a flexible and efficient way to optimize database performance by targeting specific subsets of data.

Learning Notes #56 – Push vs Pull Architecture

Today, i learnt about push vs pull architecture, the choice between push and pull architectures can significantly influence system performance, scalability, and user experience. Both approaches have their unique advantages and trade-offs. Understanding these architectures and their ideal use cases can help developers and architects make informed decisions.

What is Push Architecture?

Push architecture is a communication pattern where the server actively sends data to clients as soon as it becomes available. This approach eliminates the need for clients to repeatedly request updates.

How it Works

  • The server maintains a connection with the client.
  • When new data is available, the server β€œpushes” it to the connected clients.
  • In a message queue context, producers send messages to a queue, and the queue actively delivers these messages to subscribed consumers without explicit requests.

Examples

  • Notifications in Mobile Apps: Users receive instant updates, such as chat messages or alerts.
  • Stock Price Updates: Financial platforms use push to provide real-time market data.
  • Message Queues with Push Delivery: Systems like RabbitMQ or Kafka configured to push messages to consumers.
  • Server-Sent Events (SSE) and WebSockets: These are common implementations of push.

Advantages

  • Low Latency: Clients receive updates instantly, improving responsiveness.
  • Reduced Redundancy: No need for clients to poll servers frequently, reducing bandwidth consumption.

Challenges

  • Complexity: Maintaining open connections, especially for many clients, can be resource-intensive.
  • Scalability: Requires robust infrastructure to handle large-scale deployments.

What is Pull Architecture?

Pull architecture involves clients actively requesting data from the server. This pattern is often used when real-time updates are not critical or predictable intervals suffice.

How it Works

  • The client periodically sends requests to the server.
  • The server responds with the requested data.
  • In a message queue context, consumers actively poll the queue to retrieve messages when ready.

Examples

  • Web Browsing: A browser sends HTTP requests to fetch pages and resources.
  • API Data Fetching: Applications periodically query APIs to update information.
  • Message Queues with Pull Delivery: Systems like SQS or Kafka where consumers poll for messages.
  • Polling: Regularly checking a server or queue for updates.

Advantages

  • Simpler Implementation: No need for persistent connections; standard HTTP requests or queue polling suffice.
  • Server Load Control: The server can limit the frequency of client requests to manage resources better.

Challenges

  • Latency: Updates are only received when the client requests them, which might lead to delays.
  • Increased Bandwidth: Frequent polling can waste resources if no new data is available.

AspectPush ArchitecturePull Architecture
LatencyLow – Real-time updatesHigher – Dependent on polling frequency
ComplexityHigher – Requires persistent connectionsLower – Simple request-response model
Bandwidth EfficiencyEfficient – Updates sent only when neededLess efficient – Redundant polling possible
ScalabilityChallenging – High client connection overheadEasier – Controlled client request intervals
Message Queue FlowMessages actively delivered to consumersConsumers poll the queue for messages
Use CasesReal-time applications (e.g., chat, live data)Non-critical updates (e.g., periodic reports)

Learning Notes #55 – API Keys and Tokens

Tokens and API keys are foundational tools that ensure secure communication between systems. They enable authentication, authorization, and access control, facilitating secure data exchange.

What Are Tokens?

Tokens are digital objects that represent a specific set of permissions or claims. They are often used in authentication and authorization processes to verify a user’s identity or grant access to resources. Tokens can be time-bound and carry information like:

  1. User Identity: Information about the user or system initiating the request.
  2. Scope of Access: Details about what actions or resources the token permits.
  3. Validity Period: Start and expiry times for the token.

Common Types of Tokens:

  • JWT (JSON Web Tokens): Compact, URL-safe tokens containing a payload, signature, and header.
  • Opaque Tokens: Tokens without embedded information; they require validation against a server.
  • Refresh Tokens: Used to obtain a new access token when the current one expires.

What Are API Keys?

API keys are unique identifiers used to authenticate applications or systems accessing APIs. They are simple to use and act as a credential to allow systems to make authorized API calls.

Key Characteristics:

  • Static Credential: Unlike tokens, API keys do not typically expire unless explicitly revoked.
  • Simple to Use: They are easy to implement and often passed in headers or query parameters.
  • Application-Specific: Keys are tied to specific applications rather than user accounts.

Functionalities and Usage

Both tokens and API keys enable secure interaction between systems, but their application depends on the scenario

1. Authentication

  • Tokens: Often used for user authentication in web apps and APIs.
    • Example: A JWT issued after login is included in subsequent API requests to validate the user’s session.
  • API Keys: Authenticate applications rather than users.
    • Example: A weather app uses an API key to fetch data from a weather API.

2. Authorization

  • Tokens: Define user-specific permissions and roles.
    • Example: A token allows read-only access to specific resources for a particular user.
  • API Keys: Grant access to predefined resources for the application.
    • Example: An API key allows access to public datasets but restricts write operations.

3. Rate Limiting and Monitoring

Both tokens and API keys can be used to

  • Enforce usage limits.
  • Monitor and log API usage for analytics and security.

Considerations for Secure Implementation

1. For Tokens

  • Use HTTPS: Always transmit tokens over HTTPS to prevent interception.
  • Implement Expiry: Set reasonable expiry times to minimize risks.
  • Adopt Refresh Tokens: Allow users to obtain new tokens securely when access tokens expire.
  • Validate Signatures: For JWTs, validate the signature to ensure the token’s integrity.

2. For API Keys

  • Restrict IP Usage: Limit the key’s use to specific IPs or networks.
  • Set Permissions: Assign the minimum required permissions for the API key.
  • Regenerate Periodically: Refresh keys periodically to mitigate risks.
  • Monitor Usage: Track API key usage for anomalies and revoke compromised keys promptly.

3. For Both

  • Avoid Hardcoding: Never embed tokens or keys in source code. Use environment variables or secure vaults.
  • Audit and Rotate: Regularly audit and rotate keys and tokens to maintain security.
  • Educate Users: Ensure users and developers understand secure handling practices.

Learning Notes #54 – Architecture Decision Records

Last few days, i was learning on how to make a accountable decision on deciding technical stuffs. Then i came across ADR. So far i haven’t used or seen used by our team. I think this is a necessary step to be incorporated to make accountable decisions. In this blog i share details on ADR for my future reference.

What is an ADR?

An Architectural Decision Record (ADR) is a concise document that captures a single architectural decision, its context, the reasoning behind it, and its consequences. ADRs help teams document, share, and revisit architectural choices, ensuring transparency and better collaboration.

Why Use ADRs?

  1. Documentation: ADRs serve as a historical record of why certain decisions were made.
  2. Collaboration: They promote better understanding across teams.
  3. Traceability: ADRs link architectural decisions to specific project requirements and constraints.
  4. Accountability: They clarify who made a decision and when.
  5. Change Management: ADRs help evaluate the impact of changes and facilitate discussions around reversals or updates.

ADR Structure

A typical ADR document follows a standard format. Here’s an example:

  1. Title: A clear and concise title describing the decision.
  2. Context: Background information explaining the problem or opportunity.
  3. Decision: A summary of the chosen solution.
  4. Consequences: The positive and negative outcomes of the decision.
  5. Status: Indicates whether the decision is proposed, accepted, superseded, or deprecated.

Example:

Optimistic locking on MongoDB https://docs.google.com/document/d/1olCbicQeQzYpCxB0ejPDtnri9rWb2Qhs9_JZuvANAxM/edit?usp=sharing

References

  1. https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions
  2. https://www.infoq.com/podcasts/architecture-advice-process/
  3. Recommended: https://github.com/joelparkerhenderson/architecture-decision-record/tree/main

Learning Notes #53 – The Expiration Time Can Be Unexpectedly Lost While Using Redis SET EX

Redis, a high-performance in-memory key-value store, is widely used for caching, session management, and various other scenarios where fast data retrieval is essential. One of its key features is the ability to set expiration times for keys. However, when using the SET command with the EX option, developers might encounter unexpected behaviors where the expiration time is seemingly lost. Let’s explore this issue in detail.

Understanding SET with EX

The Redis SET command with the EX option allows you to set a key’s value and specify its expiration time in seconds. For instance


SET key value EX 60

This command sets the key key to the value value and sets an expiration time of 60 seconds.

The Problem

In certain cases, the expiration time might be unexpectedly lost. This typically happens when subsequent operations overwrite the key without specifying a new expiration. For example,


SET key value1 EX 60
SET key value2

In the above sequence,

  1. The first SET command assigns a value to key and sets an expiration of 60 seconds.
  2. The second SET command overwrites the value of key but does not include an expiration time, resulting in the key persisting indefinitely.

This behavior can lead to subtle bugs, especially in applications that rely on key expiration for correctness or resource management.

Why Does This Happen?

The Redis SET command is designed to replace the entire state of a key, including its expiration. When you use SET without the EX, PX, or EXAT options, the expiration is removed, and the key becomes persistent. This behavior aligns with the principle that SET is a complete update operation.

When using Redis SET with EX, be mindful of operations that might overwrite keys without reapplying expiration. Understanding Redis’s behavior and implementing robust patterns can save you from unexpected issues, ensuring your application remains efficient and reliable.

Learning Notes #52 – Hybrid Origin Failover Pattern

Today, i learnt about failover patterns from AWS https://aws.amazon.com/blogs/networking-and-content-delivery/three-advanced-design-patterns-for-high-available-applications-using-amazon-cloudfront/ . In this blog i jot down my understanding on this pattern for future reference,

Hybrid origin failover is a strategy that combines two distinct approaches to handle origin failures effectively, balancing speed and resilience.

The Need for Origin Failover

When an application’s primary origin server becomes unavailable, the ability to reroute traffic to a secondary origin ensures continuity. The failover process determines how quickly and effectively this switch happens. Broadly, there are two approaches to implement origin failover:

  1. Stateful Failover with DNS-based Routing
  2. Stateless Failover with Application Logic

Each has its strengths and limitations, which the hybrid approach aims to mitigate.

Stateful Failover

Stateful failover is a system that allows a standby server to take over for a failed server and continue active sessions. It’s used to create a resilient network infrastructure and avoid service interruptions.

This method relies on a DNS service with health checks to detect when the primary origin is unavailable. Here’s how it works,

  1. Health Checks: The DNS service continuously monitors the health of the primary origin using health checks (e.g., HTTP, HTTPS).
  2. DNS Failover: When the primary origin is marked unhealthy, the DNS service resolves the origin’s domain name to the secondary origin’s IP address.
  3. TTL Impact: The failover process honors the DNS Time-to-Live (TTL) settings. A low TTL ensures faster propagation, but even in the most optimal configurations, this process introduces a delayβ€”often around 60 to 70 seconds.
  4. Stateful Behavior: Once failover occurs, all traffic is routed to the secondary origin until the primary origin is marked healthy again.

Implementation from AWS (as-is from aws blog)

The first approach is usingΒ Amazon Route 53 Failover routing policy with health checks on the origin domain name that’s configured as the origin in CloudFront. When the primary origin becomes unhealthy, Route 53 detects it, and then starts resolving the origin domain name with the IP address of the secondary origin. CloudFront honors the origin DNS TTL, which means that traffic will start flowing to the secondary origin within the DNS TTLs.Β The most optimal configuration (Fast Check activated, a failover threshold of 1, and 60 second DNS TTL) means that the failover will take 70 seconds at minimum to occur. When it does, all of the traffic is switched to the secondary origin, since it’s a stateful failover. Note that this design can be further extended with Route 53 Application Recovery Control for more sophisticated application failover across multiple AWS Regions, Availability Zones, and on-premises.

The second approach is using origin failover, a native feature of CloudFront. This capability of CloudFront tries for the primary origin of every request, and if a configured 4xx or 5xx error is received, then CloudFront attempts a retry with the secondary origin. This approach is simple to configure and provides immediate failover. However, it’s stateless, which means every request must fail independently, thus introducing latency to failed requests. For transient origin issues, this additional latency is an acceptable tradeoff with the speed of failover, but it’s not ideal when the origin is completely out of service. Finally, this approach only works for the GET/HEAD/OPTIONS HTTP methods, because other HTTP methods are not allowed on a CloudFront cache behavior with Origin Failover enabled.

Advantages

  • Works for all HTTP methods and request types.
  • Ensures complete switchover, minimizing ongoing failures.

Disadvantages

  • Relatively slower failover due to DNS propagation time.
  • Requires a reliable health-check mechanism.

Approach 2: Stateless Failover with Application Logic

This method handles failover at the application level. If a request to the primary origin fails (e.g., due to a 4xx or 5xx HTTP response), the application or CDN immediately retries the request with the secondary origin.

How It Works

  1. Primary Request: The application sends a request to the primary origin.
  2. Failure Handling: If the response indicates a failure (configurable for specific error codes), the request is retried with the secondary origin.
  3. Stateless Behavior: Each request operates independently, so failover happens on a per-request basis without waiting for a stateful switchover.

Implementation from AWS (as-is from aws blog)

The hybrid origin failover pattern combines both approaches to get the best of both worlds. First, you configure both of your origins with a Failover Policy in Route 53 behind a single origin domain name. Then, you configure an origin failover group with the single origin domain name as primary origin, and the secondary origin domain name as secondary origin. This means that when the primary origin becomes unavailable, requests are immediately retried with the secondary origin until the stateful failover of Route 53 kicks in within tens of seconds, after which requests go directly to the secondary origin without any latency penalty. Note that this pattern only works with the GET/HEAD/OPTIONS HTTP methods.

Advantages

  • Near-instantaneous failover for failed requests.
  • Simple to configure and doesn’t depend on DNS TTL.

Disadvantages

  • Adds latency for failed requests due to retries.
  • Limited to specific HTTP methods like GET, HEAD, and OPTIONS.
  • Not suitable for scenarios where the primary origin is entirely down, as every request must fail first.

The Hybrid Origin Failover Pattern

The hybrid origin failover pattern combines the strengths of both approaches, mitigating their individual limitations. Here’s how it works:

  1. DNS-based Stateful Failover: A DNS service with health checks monitors the primary origin and switches to the secondary origin if the primary becomes unhealthy. This ensures a complete and stateful failover within tens of seconds.
  2. Application-level Stateless Failover: Simultaneously, the application or CDN is configured to retry failed requests with a secondary origin. This provides an immediate failover mechanism for transient or initial failures.

Implementation Steps

  1. DNS Configuration
    • Set up health checks on the primary origin.
    • Define a failover policy in the DNS service, which resolves the origin domain name to the secondary origin when the primary is unhealthy.
  2. Application Configuration
    • Configure the application or CDN to use an origin failover group.
    • Specify the primary origin domain as the primary origin and the secondary origin domain as the backup.

Behavior

  • Initially, if the primary origin encounters issues, requests are retried immediately with the secondary origin.
  • Meanwhile, the DNS failover switches all traffic to the secondary origin within tens of seconds, eliminating retry latencies for subsequent requests.

Benefits of Hybrid Origin Failover

  1. Faster Failover: Immediate retries for failed requests minimize initial impact, while DNS failover ensures long-term stability.
  2. Reduced Latency: After DNS failover, subsequent requests don’t experience retry delays.
  3. High Resilience: Combines stateful and stateless failover for robust redundancy.
  4. Simplicity and Scalability: Leverages existing DNS and application/CDN features without complex configurations.

Limitations and Considerations

  1. HTTP Method Constraints: Stateless failover works only for GET, HEAD, and OPTIONS methods, limiting its use for POST or PUT requests.
  2. TTL Impact: Low TTLs reduce propagation delays but increase DNS query rates, which could lead to higher costs.
  3. Configuration Complexity: Combining DNS and application-level failover requires careful setup and testing to avoid misconfigurations.
  4. Secondary Origin Capacity: Ensure the secondary origin can handle full traffic loads during failover.

POTD #22 – Longest substring with distinct characters | Geeks For Geeks

Problem Statement

Geeks For Geeks : https://www.geeksforgeeks.org/problems/longest-distinct-characters-in-string5848/1

Given a string s, find the length of the longest substring with all distinct characters.Β 


Input: s = "geeksforgeeks"
Output: 7
Explanation: "eksforg" is the longest substring with all distinct characters.


Input: s = "abcdefabcbb"
Output: 6
Explanation: The longest substring with all distinct characters is "abcdef", which has a length of 6.

My Approach – Sliding Window


class Solution:
    def longestUniqueSubstr(self, s):
        # code here
        char_index = {}
        max_length = 0
        start = 0
        
        for i, char in enumerate(s):
            if char in char_index and char_index[char] >= start:
                start = char_index[char] + 1 #crux
            
            char_index[char] = i
            
            max_length = max(max_length, i - start + 1)
        
        return max_length
                

Learning Notes #51 – Postgres as a Queue using SKIP LOCKED

Yesterday, i came across a blog from inferable.ai https://www.inferable.ai/blog/posts/postgres-skip-locked, which walkthrough about using postgres as a queue. In this blog, i jot down notes on using postgres as a queue for future references.

PostgreSQL is a robust relational database that can be used for more than just storing structured data. With the SKIP LOCKED feature introduced in PostgreSQL 9.5, you can efficiently turn a PostgreSQL table into a job queue for distributed processing.

Why Use PostgreSQL as a Queue?

Using PostgreSQL as a queue can be advantageous because,

  • Familiarity: If you’re already using PostgreSQL, there’s no need for an additional message broker.
  • Durability: PostgreSQL ensures ACID compliance, offering reliability for your job processing.
  • Simplicity: No need to manage another component like RabbitMQ or Kafka

Implementing a Queue with SKIP LOCKED

1. Create a Queue Table

To start, you need a table to store the jobs,


CREATE TABLE job_queue (
    id SERIAL PRIMARY KEY,
    job_data JSONB NOT NULL,
    status TEXT DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This table has the following columns,

  • id: A unique identifier for each job.
  • job_data: The data or payload for the job.
  • status: Tracks the job’s state (β€˜pending’, β€˜in_progress’, or β€˜completed’).
  • created_at: Timestamp of job creation.

2. Insert Jobs into the Queue

Adding jobs is straightforward,


INSERT INTO job_queue (job_data)
VALUES ('{"task": "send_email", "email": "user@example.com"}');

3. Fetch Jobs for Processing with SKIP LOCKED

Workers will fetch jobs from the queue using SELECT ... FOR UPDATE SKIP LOCKED to avoid contention,

WITH next_job AS (
    SELECT id, job_data
    FROM job_queue
    WHERE status = 'pending'
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
UPDATE job_queue
SET status = 'in_progress'
FROM next_job
WHERE job_queue.id = next_job.id
RETURNING job_queue.id, job_queue.job_data;

Key Points:

  • FOR UPDATE locks the selected row to prevent other workers from picking it up.
  • SKIP LOCKED ensures locked rows are skipped, enabling concurrent workers to operate without waiting.
  • LIMIT 1 processes one job at a time per worker.

4. Mark Jobs as Completed

Once a worker finishes processing a job, it should update the job’s status,


UPDATE job_queue
SET status = 'completed'
WHERE id = $1; -- Replace $1 with the job ID

5. Delete Old or Processed Jobs

To keep the table clean, you can periodically remove completed jobs,


DELETE FROM job_queue
WHERE status = 'completed' AND created_at < NOW() - INTERVAL '30 days';

Example Worker Implementation

Here’s an example of a worker implemented in Python using psycopg2


import psycopg2
from psycopg2.extras import RealDictCursor

connection = psycopg2.connect("dbname=yourdb user=youruser")

while True:
    with connection.cursor(cursor_factory=RealDictCursor) as cursor:
        cursor.execute(
            """
            WITH next_job AS (
                SELECT id, job_data
                FROM job_queue
                WHERE status = 'pending'
                FOR UPDATE SKIP LOCKED
                LIMIT 1
            )
            UPDATE job_queue
            SET status = 'in_progress'
            FROM next_job
            WHERE job_queue.id = next_job.id
            RETURNING job_queue.id, job_queue.job_data;
            """
        )

        job = cursor.fetchone()
        if job:
            print(f"Processing job {job['id']}: {job['job_data']}")

            # Simulate job processing
            cursor.execute("UPDATE job_queue SET status = 'completed' WHERE id = %s", (job['id'],))

        else:
            print("No jobs available. Sleeping...")
            time.sleep(5)

    connection.commit()

Considerations

  1. Transaction Isolation: Use the REPEATABLE READ or SERIALIZABLE isolation level cautiously to avoid unnecessary locks.
  2. Row Locking: SKIP LOCKED only skips rows locked by other transactions, not those locked within the same transaction.
  3. Performance: Regularly archive or delete old jobs to prevent the table from growing indefinitely. Consider indexing the status column to improve query performance.
  4. Fault Tolerance: Ensure that workers handle crashes or timeouts gracefully. Use a timeout mechanism to revert jobs stuck in the β€˜in_progress’ state.
  5. Scaling: Distribute workers across multiple nodes to handle a higher job throughput.
  6. The SKIP LOCKED clause only applies to row-level locks – the required ROW SHARE table-level lock is still taken normally.
  7. Using SKIP LOCKED provides an inconsistent view of the data by design. This is why it’s perfect for queue-like tables where we want to distribute work, but not suitable for general purpose work where consistency is required.

Learning Notes #50 – Fixed Partition Pattern | Distributed Pattern

Today, i learnt about fixed partition, where it handles about balancing the data among servers without high movement of data. In this blog, i jot down notes on how fixed partition helps in solving the problem.

This entire blog is inspired from https://www.linkedin.com/pulse/distributed-systems-design-pattern-fixed-partitions-retail-kumar-v-c34pc/?trackingId=DMovSwEZSfCzKZEKa7yJrg%3D%3D

Problem Statement

In a distributed key-value store system, data items need to be mapped to a set of cluster nodes to ensure efficient storage and retrieval. The system must satisfy the following requirements,

  1. Uniform Distribution: Data should be evenly distributed across all cluster nodes to avoid overloading any single node.
  2. Deterministic Mapping: Given a data item, the specific node responsible for storing it should be determinable without querying all the nodes in the cluster.

A common approach to achieve these goals is to use hashing with a modulo operation. For example, if there are three nodes in the cluster, the key is hashed, and the hash value modulo the number of nodes determines the node to store the data. However, this method has a critical drawback,

Rebalancing Issue: When the cluster size changes (e.g., nodes are added or removed), the mapping for most keys changes. This requires the system to move almost all the data to new nodes, leading to significant overhead in terms of time and resources, especially when dealing with large data volumes.

Challenge: How can we design a mapping mechanism that minimizes data movement during cluster size changes while maintaining uniform distribution and deterministic mapping?

Solution

There is a concept of Fixed Partitioning,

What Is Fixed Partitioning?

This pattern organizes data into a predefined number of fixed partitions that remain constant over time. Data is assigned to these partitions using a hashing algorithm, ensuring that the mapping of data to partitions is permanent. The system separates the fixed partitioning of data from the physical servers managing these partitions, enabling seamless scaling.

Key Features of Fixed Partitioning

  1. Fixed Number of Partitions
    • The number of partitions is determined during system initialization (e.g., 8 partitions).
    • Data is assigned to these partitions based on a consistent hashing algorithm.
  2. Stable Data Mapping
    • Each piece of data is permanently mapped to a specific partition.
    • This eliminates the need for large-scale data reshuffling when scaling the system.
  3. Adjustable Partition-to-Server Mapping
    • Partitions can be reassigned to different servers as the system scales.
    • Only the physical location of the partitions changes; the fixed mapping remains intact.
  4. Balanced Load Distribution
    • Partitions are distributed evenly across servers to balance the workload.
    • Adding new servers involves reassigning partitions without moving or reorganizing data within the partitions.

Naive Example

We have a banking system with transactions stored in 8 fixed partitions, distributed based on a customer’s account ID.


CREATE TABLE transactions (
    id SERIAL PRIMARY KEY,
    account_id INT NOT NULL,
    transaction_amount NUMERIC(10, 2) NOT NULL,
    transaction_date DATE NOT NULL
) PARTITION BY HASH (account_id);

1. Create Partition


DO $$
BEGIN
    FOR i IN 0..7 LOOP
        EXECUTE format(
            'CREATE TABLE transactions_p%s PARTITION OF transactions FOR VALUES WITH (modulus 8, remainder %s);',
            i, i
        );
    END LOOP;
END $$;

This creates 8 partitions (transactions_p0 to transactions_p7) based on the hash remainder of account_id modulo 8.

2. Inserting Data

When inserting data into the transactions table, PostgreSQL automatically places it into the correct partition based on the account_id.


INSERT INTO transactions (account_id, transaction_amount, transaction_date)
VALUES (12345, 500.00, '2025-01-01');

The hash of 12345 % 8 determines the target partition (e.g., transactions_p5).

3. Querying Data

Querying the base table works transparently across all partitions


SELECT * FROM transactions WHERE account_id = 12345;

PostgreSQL automatically routes the query to the correct partition.

4. Scaling by Adding Servers

Initial Setup:

Suppose we have 4 servers managing the partitions,

  • Server 1: transactions_p0, transactions_p1
  • Server 2: transactions_p2, transactions_p3
  • Server 3: transactions_p4, transactions_p5
  • Server 4: transactions_p6, transactions_p7

Adding a New Server:

When a 5th server is added, we redistribute partitions,

  • Server 1: transactions_p0
  • Server 2: transactions_p1
  • Server 3: transactions_p2, transactions_p3
  • Server 4: transactions_p4
  • Server 5: transactions_p5, transactions_p6, transactions_p7

Partition Migration

  • During the migration, transactions_p5 is copied from Server 3 to Server 5.
  • Once the migration is complete, Server 5 becomes responsible for transactions_p5.

Benefits:

  1. Minimal Data Movement – When scaling, only the partitions being reassigned are copied to new servers. Data within partitions remains stable.
  2. Optimized Performance – Queries are routed directly to the relevant partition, minimizing scan times.
  3. Scalability – Adding servers is straightforward, as it involves reassigning partitions, not reorganizing data.

What happens when a new server is added then. Don’t we need to copy the data ?

When a partition is moved to a new server (e.g., partition_b from server_A to server_B), the data in the partition must be copied to the new server. However,

  1. The copying is limited to the partition being reassigned.
  2. No data within the partition is reorganized.
  3. Once the partition is fully migrated, the original copy is typically deleted.

For example, in PostgreSQL,

  • Export the Partition pg_dump -t partition_b -h server_A -U postgres > partition_b.sql
  • Import on New Server: psql -h server_B -U postgres -d mydb < partition_b.sql

Learning Notes #49 – Pitfall of Implicit Default Values in APIs

Today, we faced a bug in our workflow due to implicit default value in an 3rd party api. In this blog i will be sharing my experience for future reference.

Understanding the Problem

Consider an API where some fields are optional, and a default value is used when those fields are not provided by the client. This design is common and seemingly harmless. However, problems arise when,

  1. Unexpected Categorization: The default value influences logic, such as category assignment, in ways the client did not intend.
  2. Implicit Assumptions: The API assumes a default value aligns with the client’s intention, leading to misclassification or incorrect behavior.
  3. Debugging Challenges: When issues occur, clients and developers spend significant time tracing the problem because the default behavior is not transparent.

Here’s an example of how this might manifest,


POST /items
{
  "name": "Sample Item",
  "category": "premium"
}

If the category field is optional and a default value of "basic" is applied when it’s omitted, the following request,


POST /items
{
  "name": "Another Item"
}

might incorrectly classify the item as basic, even if the client intended it to be uncategorized.

Why This is a Code Smell

Implicit default handling for optional fields often signals poor design. Let’s break down why,

  1. Violation of the Principle of Least Astonishment: Clients may be unaware of default behavior, leading to unexpected outcomes.
  2. Hidden Logic: The business logic embedded in defaults is not explicit in the API’s contract, reducing transparency.
  3. Coupling Between API and Business Logic: When defaults dictate core behavior, the API becomes tightly coupled to specific business rules, making it harder to adapt or extend.
  4. Inconsistent Behavior: If the default logic changes in future versions, existing clients may experience breaking changes.

Best Practices to Avoid the Trap

  1. Make Default Behavior Explicit
    • Clearly document default values in the API specification (but we still missed it.)
    • For example, use OpenAPI/Swagger to define optional fields and their default values explicitly
  2. Avoid Implicit Defaults
    • Instead of applying defaults server-side, require the client to explicitly provide values, even if they are defaults.
    • This ensures the client is fully aware of the data being sent and its implications.
  3. Use Null or Explicit Indicators
    • Allow optional fields to be explicitly null or undefined, and handle these cases appropriately.
    • In this case, the API can handle null as β€œno category specified” rather than applying a default.
  4. Fail Fast with Validation
    • Use strict validation to reject ambiguous requests, encouraging clients to provide clear inputs.

{
  "error": "Field 'category' must be provided explicitly."
}

5. Version Your API Thoughtfully:

  • Document changes and provide clear migration paths for clients.
  • If you must change default behaviors, ensure backward compatibility through versioning.

Implicit default values for optional fields can lead to unintended consequences, obscure logic, and hard-to-debug issues. Recognizing this pattern as a code smell is the first step to building more robust APIs. By adopting explicitness, transparency, and rigorous validation, you can create APIs that are easier to use, understand, and maintain.

POTD #19 – Count the number of possible triangles | Geeks For Geeks

Problem Statement

Geeks For Geeks – https://www.geeksforgeeks.org/problems/count-possible-triangles-1587115620/1

Given an integer array arr[]. Find the number of triangles that can be formed with three different array elements as lengths of three sides of the triangle.Β  A triangle with three given sides is only possible if sum of any two sides is always greater than the third side.


Input: arr[] = [4, 6, 3, 7]
Output: 3
Explanation: There are three triangles possible [3, 4, 6], [4, 6, 7] and [3, 6, 7]. Note that [3, 4, 7] is not a possible triangle.  


Input: arr[] = [10, 21, 22, 100, 101, 200, 300]
Output: 6
Explanation: There can be 6 possible triangles: [10, 21, 22], [21, 100, 101], [22, 100, 101], [10, 100, 101], [100, 101, 200] and [101, 200, 300]

My Approach


class Solution:
    #Function to count the number of possible triangles.
    def countTriangles(self, arr):
        # code here
        arr.sort()
        n = len(arr)
        cnt = 0
        for itr in range(2, n):
            left = 0
            right = itr - 1
            
            while left < right:
                
                if arr[left] + arr[right] > arr[itr]:
                    cnt += right - left
                    right -= 1
                else:
                    left += 1
        return cnt

Learning Notes #48 – Common Pitfalls in Event Driven Architecture

Today, i came across Raul Junco post on mistakes in Event Driven Architecture – https://www.linkedin.com/posts/raul-junco_after-years-building-event-driven-systems-activity-7278770394046631936-zu3-?utm_source=share&utm_medium=member_desktop. In this blog i am highlighting the same for future reference.

Event-driven architectures are awesome, but they come with their own set of challenges. Missteps can lead to unreliable systems, inconsistent data, and frustrated users. Let’s explore some of the most common pitfalls and how to address them effectively.

1. Duplication

Idempotent APIs – https://parottasalna.com/2025/01/08/learning-notes-47-idempotent-post-requests/

Events often get re-delivered due to retries or system failures. Without proper handling, duplicate events can,

  • Charge a customer twice for the same transaction: Imagine a scenario where a payment service retries a payment event after a temporary network glitch, resulting in a duplicate charge.
  • Cause duplicate inventory updates: For example, an e-commerce platform might update stock levels twice for a single order, leading to overestimating available stock.
  • Create inconsistent or broken system states: Duplicates can cascade through downstream systems, introducing mismatched or erroneous data.

Solution:

  • Assign unique IDs: Ensure every event has a globally unique identifier. Consumers can use these IDs to detect and discard duplicates.
  • Design idempotent processing: Structure your operations so they produce the same outcome even when executed multiple times. For instance, an API updating inventory could always set stock levels to a specific value rather than incrementing or decrementing.

2. Not Guaranteeing Order

Events can arrive out of order when distributed across partitions or queues. This can lead to

  • Processing a refund before the payment: If a refund event is processed before the corresponding payment event, the system might show a negative balance or fail to reconcile properly.
  • Breaking logic that relies on correct sequence: Certain workflows, such as assembling logs or transactional data, depend on a strict event order to function correctly.

Solution

  • Use brokers with ordering guarantees: Message brokers like Apache Kafka support partition-level ordering. Design your topics and partitions to align with entities requiring ordered processing (e.g., user or account ID).
  • Add sequence numbers or timestamps: Include metadata in events to indicate their position in a sequence. Consumers can use this data to reorder events if necessary, ensuring logical consistency.

3. The Dual Write Problem

Outbox Pattern: https://parottasalna.com/2025/01/03/learning-notes-31-outbox-pattern-cloud-pattern/

When writing to a database and publishing an event, one might succeed while the other fails. This can

  • Lose events: If the event is not published after the database write, downstream systems might remain unaware of critical changes, such as a new order or a status update.
  • Cause mismatched states: For instance, a transaction might be logged in a database but not propagated to analytical or monitoring systems, creating inconsistencies.

Solution

  • Use the Transactional Outbox Pattern: In this pattern, events are written to an β€œoutbox” table within the same database transaction as the main data write. A separate process then reads from the outbox and publishes events reliably.
  • Adopt Change Data Capture (CDC) tools: CDC tools like Debezium can monitor database changes and publish them as events automatically, ensuring no changes are missed.

4. Non-Backward-Compatible Changes

Changing event schemas without considering existing consumers can break systems. For example:

  • Removing a field: A consumer relying on this field might encounter null values or fail altogether.
  • Renaming or changing field types: This can lead to deserialization errors or misinterpretation of data.

Solution:

  • Maintain versioned schemas: Introduce new schema versions incrementally and ensure consumers can continue using older versions during the transition.
  • Use schema evolution-friendly formats: Formats like Avro or Protobuf natively support schema evolution, allowing you to add fields or make other non-breaking changes easily.
  • Add adapters for compatibility: Build adapters or translators that transform events from new schemas to older formats, ensuring backward compatibility for legacy systems.

Learning Notes #41 – Shared Lock and Exclusive Locks | Postgres

Today, I learnt about various locking mechanism to prevent double update. In this blog, i make notes on Shared Lock and Exclusive Lock for my future self.

What Are Locks in Databases?

Locks are mechanisms used by a DBMS to control access to data. They ensure that transactions are executed in a way that maintains the ACID (Atomicity, Consistency, Isolation, Durability) properties of the database. Locks can be classified into several types, including

  • Shared Locks (S Locks): Allow multiple transactions to read a resource simultaneously but prevent any transaction from writing to it.
  • Exclusive Locks (X Locks): Allow a single transaction to modify a resource, preventing both reading and writing by other transactions.
  • Intent Locks: Used to signal the type of lock a transaction intends to acquire at a lower level.
  • Deadlock Prevention Locks: Special locks aimed at preventing deadlock scenarios.

Shared Lock

A shared lock is used when a transaction needs to read a resource (e.g., a database row or table) without altering it. Multiple transactions can acquire a shared lock on the same resource simultaneously. However, as long as one or more shared locks exist on a resource, no transaction can acquire an exclusive lock on that resource.


-- Transaction A: Acquire a shared lock on a row
BEGIN;
SELECT * FROM employees WHERE id = 1 FOR SHARE;
-- Transaction B: Acquire a shared lock on the same row
BEGIN;
SELECT * FROM employees WHERE id = 1 FOR SHARE;
-- Both transactions can read the row concurrently
-- Transaction C: Attempt to update the same row
BEGIN;
UPDATE employees SET salary = salary + 1000 WHERE id = 1;
-- Transaction C will be blocked until Transactions A and B release their locks

Key Characteristics of Shared Locks

1. Concurrent Reads

  • Shared locks allow multiple transactions to read the same resource at the same time.
  • This is ideal for operations like SELECT queries that do not modify data.

2. Write Blocking

  • While a shared lock is active, no transaction can modify the locked resource.
  • Prevents dirty writes and ensures read consistency.

3. Compatibility

  • Shared locks are compatible with other shared locks but not with exclusive locks.

When Are Shared Locks Used?

Shared locks are typically employed in read operations under certain isolation levels. For instance,

1. Read Committed Isolation Level:

  • Shared locks are held for the duration of the read operation.
  • Prevents dirty reads by ensuring the data being read is not modified by other transactions during the read.

2. Repeatable Read Isolation Level:

  • Shared locks are held until the transaction completes.
  • Ensures that the data read during a transaction remains consistent and unmodified.

3. Snapshot Isolation:

  • Shared locks may not be explicitly used, as the DBMS creates a consistent snapshot of the data for the transaction.

    Exclusive Locks

    An exclusive lock is used when a transaction needs to modify a resource. Only one transaction can hold an exclusive lock on a resource at a time, ensuring no other transactions can read or write to the locked resource.

    
    -- Transaction X: Acquire an exclusive lock to update a row
    BEGIN;
    UPDATE employees SET salary = salary + 1000 WHERE id = 2;
    -- Transaction Y: Attempt to read the same row
    BEGIN;
    SELECT * FROM employees WHERE id = 2;
    -- Transaction Y will be blocked until Transaction X completes
    -- Transaction Z: Attempt to update the same row
    BEGIN;
    UPDATE employees SET salary = salary + 500 WHERE id = 2;
    -- Transaction Z will also be blocked until Transaction X completes
    

    Key Characteristics of Exclusive Locks

    1. Write Operations: Exclusive locks are essential for operations like INSERT, UPDATE, and DELETE.

    2. Blocking Reads and Writes: While an exclusive lock is active, no other transaction can read or write to the resource.

    3. Isolation: Ensures that changes made by one transaction are not visible to others until the transaction is complete.

      When Are Exclusive Locks Used?

      Exclusive locks are typically employed in write operations or any operation that modifies the database. For instance:

      1. Transactional Updates – A transaction that updates a row acquires an exclusive lock to ensure no other transaction can access or modify the row during the update.

      2. Table Modifications – When altering a table structure, the DBMS may place an exclusive lock on the entire table.

      Benefits of Shared and Exclusive Locks

      Benefits of Shared Locks

      1. Consistency in Multi-User Environments – Ensure that data being read is not altered by other transactions, preserving consistency.
      2. Concurrency Support – Allow multiple transactions to read data simultaneously, improving system performance.
      3. Data Integrity – Prevent dirty reads and writes, ensuring that operations yield reliable results.

      Benefits of Exclusive Locks

      1. Data Integrity During Modifications – Prevents other transactions from accessing data being modified, ensuring changes are applied safely.
      2. Isolation of Transactions – Ensures that modifications by one transaction are not visible to others until committed.

      Limitations and Challenges

      Shared Locks

      1. Potential for Deadlocks – Deadlocks can occur if two transactions simultaneously hold shared locks and attempt to upgrade to exclusive locks.
      2. Blocking Writes – Shared locks can delay write operations, potentially impacting performance in write-heavy systems.
      3. Lock Escalation – In systems with high concurrency, shared locks may escalate to table-level locks, reducing granularity and concurrency.

      Exclusive Locks

      1. Reduced Concurrency – Exclusive locks prevent other transactions from accessing the locked resource, which can lead to bottlenecks in highly concurrent systems.
      2. Risk of Deadlocks – Deadlocks can occur if two transactions attempt to acquire exclusive locks on resources held by each other.

      Lock Compatibility

      POTD #16 – Count Pairs whose sum is less than target | Geeks For Geeks

      Problem Statement

      Geeks For Geeks : https://www.geeksforgeeks.org/problems/count-pairs-whose-sum-is-less-than-target/1

      Given an arrayΒ arr[]Β and an integerΒ target.Β You have to find the number of pairs in the array whose sum is strictly less than theΒ target.

      
      Input: arr[] = [7, 2, 5, 3], target = 8
      Output: 2
      Explanation: There are 2 pairs with sum less than 8: (2, 5) and (2, 3). 
      

      
      Input: arr[] = [5, 2, 3, 2, 4, 1], target = 5
      Output: 4
      Explanation: There are 4 pairs whose sum is less than 5: (2, 2), (2, 1), (3, 1) and (2, 1).
      

      My Approach

      Sorted the array and used two pointer approach to find the possible pairs.

      
      class Solution:
          #Complete the below function
          def countPairs(self, arr, target):
              #Your code here
              arr.sort()
              n = len(arr)
              cnt = 0
              left = 0
              right = n - 1
              while right > left:
                  if arr[right] + arr[left] < target:
                      cnt += right - left
                      left += 1
                  elif arr[right] + arr[left] >= target:
                      right -= 1
                  
              return cnt
                          
      

      Learning Notes #40 – SAGA Pattern | Cloud Patterns

      Today, I learnt about SAGA Pattern, followed by Compensation Pattern, Orchestration Pattern, Choreography Pattern and Two Phase Commit. SAGA is a combination of all the above. In this blog, i jot down notes on SAGA, for my future self.

      Modern software applications often require the coordination of multiple distributed services to perform complex business operations. In such systems, ensuring consistency and reliability can be challenging, especially when a failure occurs in one of the services. The SAGA design pattern offers a robust solution to manage distributed transactions while maintaining data consistency.

      What is the SAGA Pattern?

      The SAGA pattern is a distributed transaction management mechanism where a series of independent operations (or steps) are executed sequentially across multiple services. Each operation in the sequence has a corresponding compensating action to roll back changes if a failure occurs. This approach avoids the complexities of distributed transactions, such as two-phase commits, by breaking down the process into smaller, manageable units.

      Key Characteristics

      1. Decentralized Control: Transactions are managed across services without a central coordinator.
      2. Compensating Transactions: Every operation has an undo or rollback mechanism.
      3. Asynchronous Communication: Services communicate asynchronously in most implementations, ensuring loose coupling.

      Types of SAGA Patterns

      There are two primary types of SAGA patterns:

      1. Choreography-Based SAGA

      • In this approach, services communicate with each other directly to coordinate the workflow.
      • Each service knows which operation to trigger next after completing its own task.
      • If a failure occurs, each service initiates its compensating action to roll back changes.

      Advantages:

      • Simple implementation.
      • No central coordinator required.

      Disadvantages:

      • Difficult to manage and debug in complex workflows.
      • Tight coupling between services.
      import pika
      
      class RabbitMQHandler:
          def __init__(self, queue):
              self.connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
              self.channel = self.connection.channel()
              self.channel.queue_declare(queue=queue)
              self.queue = queue
      
          def publish(self, message):
              self.channel.basic_publish(exchange='', routing_key=self.queue, body=message)
      
          def consume(self, callback):
              self.channel.basic_consume(queue=self.queue, on_message_callback=callback, auto_ack=True)
              self.channel.start_consuming()
      
      # Define services
      class FlightService:
          def book_flight(self):
              print("Flight booked.")
              RabbitMQHandler('hotel_queue').publish("flight_booked")
      
      class HotelService:
          def on_flight_booked(self, ch, method, properties, body):
              try:
                  print("Hotel booked.")
                  RabbitMQHandler('invoice_queue').publish("hotel_booked")
              except Exception:
                  print("Failed to book hotel. Rolling back flight.")
                  FlightService().cancel_flight()
      
          def cancel_flight(self):
              print("Flight booking canceled.")
      
      # Setup RabbitMQ
      flight_service = FlightService()
      hotel_service = HotelService()
      
      RabbitMQHandler('hotel_queue').consume(hotel_service.on_flight_booked)
      
      # Trigger the workflow
      flight_service.book_flight()
      

      2. Orchestration-Based SAGA

      • A central orchestrator service manages the workflow and coordinates between the services.
      • The orchestrator determines the sequence of operations and handles compensating actions in case of failures.

      Advantages:

      • Clear control and visibility of the workflow.
      • Easier to debug and manage.

      Disadvantages:

      • The orchestrator can become a single point of failure.
      • More complex implementation.
      import pika
      
      class Orchestrator:
          def __init__(self):
              self.rabbitmq = RabbitMQHandler('orchestrator_queue')
      
          def execute_saga(self):
              try:
                  self.reserve_inventory()
                  self.process_payment()
                  self.generate_invoice()
              except Exception as e:
                  print(f"Error occurred: {e}. Initiating rollback.")
                  self.compensate()
      
          def reserve_inventory(self):
              print("Inventory reserved.")
              self.rabbitmq.publish("inventory_reserved")
      
          def process_payment(self):
              print("Payment processed.")
              self.rabbitmq.publish("payment_processed")
      
          def generate_invoice(self):
              print("Invoice generated.")
              self.rabbitmq.publish("invoice_generated")
      
          def compensate(self):
              print("Rolling back invoice.")
              print("Rolling back payment.")
              print("Rolling back inventory.")
      
      # Trigger the workflow
      Orchestrator().execute_saga()
      

      How SAGA Works

      1. Transaction Initiation: The first operation is executed by one of the services.
      2. Service Communication: Subsequent services execute their operations based on the outcome of the previous step.
      3. Failure Handling: If an operation fails, compensating transactions are triggered in reverse order to undo any changes.
      4. Completion: Once all operations are successfully executed, the transaction is considered complete.

      Benefits of the SAGA Pattern

      1. Improved Resilience: Allows partial rollbacks in case of failure.
      2. Scalability: Suitable for microservices and distributed systems.
      3. Flexibility: Works well with event-driven architectures.
      4. No Global Locks: Unlike traditional transactions, SAGA does not require global locking of resources.

      Challenges and Limitations

      1. Complexity in Rollbacks: Designing compensating transactions for every operation can be challenging.
      2. Data Consistency: Achieving eventual consistency may require additional effort.
      3. Debugging Issues: Debugging failures in a distributed environment can be cumbersome.
      4. Latency: Sequential execution may increase overall latency.

      When to Use the SAGA Pattern

      • Distributed systems where global ACID transactions are infeasible.
      • Microservices architectures with independent services.
      • Applications requiring high resilience and eventual consistency.

      Real-World Applications

      1. E-Commerce Platforms: Managing orders, payments, and inventory updates.
      2. Travel Booking Systems: Coordinating flight, hotel, and car rental reservations.
      3. Banking Systems: Handling distributed account updates and transfers.
      4. Healthcare: Coordinating appointment scheduling and insurance claims.

      Learning Notes #39 – Compensation Pattern | Cloud Pattern

      Today i learnt about compensation pattern, where it rollback a transactions when it face some failures. In this blog i jot down notes on compensating pattern and how it relates with SAGA pattern.

      Distributed systems often involve multiple services working together to perform a business operation. Ensuring data consistency and reliability across these services is challenging, especially in cases of failure. One solution is the use of compensation transactions, a mechanism designed to maintain consistency by reversing the effects of previous operations when errors occur.

      What Are Compensation Transactions?

      A compensation transaction is an operation that undoes the effect of a previously executed operation. Unlike traditional rollback mechanisms in centralized databases, compensation transactions are explicitly defined and executed in distributed systems to maintain consistency after a failure.

      Key Characteristics

      • Explicit Definition: Compensation logic must be explicitly implemented.
      • Independent Execution: Compensation operations are separate from the main transaction.
      • Eventual Consistency: Ensures the system reaches a consistent state over time.
      • Asynchronous Nature: Often triggered asynchronously to avoid blocking main processes.

      Why Are Compensation Transactions Important?

      1. Handling Failures in Distributed Systems

      In a distributed architecture, such as microservices, different services may succeed or fail independently. Compensation transactions allow partial rollbacks to maintain overall consistency.

      2. Avoiding Global Locking

      Traditional transactions with global locks (e.g., two-phase commits) are not feasible in distributed systems due to performance and scalability concerns. Compensation transactions provide a more flexible alternative.

      3. Resilience and Fault Tolerance

      Compensation mechanisms make systems more resilient by allowing recovery from failures without manual intervention.

      How Compensation Transactions Work

      1. Perform Main Operations: Each service performs its assigned operation, such as creating a record or updating a database.
      2. Log Operations: Log actions and context to enable compensating transactions if needed.
      3. Detect Failure: Monitor the workflow for errors or failures in any service.
      4. Trigger Compensation: If a failure occurs, execute compensation transactions for all successfully completed operations to undo their effects.

      Example Workflow

      Imagine an e-commerce checkout process involving three steps

      • Step 1: Reserve inventory.
      • Step 2: Deduct payment.
      • Step 3: Confirm order.

      If Step 3 fails, compensation transactions for Steps 1 and 2 might include

      • Releasing the reserved inventory.
      • Refunding the payment.

      Design Considerations for Compensation Transactions

      1. Idempotency

      Ensure compensating actions are idempotent, meaning they can be executed multiple times without unintended side effects. This is crucial in distributed systems where retries are common.

      2. Consistency Model

      Adopt an eventual consistency model to align with the asynchronous nature of compensation transactions.

      3. Error Handling

      Design robust error-handling mechanisms for compensating actions, as these too can fail.

      4. Service Communication

      Use reliable communication protocols (e.g., message queues) to trigger and manage compensation transactions.

      5. Isolation of Compensation Logic

      Keep compensation logic isolated from the main business logic to maintain clarity and modularity.

      Use Cases for Compensation Transactions

      1. Financial Systems

      • Reversing failed fund transfers or unauthorized transactions.
      • Refunding payments in e-commerce platforms.

      2. Travel and Booking Systems

      • Canceling a hotel reservation if flight booking fails.
      • Releasing blocked seats if payment is not completed.

      3. Healthcare Systems

      • Undoing scheduled appointments if insurance validation fails.
      • Revoking prescriptions if a linked process encounters errors.

      4. Supply Chain Management

      • Canceling shipment orders if inventory updates fail.
      • Restocking items if order fulfillment is aborted.

      Challenges of Compensation Transactions

      1. Complexity in Implementation: Designing compensating logic for every operation can be tedious and error-prone.
      2. Performance Overhead: Logging operations and executing compensations can introduce latency.
      3. Partial Rollbacks: It may not always be possible to fully undo certain operations, such as sending emails or notifications.
      4. Failure in Compensating Actions: Compensation transactions themselves can fail, requiring additional mechanisms to handle such scenarios.

      Best Practices

      1. Plan for Compensation Early: Design compensating transactions as part of the initial development process.
      2. Use SAGA Pattern: Combine compensation transactions with the SAGA pattern to manage distributed workflows effectively.
      3. Test Extensively: Simulate failures and test compensating logic under various conditions.
      4. Monitor and Log: Maintain detailed logs of operations and compensations for debugging and audits.

      Learning Notes #38 – Choreography Pattern | Cloud Pattern

      Today i learnt about Choreography pattern, where each and every service is communicating using a messaging queue. In this blog, i jot down notes on choreography pattern for my future self.

      What is the Choreography Pattern?

      In the Choreography Pattern, services communicate directly with each other via asynchronous events, without a central controller. Each service is responsible for a specific part of the workflow and responds to events produced by other services. This pattern allows for a more autonomous and loosely coupled system.

      Key Features

      • High scalability and independence of services.
      • Decentralized control.
      • Services respond to events they subscribe to.

      When to Use the Choreography Pattern

      • Event-Driven Systems: When workflows can be modeled as events triggering responses.
      • High Scalability: When services need to operate independently and scale autonomously.
      • Loose Coupling: When minimizing dependencies between services is critical.

      Benefits of the Choreography Pattern

      1. Decentralized Control: No single point of failure or bottleneck.
      2. Increased Flexibility: Services can be added or modified without affecting others.
      3. Better Scalability: Services operate independently and scale based on their workloads.
      4. Resilience: The system can handle partial failures more gracefully, as services continue independently.

      Example: E-Commerce Order Fulfillment

      Problem

      A fictional e-commerce platform needs to manage the following workflow:

      1. Accepting an order.
      2. Validating payment.
      3. Reserving inventory.
      4. Sending notifications to the customer.

      Each step is handled by an independent service.

      Solution

      Using the Choreography Pattern, each service listens for specific events and publishes new events as needed. The workflow emerges naturally from the interaction of these services.

      Implementation

      Step 1: Define the Workflow as Events

      • OrderPlaced: Triggered when a customer places an order.
      • PaymentProcessed: Triggered after successful payment.
      • InventoryReserved: Triggered after reserving inventory.
      • NotificationSent: Triggered when the customer is notified.

      Step 2: Implement Services

      Each service subscribes to events and performs its task.

      shared_utility.py

      import pika
      import json
      
      def publish_event(exchange, event_type, data):
          connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
          channel = connection.channel()
          channel.exchange_declare(exchange=exchange, exchange_type='fanout')
          message = json.dumps({"event_type": event_type, "data": data})
          channel.basic_publish(exchange=exchange, routing_key='', body=message)
          connection.close()
      
      def subscribe_to_event(exchange, callback):
          connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
          channel = connection.channel()
          channel.exchange_declare(exchange=exchange, exchange_type='fanout')
          queue = channel.queue_declare('', exclusive=True).method.queue
          channel.queue_bind(exchange=exchange, queue=queue)
          channel.basic_consume(queue=queue, on_message_callback=callback, auto_ack=True)
          print(f"Subscribed to events on exchange '{exchange}'")
          channel.start_consuming()
      
      

      Order Service

      
      from shared_utils import publish_event
      
      def place_order(order_id, customer):
          print(f"Placing order {order_id} for {customer}")
          publish_event("order_exchange", "OrderPlaced", {"order_id": order_id, "customer": customer})
      
      if __name__ == "__main__":
          # Simulate placing an order
          place_order(order_id=101, customer="John Doe")
      

      Payment Service

      
      from shared_utils import publish_event, subscribe_to_event
      import time
      
      def handle_order_placed(ch, method, properties, body):
          event = json.loads(body)
          if event["event_type"] == "OrderPlaced":
              order_id = event["data"]["order_id"]
              print(f"Processing payment for order {order_id}")
              time.sleep(1)  # Simulate payment processing
              publish_event("payment_exchange", "PaymentProcessed", {"order_id": order_id})
      
      if __name__ == "__main__":
          subscribe_to_event("order_exchange", handle_order_placed)
      
      

      Inventory Service

      
      from shared_utils import publish_event, subscribe_to_event
      import time
      
      def handle_payment_processed(ch, method, properties, body):
          event = json.loads(body)
          if event["event_type"] == "PaymentProcessed":
              order_id = event["data"]["order_id"]
              print(f"Reserving inventory for order {order_id}")
              time.sleep(1)  # Simulate inventory reservation
              publish_event("inventory_exchange", "InventoryReserved", {"order_id": order_id})
      
      if __name__ == "__main__":
          subscribe_to_event("payment_exchange", handle_payment_processed)
      
      

      Notification Service

      
      from shared_utils import subscribe_to_event
      import time
      
      def handle_inventory_reserved(ch, method, properties, body):
          event = json.loads(body)
          if event["event_type"] == "InventoryReserved":
              order_id = event["data"]["order_id"]
              print(f"Notifying customer for order {order_id}")
              time.sleep(1)  # Simulate notification
              print(f"Customer notified for order {order_id}")
      
      if __name__ == "__main__":
          subscribe_to_event("inventory_exchange", handle_inventory_reserved)
      
      

      Step 3: Run the Workflow

      1. Start RabbitMQ using Docker as described above.
      2. Run the services in the following order:
        • Notification Service: python notification_service.py
        • Inventory Service: python inventory_service.py
        • Payment Service: python payment_service.py
        • Order Service: python order_service.py
      3. Place an order by running the Order Service. The workflow will propagate through the services as events are handled.

      Key Considerations

      1. Event Bus: Use an event broker like RabbitMQ, Kafka, or AWS SNS to manage communication between services.
      2. Event Versioning: Include versioning to handle changes in event formats over time.
      3. Idempotency: Ensure services handle repeated events gracefully to avoid duplication.
      4. Monitoring and Tracing: Use tools like OpenTelemetry to trace and debug distributed workflows.
      5. Error Handling:
        • Dead Letter Queues (DLQs) to capture failed events.
        • Retries with backoff for transient errors.

      Advantages of the Choreography Pattern

      1. Loose Coupling: Services interact via events without direct knowledge of each other.
      2. Resilience: Failures in one service don’t block the entire workflow.
      3. High Autonomy: Services operate independently and can be deployed or scaled separately.
      4. Dynamic Workflows: Adding new services to the workflow requires subscribing them to relevant events.

      Challenges of the Choreography Pattern

      1. Complex Debugging: Tracing errors across distributed services can be difficult.
      2. Event Storms: Poorly designed workflows may generate excessive events, overwhelming the system.
      3. Coordination Overhead: Decentralized logic can lead to inconsistent behavior if not carefully managed.

      Orchestrator vs. Choreography: When to Choose?

      • Use Orchestrator Pattern when workflows are complex, require central control, or involve many dependencies.
      • Use Choreography Pattern when you need high scalability, loose coupling, or event-driven workflows.

      Learning Notes #37 – Orchestrator Pattern | Cloud Pattern

      Today, i learnt about orchestrator pattern, while l was learning about SAGA Pattern. It simplifies the coordination of these workflows, making the system more efficient and easier to manage. In this blog i jot down notes on Orchestrator Pattern for better understanding.

      What is the Orchestrator Pattern?

      The Orchestrator Pattern is a design strategy where a central orchestrator coordinates interactions between various services or components to execute a workflow.

      Unlike the Choreography Pattern, where services interact with each other independently and are aware of their peers, the orchestrator acts as the central decision-maker, directing how and when services interact.

      Key Features

      • Centralized control of workflows.
      • Simplified service communication.
      • Enhanced error handling and monitoring.

      When to Use the Orchestrator Pattern

      • Complex Workflows: When multiple services or steps need to be executed in a defined sequence.
      • Error Handling: When failures in one step require recovery strategies or compensating transactions.
      • Centralized Logic: When you want to encapsulate business logic in a single place for easier maintenance.

      Benefits of the Orchestrator Pattern

      1. Simplifies Service Communication: Services remain focused on their core functionality while the orchestrator manages interactions.
      2. Improves Scalability: Workflows can be scaled independently from services.
      3. Centralized Monitoring: Makes it easier to track the progress of workflows and debug issues.
      4. Flexibility: Changing a workflow involves modifying the orchestrator, not the services.

      Example: Order Processing Workflow

      Problem

      A fictional e-commerce platform needs to process orders. The workflow involves:

      1. Validating the order.
      2. Reserving inventory.
      3. Processing payment.
      4. Notifying the user.

      Each step is handled by a separate microservice.

      Solution

      We implement an orchestrator to manage this workflow. Let’s see how this works in practice.

      
      import requests
      
      class OrderOrchestrator:
          def __init__(self):
              self.services = {
                  "validate_order": "http://order-service/validate",
                  "reserve_inventory": "http://inventory-service/reserve",
                  "process_payment": "http://payment-service/process",
                  "notify_user": "http://notification-service/notify",
              }
      
          def execute_workflow(self, order_id):
              try:
                  # Step 1: Validate Order
                  self.call_service("validate_order", {"order_id": order_id})
      
                  # Step 2: Reserve Inventory
                  self.call_service("reserve_inventory", {"order_id": order_id})
      
                  # Step 3: Process Payment
                  self.call_service("process_payment", {"order_id": order_id})
      
                  # Step 4: Notify User
                  self.call_service("notify_user", {"order_id": order_id})
      
                  print(f"Order {order_id} processed successfully!")
              except Exception as e:
                  print(f"Error processing order {order_id}: {e}")
      
          def call_service(self, service_name, payload):
              url = self.services[service_name]
              response = requests.post(url, json=payload)
              if response.status_code != 200:
                  raise Exception(f"{service_name} failed: {response.text}")
      

      Key Tactics for Implementation

      1. Services vs. Serverless: Use serverless functions for steps that are triggered occasionally and don’t need always-on services, reducing costs.
      2. Recovery from Failures:
        • Retry Mechanism: Configure retries with limits and delays to handle transient failures.
        • Circuit Breaker Pattern: Detect and isolate failing services to allow recovery.
        • Graceful Degradation: Use fallbacks like cached results or alternate services to ensure continuity.
      3. Monitoring and Alerting:
        • Implement real-time monitoring with automated recovery strategies.
        • Set up alerts for exceptions and utilize logs for troubleshooting.
      4. Orchestration Service Failures:
        • Service Replication: Deploy multiple instances of the orchestrator for failover.
        • Data Replication: Ensure data consistency for seamless recovery.
        • Request Queues: Use queues to buffer requests during downtime and process them later.

      Important Considerations

      The primary goal of this architectural pattern is to decompose the entire business workflow into multiple services, making it more flexible and scalable. Due to this, it’s crucial to analyze and comprehend the business processes in detail before implementation. A poorly defined and overly complicated business process will lead to a system that would be hard to maintain and scale.

      Secondly, it’s easy to fall into the trap of adding business logic into the orchestration service. Sometimes it’s inevitable because certain functionalities are too small to create their separate service. But the risk here is that if the orchestration service becomes too intelligent and performs too much business logic, it can evolve into a monolithic application that also happens to talk to microservices. So, it’s crucial to keep track of every addition to the orchestration service and ensure that its work remains within the boundaries of orchestration. Maintaining the scope of the orchestration service will prevent it from becoming a burden on the system, leading to decreased scalability and flexibility.

      Why Use the Orchestration Pattern

      The pattern comes with the following advantages

      • Orchestration makes it easier to understand, monitor, and observe the application, resulting in a better understanding of the core part of the system with less effort.
      • The pattern promotes loose coupling. Each downstream service exposes an API interface and is self-contained, without any need to know about the other services.
      • The pattern simplifies the business workflows and improves the separation of concerns. Each service participates in a long-running transaction without any need to know about it.
      • The orchestrator service can decide what to do in case of failure, making the system fault-tolerant and reliable.

      ❌