❌

Normal view

There are new articles available, click to refresh the page.
Today β€” 18 January 2025Main stream

Learning Notes #57 – Partial Indexing in Postgres

16 January 2025 at 14:36

Today, i learnt about partial indexing in postgres, how its optimizes the indexing process to filter subset of table more efficiently. In this blog, i jot down notes on partial indexing.

Partial indexing in PostgreSQL is a powerful feature that provides a way to optimize database performance by creating indexes that apply only to a subset of a table’s rows. This selective indexing can result in reduced storage space, faster index maintenance, and improved query performance, especially when queries frequently involve filters or conditions that only target a portion of the data.

An index in PostgreSQL, like in other relational database management systems, is a data structure that improves the speed of data retrieval operations. However, creating an index on an entire table can sometimes be inefficient, especially when dealing with very large datasets where queries often focus on specific subsets of the data. This is where partial indexing becomes invaluable.

Unlike a standard index that covers every row in a table, a partial index only includes rows that satisfy a specified condition. This condition is defined using a WHERE clause when the index is created.

To understand the mechanics, let us consider a practical example.

Suppose you have a table named orders that stores details about customer orders, including columns like order_id, customer_id, order_date, status, and total_amount. If the majority of your queries focus on pending orders those where the status is pending, creating a partial index specifically for these rows can significantly improve performance.

Example 1:

Here’s how you can create such an index,

CREATE INDEX idx_pending_orders
ON orders (order_date)
WHERE status = 'pending';

In this example, the index idx_pending_orders includes only the rows where status equals pending. This means that any query that involves filtering by status = 'pending' and utilizes the order_date column will leverage this index. For instance, the following query would benefit from the partial index,

SELECT *
FROM orders
WHERE status = 'pending'
AND order_date > '2025-01-01';

The benefits of this approach are significant. By indexing only the rows with status = 'pending', the size of the index is much smaller compared to a full table index.

This reduction in size not only saves disk space but also speeds up the process of scanning the index, as there are fewer entries to traverse. Furthermore, updates or modifications to rows that do not meet the WHERE condition are excluded from index maintenance, thereby reducing the overhead of maintaining the index and improving performance for write operations.

Example 2:

Let us explore another example. Suppose your application frequently queries orders that exceed a certain total amount. You can create a partial index tailored to this use case,

CREATE INDEX idx_high_value_orders
ON orders (customer_id)
WHERE total_amount > 1000;

This index would optimize queries like the following,

SELECT *
FROM orders
WHERE total_amount > 1000
AND customer_id = 123;

The key advantage here is that the index only includes rows where total_amount > 1000. For datasets with a wide range of order amounts, this can dramatically reduce the number of indexed entries. Queries that filter by high-value orders become faster because the database does not need to sift through irrelevant rows.

Additionally, as with the previous example, index maintenance is limited to the subset of rows matching the condition, improving overall performance for insertions and updates.

Partial indexes are also useful for enforcing constraints in a selective manner. Consider a scenario where you want to ensure that no two active promotions exist for the same product. You can achieve this using a unique partial index

CREATE UNIQUE INDEX idx_unique_active_promotion
ON promotions (product_id)
WHERE is_active = true;

This index guarantees that only one row with is_active = true can exist for each product_id.

In conclusion, partial indexing in PostgreSQL offers a flexible and efficient way to optimize database performance by targeting specific subsets of data.

Before yesterdayMain stream

Learning Notes #54 – Architecture Decision Records

14 January 2025 at 02:35

Last few days, i was learning on how to make a accountable decision on deciding technical stuffs. Then i came across ADR. So far i haven’t used or seen used by our team. I think this is a necessary step to be incorporated to make accountable decisions. In this blog i share details on ADR for my future reference.

What is an ADR?

An Architectural Decision Record (ADR) is a concise document that captures a single architectural decision, its context, the reasoning behind it, and its consequences. ADRs help teams document, share, and revisit architectural choices, ensuring transparency and better collaboration.

Why Use ADRs?

  1. Documentation: ADRs serve as a historical record of why certain decisions were made.
  2. Collaboration: They promote better understanding across teams.
  3. Traceability: ADRs link architectural decisions to specific project requirements and constraints.
  4. Accountability: They clarify who made a decision and when.
  5. Change Management: ADRs help evaluate the impact of changes and facilitate discussions around reversals or updates.

ADR Structure

A typical ADR document follows a standard format. Here’s an example:

  1. Title: A clear and concise title describing the decision.
  2. Context: Background information explaining the problem or opportunity.
  3. Decision: A summary of the chosen solution.
  4. Consequences: The positive and negative outcomes of the decision.
  5. Status: Indicates whether the decision is proposed, accepted, superseded, or deprecated.

Example:

Optimistic locking on MongoDB https://docs.google.com/document/d/1olCbicQeQzYpCxB0ejPDtnri9rWb2Qhs9_JZuvANAxM/edit?usp=sharing

References

  1. https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions
  2. https://www.infoq.com/podcasts/architecture-advice-process/
  3. Recommended: https://github.com/joelparkerhenderson/architecture-decision-record/tree/main

Learning Notes #53 – The Expiration Time Can Be Unexpectedly Lost While Using Redis SET EX

12 January 2025 at 09:14

Redis, a high-performance in-memory key-value store, is widely used for caching, session management, and various other scenarios where fast data retrieval is essential. One of its key features is the ability to set expiration times for keys. However, when using the SET command with the EX option, developers might encounter unexpected behaviors where the expiration time is seemingly lost. Let’s explore this issue in detail.

Understanding SET with EX

The Redis SET command with the EX option allows you to set a key’s value and specify its expiration time in seconds. For instance


SET key value EX 60

This command sets the key key to the value value and sets an expiration time of 60 seconds.

The Problem

In certain cases, the expiration time might be unexpectedly lost. This typically happens when subsequent operations overwrite the key without specifying a new expiration. For example,


SET key value1 EX 60
SET key value2

In the above sequence,

  1. The first SET command assigns a value to key and sets an expiration of 60 seconds.
  2. The second SET command overwrites the value of key but does not include an expiration time, resulting in the key persisting indefinitely.

This behavior can lead to subtle bugs, especially in applications that rely on key expiration for correctness or resource management.

Why Does This Happen?

The Redis SET command is designed to replace the entire state of a key, including its expiration. When you use SET without the EX, PX, or EXAT options, the expiration is removed, and the key becomes persistent. This behavior aligns with the principle that SET is a complete update operation.

When using Redis SET with EX, be mindful of operations that might overwrite keys without reapplying expiration. Understanding Redis’s behavior and implementing robust patterns can save you from unexpected issues, ensuring your application remains efficient and reliable.

Learning Notes #52 – Hybrid Origin Failover Pattern

12 January 2025 at 06:29

Today, i learnt about failover patterns from AWS https://aws.amazon.com/blogs/networking-and-content-delivery/three-advanced-design-patterns-for-high-available-applications-using-amazon-cloudfront/ . In this blog i jot down my understanding on this pattern for future reference,

Hybrid origin failover is a strategy that combines two distinct approaches to handle origin failures effectively, balancing speed and resilience.

The Need for Origin Failover

When an application’s primary origin server becomes unavailable, the ability to reroute traffic to a secondary origin ensures continuity. The failover process determines how quickly and effectively this switch happens. Broadly, there are two approaches to implement origin failover:

  1. Stateful Failover with DNS-based Routing
  2. Stateless Failover with Application Logic

Each has its strengths and limitations, which the hybrid approach aims to mitigate.

Stateful Failover

Stateful failover is a system that allows a standby server to take over for a failed server and continue active sessions. It’s used to create a resilient network infrastructure and avoid service interruptions.

This method relies on a DNS service with health checks to detect when the primary origin is unavailable. Here’s how it works,

  1. Health Checks: The DNS service continuously monitors the health of the primary origin using health checks (e.g., HTTP, HTTPS).
  2. DNS Failover: When the primary origin is marked unhealthy, the DNS service resolves the origin’s domain name to the secondary origin’s IP address.
  3. TTL Impact: The failover process honors the DNS Time-to-Live (TTL) settings. A low TTL ensures faster propagation, but even in the most optimal configurations, this process introduces a delayβ€”often around 60 to 70 seconds.
  4. Stateful Behavior: Once failover occurs, all traffic is routed to the secondary origin until the primary origin is marked healthy again.

Implementation from AWS (as-is from aws blog)

The first approach is usingΒ Amazon Route 53 Failover routing policy with health checks on the origin domain name that’s configured as the origin in CloudFront. When the primary origin becomes unhealthy, Route 53 detects it, and then starts resolving the origin domain name with the IP address of the secondary origin. CloudFront honors the origin DNS TTL, which means that traffic will start flowing to the secondary origin within the DNS TTLs.Β The most optimal configuration (Fast Check activated, a failover threshold of 1, and 60 second DNS TTL) means that the failover will take 70 seconds at minimum to occur. When it does, all of the traffic is switched to the secondary origin, since it’s a stateful failover. Note that this design can be further extended with Route 53 Application Recovery Control for more sophisticated application failover across multiple AWS Regions, Availability Zones, and on-premises.

The second approach is using origin failover, a native feature of CloudFront. This capability of CloudFront tries for the primary origin of every request, and if a configured 4xx or 5xx error is received, then CloudFront attempts a retry with the secondary origin. This approach is simple to configure and provides immediate failover. However, it’s stateless, which means every request must fail independently, thus introducing latency to failed requests. For transient origin issues, this additional latency is an acceptable tradeoff with the speed of failover, but it’s not ideal when the origin is completely out of service. Finally, this approach only works for the GET/HEAD/OPTIONS HTTP methods, because other HTTP methods are not allowed on a CloudFront cache behavior with Origin Failover enabled.

Advantages

  • Works for all HTTP methods and request types.
  • Ensures complete switchover, minimizing ongoing failures.

Disadvantages

  • Relatively slower failover due to DNS propagation time.
  • Requires a reliable health-check mechanism.

Approach 2: Stateless Failover with Application Logic

This method handles failover at the application level. If a request to the primary origin fails (e.g., due to a 4xx or 5xx HTTP response), the application or CDN immediately retries the request with the secondary origin.

How It Works

  1. Primary Request: The application sends a request to the primary origin.
  2. Failure Handling: If the response indicates a failure (configurable for specific error codes), the request is retried with the secondary origin.
  3. Stateless Behavior: Each request operates independently, so failover happens on a per-request basis without waiting for a stateful switchover.

Implementation from AWS (as-is from aws blog)

The hybrid origin failover pattern combines both approaches to get the best of both worlds. First, you configure both of your origins with a Failover Policy in Route 53 behind a single origin domain name. Then, you configure an origin failover group with the single origin domain name as primary origin, and the secondary origin domain name as secondary origin. This means that when the primary origin becomes unavailable, requests are immediately retried with the secondary origin until the stateful failover of Route 53 kicks in within tens of seconds, after which requests go directly to the secondary origin without any latency penalty. Note that this pattern only works with the GET/HEAD/OPTIONS HTTP methods.

Advantages

  • Near-instantaneous failover for failed requests.
  • Simple to configure and doesn’t depend on DNS TTL.

Disadvantages

  • Adds latency for failed requests due to retries.
  • Limited to specific HTTP methods like GET, HEAD, and OPTIONS.
  • Not suitable for scenarios where the primary origin is entirely down, as every request must fail first.

The Hybrid Origin Failover Pattern

The hybrid origin failover pattern combines the strengths of both approaches, mitigating their individual limitations. Here’s how it works:

  1. DNS-based Stateful Failover: A DNS service with health checks monitors the primary origin and switches to the secondary origin if the primary becomes unhealthy. This ensures a complete and stateful failover within tens of seconds.
  2. Application-level Stateless Failover: Simultaneously, the application or CDN is configured to retry failed requests with a secondary origin. This provides an immediate failover mechanism for transient or initial failures.

Implementation Steps

  1. DNS Configuration
    • Set up health checks on the primary origin.
    • Define a failover policy in the DNS service, which resolves the origin domain name to the secondary origin when the primary is unhealthy.
  2. Application Configuration
    • Configure the application or CDN to use an origin failover group.
    • Specify the primary origin domain as the primary origin and the secondary origin domain as the backup.

Behavior

  • Initially, if the primary origin encounters issues, requests are retried immediately with the secondary origin.
  • Meanwhile, the DNS failover switches all traffic to the secondary origin within tens of seconds, eliminating retry latencies for subsequent requests.

Benefits of Hybrid Origin Failover

  1. Faster Failover: Immediate retries for failed requests minimize initial impact, while DNS failover ensures long-term stability.
  2. Reduced Latency: After DNS failover, subsequent requests don’t experience retry delays.
  3. High Resilience: Combines stateful and stateless failover for robust redundancy.
  4. Simplicity and Scalability: Leverages existing DNS and application/CDN features without complex configurations.

Limitations and Considerations

  1. HTTP Method Constraints: Stateless failover works only for GET, HEAD, and OPTIONS methods, limiting its use for POST or PUT requests.
  2. TTL Impact: Low TTLs reduce propagation delays but increase DNS query rates, which could lead to higher costs.
  3. Configuration Complexity: Combining DNS and application-level failover requires careful setup and testing to avoid misconfigurations.
  4. Secondary Origin Capacity: Ensure the secondary origin can handle full traffic loads during failover.

Learning Notes #51 – Postgres as a Queue using SKIP LOCKED

11 January 2025 at 06:56

Yesterday, i came across a blog from inferable.ai https://www.inferable.ai/blog/posts/postgres-skip-locked, which walkthrough about using postgres as a queue. In this blog, i jot down notes on using postgres as a queue for future references.

PostgreSQL is a robust relational database that can be used for more than just storing structured data. With the SKIP LOCKED feature introduced in PostgreSQL 9.5, you can efficiently turn a PostgreSQL table into a job queue for distributed processing.

Why Use PostgreSQL as a Queue?

Using PostgreSQL as a queue can be advantageous because,

  • Familiarity: If you’re already using PostgreSQL, there’s no need for an additional message broker.
  • Durability: PostgreSQL ensures ACID compliance, offering reliability for your job processing.
  • Simplicity: No need to manage another component like RabbitMQ or Kafka

Implementing a Queue with SKIP LOCKED

1. Create a Queue Table

To start, you need a table to store the jobs,


CREATE TABLE job_queue (
    id SERIAL PRIMARY KEY,
    job_data JSONB NOT NULL,
    status TEXT DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This table has the following columns,

  • id: A unique identifier for each job.
  • job_data: The data or payload for the job.
  • status: Tracks the job’s state (β€˜pending’, β€˜in_progress’, or β€˜completed’).
  • created_at: Timestamp of job creation.

2. Insert Jobs into the Queue

Adding jobs is straightforward,


INSERT INTO job_queue (job_data)
VALUES ('{"task": "send_email", "email": "user@example.com"}');

3. Fetch Jobs for Processing with SKIP LOCKED

Workers will fetch jobs from the queue using SELECT ... FOR UPDATE SKIP LOCKED to avoid contention,

WITH next_job AS (
    SELECT id, job_data
    FROM job_queue
    WHERE status = 'pending'
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
UPDATE job_queue
SET status = 'in_progress'
FROM next_job
WHERE job_queue.id = next_job.id
RETURNING job_queue.id, job_queue.job_data;

Key Points:

  • FOR UPDATE locks the selected row to prevent other workers from picking it up.
  • SKIP LOCKED ensures locked rows are skipped, enabling concurrent workers to operate without waiting.
  • LIMIT 1 processes one job at a time per worker.

4. Mark Jobs as Completed

Once a worker finishes processing a job, it should update the job’s status,


UPDATE job_queue
SET status = 'completed'
WHERE id = $1; -- Replace $1 with the job ID

5. Delete Old or Processed Jobs

To keep the table clean, you can periodically remove completed jobs,


DELETE FROM job_queue
WHERE status = 'completed' AND created_at < NOW() - INTERVAL '30 days';

Example Worker Implementation

Here’s an example of a worker implemented in Python using psycopg2


import psycopg2
from psycopg2.extras import RealDictCursor

connection = psycopg2.connect("dbname=yourdb user=youruser")

while True:
    with connection.cursor(cursor_factory=RealDictCursor) as cursor:
        cursor.execute(
            """
            WITH next_job AS (
                SELECT id, job_data
                FROM job_queue
                WHERE status = 'pending'
                FOR UPDATE SKIP LOCKED
                LIMIT 1
            )
            UPDATE job_queue
            SET status = 'in_progress'
            FROM next_job
            WHERE job_queue.id = next_job.id
            RETURNING job_queue.id, job_queue.job_data;
            """
        )

        job = cursor.fetchone()
        if job:
            print(f"Processing job {job['id']}: {job['job_data']}")

            # Simulate job processing
            cursor.execute("UPDATE job_queue SET status = 'completed' WHERE id = %s", (job['id'],))

        else:
            print("No jobs available. Sleeping...")
            time.sleep(5)

    connection.commit()

Considerations

  1. Transaction Isolation: Use the REPEATABLE READ or SERIALIZABLE isolation level cautiously to avoid unnecessary locks.
  2. Row Locking: SKIP LOCKED only skips rows locked by other transactions, not those locked within the same transaction.
  3. Performance: Regularly archive or delete old jobs to prevent the table from growing indefinitely. Consider indexing the status column to improve query performance.
  4. Fault Tolerance: Ensure that workers handle crashes or timeouts gracefully. Use a timeout mechanism to revert jobs stuck in the β€˜in_progress’ state.
  5. Scaling: Distribute workers across multiple nodes to handle a higher job throughput.
  6. The SKIP LOCKED clause only applies to row-level locks – the required ROW SHARE table-level lock is still taken normally.
  7. Using SKIP LOCKED provides an inconsistent view of the data by design. This is why it’s perfect for queue-like tables where we want to distribute work, but not suitable for general purpose work where consistency is required.

Learning Notes #50 – Fixed Partition Pattern | Distributed Pattern

9 January 2025 at 16:51

Today, i learnt about fixed partition, where it handles about balancing the data among servers without high movement of data. In this blog, i jot down notes on how fixed partition helps in solving the problem.

This entire blog is inspired from https://www.linkedin.com/pulse/distributed-systems-design-pattern-fixed-partitions-retail-kumar-v-c34pc/?trackingId=DMovSwEZSfCzKZEKa7yJrg%3D%3D

Problem Statement

In a distributed key-value store system, data items need to be mapped to a set of cluster nodes to ensure efficient storage and retrieval. The system must satisfy the following requirements,

  1. Uniform Distribution: Data should be evenly distributed across all cluster nodes to avoid overloading any single node.
  2. Deterministic Mapping: Given a data item, the specific node responsible for storing it should be determinable without querying all the nodes in the cluster.

A common approach to achieve these goals is to use hashing with a modulo operation. For example, if there are three nodes in the cluster, the key is hashed, and the hash value modulo the number of nodes determines the node to store the data. However, this method has a critical drawback,

Rebalancing Issue: When the cluster size changes (e.g., nodes are added or removed), the mapping for most keys changes. This requires the system to move almost all the data to new nodes, leading to significant overhead in terms of time and resources, especially when dealing with large data volumes.

Challenge: How can we design a mapping mechanism that minimizes data movement during cluster size changes while maintaining uniform distribution and deterministic mapping?

Solution

There is a concept of Fixed Partitioning,

What Is Fixed Partitioning?

This pattern organizes data into a predefined number of fixed partitions that remain constant over time. Data is assigned to these partitions using a hashing algorithm, ensuring that the mapping of data to partitions is permanent. The system separates the fixed partitioning of data from the physical servers managing these partitions, enabling seamless scaling.

Key Features of Fixed Partitioning

  1. Fixed Number of Partitions
    • The number of partitions is determined during system initialization (e.g., 8 partitions).
    • Data is assigned to these partitions based on a consistent hashing algorithm.
  2. Stable Data Mapping
    • Each piece of data is permanently mapped to a specific partition.
    • This eliminates the need for large-scale data reshuffling when scaling the system.
  3. Adjustable Partition-to-Server Mapping
    • Partitions can be reassigned to different servers as the system scales.
    • Only the physical location of the partitions changes; the fixed mapping remains intact.
  4. Balanced Load Distribution
    • Partitions are distributed evenly across servers to balance the workload.
    • Adding new servers involves reassigning partitions without moving or reorganizing data within the partitions.

Naive Example

We have a banking system with transactions stored in 8 fixed partitions, distributed based on a customer’s account ID.


CREATE TABLE transactions (
    id SERIAL PRIMARY KEY,
    account_id INT NOT NULL,
    transaction_amount NUMERIC(10, 2) NOT NULL,
    transaction_date DATE NOT NULL
) PARTITION BY HASH (account_id);

1. Create Partition


DO $$
BEGIN
    FOR i IN 0..7 LOOP
        EXECUTE format(
            'CREATE TABLE transactions_p%s PARTITION OF transactions FOR VALUES WITH (modulus 8, remainder %s);',
            i, i
        );
    END LOOP;
END $$;

This creates 8 partitions (transactions_p0 to transactions_p7) based on the hash remainder of account_id modulo 8.

2. Inserting Data

When inserting data into the transactions table, PostgreSQL automatically places it into the correct partition based on the account_id.


INSERT INTO transactions (account_id, transaction_amount, transaction_date)
VALUES (12345, 500.00, '2025-01-01');

The hash of 12345 % 8 determines the target partition (e.g., transactions_p5).

3. Querying Data

Querying the base table works transparently across all partitions


SELECT * FROM transactions WHERE account_id = 12345;

PostgreSQL automatically routes the query to the correct partition.

4. Scaling by Adding Servers

Initial Setup:

Suppose we have 4 servers managing the partitions,

  • Server 1: transactions_p0, transactions_p1
  • Server 2: transactions_p2, transactions_p3
  • Server 3: transactions_p4, transactions_p5
  • Server 4: transactions_p6, transactions_p7

Adding a New Server:

When a 5th server is added, we redistribute partitions,

  • Server 1: transactions_p0
  • Server 2: transactions_p1
  • Server 3: transactions_p2, transactions_p3
  • Server 4: transactions_p4
  • Server 5: transactions_p5, transactions_p6, transactions_p7

Partition Migration

  • During the migration, transactions_p5 is copied from Server 3 to Server 5.
  • Once the migration is complete, Server 5 becomes responsible for transactions_p5.

Benefits:

  1. Minimal Data Movement – When scaling, only the partitions being reassigned are copied to new servers. Data within partitions remains stable.
  2. Optimized Performance – Queries are routed directly to the relevant partition, minimizing scan times.
  3. Scalability – Adding servers is straightforward, as it involves reassigning partitions, not reorganizing data.

What happens when a new server is added then. Don’t we need to copy the data ?

When a partition is moved to a new server (e.g., partition_b from server_A to server_B), the data in the partition must be copied to the new server. However,

  1. The copying is limited to the partition being reassigned.
  2. No data within the partition is reorganized.
  3. Once the partition is fully migrated, the original copy is typically deleted.

For example, in PostgreSQL,

  • Export the Partition pg_dump -t partition_b -h server_A -U postgres > partition_b.sql
  • Import on New Server: psql -h server_B -U postgres -d mydb < partition_b.sql

Learning Notes #49 – Pitfall of Implicit Default Values in APIs

9 January 2025 at 14:00

Today, we faced a bug in our workflow due to implicit default value in an 3rd party api. In this blog i will be sharing my experience for future reference.

Understanding the Problem

Consider an API where some fields are optional, and a default value is used when those fields are not provided by the client. This design is common and seemingly harmless. However, problems arise when,

  1. Unexpected Categorization: The default value influences logic, such as category assignment, in ways the client did not intend.
  2. Implicit Assumptions: The API assumes a default value aligns with the client’s intention, leading to misclassification or incorrect behavior.
  3. Debugging Challenges: When issues occur, clients and developers spend significant time tracing the problem because the default behavior is not transparent.

Here’s an example of how this might manifest,


POST /items
{
  "name": "Sample Item",
  "category": "premium"
}

If the category field is optional and a default value of "basic" is applied when it’s omitted, the following request,


POST /items
{
  "name": "Another Item"
}

might incorrectly classify the item as basic, even if the client intended it to be uncategorized.

Why This is a Code Smell

Implicit default handling for optional fields often signals poor design. Let’s break down why,

  1. Violation of the Principle of Least Astonishment: Clients may be unaware of default behavior, leading to unexpected outcomes.
  2. Hidden Logic: The business logic embedded in defaults is not explicit in the API’s contract, reducing transparency.
  3. Coupling Between API and Business Logic: When defaults dictate core behavior, the API becomes tightly coupled to specific business rules, making it harder to adapt or extend.
  4. Inconsistent Behavior: If the default logic changes in future versions, existing clients may experience breaking changes.

Best Practices to Avoid the Trap

  1. Make Default Behavior Explicit
    • Clearly document default values in the API specification (but we still missed it.)
    • For example, use OpenAPI/Swagger to define optional fields and their default values explicitly
  2. Avoid Implicit Defaults
    • Instead of applying defaults server-side, require the client to explicitly provide values, even if they are defaults.
    • This ensures the client is fully aware of the data being sent and its implications.
  3. Use Null or Explicit Indicators
    • Allow optional fields to be explicitly null or undefined, and handle these cases appropriately.
    • In this case, the API can handle null as β€œno category specified” rather than applying a default.
  4. Fail Fast with Validation
    • Use strict validation to reject ambiguous requests, encouraging clients to provide clear inputs.

{
  "error": "Field 'category' must be provided explicitly."
}

5. Version Your API Thoughtfully:

  • Document changes and provide clear migration paths for clients.
  • If you must change default behaviors, ensure backward compatibility through versioning.

Implicit default values for optional fields can lead to unintended consequences, obscure logic, and hard-to-debug issues. Recognizing this pattern as a code smell is the first step to building more robust APIs. By adopting explicitness, transparency, and rigorous validation, you can create APIs that are easier to use, understand, and maintain.

Learning Notes #37 – Orchestrator Pattern | Cloud Pattern

5 January 2025 at 11:16

Today, i learnt about orchestrator pattern, while l was learning about SAGA Pattern. It simplifies the coordination of these workflows, making the system more efficient and easier to manage. In this blog i jot down notes on Orchestrator Pattern for better understanding.

What is the Orchestrator Pattern?

The Orchestrator Pattern is a design strategy where a central orchestrator coordinates interactions between various services or components to execute a workflow.

Unlike the Choreography Pattern, where services interact with each other independently and are aware of their peers, the orchestrator acts as the central decision-maker, directing how and when services interact.

Key Features

  • Centralized control of workflows.
  • Simplified service communication.
  • Enhanced error handling and monitoring.

When to Use the Orchestrator Pattern

  • Complex Workflows: When multiple services or steps need to be executed in a defined sequence.
  • Error Handling: When failures in one step require recovery strategies or compensating transactions.
  • Centralized Logic: When you want to encapsulate business logic in a single place for easier maintenance.

Benefits of the Orchestrator Pattern

  1. Simplifies Service Communication: Services remain focused on their core functionality while the orchestrator manages interactions.
  2. Improves Scalability: Workflows can be scaled independently from services.
  3. Centralized Monitoring: Makes it easier to track the progress of workflows and debug issues.
  4. Flexibility: Changing a workflow involves modifying the orchestrator, not the services.

Example: Order Processing Workflow

Problem

A fictional e-commerce platform needs to process orders. The workflow involves:

  1. Validating the order.
  2. Reserving inventory.
  3. Processing payment.
  4. Notifying the user.

Each step is handled by a separate microservice.

Solution

We implement an orchestrator to manage this workflow. Let’s see how this works in practice.


import requests

class OrderOrchestrator:
    def __init__(self):
        self.services = {
            "validate_order": "http://order-service/validate",
            "reserve_inventory": "http://inventory-service/reserve",
            "process_payment": "http://payment-service/process",
            "notify_user": "http://notification-service/notify",
        }

    def execute_workflow(self, order_id):
        try:
            # Step 1: Validate Order
            self.call_service("validate_order", {"order_id": order_id})

            # Step 2: Reserve Inventory
            self.call_service("reserve_inventory", {"order_id": order_id})

            # Step 3: Process Payment
            self.call_service("process_payment", {"order_id": order_id})

            # Step 4: Notify User
            self.call_service("notify_user", {"order_id": order_id})

            print(f"Order {order_id} processed successfully!")
        except Exception as e:
            print(f"Error processing order {order_id}: {e}")

    def call_service(self, service_name, payload):
        url = self.services[service_name]
        response = requests.post(url, json=payload)
        if response.status_code != 200:
            raise Exception(f"{service_name} failed: {response.text}")

Key Tactics for Implementation

  1. Services vs. Serverless: Use serverless functions for steps that are triggered occasionally and don’t need always-on services, reducing costs.
  2. Recovery from Failures:
    • Retry Mechanism: Configure retries with limits and delays to handle transient failures.
    • Circuit Breaker Pattern: Detect and isolate failing services to allow recovery.
    • Graceful Degradation: Use fallbacks like cached results or alternate services to ensure continuity.
  3. Monitoring and Alerting:
    • Implement real-time monitoring with automated recovery strategies.
    • Set up alerts for exceptions and utilize logs for troubleshooting.
  4. Orchestration Service Failures:
    • Service Replication: Deploy multiple instances of the orchestrator for failover.
    • Data Replication: Ensure data consistency for seamless recovery.
    • Request Queues: Use queues to buffer requests during downtime and process them later.

Important Considerations

The primary goal of this architectural pattern is to decompose the entire business workflow into multiple services, making it more flexible and scalable. Due to this, it’s crucial to analyze and comprehend the business processes in detail before implementation. A poorly defined and overly complicated business process will lead to a system that would be hard to maintain and scale.

Secondly, it’s easy to fall into the trap of adding business logic into the orchestration service. Sometimes it’s inevitable because certain functionalities are too small to create their separate service. But the risk here is that if the orchestration service becomes too intelligent and performs too much business logic, it can evolve into a monolithic application that also happens to talk to microservices. So, it’s crucial to keep track of every addition to the orchestration service and ensure that its work remains within the boundaries of orchestration. Maintaining the scope of the orchestration service will prevent it from becoming a burden on the system, leading to decreased scalability and flexibility.

Why Use the Orchestration Pattern

The pattern comes with the following advantages

  • Orchestration makes it easier to understand, monitor, and observe the application, resulting in a better understanding of the core part of the system with less effort.
  • The pattern promotes loose coupling. Each downstream service exposes an API interface and is self-contained, without any need to know about the other services.
  • The pattern simplifies the business workflows and improves the separation of concerns. Each service participates in a long-running transaction without any need to know about it.
  • The orchestrator service can decide what to do in case of failure, making the system fault-tolerant and reliable.

Learning Notes #34 – Consistency (Correctness) in ACID | Postgres

4 January 2025 at 12:37

As part of the ACID Series, i am refreshing on consistency. In this blog, i jot down notes on consistency (correctness) in postgres database.

What is Consistency?

Consistency ensures that a transaction brings the database from one valid state to another, adhering to predefined rules such as constraints, triggers, and relational integrity. If a transaction violates these rules, it is aborted, and the database remains unchanged. This guarantees that only valid data exists in the database.

Consistency works together with other ACID properties:

  • Atomicity ensures the β€œall-or-nothing” execution of a transaction.
  • Isolation ensures transactions don’t interfere with each other.
  • Durability guarantees committed transactions persist despite system failures

Key Aspects of Consistency in PostgreSQL

  1. Constraints
    • Primary Key: Ensures uniqueness of rows.
    • Foreign Key: Maintains referential integrity.
    • Check Constraints: Enforces custom business rules.
    • Not Null: Ensures that specific columns cannot have null values.
  2. Triggers
    • Custom logic executed before or after specific database events.
  3. Rules
    • Enforce application-specific invariants on the database.
  4. Transactions
    • Changes are made in a controlled environment, ensuring consistency even in the event of errors or system failures.

Practical Examples of Consistency in PostgreSQL

1. Primary Key Constraint

Ensures that no two rows in a table have the same primary key value.


CREATE TABLE accounts (
    account_id SERIAL PRIMARY KEY,
    account_holder_name VARCHAR(255) NOT NULL,
    balance NUMERIC(15, 2) NOT NULL CHECK (balance >= 0)
);

-- Attempt to insert duplicate primary keys.
INSERT INTO accounts (account_id, account_holder_name, balance)
VALUES (1, 'Alice', 1000.00);

INSERT INTO accounts (account_id, account_holder_name, balance)
VALUES (1, 'Bob', 2000.00); -- This will fail.

2. Foreign Key Constraint

Enforces referential integrity between tables.


CREATE TABLE transactions (
    transaction_id SERIAL PRIMARY KEY,
    account_id INT NOT NULL REFERENCES accounts(account_id),
    amount NUMERIC(15, 2) NOT NULL,
    transaction_type VARCHAR(10) NOT NULL CHECK (transaction_type IN ('credit', 'debit')),
    transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Attempt to insert a transaction for a non-existent account.
INSERT INTO transactions (account_id, amount, transaction_type)
VALUES (999, 500, 'credit'); -- This will fail.

3. Check Constraint

Validates custom business rules.


-- Ensure account balance cannot go negative.
INSERT INTO accounts (account_holder_name, balance)
VALUES ('Charlie', -500); -- This will fail due to the CHECK constraint.

4. Trigger for Business Logic

Ensures derived data or additional checks are implemented.


CREATE OR REPLACE FUNCTION enforce_minimum_balance()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.balance < 0 THEN
        RAISE EXCEPTION 'Balance cannot be negative';
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER check_balance_before_insert
BEFORE INSERT OR UPDATE ON accounts
FOR EACH ROW EXECUTE FUNCTION enforce_minimum_balance();

-- Attempt to update an account with a negative balance.
UPDATE accounts SET balance = -100 WHERE account_id = 1; -- This will fail.

5. Transactions to Maintain Consistency

A transaction groups multiple operations into a single unit, ensuring all succeed or none.


BEGIN;

-- Deduct from sender's account.
UPDATE accounts SET balance = balance - 500 WHERE account_id = 1;

-- Credit to receiver's account.
UPDATE accounts SET balance = balance + 500 WHERE account_id = 2;

-- If any operation fails, rollback the transaction.
COMMIT;

If the system crashes before the COMMIT, the database remains unchanged, ensuring consistency.

How Consistency Works with Other ACID Properties

  1. With Atomicity: If any step in a transaction violates a constraint, the entire transaction is rolled back, ensuring that the database remains consistent.
  2. With Isolation: Concurrent transactions operate independently, preventing inconsistent states caused by interference.
  3. With Durability: Once a transaction is committed, its consistency guarantees persist even in the event of a crash.

Benefits of Consistency

  1. Data Integrity: Prevents invalid data from being stored.
  2. Application Reliability: Reduces the need for additional application-level checks.
  3. Simplified Maintenance: Developers can rely on the database to enforce business rules and relationships.
  4. Error Prevention: Constraints and triggers act as safeguards, catching mistakes early.

Learning Notes #30 – Queue Based Loading | Cloud Patterns

3 January 2025 at 14:47

Today, i learnt about Queue Based Loading pattern, which helps to manage intermittent peak load to a service via queues. Basically decoupling Tasks from Services. In this blog i jot down notes on this pattern for my future self.

In today’s digital landscape, applications are expected to handle large-scale operations efficiently. Whether it’s processing massive data streams, ensuring real-time responsiveness, or integrating with multiple third-party services, scalability and reliability are paramount. One pattern that elegantly addresses these challenges is the Queue-Based Loading Pattern.

What Is the Queue-Based Loading Pattern?

The Queue-Based Loading Pattern leverages message queues to decouple and coordinate tasks between producers (such as applications or services generating data) and consumers (services or workers processing that data). By using queues as intermediaries, this pattern allows systems to manage workloads efficiently, ensuring seamless and scalable operation.

Key Components of the Pattern

  1. Producers: Producers are responsible for generating tasks or data. They send these tasks to a message queue instead of directly interacting with consumers. Examples include:
    • Web applications logging user activity.
    • IoT devices sending sensor data.
  2. Message Queue: The queue acts as a buffer, storing tasks until consumers are ready to process them. Popular tools for implementing queues include RabbitMQ, Apache Kafka, AWS SQS, and Redis.
  3. Consumers: Consumers retrieve messages from the queue and process them asynchronously. They are typically designed to handle tasks independently and at their own pace.
  4. Processing Logic: This is the core functionality that processes the tasks retrieved by consumers. For example, resizing images, sending notifications, or updating a database.

How It Works

  1. Task Generation: Producers push tasks to the queue as they are generated.
  2. Message Storage: The queue stores tasks in a structured manner (FIFO, priority-based, etc.) and ensures reliable delivery.
  3. Task Consumption: Consumers pull tasks from the queue, process them, and optionally acknowledge completion.
  4. Scalability: New consumers can be added dynamically to handle increased workloads, ensuring the system remains responsive.

Benefits of the Queue-Based Loading Pattern

  1. Decoupling: Producers and consumers operate independently, reducing tight coupling and improving system maintainability.
  2. Scalability: By adding more consumers, systems can easily scale to handle higher workloads.
  3. Fault Tolerance: If a consumer fails, messages remain in the queue, ensuring no data is lost.
  4. Load Balancing: Tasks are distributed evenly among consumers, preventing any single consumer from becoming a bottleneck.
  5. Asynchronous Processing: Consumers can process tasks in the background, freeing producers to continue generating data without delay.

Issues and Considerations

  1. Rate Limiting: Implement logic to control the rate at which services handle messages to prevent overwhelming the target resource. Test the system under load and adjust the number of queues or service instances to manage demand effectively.
  2. One-Way Communication: Message queues are inherently one-way. If tasks require responses, you may need to implement a separate mechanism for replies.
  3. Autoscaling Challenges: Be cautious when autoscaling consumers, as it can lead to increased contention for shared resources, potentially reducing the effectiveness of load leveling.
  4. Traffic Variability: Consider the variability of incoming traffic to avoid situations where tasks pile up faster than they are processed, creating a perpetual backlog.
  5. Queue Persistence: Ensure your queue is durable and capable of persisting messages. Crashes or system limits could lead to dropped messages, risking data loss.

Use Cases

  1. Email and Notification Systems: Sending bulk emails or push notifications without overloading the main application.
  2. Data Pipelines: Ingesting, transforming, and analyzing large datasets in real-time or batch processing.
  3. Video Processing: Queues facilitate tasks like video encoding and thumbnail generation.
  4. Microservices Communication: Ensures reliable and scalable communication between microservices.

Best Practices

  1. Message Durability: Configure your queue to persist messages to disk, ensuring they are not lost during system failures.
  2. Monitoring and Metrics: Use monitoring tools to track queue lengths, processing rates, and consumer health.
  3. Idempotency: Design consumers to handle duplicate messages gracefully.
  4. Error Handling and Dead Letter Queues (DLQs): Route failed messages to DLQs for later analysis and reprocessing.

Learning Notes #22 – Claim Check Pattern | Cloud Pattern

31 December 2024 at 17:03

Today, i learnt about claim check pattern, which tells how to handle a big message into the queue. Every message broker has a defined message size limit. If our message size exceeds the size, it wont work.

The Claim Check Pattern emerges as a pivotal architectural design to address challenges in managing large payloads in a decoupled and efficient manner. In this blog, i jot down notes on my learning for my future self.

What is the Claim Check Pattern?

The Claim Check Pattern is a messaging pattern used in distributed systems to manage large messages efficiently. Instead of transmitting bulky data directly between services, this pattern extracts and stores the payload in a dedicated storage system (e.g., object storage or a database).

A lightweight reference or β€œclaim check” is then sent through the message queue, which the receiving service can use to retrieve the full data from the storage.

This pattern is inspired by the physical process of checking in luggage at an airport: you hand over your luggage, receive a claim check (a token), and later use it to retrieve your belongings.

How Does the Claim Check Pattern Work?

The process typically involves the following steps

  1. Data Submission The sender service splits a message into two parts:
    • Metadata: A small piece of information that provides context about the data.
    • Payload: The main body of data that is too large or sensitive to send through the message queue.
  2. Storing the Payload
    • The sender uploads the payload to a storage service (e.g., AWS S3, Azure Blob Storage, or Google Cloud Storage).
    • The storage service returns a unique identifier (e.g., a URL or object key).
  3. Sending the Claim Check
    • The sender service places the metadata and the unique identifier (claim check) onto the message queue.
  4. Receiving the Claim Check
    • The receiver service consumes the message from the queue, extracts the claim check, and retrieves the payload from the storage system.
  5. Processing
    • The receiver processes the payload alongside the metadata as required.

Use Cases

1. Media Processing Pipelines In video transcoding systems, raw video files can be uploaded to storage while metadata (e.g., video format and length) is passed through the message queue.

2. IoT Systems – IoT devices generate large datasets. Using the Claim Check Pattern ensures efficient transmission and processing of these data chunks.

3. Data Processing Workflows – In big data systems, datasets can be stored in object storage while processing metadata flows through orchestration tools like Apache Airflow.

4. Event-Driven Architectures – For systems using event-driven models, large event payloads can be offloaded to storage to avoid overloading the messaging layer.

Example with RabbitMQ

1.Sender Service


import boto3
import pika

s3 = boto3.client('s3')
bucket_name = 'my-bucket'
object_key = 'data/large-file.txt'

response = s3.upload_file('large-file.txt', bucket_name, object_key)
claim_check = f's3://{bucket_name}/{object_key}'

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare a queue
channel.queue_declare(queue='claim_check_queue')

# Send the claim check
message = {
    'metadata': 'Some metadata',
    'claim_check': claim_check
}
channel.basic_publish(exchange='', routing_key='claim_check_queue', body=str(message))

connection.close()

2. Consumer


import boto3
import pika

s3 = boto3.client('s3')

# Connect to RabbitMQ
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare a queue
channel.queue_declare(queue='claim_check_queue')

# Callback function to process messages
def callback(ch, method, properties, body):
    message = eval(body)
    claim_check = message['claim_check']

    bucket_name, object_key = claim_check.replace('s3://', '').split('/', 1)
    s3.download_file(bucket_name, object_key, 'retrieved-large-file.txt')
    print("Payload retrieved and processed.")

# Consume messages
channel.basic_consume(queue='claim_check_queue', on_message_callback=callback, auto_ack=True)

print('Waiting for messages. To exit press CTRL+C')
channel.start_consuming()

References

  1. https://learn.microsoft.com/en-us/azure/architecture/patterns/claim-check
  2. https://medium.com/@dmosyan/claim-check-design-pattern-603dc1f3796d

Learning Notes #19 – Blue Green Deployments – An near ZERO downtime deployment

30 December 2024 at 18:19

Today, i got refreshed on Blue Green Deployment from a podcast https://open.spotify.com/episode/03p86zgOuSEbNezK71CELH. Deployment designing is a plate i haven’t touched yet. In this blog i jot down the notes on blue green deployment for my future self.

What is Blue-Green Deployment?

Blue-Green Deployment is a release management strategy that involves maintaining two identical environments, referred to as β€œBlue” and β€œGreen.” At any point in time, only one environment is live (receiving traffic), while the other remains idle or in standby. Updates are deployed to the idle environment, thoroughly tested, and then switched to live with minimal downtime.

How It Works

  • This approach involves setting up two environments: the Blue environment, which serves live traffic, and the Green environment, a replica used for staging updates.
  • Updates are first deployed to the Green environment, where comprehensive testing is performed to ensure functionality, performance, and integration meet expectations.
  • Once testing is successful, the routing mechanism, such as a DNS or API Gateway or load balancer, is updated to redirect traffic from the Blue environment to the Green environment.
  • The Green environment then becomes live, while the Blue environment transitions to an idle state.
  • If issues arise, traffic can be reverted to the Blue environment for a quick recovery with minimal impact.

Benefits of Blue-Green Deployment

  • Blue-Green Deployment provides zero downtime during the deployment process, ensuring uninterrupted user experiences.
  • Rollbacks are simplified because the previous version remains intact in the Blue environment, enabling quick reversion if necessary. Consideration of forward and backwar capability is important. eg, Database.
  • It also allows seamless testing in the Green environment before updates go live, reducing risks by isolating production from deployment issues.

Challenges and Considerations

  • Maintaining two identical environments can be resource intensive.
  • Ensuring synchronization between environments is critical to prevent discrepancies in configuration and data.
  • Handling live database changes during the environment switch is complex, requiring careful planning for database migrations.

Implementing Blue-Green Deployment (Not Yet Tried)

  • Several tools and platforms support Blue-Green Deployment. Kubernetes simplifies managing multiple environments through namespaces and services.
  • AWS Elastic Beanstalk offers built-in support for Blue-Green Deployment, while HashiCorp Terraform automates the setup of Blue-Green infrastructure.
  • To implement this strategy, organizations should design infrastructure capable of supporting two identical environments, automate deployments using CI/CD pipelines, monitor and test thoroughly, and define rollback procedures to revert to previous versions when necessary.

Reference:

❌
❌