Reading view

There are new articles available, click to refresh the page.

Learning Notes #57 – Partial Indexing in Postgres

16 January 2025 at 14:36

Today, i learnt about partial indexing in postgres, how its optimizes the indexing process to filter subset of table more efficiently. In this blog, i jot down notes on partial indexing.

Partial indexing in PostgreSQL is a powerful feature that provides a way to optimize database performance by creating indexes that apply only to a subset of a table’s rows. This selective indexing can result in reduced storage space, faster index maintenance, and improved query performance, especially when queries frequently involve filters or conditions that only target a portion of the data.

An index in PostgreSQL, like in other relational database management systems, is a data structure that improves the speed of data retrieval operations. However, creating an index on an entire table can sometimes be inefficient, especially when dealing with very large datasets where queries often focus on specific subsets of the data. This is where partial indexing becomes invaluable.

Unlike a standard index that covers every row in a table, a partial index only includes rows that satisfy a specified condition. This condition is defined using a WHERE clause when the index is created.

To understand the mechanics, let us consider a practical example.

Suppose you have a table named orders that stores details about customer orders, including columns like order_id, customer_id, order_date, status, and total_amount. If the majority of your queries focus on pending orders those where the status is pending, creating a partial index specifically for these rows can significantly improve performance.

Example 1:

Here’s how you can create such an index,

CREATE INDEX idx_pending_orders
ON orders (order_date)
WHERE status = 'pending';

In this example, the index idx_pending_orders includes only the rows where status equals pending. This means that any query that involves filtering by status = 'pending' and utilizes the order_date column will leverage this index. For instance, the following query would benefit from the partial index,

SELECT *
FROM orders
WHERE status = 'pending'
AND order_date > '2025-01-01';

The benefits of this approach are significant. By indexing only the rows with status = 'pending', the size of the index is much smaller compared to a full table index.

This reduction in size not only saves disk space but also speeds up the process of scanning the index, as there are fewer entries to traverse. Furthermore, updates or modifications to rows that do not meet the WHERE condition are excluded from index maintenance, thereby reducing the overhead of maintaining the index and improving performance for write operations.

Example 2:

Let us explore another example. Suppose your application frequently queries orders that exceed a certain total amount. You can create a partial index tailored to this use case,

CREATE INDEX idx_high_value_orders
ON orders (customer_id)
WHERE total_amount > 1000;

This index would optimize queries like the following,

SELECT *
FROM orders
WHERE total_amount > 1000
AND customer_id = 123;

The key advantage here is that the index only includes rows where total_amount > 1000. For datasets with a wide range of order amounts, this can dramatically reduce the number of indexed entries. Queries that filter by high-value orders become faster because the database does not need to sift through irrelevant rows.

Additionally, as with the previous example, index maintenance is limited to the subset of rows matching the condition, improving overall performance for insertions and updates.

Partial indexes are also useful for enforcing constraints in a selective manner. Consider a scenario where you want to ensure that no two active promotions exist for the same product. You can achieve this using a unique partial index

CREATE UNIQUE INDEX idx_unique_active_promotion
ON promotions (product_id)
WHERE is_active = true;

This index guarantees that only one row with is_active = true can exist for each product_id.

In conclusion, partial indexing in PostgreSQL offers a flexible and efficient way to optimize database performance by targeting specific subsets of data.

Learning Notes #51 – Postgres as a Queue using SKIP LOCKED

Parotta Salna

Mr.ParottaSalna

11 January 2025 at 06:56

Yesterday, i came across a blog from inferable.ai https://www.inferable.ai/blog/posts/postgres-skip-locked, which walkthrough about using postgres as a queue. In this blog, i jot down notes on using postgres as a queue for future references.

PostgreSQL is a robust relational database that can be used for more than just storing structured data. With the SKIP LOCKED feature introduced in PostgreSQL 9.5, you can efficiently turn a PostgreSQL table into a job queue for distributed processing.

Why Use PostgreSQL as a Queue?

Using PostgreSQL as a queue can be advantageous because,

Familiarity: If you’re already using PostgreSQL, there’s no need for an additional message broker.
Durability: PostgreSQL ensures ACID compliance, offering reliability for your job processing.
Simplicity: No need to manage another component like RabbitMQ or Kafka

Implementing a Queue with SKIP LOCKED

1. Create a Queue Table

To start, you need a table to store the jobs,


CREATE TABLE job_queue (
    id SERIAL PRIMARY KEY,
    job_data JSONB NOT NULL,
    status TEXT DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This table has the following columns,

id: A unique identifier for each job.
job_data: The data or payload for the job.
status: Tracks the job’s state (‘pending’, ‘in_progress’, or ‘completed’).
created_at: Timestamp of job creation.

2. Insert Jobs into the Queue

Adding jobs is straightforward,


INSERT INTO job_queue (job_data)
VALUES ('{"task": "send_email", "email": "user@example.com"}');

3. Fetch Jobs for Processing with SKIP LOCKED

Workers will fetch jobs from the queue using SELECT ... FOR UPDATE SKIP LOCKED to avoid contention,

WITH next_job AS (
    SELECT id, job_data
    FROM job_queue
    WHERE status = 'pending'
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
UPDATE job_queue
SET status = 'in_progress'
FROM next_job
WHERE job_queue.id = next_job.id
RETURNING job_queue.id, job_queue.job_data;

Key Points:

FOR UPDATE locks the selected row to prevent other workers from picking it up.
SKIP LOCKED ensures locked rows are skipped, enabling concurrent workers to operate without waiting.
LIMIT 1 processes one job at a time per worker.

4. Mark Jobs as Completed

Once a worker finishes processing a job, it should update the job’s status,


UPDATE job_queue
SET status = 'completed'
WHERE id = $1; -- Replace $1 with the job ID

5. Delete Old or Processed Jobs

To keep the table clean, you can periodically remove completed jobs,


DELETE FROM job_queue
WHERE status = 'completed' AND created_at < NOW() - INTERVAL '30 days';

Example Worker Implementation

Here’s an example of a worker implemented in Python using psycopg2


import psycopg2
from psycopg2.extras import RealDictCursor

connection = psycopg2.connect("dbname=yourdb user=youruser")

while True:
    with connection.cursor(cursor_factory=RealDictCursor) as cursor:
        cursor.execute(
            """
            WITH next_job AS (
                SELECT id, job_data
                FROM job_queue
                WHERE status = 'pending'
                FOR UPDATE SKIP LOCKED
                LIMIT 1
            )
            UPDATE job_queue
            SET status = 'in_progress'
            FROM next_job
            WHERE job_queue.id = next_job.id
            RETURNING job_queue.id, job_queue.job_data;
            """
        )

        job = cursor.fetchone()
        if job:
            print(f"Processing job {job['id']}: {job['job_data']}")

            # Simulate job processing
            cursor.execute("UPDATE job_queue SET status = 'completed' WHERE id = %s", (job['id'],))

        else:
            print("No jobs available. Sleeping...")
            time.sleep(5)

    connection.commit()

Considerations

Transaction Isolation: Use the REPEATABLE READ or SERIALIZABLE isolation level cautiously to avoid unnecessary locks.
Row Locking: SKIP LOCKED only skips rows locked by other transactions, not those locked within the same transaction.
Performance: Regularly archive or delete old jobs to prevent the table from growing indefinitely. Consider indexing the status column to improve query performance.
Fault Tolerance: Ensure that workers handle crashes or timeouts gracefully. Use a timeout mechanism to revert jobs stuck in the ‘in_progress’ state.
Scaling: Distribute workers across multiple nodes to handle a higher job throughput.
The SKIP LOCKED clause only applies to row-level locks – the required ROW SHARE table-level lock is still taken normally.
Using SKIP LOCKED provides an inconsistent view of the data by design. This is why it’s perfect for queue-like tables where we want to distribute work, but not suitable for general purpose work where consistency is required.

Learning Notes #50 – Fixed Partition Pattern | Distributed Pattern

Parotta Salna

Mr.ParottaSalna

9 January 2025 at 16:51

Today, i learnt about fixed partition, where it handles about balancing the data among servers without high movement of data. In this blog, i jot down notes on how fixed partition helps in solving the problem.

This entire blog is inspired from https://www.linkedin.com/pulse/distributed-systems-design-pattern-fixed-partitions-retail-kumar-v-c34pc/?trackingId=DMovSwEZSfCzKZEKa7yJrg%3D%3D

Problem Statement

In a distributed key-value store system, data items need to be mapped to a set of cluster nodes to ensure efficient storage and retrieval. The system must satisfy the following requirements,

Uniform Distribution: Data should be evenly distributed across all cluster nodes to avoid overloading any single node.
Deterministic Mapping: Given a data item, the specific node responsible for storing it should be determinable without querying all the nodes in the cluster.

A common approach to achieve these goals is to use hashing with a modulo operation. For example, if there are three nodes in the cluster, the key is hashed, and the hash value modulo the number of nodes determines the node to store the data. However, this method has a critical drawback,

Rebalancing Issue: When the cluster size changes (e.g., nodes are added or removed), the mapping for most keys changes. This requires the system to move almost all the data to new nodes, leading to significant overhead in terms of time and resources, especially when dealing with large data volumes.

Challenge: How can we design a mapping mechanism that minimizes data movement during cluster size changes while maintaining uniform distribution and deterministic mapping?

Solution

There is a concept of Fixed Partitioning,

What Is Fixed Partitioning?

This pattern organizes data into a predefined number of fixed partitions that remain constant over time. Data is assigned to these partitions using a hashing algorithm, ensuring that the mapping of data to partitions is permanent. The system separates the fixed partitioning of data from the physical servers managing these partitions, enabling seamless scaling.

Key Features of Fixed Partitioning

Fixed Number of Partitions
- The number of partitions is determined during system initialization (e.g., 8 partitions).
- Data is assigned to these partitions based on a consistent hashing algorithm.
Stable Data Mapping
- Each piece of data is permanently mapped to a specific partition.
- This eliminates the need for large-scale data reshuffling when scaling the system.
Adjustable Partition-to-Server Mapping
- Partitions can be reassigned to different servers as the system scales.
- Only the physical location of the partitions changes; the fixed mapping remains intact.
Balanced Load Distribution
- Partitions are distributed evenly across servers to balance the workload.
- Adding new servers involves reassigning partitions without moving or reorganizing data within the partitions.

Naive Example

We have a banking system with transactions stored in 8 fixed partitions, distributed based on a customer’s account ID.


CREATE TABLE transactions (
    id SERIAL PRIMARY KEY,
    account_id INT NOT NULL,
    transaction_amount NUMERIC(10, 2) NOT NULL,
    transaction_date DATE NOT NULL
) PARTITION BY HASH (account_id);

1. Create Partition


DO $$
BEGIN
    FOR i IN 0..7 LOOP
        EXECUTE format(
            'CREATE TABLE transactions_p%s PARTITION OF transactions FOR VALUES WITH (modulus 8, remainder %s);',
            i, i
        );
    END LOOP;
END $$;

This creates 8 partitions (transactions_p0 to transactions_p7) based on the hash remainder of account_id modulo 8.

2. Inserting Data

When inserting data into the transactions table, PostgreSQL automatically places it into the correct partition based on the account_id.


INSERT INTO transactions (account_id, transaction_amount, transaction_date)
VALUES (12345, 500.00, '2025-01-01');

The hash of 12345 % 8 determines the target partition (e.g., transactions_p5).

3. Querying Data

Querying the base table works transparently across all partitions


SELECT * FROM transactions WHERE account_id = 12345;

PostgreSQL automatically routes the query to the correct partition.

4. Scaling by Adding Servers

Initial Setup:

Suppose we have 4 servers managing the partitions,

Server 1: transactions_p0, transactions_p1
Server 2: transactions_p2, transactions_p3
Server 3: transactions_p4, transactions_p5
Server 4: transactions_p6, transactions_p7

Adding a New Server:

When a 5th server is added, we redistribute partitions,

Server 1: transactions_p0
Server 2: transactions_p1
Server 3: transactions_p2, transactions_p3
Server 4: transactions_p4
Server 5: transactions_p5, transactions_p6, transactions_p7

Partition Migration

During the migration, transactions_p5 is copied from Server 3 to Server 5.
Once the migration is complete, Server 5 becomes responsible for transactions_p5.

Benefits:

Minimal Data Movement – When scaling, only the partitions being reassigned are copied to new servers. Data within partitions remains stable.
Optimized Performance – Queries are routed directly to the relevant partition, minimizing scan times.
Scalability – Adding servers is straightforward, as it involves reassigning partitions, not reorganizing data.

What happens when a new server is added then. Don’t we need to copy the data ?

When a partition is moved to a new server (e.g., partition_b from server_A to server_B), the data in the partition must be copied to the new server. However,

The copying is limited to the partition being reassigned.
No data within the partition is reorganized.
Once the partition is fully migrated, the original copy is typically deleted.

For example, in PostgreSQL,

Export the Partition pg_dump -t partition_b -h server_A -U postgres > partition_b.sql
Import on New Server: psql -h server_B -U postgres -d mydb < partition_b.sql

Learning Notes #41 – Shared Lock and Exclusive Locks | Postgres

Parotta Salna

Mr.ParottaSalna

6 January 2025 at 14:07

Today, I learnt about various locking mechanism to prevent double update. In this blog, i make notes on Shared Lock and Exclusive Lock for my future self.

What Are Locks in Databases?

Locks are mechanisms used by a DBMS to control access to data. They ensure that transactions are executed in a way that maintains the ACID (Atomicity, Consistency, Isolation, Durability) properties of the database. Locks can be classified into several types, including

Shared Locks (S Locks): Allow multiple transactions to read a resource simultaneously but prevent any transaction from writing to it.
Exclusive Locks (X Locks): Allow a single transaction to modify a resource, preventing both reading and writing by other transactions.
Intent Locks: Used to signal the type of lock a transaction intends to acquire at a lower level.
Deadlock Prevention Locks: Special locks aimed at preventing deadlock scenarios.

Shared Lock

A shared lock is used when a transaction needs to read a resource (e.g., a database row or table) without altering it. Multiple transactions can acquire a shared lock on the same resource simultaneously. However, as long as one or more shared locks exist on a resource, no transaction can acquire an exclusive lock on that resource.


-- Transaction A: Acquire a shared lock on a row
BEGIN;
SELECT * FROM employees WHERE id = 1 FOR SHARE;
-- Transaction B: Acquire a shared lock on the same row
BEGIN;
SELECT * FROM employees WHERE id = 1 FOR SHARE;
-- Both transactions can read the row concurrently
-- Transaction C: Attempt to update the same row
BEGIN;
UPDATE employees SET salary = salary + 1000 WHERE id = 1;
-- Transaction C will be blocked until Transactions A and B release their locks

Key Characteristics of Shared Locks

1. Concurrent Reads

Shared locks allow multiple transactions to read the same resource at the same time.
This is ideal for operations like SELECT queries that do not modify data.

2. Write Blocking

While a shared lock is active, no transaction can modify the locked resource.
Prevents dirty writes and ensures read consistency.

3. Compatibility

Shared locks are compatible with other shared locks but not with exclusive locks.

When Are Shared Locks Used?

Shared locks are typically employed in read operations under certain isolation levels. For instance,

1. Read Committed Isolation Level:

Shared locks are held for the duration of the read operation.
Prevents dirty reads by ensuring the data being read is not modified by other transactions during the read.

2. Repeatable Read Isolation Level:

Shared locks are held until the transaction completes.
Ensures that the data read during a transaction remains consistent and unmodified.

3. Snapshot Isolation:

Shared locks may not be explicitly used, as the DBMS creates a consistent snapshot of the data for the transaction.

Exclusive Locks

An exclusive lock is used when a transaction needs to modify a resource. Only one transaction can hold an exclusive lock on a resource at a time, ensuring no other transactions can read or write to the locked resource.


-- Transaction X: Acquire an exclusive lock to update a row
BEGIN;
UPDATE employees SET salary = salary + 1000 WHERE id = 2;
-- Transaction Y: Attempt to read the same row
BEGIN;
SELECT * FROM employees WHERE id = 2;
-- Transaction Y will be blocked until Transaction X completes
-- Transaction Z: Attempt to update the same row
BEGIN;
UPDATE employees SET salary = salary + 500 WHERE id = 2;
-- Transaction Z will also be blocked until Transaction X completes

Key Characteristics of Exclusive Locks

1. Write Operations: Exclusive locks are essential for operations like INSERT, UPDATE, and DELETE.

2. Blocking Reads and Writes: While an exclusive lock is active, no other transaction can read or write to the resource.

3. Isolation: Ensures that changes made by one transaction are not visible to others until the transaction is complete.

When Are Exclusive Locks Used?

Exclusive locks are typically employed in write operations or any operation that modifies the database. For instance:

1. Transactional Updates – A transaction that updates a row acquires an exclusive lock to ensure no other transaction can access or modify the row during the update.

2. Table Modifications – When altering a table structure, the DBMS may place an exclusive lock on the entire table.

Benefits of Shared and Exclusive Locks

Benefits of Shared Locks

Consistency in Multi-User Environments – Ensure that data being read is not altered by other transactions, preserving consistency.
Concurrency Support – Allow multiple transactions to read data simultaneously, improving system performance.
Data Integrity – Prevent dirty reads and writes, ensuring that operations yield reliable results.

Benefits of Exclusive Locks

Data Integrity During Modifications – Prevents other transactions from accessing data being modified, ensuring changes are applied safely.
Isolation of Transactions – Ensures that modifications by one transaction are not visible to others until committed.

Limitations and Challenges

Shared Locks

Potential for Deadlocks – Deadlocks can occur if two transactions simultaneously hold shared locks and attempt to upgrade to exclusive locks.
Blocking Writes – Shared locks can delay write operations, potentially impacting performance in write-heavy systems.
Lock Escalation – In systems with high concurrency, shared locks may escalate to table-level locks, reducing granularity and concurrency.

Exclusive Locks

Reduced Concurrency – Exclusive locks prevent other transactions from accessing the locked resource, which can lead to bottlenecks in highly concurrent systems.
Risk of Deadlocks – Deadlocks can occur if two transactions attempt to acquire exclusive locks on resources held by each other.

Lock Compatibility

Learning Notes #34 – Consistency (Correctness) in ACID | Postgres

Parotta Salna

Mr.ParottaSalna

4 January 2025 at 12:37

As part of the ACID Series, i am refreshing on consistency. In this blog, i jot down notes on consistency (correctness) in postgres database.

What is Consistency?

Consistency ensures that a transaction brings the database from one valid state to another, adhering to predefined rules such as constraints, triggers, and relational integrity. If a transaction violates these rules, it is aborted, and the database remains unchanged. This guarantees that only valid data exists in the database.

Consistency works together with other ACID properties:

Atomicity ensures the “all-or-nothing” execution of a transaction.
Isolation ensures transactions don’t interfere with each other.
Durability guarantees committed transactions persist despite system failures

Key Aspects of Consistency in PostgreSQL

Constraints
- Primary Key: Ensures uniqueness of rows.
- Foreign Key: Maintains referential integrity.
- Check Constraints: Enforces custom business rules.
- Not Null: Ensures that specific columns cannot have null values.
Triggers
- Custom logic executed before or after specific database events.
Rules
- Enforce application-specific invariants on the database.
Transactions
- Changes are made in a controlled environment, ensuring consistency even in the event of errors or system failures.

Practical Examples of Consistency in PostgreSQL

1. Primary Key Constraint

Ensures that no two rows in a table have the same primary key value.


CREATE TABLE accounts (
    account_id SERIAL PRIMARY KEY,
    account_holder_name VARCHAR(255) NOT NULL,
    balance NUMERIC(15, 2) NOT NULL CHECK (balance >= 0)
);

-- Attempt to insert duplicate primary keys.
INSERT INTO accounts (account_id, account_holder_name, balance)
VALUES (1, 'Alice', 1000.00);

INSERT INTO accounts (account_id, account_holder_name, balance)
VALUES (1, 'Bob', 2000.00); -- This will fail.

2. Foreign Key Constraint

Enforces referential integrity between tables.


CREATE TABLE transactions (
    transaction_id SERIAL PRIMARY KEY,
    account_id INT NOT NULL REFERENCES accounts(account_id),
    amount NUMERIC(15, 2) NOT NULL,
    transaction_type VARCHAR(10) NOT NULL CHECK (transaction_type IN ('credit', 'debit')),
    transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Attempt to insert a transaction for a non-existent account.
INSERT INTO transactions (account_id, amount, transaction_type)
VALUES (999, 500, 'credit'); -- This will fail.

3. Check Constraint

Validates custom business rules.


-- Ensure account balance cannot go negative.
INSERT INTO accounts (account_holder_name, balance)
VALUES ('Charlie', -500); -- This will fail due to the CHECK constraint.

4. Trigger for Business Logic

Ensures derived data or additional checks are implemented.


CREATE OR REPLACE FUNCTION enforce_minimum_balance()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.balance < 0 THEN
        RAISE EXCEPTION 'Balance cannot be negative';
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER check_balance_before_insert
BEFORE INSERT OR UPDATE ON accounts
FOR EACH ROW EXECUTE FUNCTION enforce_minimum_balance();

-- Attempt to update an account with a negative balance.
UPDATE accounts SET balance = -100 WHERE account_id = 1; -- This will fail.

5. Transactions to Maintain Consistency

A transaction groups multiple operations into a single unit, ensuring all succeed or none.


BEGIN;

-- Deduct from sender's account.
UPDATE accounts SET balance = balance - 500 WHERE account_id = 1;

-- Credit to receiver's account.
UPDATE accounts SET balance = balance + 500 WHERE account_id = 2;

-- If any operation fails, rollback the transaction.
COMMIT;

If the system crashes before the COMMIT, the database remains unchanged, ensuring consistency.

How Consistency Works with Other ACID Properties

With Atomicity: If any step in a transaction violates a constraint, the entire transaction is rolled back, ensuring that the database remains consistent.
With Isolation: Concurrent transactions operate independently, preventing inconsistent states caused by interference.
With Durability: Once a transaction is committed, its consistency guarantees persist even in the event of a crash.

Benefits of Consistency

Data Integrity: Prevents invalid data from being stored.
Application Reliability: Reduces the need for additional application-level checks.
Simplified Maintenance: Developers can rely on the database to enforce business rules and relationships.
Error Prevention: Constraints and triggers act as safeguards, catching mistakes early.

Learning Notes #28 – Unlogged Table in Postgres

Parotta Salna

Mr.ParottaSalna

2 January 2025 at 17:30

Today, As part of daily reading, i came across https://raphaeldelio.com/2024/07/14/can-postgres-replace-redis-as-a-cache/ where they discussing about postgres as a cache ! and comparing it with redis !! I was surprised at the title so gave a read through. Then i came across a concept of UNLOGGED table which act as a fast retrieval as cache. In this blog i jot down notes on unlogged table for future reference.

Highly Recommended Links: https://martinheinz.dev/blog/105, https://raphaeldelio.com/2024/07/14/can-postgres-replace-redis-as-a-cache/, https://www.crunchydata.com/blog/postgresl-unlogged-tables

Unlogged tables offer unique benefits in scenarios where speed is paramount, and durability (the guarantee that data is written to disk and will survive crashes) is not critical.

What Are Unlogged Tables?

Postgres Architecture : https://miro.com/app/board/uXjVLD2T5os=/

In PostgreSQL, a table is a basic unit of data storage. By default, PostgreSQL ensures that data in regular tables is durable. This means that all data is written to the disk and will survive server crashes. However, in some situations, durability is not necessary. Unlogged tables are special types of tables in PostgreSQL where the database does not write data changes to the WAL (Write-Ahead Log).

The absence of WAL logging for unlogged tables makes them faster than regular tables because PostgreSQL doesn’t need to ensure data consistency across crashes for these tables. However, this also means that if the server crashes or the system is powered off, the data in unlogged tables is lost.

Key Characteristics of Unlogged Tables

No Write-Ahead Logging (WAL) – By default, PostgreSQL writes changes to the WAL to ensure data durability. For unlogged tables, this step is skipped, making operations like INSERTs, UPDATEs, and DELETEs faster.
No Durability – The absence of WAL means that unlogged tables will lose their data if the database crashes or if the server is restarted. This makes them unsuitable for critical data.
Faster Performance – Since WAL writes are skipped, unlogged tables are faster for data insertion and modification. This can be beneficial for use cases where data is transient and doesn’t need to persist beyond the current session.
Support for Indexes and Constraints – Unlogged tables can have indexes and constraints like regular tables. However, the data in these tables is still non-durable.
Automatic Cleanup – When the PostgreSQL server restarts, the data in unlogged tables is automatically dropped. Therefore, unlogged tables only hold data during the current database session.

Drawbacks of Unlogged Tables

Data Loss on Crash – The most significant disadvantage of unlogged tables is the loss of data in case of a crash or restart. If the application depends on this data, then using unlogged tables would not be appropriate.
Not Suitable for Critical Applications – Applications that require data persistence (such as financial or inventory systems) should avoid using unlogged tables, as the risk of data loss outweighs any performance benefits.
No Replication – Unlogged tables are not replicated in standby servers in a replication setup, as the data is not written to the WAL.

Creating an Unlogged Table

Creating an unlogged table is very straightforward in PostgreSQL. You simply need to add the UNLOGGED keyword when creating the table.


CREATE UNLOGGED TABLE temp_data (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    value INT
);

In this example, temp_data is an unlogged table. All operations performed on this table will not be logged to the WAL.

When to Avoid Unlogged Tables?

If you are working with critical data that needs to be durable and persistent across restarts.
If your application requires data replication, as unlogged tables are not replicated in standby servers.
If your workload involves frequent crash scenarios where data loss cannot be tolerated.

Examples

1. Temporary Storage for processing


CREATE UNLOGGED TABLE etl_staging (
    source_id INT,
    raw_data JSONB,
    processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert raw data into the staging table
INSERT INTO etl_staging (source_id, raw_data)
VALUES 
    (1, '{"key": "value1"}'),
    (2, '{"key": "value2"}');

-- Perform transformations on the data
INSERT INTO final_table (id, key, value)
SELECT source_id, 
       raw_data->>'key' AS key, 
       'processed_value' AS value
FROM etl_staging;

-- Clear the staging table
TRUNCATE TABLE etl_staging;

2. Caching


CREATE UNLOGGED TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id INT,
    last_accessed TIMESTAMP DEFAULT NOW()
);

-- Insert session data
INSERT INTO user_sessions (session_id, user_id)
VALUES 
    (uuid_generate_v4(), 101),
    (uuid_generate_v4(), 102);

-- Update last accessed timestamp
UPDATE user_sessions
SET last_accessed = NOW()
WHERE session_id = 'some-session-id';

-- Delete expired sessions
DELETE FROM user_sessions WHERE last_accessed < NOW() - INTERVAL '1 hour';

Learning Notes #21 – Sharding (database) with Postgres

Parotta Salna

Mr.ParottaSalna

31 December 2024 at 07:54

…contd. Early Morning today, i watched a video on partitioning and sharding. In that video, Arpit explained the limitation of Vertical Scaling and ways to infinite scale DB with Sharding and Partitioning. In this blog, i jot down notes on partioining with single node implementation with postgres for my future self.

As the volume of data grows, managing databases efficiently becomes critical and when we understood that vertical scaling has its limits, we have two common strategies to handle large datasets are partitioning and sharding. While they may sound similar, these techniques serve different purposes and are implemented differently. Let’s explore these concepts in detail.

What is Sharding?

Sharding is a type of database architecture where data is horizontally divided across multiple database instances, called shards. Each shard is an independent database, often hosted on separate servers. Sharding is commonly used to scale out databases.

How Sharding Works

Shard Key
- A shard key determines how data is distributed across shards.
- Example: A social media app might use user IDs as the shard key to ensure all data related to a user resides in the same shard.
Data Distribution
- Data is split horizontally; each shard contains a subset of the entire dataset.
- Shards can be distributed geographically or across servers for better performance.

Combining Partitioning and Sharding

In some advanced architectures, partitioning and sharding are combined. Here, partitioned data is further distributed across shards. Each shard can manage its partitions independently, providing both scalability and query optimization.

PostgreSQL Example with Citus

1. Install the Citus extension (https://www.citusdata.com/download/, https://github.com/citusdata/citus?tab=readme-ov-file#getting-started)

2. Create a distributed table with partitioning

CREATE TABLE orders (
    id SERIAL,
    customer_id INT,
    order_date DATE NOT NULL,
    PRIMARY KEY (id, order_date)
) PARTITION BY RANGE (order_date);

SELECT create_distributed_table('orders', 'customer_id');

CREATE TABLE orders_jan PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_feb PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

3. Add worker nodes to manage shards

SELECT master_add_node('worker1', 5432);
SELECT master_add_node('worker2', 5432);

4. Data placement

The data for each partition (e.g., orders_jan) is distributed across the shards (worker1, worker2) based on the shard key (customer_id).

5. Insert and query data

INSERT INTO orders (customer_id, order_date) VALUES (101, '2024-01-15');
SELECT * FROM orders WHERE customer_id = 101;

Learning Notes #20 – Partitioning (data) With Postgres

Parotta Salna

Mr.ParottaSalna

31 December 2024 at 06:55

Early Morning today, i watched a video on partitioning and sharding. In that video, Arpit explained the limitation of Vertical Scaling and ways to infinite scale DB with Sharding and Partitioning. In this blog, i jot down notes on partioining with single node implementation with postgres for my future self.

What is Partitioning?

Partitioning involves dividing a large dataset into smaller, manageable segments, known as partitions. Each partition is stored separately but remains part of a single database instance. Partitioning is typically used to improve query performance and manageability.

Types of Partitioning

1. Range Partitioning

Data is divided based on ranges of a column’s values.
Example: A table storing customer orders might partition data by order date: January orders in one partition, February orders in another.

PostgreSQL Example

CREATE TABLE orders (
    id SERIAL,
    customer_id INT,
    order_date DATE NOT NULL,
    PRIMARY KEY (id, order_date) -- Include the partition key
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_jan PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_feb PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

2. Hash Partitioning

A hash function determines the partition where a record will be stored.
Example: Orders can be distributed across partitions based on the hash of the customer ID.

Postgres Example

CREATE TABLE orders (
    id SERIAL ,
    customer_id INT,
    order_date DATE NOT NULL,
    PRIMARY KEY (id, customer_id)
) PARTITION BY HASH (customer_id, id);

CREATE TABLE orders_part_1 PARTITION OF orders
    FOR VALUES WITH (MODULUS 2, REMAINDER 0);

CREATE TABLE orders_part_2 PARTITION OF orders
    FOR VALUES WITH (MODULUS 2, REMAINDER 1);

3. List Partitioning

Data is divided based on a predefined list of values.
Example: A table storing sales data could partition based on regions: North, South, East, and West

Postgres Example

CREATE TABLE sales (
    id SERIAL ,
    region TEXT NOT NULL,
    amount NUMERIC,
    PRIMARY KEY (id, region)
) PARTITION BY LIST (region);

CREATE TABLE sales_north PARTITION OF sales
    FOR VALUES IN ('North');

CREATE TABLE sales_south PARTITION OF sales
    FOR VALUES IN ('South');

4. Composite Partitioning

Combines two or more partitioning strategies, such as range and list partitioning.
Example: A table partitioned by range on order date and sub-partitioned by list on region.

Postgres Example

CREATE TABLE orders (
    id SERIAL,
    customer_id INT,
    order_date DATE NOT NULL,
    region TEXT NOT NULL,
    PRIMARY KEY (id, order_date, region)
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01')
    PARTITION BY LIST (region);

CREATE TABLE orders_2024_north PARTITION OF orders_2024
    FOR VALUES IN ('North');

CREATE TABLE orders_2024_south PARTITION OF orders_2024
    FOR VALUES IN ('South');

Postgres – Write-Ahead Logging (WAL) in PostgreSQL

Parotta Salna

Mr.ParottaSalna

16 November 2024 at 07:06

Write-Ahead Logging (WAL) is a fundamental feature of PostgreSQL, ensuring data integrity and facilitating critical functionalities like crash recovery, replication, and backup.

This series of experimentation explores WAL in detail, its importance, how it works, and provides examples to demonstrate its usage.

What is Write-Ahead Logging (WAL)?

WAL is a logging mechanism where changes to the database are first written to a log file before being applied to the actual data files. This ensures that in case of a crash or unexpected failure, the database can recover and replay these logs to restore its state.

Your question is right !

Why do we need a WAL, when we do a periodic backup ?

Write-Ahead Logging (WAL) is critical even when periodic backups are in place because it complements backups to provide data consistency, durability, and flexibility in the following scenarios.

1. Crash Recovery

Why It’s Important: Periodic backups only capture the database state at specific intervals. If a crash occurs after the latest backup, all changes made since that backup would be lost.
Role of WAL: WAL ensures that any committed transactions not yet written to data files (due to PostgreSQL’s lazy-writing behavior) are recoverable. During recovery, PostgreSQL replays the WAL logs to restore the database to its last consistent state, bridging the gap between the last checkpoint and the crash.

Example:

Backup Taken: At 12:00 PM.
Crash Occurs: At 1:30 PM.
Without WAL: All changes after 12:00 PM are lost.
With WAL: All changes up to 1:30 PM are recovered.

2. Point-in-Time Recovery (PITR)

Why It’s Important: Periodic backups restore the database to the exact time of the backup. However, this may not be sufficient if you need to recover to a specific point, such as just before a mistake (e.g., accidental data deletion).
Role of WAL: WAL records every change, enabling you to replay transactions up to a specific time. This allows fine-grained recovery beyond what periodic backups can provide.

Example:

Backup Taken: At 12:00 AM.
Mistake Made: At 9:45 AM, an important table is accidentally dropped.
Without WAL: Restore only to 12:00 AM, losing 9 hours and 45 minutes of data.
With WAL: Restore to 9:44 AM, recovering all valid changes except the accidental drop.

3. Replication and High Availability

Why It’s Important: In a high-availability setup, replicas must stay synchronized with the primary database to handle failovers. Periodic backups cannot provide real-time synchronization.
Role of WAL: WAL enables streaming replication by transmitting logs to replicas, ensuring near real-time synchronization.

Example:

A primary database sends WAL logs to replicas as changes occur. If the primary fails, a replica can quickly take over without data loss.

4. Handling Incremental Changes

Why It’s Important: Periodic backups store complete snapshots of the database, which can be time-consuming and resource-intensive. They also do not capture intermediate changes.
Role of WAL: WAL allows incremental updates by recording only the changes made since the last backup or checkpoint. This is crucial for efficient data recovery and backup optimization.

5. Ensuring Data Durability

Why It’s Important: Even during normal operations, a database crash (e.g., power failure) can occur. Without WAL, transactions committed by users but not yet flushed to disk are lost.
Role of WAL: WAL ensures durability by logging all changes before acknowledging transaction commits. This guarantees that committed transactions are recoverable even if the system crashes before flushing the changes to data files.

6. Supporting Hot Backups

Why It’s Important: For large, active databases, taking a backup while the database is running can result in inconsistent snapshots.
Role of WAL: WAL ensures consistency by recording changes that occur during the backup process. When replayed, these logs synchronize the backup, ensuring it is valid and consistent.

7. Debugging and Auditing

Why It’s Important: Periodic backups are static snapshots and don’t provide a record of what happened in the database between backups.
Role of WAL: WAL contains a sequential record of all database modifications, which can help in debugging issues or auditing transactions.

Feature	Periodic Backups	Write-Ahead Logging
Crash Recovery	Limited to the last backup	Ensures full recovery to the crash point
Point-in-Time Recovery	Restores only to the backup time	Allows recovery to any specific point
Replication	Not supported	Enables real-time replication
Efficiency	Full snapshot	Incremental changes
Durability	Relies on backup frequency	Guarantees transaction durability

In upcoming sessions, we will all experiment each one of the failure scenarios for understanding.

Postgres Ep 2 : Amutha Hotel and Issues with Flat Files

Parotta Salna

Mr.ParottaSalna

23 August 2024 at 01:29

Once upon a time in ooty, there was a small business called “Amutha Hotel,” run by a passionate baker named Saravanan. Saravanan bakery was famous for its delicious sambar, and as his customer base grew, he needed to keep track of orders, customer information, and inventory.

Being a techie, he decided to store all this information in a flat file a simple spreadsheet named “HotelData.csv.”

The Early Days: Simple and Sweet

At first, everything was easy. Saravanan’s flat file had only a few columns, OrderID, CustomerName, Product, Quantity, and Price. Each row represented a new order, and it was simple enough to manage. Saravanan could quickly find orders, calculate totals, and even check his inventory by filtering the file.

The Business Grows: Complexity Creeps In

As the business boomed, Saravanan started offering new products, special discounts, and loyalty programs. He added more columns to her flat file, like Discount, LoyaltyPoints, and DeliveryAddress. He once-simple file began to swell with information.

Then, Saravanan decided to start tracking customer preferences and order history. He began adding multiple rows for the same customer, each representing a different order. His flat file now had repeating groups of data for each customer, and it became harder and harder to find the information he needed.

His flat file was getting out of hand. For every new order from a returning customer, he had to re-enter all their information

CustomerName, DeliveryAddress, LoyaltyPoints

over and over again. This duplication wasn’t just tedious; it started to cause mistakes. One day, he accidentally typed “John Smyth” instead of “John Smith,” and suddenly, his loyal customer was split into two different entries.

On a Busy Saturday

One busy Saturday, Saravanan opened his flat file to update the day’s orders, but instead of popping up instantly as it used to, it took several minutes to load. As he scrolled through the endless rows, his computer started to lag, and the spreadsheet software even crashed a few times. The file had become too large and cumbersome for him to handle efficiently.

Customers were waiting longer for their orders to be processed because Saravanan was struggling to find their previous details and apply the right discounts. The flat file that once served his so well was now slowing her down, and it was affecting her business.

The Journaling

Techie Saravanan started to note these issues in to a notepad. He badly wants a solution which will solve these problems. So he started listing out the problems with examples to look for a solution.

His journal continues …

Before databases became common for data storage, flat files (such as CSVs or text files) were often used to store and manage data. The data file that we use has no special structure; it’s just some lines of text that mean something to the particular application that reads it. It has no inherent structure

However, these flat files posed several challenges, particularly when dealing with repeating groups, which are essentially sets of related fields that repeat multiple times within a record. Here are some of the key problems associated with repeating groups in flat files,

1. Data Redundancy

Description: Repeating groups can lead to significant redundancy, as the same data might need to be repeated across multiple records.
Example: If an employee can have multiple skills, a flat file might need to repeat the employee’s name, ID, and other details for each skill.
Problem: This not only increases the file size but also makes data entry, updates, and deletions more prone to errors.

Eg: Suppose you are maintaining a flat file to track employees and their skills. Each employee can have multiple skills, which you store as repeating groups in the file.

EmployeeID, EmployeeName, Skill1, Skill2, Skill3, Skill4
1, John Doe, Python, SQL, Java, 
2, Jane Smith, Excel, PowerPoint, Python, SQL

If an employee has four skills, you need to add four columns (Skill1, Skill2, Skill3, Skill4). If an employee has more than four skills, you must either add more columns or create a new row with repeated employee details.

2. Data Inconsistency

Description: Repeating groups can lead to inconsistencies when data is updated.
Example: If an employee’s name changes, and it’s stored multiple times in different rows because of repeating skills, it’s easy for some instances to be updated while others are not.
Problem: This can lead to situations where the same employee is listed under different names or IDs in the same file.

Eg: Suppose you are maintaining a flat file to track employees and their skills. Each employee can have multiple skills, which you store as repeating groups in the file.

EmployeeID, EmployeeName, Skill1, Skill2, Skill3, Skill4
1, John Doe, Python, SQL, Java, 
2, Jane Smith, Excel, PowerPoint, Python, SQL

If John’s name changes to “John A. Doe,” you must manually update each occurrence of “John Doe” across all rows, which increases the chance of inconsistencies.

3. Difficulty in Querying

Description: Querying data in flat files with repeating groups can be cumbersome and inefficient.
Example: Extracting a list of unique employees with their respective skills requires complex scripting or manual processing.
Problem: Unlike relational databases, which use joins to simplify such queries, flat files require custom logic to manage and extract data, leading to slower processing and more potential for errors.

Eg: Suppose you are maintaining a flat file to track employees and their skills. Each employee can have multiple skills, which you store as repeating groups in the file.

EmployeeID, EmployeeName, Skill1, Skill2, Skill3, Skill4
1, John Doe, Python, SQL, Java, 
2, Jane Smith, Excel, PowerPoint, Python, SQL

Extracting a list of all employees proficient in “Python” requires you to search across multiple skill columns (Skill1, Skill2, etc.), which is cumbersome compared to a relational database where you can use a simple JOIN on a normalized EmployeeSkills table.

4. Limited Scalability

Description: Flat files do not scale well when the number of repeating groups or the size of the data grows.
Example: A file with multiple repeating fields can become extremely large and difficult to manage as the number of records increases.
Problem: This can lead to performance issues, such as slow read/write operations and difficulty in maintaining the file over time.

Eg: You are storing customer orders in a flat file where each customer can place multiple orders.

CustomerID, CustomerName, Order1ID, Order1Date, Order2ID, Order2Date, Order3ID, Order3Date
1001, Alice Brown, 5001, 2023-08-01, 5002, 2023-08-15, 
1002, Bob White, 5003, 2023-08-05,

If Alice places more than three orders, you’ll need to add more columns (Order4ID, Order4Date, etc.), leading to an unwieldy file with many empty cells for customers with fewer orders.

5. Challenges in Data Integrity

Description: Ensuring data integrity in flat files with repeating groups is difficult.
Example: Enforcing rules like “an employee can only have unique skills” is nearly impossible in a flat file format.
Problem: This can result in duplicated or invalid data, which is hard to detect and correct without a database system.

Eg: You are storing customer orders in a flat file where each customer can place multiple orders.

CustomerID, CustomerName, Order1ID, Order1Date, Order2ID, Order2Date, Order3ID, Order3Date
1001, Alice Brown, 5001, 2023-08-01, 5002, 2023-08-15, 
1002, Bob White, 5003, 2023-08-05,

There’s no easy way to enforce that each order ID is unique and corresponds to the correct customer, which could lead to errors or duplicated orders.

6. Complex File Formats

Description: Managing and processing flat files with repeating groups often requires complex file formats.
Example: Custom delimiters or nested formats might be needed to handle repeating groups, making the file harder to understand and work with.
Problem: This increases the likelihood of errors during data entry, processing, or when the file is read by different systems.

Eg: You are storing customer orders in a flat file where each customer can place multiple orders.

CustomerID, CustomerName, Order1ID, Order1Date, Order2ID, Order2Date, Order3ID, Order3Date
1001, Alice Brown, 5001, 2023-08-01, 5002, 2023-08-15, 
1002, Bob White, 5003, 2023-08-05,

As the number of orders grows, the file format becomes increasingly complex, requiring custom scripts to manage and extract order data for each customer.

7. Lack of Referential Integrity

Description: Flat files lack mechanisms to enforce referential integrity between related groups of data.
Example: Ensuring that a skill listed in one file corresponds to a valid skill ID in another file requires manual checks or complex logic.
Problem: This can lead to orphaned records or mismatches between related data sets.

Eg: A fleet management company tracks maintenance records for each vehicle in a flat file. Each vehicle can have multiple maintenance records.

VehicleID, VehicleType, Maintenance1Date, Maintenance1Type, Maintenance2Date, Maintenance2Type
V001, Truck, 2023-01-15, Oil Change, 2023-03-10, Tire Rotation
V002, Van, 2023-02-20, Brake Inspection, ,

There’s no way to ensure that the Maintenance1Type and Maintenance2Type fields are valid maintenance types or that the dates are in correct chronological order.

8. Difficulty in Data Modification

Description: Modifying data in flat files with repeating groups can be complex and error-prone.
Example: Adding or removing an item from a repeating group might require extensive manual edits across multiple records.
Problem: This increases the risk of errors and makes data management time-consuming.

Eg: A university maintains a flat file to record student enrollments in courses. Each student can enroll in multiple courses.

StudentID, StudentName, Course1, Course2, Course3, Course4, Course5
2001, Charlie Green, Math101, Physics102, , , 
2002, Dana Blue, History101, Math101, Chemistry101, ,

If a student drops a course or switches to a different one, manually editing the file can easily lead to errors, especially as the number of students and courses increases.

After listing down all these, Saravanan started looking into solutions. His search goes on…

What is Relational Database and Postgres Sql ?

Parotta Salna

Mr.ParottaSalna

21 August 2024 at 01:56

In the city of Data, the citizens relied heavily on organizing their information. The city was home to many different types of data numbers, names, addresses, and even some exotic types like images and documents. But as the city grew, so did the complexity of managing all this information.

One day, the city’s leaders called a meeting to discuss how best to handle the growing data. They were split between two different systems

the old and trusted Relational Database Management System (RDBMS)
the new, flashy NoSQL databases.

Enter Relational Databases:

Relational databases were like the city’s libraries. They had rows of neatly organized shelves (tables) where every book (data entry) was placed according to a specific category (columns).

Each book had a unique ID (primary key) so that anyone could find it quickly. These libraries had been around for decades, and everyone knew how to use them.

The RDBMS was more than just a library. It enforced rules (constraints) to ensure that no book went missing, was duplicated, or misplaced. It even allowed librarians (queries) to connect different books using relationships (joins).

If you wanted to find all the books by a particular author that were published in the last five years, the RDBMS could do it in a heartbeat.

The Benefits of RDBMS:

The citizens loved the RDBMS because it was:

Organized: Everything was in its place, and data was easy to find.
Reliable: The rules ensured data integrity, so they didn’t have to worry about inconsistencies.
Powerful: It could handle complex queries, making it easy to get insights from their data.
Secure: Access to the data could be controlled, keeping it safe from unauthorized users.

The Rise of NoSQL:

But then came the NoSQL databases, which were more like vast, sprawling warehouses. These warehouses didn’t care much about organization; they just stored everything in a big open space. You could toss in anything, and it would accept it—no need for strict categories or relationships. This flexibility appealed to the tech-savvy citizens who wanted to store newer, more diverse types of data like social media posts, images, and videos.

NoSQL warehouses were fast. They could handle enormous amounts of data without breaking a sweat and were perfect for real-time applications like chat systems and analytics.

The PostgreSQL Advantage:

PostgreSQL was a superstar in the world of RDBMS. It combined the organization and reliability of traditional relational databases with some of the flexibility of NoSQL. It allowed citizens to store structured data in tables while also offering support for unstructured data types like JSON. This made PostgreSQL a versatile choice, bridging the gap between the old and new worlds.

For installing postgres : https://www.postgresql.org/download/

The Dilemma: PostgreSQL vs. NoSQL:

The city faced a dilemma. Should they stick with PostgreSQL, which offered the best of both worlds, or fully embrace NoSQL for its speed and flexibility? The answer wasn’t simple. It depended on what the city valued more: the structured, reliable nature of PostgreSQL or the unstructured, flexible approach of NoSQL.

For applications that required strict data integrity and complex queries, PostgreSQL was the way to go. But for projects that needed to handle massive amounts of unstructured data quickly, NoSQL was the better choice.

Conclusion:

In the end, the city of Data realized that there was no one-size-fits-all solution. They decided to use PostgreSQL for applications where data relationships and integrity were crucial, and NoSQL for those that required speed and flexibility with diverse data types.

And so, the citizens of Data lived happily, managing their information with the right tools for the right tasks, knowing that both systems had their place in the ever-growing city.

Lock your (postgres) tables from being altered

ashish thomas cherian

ashish

14 April 2022 at 20:55

Want to lock your table but only on DDL operations? Use the below code to ensure no one is altering your tables. To achieve this we will create a trigger on all alter table commands, then filter it down by table name and finally throw an exception telling the user that they cannot alter the table.

Ingesting large files to postgres through S3

ashish thomas cherian

ashish

13 April 2022 at 20:16

One of the tasks I recently came accross my job was ingest large files but with the following

Do some processing ( like generate hash for each row )
Insert it into S3 for audit purposes
Insert into postgres

Note:

Keep in mind your postgres database needs to support this and a s3 bucket policy needs to exist in order to allow the data to be copied over.

The setup I am using is a RDS database with S3 in the same region and proper policies and IAM roles already created.

Read more on that here – AWS documentation

For the purpose of this post I will be using dummy data from – eforexcel(1 million records)

The most straight forward way to do this would be to just do a df.to_sql like this

df = pd.read_csv("records.csv")
df.to_sql(
    name="test_table",
    con=connection_detail,
    schema="schema",
    if_exists="replace",
)

Something like this would take more than an hour! Lets do it in less than 5 minutes.

Now ofcourse there are several ways to make this faster – using copy expert, psycogpg driver etc(maybe a sepearate blog post on these), but that’s not the use case I have been tasked with. Since we need to upload the file s3 in the end for audit purposes I will ingest the data from S3 to DB.

Generate table metadata

Before we can assign an s3 operator to ingest the data we need to create the table into which this data will be inserted. We have two ways that I can think of

Each column in the file will be created in the DB with a highest threshold value like varchar(2000)
Each column is created with the data length as max length in each row

I will be going with option 2 here.

This entire process took around 210 seconds instead of more than an hour like the last run.

Let’s go over the code one by one

Read the csv

We can pass the data directly to pandas or stream it into buffered memory something like this

with open("records.csv") as f:
    csv_rdr = csv.reader(f, delimiter=",")
    header = next(csv_rdr)
    with gzip.GzipFile(fileobj=mem_file, mode="wb", compresslevel=6) as gz:
        buff = io.StringIO()
        writer = csv.writer(buff)
        writer.writerows([header])
        for row in csv_rdr:
            writer.writerows([row])
        gz.write(buff.getvalue().encode("utf-8", "replace"))
    mem_file.seek(0)
    s3.put_object(Bucket="mybucket", Key="folder/file.gz", Body=mem_file)

2. Since the file is less than 50 MB i’ll go ahead and load it directly.

Create the table

Get the max lengths of each column and use that to generate the table. We use pandas to_sql() function for this and pass the dtypes.

Copy data from s3 gzipped file to postgres

Finally we use –

aws_s3.table_import_from_s3

to copy over the file to the postgres table.

Granting Access to Read-Only Users and Refreshing Permissions Automatically: A Function-Based Solution

angu10

14 March 2023 at 01:52

Problem Statement

We have a PostgreSQL database with multiple schemas and tables. Some users have read-only access to the database and and they relay on Devops/Support team to refresh their access to view any new schemas or tables added to the database. We need to provide a solution to allow read-only users to refresh their access so they can view new schemas and tables as they are added.

Named Read-only User Group

Function 1: Will create a user and create a read_only group not available. If the group is available, it will create the user and password, attach it to the read_only group, and add all existing schema read-only access.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE or replace FUNCTION create_users_and_grant_access(users text[]) RETURNS void AS $$
DECLARE
    READONLY_GROUP text := 'readonly';
    password text;
    user_name text;
    schemata text;
BEGIN
    FOREACH user_name IN ARRAY users LOOP
        -- Check if the user already exists
        PERFORM 1 FROM pg_user WHERE usename = user_name;
        IF NOT FOUND THEN
            -- Generate a random password for the new user
            password := encode(gen_random_bytes(12), 'base64');


            -- Create the database user with the hashed password
            RAISE NOTICE 'Creating database user: %', user_name;
            RAISE NOTICE 'Password: %', password;
            EXECUTE format('CREATE USER %I WITH PASSWORD %L', user_name, password);

            -- Create the read-only group if it does not exist
            PERFORM 1 FROM pg_roles WHERE rolname = READONLY_GROUP;
            IF NOT FOUND THEN
                RAISE NOTICE 'Creating read-only group: %', READONLY_GROUP;
                EXECUTE format('CREATE ROLE %I', READONLY_GROUP);
            END IF;

            -- Add the user to the read-only group
            RAISE NOTICE 'Adding user to read-only group: %', READONLY_GROUP;
            EXECUTE format('GRANT %I TO %I', READONLY_GROUP, user_name);
        ELSE
            RAISE NOTICE 'User already exists: %', user_name;
        END IF;
    END LOOP;

    -- Grant read-only access to all schemas for the read-only group
    FOR schemata IN SELECT schema_name FROM information_schema.schemata WHERE schema_name NOT LIKE 'pg_%' AND schema_name != 'information_schema' LOOP
        -- Check if the read-only group already has access to the schema
        PERFORM 1 FROM information_schema.role_table_grants WHERE grantee = READONLY_GROUP AND table_schema = schemata;
        IF NOT FOUND THEN
            -- Grant read-only access to the schema for the read-only group
            RAISE NOTICE 'Granting read-only access to schema: %', schemata;
            EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schemata, READONLY_GROUP);
            EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schemata, READONLY_GROUP);
            EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schemata, READONLY_GROUP);
        ELSE
            RAISE NOTICE 'Read-only access already granted to schema: %', schemata;
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Function 2:

This function will enable users to refresh read_only permissions, so they don’t have to rely on DevOps

CREATE OR REPLACE FUNCTION grant_readonly_access(schematabe text DEFAULT NULL)
RETURNS void
SECURITY DEFINER
AS $$
DECLARE
  READONLY_GROUP text := 'readonly';
BEGIN
  IF schematabe IS NOT NULL THEN
    -- Grant read-only access to specified schema for the user and read-only group
    PERFORM 1 FROM information_schema.schemata WHERE schema_name = schematabe;
    IF FOUND THEN
      RAISE NOTICE 'Granting read-only access to schema: % for user: %', schematabe, READONLY_GROUP;
      EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, readonly_group);
      EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, readonly_group);
      EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, readonly_group);
      EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, READONLY_GROUP);
      EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
      EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
    ELSE
      RAISE EXCEPTION 'Schema not found: %', schematabe;
    END IF;
  ELSE
    -- Grant read-only access to all schemas for the user and read-only group
    FOR schematabe IN SELECT schema_name FROM information_schema.schemata WHERE schema_name NOT LIKE 'pg_%' AND schema_name != 'information_schema' LOOP
      -- Check if the read-only group already has access to the schema
      PERFORM 1 FROM information_schema.role_table_grants WHERE grantee = readonly_group AND table_schema = schematabe;
      IF NOT FOUND THEN
        -- Grant read-only access to the schema for the read-only group
        RAISE NOTICE 'Granting read-only access to schema: % for user: %', schematabe, READONLY_GROUP;
        EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, readonly_group);
        EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, readonly_group);
        EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, readonly_group);
        EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, READONLY_GROUP);
        EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
        EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
      ELSE
        RAISE NOTICE 'Read-only access already granted to schema: % for user: %', schematabe, READONLY_GROUP;
      END IF;
    END LOOP;
  END IF;
END;
$$ LANGUAGE plpgsql;