❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Postgres Ep 2 : Amutha Hotel and Issues with Flat Files

23 August 2024 at 01:29

Once upon a time in ooty, there was a small business called β€œAmutha Hotel,” run by a passionate baker named Saravanan. Saravanan bakery was famous for its delicious sambar, and as his customer base grew, he needed to keep track of orders, customer information, and inventory.

Being a techie, he decided to store all this information in a flat file a simple spreadsheet named β€œHotelData.csv.”

The Early Days: Simple and Sweet

At first, everything was easy. Saravanan’s flat file had only a few columns, OrderID, CustomerName, Product, Quantity, and Price. Each row represented a new order, and it was simple enough to manage. Saravanan could quickly find orders, calculate totals, and even check his inventory by filtering the file.

The Business Grows: Complexity Creeps In

As the business boomed, Saravanan started offering new products, special discounts, and loyalty programs. He added more columns to her flat file, like Discount, LoyaltyPoints, and DeliveryAddress. He once-simple file began to swell with information.

Then, Saravanan decided to start tracking customer preferences and order history. He began adding multiple rows for the same customer, each representing a different order. His flat file now had repeating groups of data for each customer, and it became harder and harder to find the information he needed.

His flat file was getting out of hand. For every new order from a returning customer, he had to re-enter all their information

CustomerName, DeliveryAddress, LoyaltyPoints

over and over again. This duplication wasn’t just tedious; it started to cause mistakes. One day, he accidentally typed β€œJohn Smyth” instead of β€œJohn Smith,” and suddenly, his loyal customer was split into two different entries.

On a Busy Saturday

One busy Saturday, Saravanan opened his flat file to update the day’s orders, but instead of popping up instantly as it used to, it took several minutes to load. As he scrolled through the endless rows, his computer started to lag, and the spreadsheet software even crashed a few times. The file had become too large and cumbersome for him to handle efficiently.

Customers were waiting longer for their orders to be processed because Saravanan was struggling to find their previous details and apply the right discounts. The flat file that once served his so well was now slowing her down, and it was affecting her business.

The Journaling

Techie Saravanan started to note these issues in to a notepad. He badly wants a solution which will solve these problems. So he started listing out the problems with examples to look for a solution.

His journal continues …

Before databases became common for data storage, flat files (such as CSVs or text files) were often used to store and manage data. The data file that we use has no special structure; it’s just some lines of text that mean something to the particular application that reads it. It has no inherent structure

However, these flat files posed several challenges, particularly when dealing with repeating groups, which are essentially sets of related fields that repeat multiple times within a record. Here are some of the key problems associated with repeating groups in flat files,

1. Data Redundancy

  • Description: Repeating groups can lead to significant redundancy, as the same data might need to be repeated across multiple records.
  • Example: If an employee can have multiple skills, a flat file might need to repeat the employee’s name, ID, and other details for each skill.
  • Problem: This not only increases the file size but also makes data entry, updates, and deletions more prone to errors.

Eg: Suppose you are maintaining a flat file to track employees and their skills. Each employee can have multiple skills, which you store as repeating groups in the file.

EmployeeID, EmployeeName, Skill1, Skill2, Skill3, Skill4
1, John Doe, Python, SQL, Java, 
2, Jane Smith, Excel, PowerPoint, Python, SQL

If an employee has four skills, you need to add four columns (Skill1, Skill2, Skill3, Skill4). If an employee has more than four skills, you must either add more columns or create a new row with repeated employee details.

2. Data Inconsistency

  • Description: Repeating groups can lead to inconsistencies when data is updated.
  • Example: If an employee’s name changes, and it’s stored multiple times in different rows because of repeating skills, it’s easy for some instances to be updated while others are not.
  • Problem: This can lead to situations where the same employee is listed under different names or IDs in the same file.

Eg: Suppose you are maintaining a flat file to track employees and their skills. Each employee can have multiple skills, which you store as repeating groups in the file.

EmployeeID, EmployeeName, Skill1, Skill2, Skill3, Skill4
1, John Doe, Python, SQL, Java, 
2, Jane Smith, Excel, PowerPoint, Python, SQL

If John’s name changes to β€œJohn A. Doe,” you must manually update each occurrence of β€œJohn Doe” across all rows, which increases the chance of inconsistencies.

3. Difficulty in Querying

  • Description: Querying data in flat files with repeating groups can be cumbersome and inefficient.
  • Example: Extracting a list of unique employees with their respective skills requires complex scripting or manual processing.
  • Problem: Unlike relational databases, which use joins to simplify such queries, flat files require custom logic to manage and extract data, leading to slower processing and more potential for errors.

Eg: Suppose you are maintaining a flat file to track employees and their skills. Each employee can have multiple skills, which you store as repeating groups in the file.

EmployeeID, EmployeeName, Skill1, Skill2, Skill3, Skill4
1, John Doe, Python, SQL, Java, 
2, Jane Smith, Excel, PowerPoint, Python, SQL

Extracting a list of all employees proficient in β€œPython” requires you to search across multiple skill columns (Skill1, Skill2, etc.), which is cumbersome compared to a relational database where you can use a simple JOIN on a normalized EmployeeSkills table.

4. Limited Scalability

  • Description: Flat files do not scale well when the number of repeating groups or the size of the data grows.
  • Example: A file with multiple repeating fields can become extremely large and difficult to manage as the number of records increases.
  • Problem: This can lead to performance issues, such as slow read/write operations and difficulty in maintaining the file over time.

Eg: You are storing customer orders in a flat file where each customer can place multiple orders.

CustomerID, CustomerName, Order1ID, Order1Date, Order2ID, Order2Date, Order3ID, Order3Date
1001, Alice Brown, 5001, 2023-08-01, 5002, 2023-08-15, 
1002, Bob White, 5003, 2023-08-05, 

If Alice places more than three orders, you’ll need to add more columns (Order4ID, Order4Date, etc.), leading to an unwieldy file with many empty cells for customers with fewer orders.

5. Challenges in Data Integrity

  • Description: Ensuring data integrity in flat files with repeating groups is difficult.
  • Example: Enforcing rules like β€œan employee can only have unique skills” is nearly impossible in a flat file format.
  • Problem: This can result in duplicated or invalid data, which is hard to detect and correct without a database system.

Eg: You are storing customer orders in a flat file where each customer can place multiple orders.

CustomerID, CustomerName, Order1ID, Order1Date, Order2ID, Order2Date, Order3ID, Order3Date
1001, Alice Brown, 5001, 2023-08-01, 5002, 2023-08-15, 
1002, Bob White, 5003, 2023-08-05,

There’s no easy way to enforce that each order ID is unique and corresponds to the correct customer, which could lead to errors or duplicated orders.

6. Complex File Formats

  • Description: Managing and processing flat files with repeating groups often requires complex file formats.
  • Example: Custom delimiters or nested formats might be needed to handle repeating groups, making the file harder to understand and work with.
  • Problem: This increases the likelihood of errors during data entry, processing, or when the file is read by different systems.

Eg: You are storing customer orders in a flat file where each customer can place multiple orders.

CustomerID, CustomerName, Order1ID, Order1Date, Order2ID, Order2Date, Order3ID, Order3Date
1001, Alice Brown, 5001, 2023-08-01, 5002, 2023-08-15, 
1002, Bob White, 5003, 2023-08-05, 

As the number of orders grows, the file format becomes increasingly complex, requiring custom scripts to manage and extract order data for each customer.

7. Lack of Referential Integrity

  • Description: Flat files lack mechanisms to enforce referential integrity between related groups of data.
  • Example: Ensuring that a skill listed in one file corresponds to a valid skill ID in another file requires manual checks or complex logic.
  • Problem: This can lead to orphaned records or mismatches between related data sets.

Eg: A fleet management company tracks maintenance records for each vehicle in a flat file. Each vehicle can have multiple maintenance records.

VehicleID, VehicleType, Maintenance1Date, Maintenance1Type, Maintenance2Date, Maintenance2Type
V001, Truck, 2023-01-15, Oil Change, 2023-03-10, Tire Rotation
V002, Van, 2023-02-20, Brake Inspection, , 

There’s no way to ensure that the Maintenance1Type and Maintenance2Type fields are valid maintenance types or that the dates are in correct chronological order.

8. Difficulty in Data Modification

  • Description: Modifying data in flat files with repeating groups can be complex and error-prone.
  • Example: Adding or removing an item from a repeating group might require extensive manual edits across multiple records.
  • Problem: This increases the risk of errors and makes data management time-consuming.

Eg: A university maintains a flat file to record student enrollments in courses. Each student can enroll in multiple courses.

StudentID, StudentName, Course1, Course2, Course3, Course4, Course5
2001, Charlie Green, Math101, Physics102, , , 
2002, Dana Blue, History101, Math101, Chemistry101, , 

If a student drops a course or switches to a different one, manually editing the file can easily lead to errors, especially as the number of students and courses increases.

After listing down all these, Saravanan started looking into solutions. His search goes on…

What is Relational Database and Postgres Sql ?

21 August 2024 at 01:56

In the city of Data, the citizens relied heavily on organizing their information. The city was home to many different types of data numbers, names, addresses, and even some exotic types like images and documents. But as the city grew, so did the complexity of managing all this information.

One day, the city’s leaders called a meeting to discuss how best to handle the growing data. They were split between two different systems

  1. the old and trusted Relational Database Management System (RDBMS)
  2. the new, flashy NoSQL databases.

Enter Relational Databases:

Relational databases were like the city’s libraries. They had rows of neatly organized shelves (tables) where every book (data entry) was placed according to a specific category (columns).

Each book had a unique ID (primary key) so that anyone could find it quickly. These libraries had been around for decades, and everyone knew how to use them.

The RDBMS was more than just a library. It enforced rules (constraints) to ensure that no book went missing, was duplicated, or misplaced. It even allowed librarians (queries) to connect different books using relationships (joins).

If you wanted to find all the books by a particular author that were published in the last five years, the RDBMS could do it in a heartbeat.

The Benefits of RDBMS:

The citizens loved the RDBMS because it was:

  1. Organized: Everything was in its place, and data was easy to find.
  2. Reliable: The rules ensured data integrity, so they didn’t have to worry about inconsistencies.
  3. Powerful: It could handle complex queries, making it easy to get insights from their data.
  4. Secure: Access to the data could be controlled, keeping it safe from unauthorized users.

The Rise of NoSQL:

But then came the NoSQL databases, which were more like vast, sprawling warehouses. These warehouses didn’t care much about organization; they just stored everything in a big open space. You could toss in anything, and it would accept itβ€”no need for strict categories or relationships. This flexibility appealed to the tech-savvy citizens who wanted to store newer, more diverse types of data like social media posts, images, and videos.

NoSQL warehouses were fast. They could handle enormous amounts of data without breaking a sweat and were perfect for real-time applications like chat systems and analytics.

The PostgreSQL Advantage:

PostgreSQL was a superstar in the world of RDBMS. It combined the organization and reliability of traditional relational databases with some of the flexibility of NoSQL. It allowed citizens to store structured data in tables while also offering support for unstructured data types like JSON. This made PostgreSQL a versatile choice, bridging the gap between the old and new worlds.

For installing postgres : https://www.postgresql.org/download/

The Dilemma: PostgreSQL vs. NoSQL:

The city faced a dilemma. Should they stick with PostgreSQL, which offered the best of both worlds, or fully embrace NoSQL for its speed and flexibility? The answer wasn’t simple. It depended on what the city valued more: the structured, reliable nature of PostgreSQL or the unstructured, flexible approach of NoSQL.

For applications that required strict data integrity and complex queries, PostgreSQL was the way to go. But for projects that needed to handle massive amounts of unstructured data quickly, NoSQL was the better choice.

Conclusion:

In the end, the city of Data realized that there was no one-size-fits-all solution. They decided to use PostgreSQL for applications where data relationships and integrity were crucial, and NoSQL for those that required speed and flexibility with diverse data types.

And so, the citizens of Data lived happily, managing their information with the right tools for the right tasks, knowing that both systems had their place in the ever-growing city.

Ingesting large files to postgres through S3

By: ashish
13 April 2022 at 20:16

One of the tasks I recently came accross my job was ingest large files but with the following

  1. Do some processing ( like generate hash for each row )
  2. Insert it into S3 for audit purposes
  3. Insert into postgres

Note:

Keep in mind your postgres database needs to support this and a s3 bucket policy needs to exist in order to allow the data to be copied over.

The setup I am using is a RDS database with S3 in the same region and proper policies and IAM roles already created.

Read more on that here – AWS documentation

For the purpose of this post I will be using dummy data from – eforexcel(1 million records)

The most straight forward way to do this would be to just do a df.to_sql like this

df = pd.read_csv("records.csv")
df.to_sql(
    name="test_table",
    con=connection_detail,
    schema="schema",
    if_exists="replace",
)

Something like this would take more than an hour! Lets do it in less than 5 minutes.

Now ofcourse there are several ways to make this faster – using copy expert, psycogpg driver etc(maybe a sepearate blog post on these), but that’s not the use case I have been tasked with. Since we need to upload the file s3 in the end for audit purposes I will ingest the data from S3 to DB.

Generate table metadata

Before we can assign an s3 operator to ingest the data we need to create the table into which this data will be inserted. We have two ways that I can think of

  1. Each column in the file will be created in the DB with a highest threshold value like varchar(2000)
  2. Each column is created with the data length as max length in each row

I will be going with option 2 here.

This entire process took around 210 seconds instead of more than an hour like the last run.

Let’s go over the code one by one

Read the csv

  1. We can pass the data directly to pandas or stream it into buffered memory something like this
with open("records.csv") as f:
    csv_rdr = csv.reader(f, delimiter=",")
    header = next(csv_rdr)
    with gzip.GzipFile(fileobj=mem_file, mode="wb", compresslevel=6) as gz:
        buff = io.StringIO()
        writer = csv.writer(buff)
        writer.writerows([header])
        for row in csv_rdr:
            writer.writerows([row])
        gz.write(buff.getvalue().encode("utf-8", "replace"))
    mem_file.seek(0)
    s3.put_object(Bucket="mybucket", Key="folder/file.gz", Body=mem_file)

2. Since the file is less than 50 MB i’ll go ahead and load it directly.

Create the table

Get the max lengths of each column and use that to generate the table. We use pandas to_sql() function for this and pass the dtypes.

Copy data from s3 gzipped file to postgres

Finally we use –

aws_s3.table_import_from_s3

to copy over the file to the postgres table.

Granting Access to Read-Only Users and Refreshing Permissions Automatically: A Function-Based Solution

By: angu10
14 March 2023 at 01:52

Problem Statement

We have a PostgreSQL database with multiple schemas and tables. Some users have read-only access to the database and and they relay on Devops/Support team to refresh their access to view any new schemas or tables added to the database. We need to provide a solution to allow read-only users to refresh their access so they can view new schemas and tables as they are added.

Named Read-only User Group

Function 1: Will create a user and create a read_only group not available. If the group is available, it will create the user and password, attach it to the read_only group, and add all existing schema read-only access.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE or replace FUNCTION create_users_and_grant_access(users text[]) RETURNS void AS $$
DECLARE
    READONLY_GROUP text := 'readonly';
    password text;
    user_name text;
    schemata text;
BEGIN
    FOREACH user_name IN ARRAY users LOOP
        -- Check if the user already exists
        PERFORM 1 FROM pg_user WHERE usename = user_name;
        IF NOT FOUND THEN
            -- Generate a random password for the new user
            password := encode(gen_random_bytes(12), 'base64');


            -- Create the database user with the hashed password
            RAISE NOTICE 'Creating database user: %', user_name;
            RAISE NOTICE 'Password: %', password;
            EXECUTE format('CREATE USER %I WITH PASSWORD %L', user_name, password);

            -- Create the read-only group if it does not exist
            PERFORM 1 FROM pg_roles WHERE rolname = READONLY_GROUP;
            IF NOT FOUND THEN
                RAISE NOTICE 'Creating read-only group: %', READONLY_GROUP;
                EXECUTE format('CREATE ROLE %I', READONLY_GROUP);
            END IF;

            -- Add the user to the read-only group
            RAISE NOTICE 'Adding user to read-only group: %', READONLY_GROUP;
            EXECUTE format('GRANT %I TO %I', READONLY_GROUP, user_name);
        ELSE
            RAISE NOTICE 'User already exists: %', user_name;
        END IF;
    END LOOP;

    -- Grant read-only access to all schemas for the read-only group
    FOR schemata IN SELECT schema_name FROM information_schema.schemata WHERE schema_name NOT LIKE 'pg_%' AND schema_name != 'information_schema' LOOP
        -- Check if the read-only group already has access to the schema
        PERFORM 1 FROM information_schema.role_table_grants WHERE grantee = READONLY_GROUP AND table_schema = schemata;
        IF NOT FOUND THEN
            -- Grant read-only access to the schema for the read-only group
            RAISE NOTICE 'Granting read-only access to schema: %', schemata;
            EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schemata, READONLY_GROUP);
            EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schemata, READONLY_GROUP);
            EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schemata, READONLY_GROUP);
        ELSE
            RAISE NOTICE 'Read-only access already granted to schema: %', schemata;
        END IF;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

Function 2:

This function will enable users to refresh read_only permissions, so they don’t have to rely on DevOps

CREATE OR REPLACE FUNCTION grant_readonly_access(schematabe text DEFAULT NULL)
RETURNS void
SECURITY DEFINER
AS $$
DECLARE
  READONLY_GROUP text := 'readonly';
BEGIN
  IF schematabe IS NOT NULL THEN
    -- Grant read-only access to specified schema for the user and read-only group
    PERFORM 1 FROM information_schema.schemata WHERE schema_name = schematabe;
    IF FOUND THEN
      RAISE NOTICE 'Granting read-only access to schema: % for user: %', schematabe, READONLY_GROUP;
      EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, readonly_group);
      EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, readonly_group);
      EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, readonly_group);
      EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, READONLY_GROUP);
      EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
      EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
    ELSE
      RAISE EXCEPTION 'Schema not found: %', schematabe;
    END IF;
  ELSE
    -- Grant read-only access to all schemas for the user and read-only group
    FOR schematabe IN SELECT schema_name FROM information_schema.schemata WHERE schema_name NOT LIKE 'pg_%' AND schema_name != 'information_schema' LOOP
      -- Check if the read-only group already has access to the schema
      PERFORM 1 FROM information_schema.role_table_grants WHERE grantee = readonly_group AND table_schema = schematabe;
      IF NOT FOUND THEN
        -- Grant read-only access to the schema for the read-only group
        RAISE NOTICE 'Granting read-only access to schema: % for user: %', schematabe, READONLY_GROUP;
        EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, readonly_group);
        EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, readonly_group);
        EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, readonly_group);
        EXECUTE format('GRANT USAGE ON SCHEMA %I TO %I', schematabe, READONLY_GROUP);
        EXECUTE format('GRANT SELECT ON ALL TABLES IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
        EXECUTE format('GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA %I TO %I', schematabe, READONLY_GROUP);
      ELSE
        RAISE NOTICE 'Read-only access already granted to schema: % for user: %', schematabe, READONLY_GROUP;
      END IF;
    END LOOP;
  END IF;
END;
$$ LANGUAGE plpgsql;
❌
❌