Learning Notes #21 β Sharding (database) with Postgres
β¦contd. Early Morning today, i watched a video on partitioning and sharding. In that video, Arpit explained the limitation of Vertical Scaling and ways to infinite scale DB with Sharding and Partitioning. In this blog, i jot down notes on partioining with single node implementation with postgres for my future self.
As the volume of data grows, managing databases efficiently becomes critical and when we understood that vertical scaling has its limits, we have two common strategies to handle large datasets are partitioning and sharding. While they may sound similar, these techniques serve different purposes and are implemented differently. Letβs explore these concepts in detail.
What is Sharding?
Sharding is a type of database architecture where data is horizontally divided across multiple database instances, called shards. Each shard is an independent database, often hosted on separate servers. Sharding is commonly used to scale out databases.
How Sharding Works
- Shard Key
- A shard key determines how data is distributed across shards.
- Example: A social media app might use user IDs as the shard key to ensure all data related to a user resides in the same shard.
- Data Distribution
- Data is split horizontally; each shard contains a subset of the entire dataset.
- Shards can be distributed geographically or across servers for better performance.
Combining Partitioning and Sharding
In some advanced architectures, partitioning and sharding are combined. Here, partitioned data is further distributed across shards. Each shard can manage its partitions independently, providing both scalability and query optimization.
PostgreSQL Example with Citus
1. Install the Citus extension (https://www.citusdata.com/download/, https://github.com/citusdata/citus?tab=readme-ov-file#getting-started)
2. Create a distributed table with partitioning
CREATE TABLE orders ( id SERIAL, customer_id INT, order_date DATE NOT NULL, PRIMARY KEY (id, order_date) ) PARTITION BY RANGE (order_date); SELECT create_distributed_table('orders', 'customer_id'); CREATE TABLE orders_jan PARTITION OF orders FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); CREATE TABLE orders_feb PARTITION OF orders FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
3. Add worker nodes to manage shards
SELECT master_add_node('worker1', 5432); SELECT master_add_node('worker2', 5432);
4. Data placement
The data for each partition (e.g., orders_jan
) is distributed across the shards (worker1
, worker2
) based on the shard key (customer_id
).
5. Insert and query data
INSERT INTO orders (customer_id, order_date) VALUES (101, '2024-01-15'); SELECT * FROM orders WHERE customer_id = 101;