Learning Notes #50 β Fixed Partition Pattern | Distributed Pattern
Today, i learnt about fixed partition, where it handles about balancing the data among servers without high movement of data. In this blog, i jot down notes on how fixed partition helps in solving the problem.
This entire blog is inspired from https://www.linkedin.com/pulse/distributed-systems-design-pattern-fixed-partitions-retail-kumar-v-c34pc/?trackingId=DMovSwEZSfCzKZEKa7yJrg%3D%3D
Problem Statement
In a distributed key-value store system, data items need to be mapped to a set of cluster nodes to ensure efficient storage and retrieval. The system must satisfy the following requirements,
- Uniform Distribution: Data should be evenly distributed across all cluster nodes to avoid overloading any single node.
- Deterministic Mapping: Given a data item, the specific node responsible for storing it should be determinable without querying all the nodes in the cluster.
A common approach to achieve these goals is to use hashing with a modulo operation. For example, if there are three nodes in the cluster, the key is hashed, and the hash value modulo the number of nodes determines the node to store the data. However, this method has a critical drawback,
Rebalancing Issue: When the cluster size changes (e.g., nodes are added or removed), the mapping for most keys changes. This requires the system to move almost all the data to new nodes, leading to significant overhead in terms of time and resources, especially when dealing with large data volumes.
Challenge: How can we design a mapping mechanism that minimizes data movement during cluster size changes while maintaining uniform distribution and deterministic mapping?
Solution
There is a concept of Fixed Partitioning,
What Is Fixed Partitioning?
This pattern organizes data into a predefined number of fixed partitions that remain constant over time. Data is assigned to these partitions using a hashing algorithm, ensuring that the mapping of data to partitions is permanent. The system separates the fixed partitioning of data from the physical servers managing these partitions, enabling seamless scaling.
Key Features of Fixed Partitioning
- Fixed Number of Partitions
- The number of partitions is determined during system initialization (e.g., 8 partitions).
- Data is assigned to these partitions based on a consistent hashing algorithm.
- Stable Data Mapping
- Each piece of data is permanently mapped to a specific partition.
- This eliminates the need for large-scale data reshuffling when scaling the system.
- Adjustable Partition-to-Server Mapping
- Partitions can be reassigned to different servers as the system scales.
- Only the physical location of the partitions changes; the fixed mapping remains intact.
- Balanced Load Distribution
- Partitions are distributed evenly across servers to balance the workload.
- Adding new servers involves reassigning partitions without moving or reorganizing data within the partitions.
Naive Example
We have a banking system with transactions stored in 8 fixed partitions, distributed based on a customerβs account ID.
CREATE TABLE transactions ( id SERIAL PRIMARY KEY, account_id INT NOT NULL, transaction_amount NUMERIC(10, 2) NOT NULL, transaction_date DATE NOT NULL ) PARTITION BY HASH (account_id);
1. Create Partition
DO $$ BEGIN FOR i IN 0..7 LOOP EXECUTE format( 'CREATE TABLE transactions_p%s PARTITION OF transactions FOR VALUES WITH (modulus 8, remainder %s);', i, i ); END LOOP; END $$;
This creates 8 partitions (transactions_p0
to transactions_p7
) based on the hash remainder of account_id
modulo 8.
2. Inserting Data
When inserting data into the transactions
table, PostgreSQL automatically places it into the correct partition based on the account_id
.
INSERT INTO transactions (account_id, transaction_amount, transaction_date) VALUES (12345, 500.00, '2025-01-01');
The hash of 12345 % 8
determines the target partition (e.g., transactions_p5
).
3. Querying Data
Querying the base table works transparently across all partitions
SELECT * FROM transactions WHERE account_id = 12345;
PostgreSQL automatically routes the query to the correct partition.
4. Scaling by Adding Servers
Initial Setup:
Suppose we have 4 servers managing the partitions,
- Server 1:
transactions_p0
,transactions_p1
- Server 2:
transactions_p2
,transactions_p3
- Server 3:
transactions_p4
,transactions_p5
- Server 4:
transactions_p6
,transactions_p7
Adding a New Server:
When a 5th server is added, we redistribute partitions,
- Server 1:
transactions_p0
- Server 2:
transactions_p1
- Server 3:
transactions_p2
,transactions_p3
- Server 4:
transactions_p4
- Server 5:
transactions_p5
,transactions_p6
,transactions_p7
Partition Migration
- During the migration,
transactions_p5
is copied from Server 3 to Server 5. - Once the migration is complete, Server 5 becomes responsible for
transactions_p5
.
Benefits:
- Minimal Data Movement β When scaling, only the partitions being reassigned are copied to new servers. Data within partitions remains stable.
- Optimized Performance β Queries are routed directly to the relevant partition, minimizing scan times.
- Scalability β Adding servers is straightforward, as it involves reassigning partitions, not reorganizing data.
What happens when a new server is added then. Donβt we need to copy the data ?
When a partition is moved to a new server (e.g., partition_b
from server_A
to server_B
), the data in the partition must be copied to the new server. However,
- The copying is limited to the partition being reassigned.
- No data within the partition is reorganized.
- Once the partition is fully migrated, the original copy is typically deleted.
For example, in PostgreSQL,
- Export the Partition
pg_dump -t partition_b -h server_A -U postgres > partition_b.sql
- Import on New Server:
psql -h server_B -U postgres -d mydb < partition_b.sql