❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Learning Notes #20 – Partitioning (data) With Postgres

31 December 2024 at 06:55

Early Morning today, i watched a video on partitioning and sharding. In that video, Arpit explained the limitation of Vertical Scaling and ways to infinite scale DB with Sharding and Partitioning. In this blog, i jot down notes on partioining with single node implementation with postgres for my future self.

As the volume of data grows, managing databases efficiently becomes critical and when we understood that vertical scaling has its limits, we have two common strategies to handle large datasets are partitioning and sharding. While they may sound similar, these techniques serve different purposes and are implemented differently. Let’s explore these concepts in detail.

What is Partitioning?

Partitioning involves dividing a large dataset into smaller, manageable segments, known as partitions. Each partition is stored separately but remains part of a single database instance. Partitioning is typically used to improve query performance and manageability.

Types of Partitioning

1. Range Partitioning

  • Data is divided based on ranges of a column’s values.
  • Example: A table storing customer orders might partition data by order date: January orders in one partition, February orders in another.

PostgreSQL Example

CREATE TABLE orders (
    id SERIAL,
    customer_id INT,
    order_date DATE NOT NULL,
    PRIMARY KEY (id, order_date) -- Include the partition key
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_jan PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE orders_feb PARTITION OF orders
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

2. Hash Partitioning

  • A hash function determines the partition where a record will be stored.
  • Example: Orders can be distributed across partitions based on the hash of the customer ID.

Postgres Example

CREATE TABLE orders (
    id SERIAL ,
    customer_id INT,
    order_date DATE NOT NULL,
    PRIMARY KEY (id, customer_id)
) PARTITION BY HASH (customer_id, id);

CREATE TABLE orders_part_1 PARTITION OF orders
    FOR VALUES WITH (MODULUS 2, REMAINDER 0);

CREATE TABLE orders_part_2 PARTITION OF orders
    FOR VALUES WITH (MODULUS 2, REMAINDER 1);

3. List Partitioning

  • Data is divided based on a predefined list of values.
  • Example: A table storing sales data could partition based on regions: North, South, East, and West

Postgres Example

CREATE TABLE sales (
    id SERIAL ,
    region TEXT NOT NULL,
    amount NUMERIC,
    PRIMARY KEY (id, region)
) PARTITION BY LIST (region);

CREATE TABLE sales_north PARTITION OF sales
    FOR VALUES IN ('North');

CREATE TABLE sales_south PARTITION OF sales
    FOR VALUES IN ('South');

4. Composite Partitioning

  • Combines two or more partitioning strategies, such as range and list partitioning.
  • Example: A table partitioned by range on order date and sub-partitioned by list on region.

Postgres Example

CREATE TABLE orders (
    id SERIAL,
    customer_id INT,
    order_date DATE NOT NULL,
    region TEXT NOT NULL,
    PRIMARY KEY (id, order_date, region)
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01')
    PARTITION BY LIST (region);

CREATE TABLE orders_2024_north PARTITION OF orders_2024
    FOR VALUES IN ('North');

CREATE TABLE orders_2024_south PARTITION OF orders_2024
    FOR VALUES IN ('South');

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

16 June 2024 at 02:35

At FreeTamilEbooks.com we have published 850 ebooks. All in sharable creative commons license. There are many people asking for the text only content of all these books many times. As it is a big task, took long time for it. Thanks to Lenin, Anwar of Kaniyam Foundation, all the contributors, all the writers and readers for making this project alive and a great success.

We are publishing the books as epub format, along with PDF format. Epub is just a zip file of HTML files. So, we can copy all the content from it as unicode text. Pandoc is a wonderful open source software, which can convert an epub to plaintext file.

There are the list of actions we have to do.

  1. Get URLs of all the 850+ epub files
  2. Download them all.
  3. using pandoc, convert to text file.

So far, we dont have a metadata file for all the books published. Getting the links of all epub files need some programming. As Python is a swiss knife to automate anything, started to explore the wordpress REST api with python to get all the books pages content.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_Data.py

Wrote the code here to get all the books info.

This gave a JSON file with book name, author, genre, epub, mobi, a4 pdf,6 inch pdf links.

Converted this to a CSV file with the below code. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/parse.py

I had to fix few things manually on the CSV file.

This is the final CSV file. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/fte_metadata.csv

The below code is to download all the epub files from their links in the fte_metadata.csv file. Used pandoc to convert to text.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_fte_books.py

Got 845 txt files. Total size is 374 MB

Compressed with 7z to get 47MB compressed file.

Published the data here. https://kaniyam.cloudns.nz/tamil_datasets/

Download, share the text data for free. Dont sell them as most of the books are released as CC-BY-NC ( No Commercial ) license.

Use these data to build awesome open source applications and researches like Spellchekers, grammar checkers, LLm, RAG, what not?

Data is always the oil. Let us grow the open data oil.

Please share all your text, audio, video content in sharable license like creative commons. They will use to build a better future.

Collecting content for LLM dataset – Part 3 – Thamizh_Mann books, project madurai, WikiSource

23 November 2024 at 00:34

We are collecting open licensed dataset in tamil language, to build LLM, and other interesting applications in the coming days.

The ML models we build may have very short lifespan, but the open data will be there forever or at least for longer time than our life time.

Check the efforts part 1 and part 2 here.

part 1 – https://goinggnu.wordpress.com/2024/06/11/collecting-content-for-llm-dataset-part-1-tamil-wikipedia-content/

part 2 – https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/

here goes part 3.

Thamizh_mann publishers are publishing the public domain and nationalized tamil books for many years. Few years ago, with a collaboration with the Library at University of Toronto, Scarborough, Canada, and Thamizh_mann publishers, the kaniyam foundation team helped to release all the 1000+ tamil books as PDF and Docx formats for free online.

You can download them all here https://tamil.digital.utsc.utoronto.ca/61220/utsc35335

Thanks to UTSC, Thamizh_mann team for the great gift for the tamil Diaspora.

Now, we have 1000+ books in Unicode Docx format. Next is to convert them all as PlainText and use them. Natkeeran and Parathan helped on this.

Along with this, they helped to scrap project madurai books and tamil WikiSource books. They published all in a git repo here – https://github.com/KaniyamFoundation/open_tamil_texts along with the scripts and metadata.

I am adding those text in our open licensed tamil data collection.

Download them all here https://kaniyam.cloudns.nz/tamil_datasets/

here is the current size in text format and compressed format.

shrini@dell-optiplex-9100 v/w/h/tamil_datasets> du -h compressed
258M compressed/

shrini@dell-optiplex-9100 v/w/h/tamil_datasets> du -h text-files
355M text-files/project_madurai/data/text
355M text-files/project_madurai/data
355M text-files/project_madurai
110M text-files/tamil_wikisource/data
110M text-files/tamil_wikisource
374M text-files/FreeTamilEbooks-txt
714M text-files/thamizh_mann/data
716M text-files/thamizh_mann
1.6G text-files/

We have 1.6 G of text data to work on LLM or other works.

Go ahead, use it and build more models and tools using this data.

Hope this may not enough to get any good output. But, if we can bring something out of this, even though they are not good, then we can ask people to release their recent contents, blogs, social media posts in creative commons license.

There are few bloggers, magazines are already released their content in CC license. Now, we need your help to scarp them. If you know any programming language and can help for this project, please do webscrapping for the websites mentioned here. share the data and code.

https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

Thanks for all the content providers and the contributors.

Collecting content for LLM dataset – Part 3 – Thamizh_Mann books, project madurai, WikiSource

23 November 2024 at 00:34

We are collecting open licensed dataset in tamil language, to build LLM, and other interesting applications in the coming days.

The ML models we build may have very short lifespan, but the open data will be there forever or at least for longer time than our life time.

Check the efforts part 1 and part 2 here.

part 1 – https://goinggnu.wordpress.com/2024/06/11/collecting-content-for-llm-dataset-part-1-tamil-wikipedia-content/

part 2 – https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/

here goes part 3.

Thamizh_mann publishers are publishing the public domain and nationalized tamil books for many years. Few years ago, with a collaboration with the Library at University of Toronto, Scarborough, Canada, and Thamizh_mann publishers, the kaniyam foundation team helped to release all the 1000+ tamil books as PDF and Docx formats for free online.

You can download them all here https://tamil.digital.utsc.utoronto.ca/61220/utsc35335

Thanks to UTSC, Thamizh_mann team for the great gift for the tamil Diaspora.

Now, we have 1000+ books in Unicode Docx format. Next is to convert them all as PlainText and use them. Natkeeran and Parathan helped on this.

Along with this, they helped to scrap project madurai books and tamil WikiSource books. They published all in a git repo here – https://github.com/KaniyamFoundation/open_tamil_texts along with the scripts and metadata.

I am adding those text in our open licensed tamil data collection.

Download them all here https://kaniyam.cloudns.nz/tamil_datasets/

here is the current size in text format and compressed format.

shrini@dell-optiplex-9100 v/w/h/tamil_datasets> du -h compressed
258M compressed/

shrini@dell-optiplex-9100 v/w/h/tamil_datasets> du -h text-files
355M text-files/project_madurai/data/text
355M text-files/project_madurai/data
355M text-files/project_madurai
110M text-files/tamil_wikisource/data
110M text-files/tamil_wikisource
374M text-files/FreeTamilEbooks-txt
714M text-files/thamizh_mann/data
716M text-files/thamizh_mann
1.6G text-files/

We have 1.6 G of text data to work on LLM or other works.

Go ahead, use it and build more models and tools using this data.

Hope this may not enough to get any good output. But, if we can bring something out of this, even though they are not good, then we can ask people to release their recent contents, blogs, social media posts in creative commons license.

There are few bloggers, magazines are already released their content in CC license. Now, we need your help to scarp them. If you know any programming language and can help for this project, please do webscrapping for the websites mentioned here. share the data and code.

https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

Thanks for all the content providers and the contributors.

Task: Moving MP3 Files Based on Metadata Date

By: Sakthivel
6 August 2024 at 15:32
import os
import shutil
from datetime import datetime

def list_files_in_folder(folder_path):
    return os.listdir(folder_path)

def get_file_format():
    return input("Enter the file format (e.g., .mp3, .jpg): ")

def get_creation_date(file_path):
    return datetime.fromtimestamp(os.path.getctime(file_path))

def get_user_date():
    date_str = input("Enter the date (YYYY-MM-DD): ")
    return datetime.strptime(date_str, '%Y-%m-%d')

def move_files_based_on_date(folder_path, file_format, user_date, destination_folder):
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)
    
    for file_name in list_files_in_folder(folder_path):
        if file_name.endswith(file_format):
            file_path = os.path.join(folder_path, file_name)
            creation_date = get_creation_date(file_path)
            if creation_date.date() == user_date.date():
                shutil.move(file_path, os.path.join(destination_folder, file_name))
                print(f"Moved: {file_name}")

def main():
    folder_path = ("/home/sakthivel/Documents/Practice/task")
    destination_folder = ("/home/sakthivel/Documents/Practice/mp3")
    
    if not os.path.exists(folder_path):
        print("Folder does not exist.")
        return
    
    file_format = get_file_format()
    user_date = get_user_date()
    
    move_files_based_on_date(folder_path, file_format, user_date, destination_folder)

if __name__ == "__main__":
    main()

Detailed Definition:

This Python script automates the task of moving files from one directory to another based on their creation date. The script follows these main steps:

  1. List Files in a Folder:
    • Function: list_files_in_folder(folder_path)
    • Description: This function takes a folder path as an argument and returns a list of all files in that folder.
  2. Get File Format from User:
    • Function: get_file_format()
    • Description: This function prompts the user to enter a file format (e.g., .mp3, .jpg). The entered format is returned as a string.
  3. Get Creation Date of a File:
    • Function: get_creation_date(file_path)
    • Description: This function takes the file path as an argument and returns the creation date of the file as a datetime object.
  4. Get Date from User:
    • Function: get_user_date()
    • Description: This function prompts the user to enter a date in the format YYYY-MM-DD. The entered date is converted to a datetime object and returned.
  5. Move Files Based on Date:
    • Function: move_files_based_on_date(folder_path, file_format, user_date, destination_folder)
    • Description: This function moves files from the source folder to the destination folder based on the specified file format and user-provided date.
      • It first checks if the destination folder exists; if not, it creates it.
      • It then iterates over the files in the source folder, checking if each file matches the specified format and creation date.
      • If a match is found, the file is moved to the destination folder, and a message is printed indicating the file has been moved.
  6. Main Function:
    • Function: main()
    • Description: This is the entry point of the script. It sets the paths for the source and destination folders and performs the following steps:
      • Verifies the existence of the source folder.
      • Retrieves the file format and date from the user.
      • Calls the function to move files based on the provided criteria.
  7. Script Execution:
    • The script is executed by calling the main() function when the script is run directly.

Enhancements for Future Consideration:

  • User Input Validation: Ensure the file format and date inputs are valid.
  • Error Handling: Implement error handling for file operations and user inputs.
  • Logging: Add logging to keep track of the operations performed and any errors encountered.
  • Flexible Date Comparison: Allow for more flexible date comparisons, such as moving files created on or after a specified date.

By following these steps, the script efficiently organizes files based on their creation dates, making it a useful tool for managing large collections of files.

Output:


Using Google Sheets as a makeshift Database [Depriciated]

By: ashish
9 March 2020 at 19:51

Do you want need a quick solution without going into the hassle of setting up a Database? If your answer to any of those questions was a yes, then you’ve come to the right place. This post will show you how you can use Google sheets as your database.

For the purposes of this blogpost I will be usiing this Google sheet.

As you can see, we will be collecting the following data from the user – Name, Email and Age.

Create the API

  • Go to the google sheet you want to use.
  • Create column headers in the first column
  • Click on tools> script editor
  • Copy the following code to the editor

    • Click on run>run function> setup.
    • Now publish your script to get the request URL with the following settings.

Now let us test this URL in a webpage.

See the Pen
Simple register form
by Thomas Ashish Cherian (@pandawhocodes)
on CodePen.

You can enter your details here to see your details being updated in the Google Sheet above( refresh to see changes) .

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

16 June 2024 at 02:35

At FreeTamilEbooks.com we have published 850 ebooks. All in sharable creative commons license. There are many people asking for the text only content of all these books many times. As it is a big task, took long time for it. Thanks to Lenin, Anwar of Kaniyam Foundation, all the contributors, all the writers and readers for making this project alive and a great success.

We are publishing the books as epub format, along with PDF format. Epub is just a zip file of HTML files. So, we can copy all the content from it as unicode text. Pandoc is a wonderful open source software, which can convert an epub to plaintext file.

There are the list of actions we have to do.

  1. Get URLs of all the 850+ epub files
  2. Download them all.
  3. using pandoc, convert to text file.

So far, we dont have a metadata file for all the books published. Getting the links of all epub files need some programming. As Python is a swiss knife to automate anything, started to explore the wordpress REST api with python to get all the books pages content.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_Data.py

Wrote the code here to get all the books info.

This gave a JSON file with book name, author, genre, epub, mobi, a4 pdf,6 inch pdf links.

Converted this to a CSV file with the below code. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/parse.py

I had to fix few things manually on the CSV file.

This is the final CSV file. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/fte_metadata.csv

The below code is to download all the epub files from their links in the fte_metadata.csv file. Used pandoc to convert to text.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_fte_books.py

Got 845 txt files. Total size is 374 MB

Compressed with 7z to get 47MB compressed file.

Published the data here. https://kaniyam.cloudns.nz/tamil_datasets/fte-books/

Download, share the text data for free. Dont sell them as most of the books are released as CC-BY-NC ( No Commercial ) license.

Use these data to build awesome open source applications and researches like Spellchekers, grammar checkers, LLm, RAG, what not?

Data is always the oil. Let us grow the open data oil.

Please share all your text, audio, video content in sharable license like creative commons. They will use to build a better future.

Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

11 June 2024 at 00:00

At Kaniyam Foundation, we have a dream of collecting publishing TerraBytes of Tamil text data for Tamil LLM and other research works. We are documenting the websites that provide Open Licensed tamil content, like Public Domain, Creative Commons license here. https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

From here, we can get the websites, scrap them and use and share the data.

Firstly, Today, I started to explore the tamil wikipedia data.

All the wikepedia content are stored as XML and SQL files here.

Download the Wikipedia dump for the all the languages from http://dumps.wikimedia.org/backup-index.html.

For tamil wikipedia content, from here, https://dumps.wikimedia.org/tawiki/ I downloaded this file

tawiki-20240501-pages-articles-multistream.xml.bz2

it is 223.3 MB

That page has multiple files. But look for β€œpages-articles” to get the main content for wikipedia.

Then, extracted as

bunzip2 tawiki-20240501-pages-articles-multistream.xml.bz2

It gave a file tawiki-20240501-pages-articles-multistream.xml for 1.7 GB

It has a XML file. We have to extract the text content from it.

For it, explored and found a good tool. – https://github.com/apertium/WikiExtractor

Downloaded it and used it.

python3 WikiExtractor.py --infn tawiki-20240501-pages-articles-multistream.xml

It ran for 2 minutes and gave a file wiki.txt for 627 MB. It has all the articles content as a one single big plaintext file.

Compressed it with 7z as it gives better compression.

mv wiki.txt tawiki-20240501-pages-article-wiki.txt
7z a tawiki-20240501-pages-article-text.7z tawiki-20240501-pages-article-wiki.txt

it is 70 MB

Like this, will continue to get plain text tamil data from various sources. We have to find, where we can publish few 100 GBs to TBs of data, for free. Till then, will share these files on my self hosted desktop PC at my home.

Published the file here – https://kaniyam.cloudns.nz/tamil_datasets/

Let me know, if you are interested in joining this project.

❌
❌