Collecting content for LLM dataset β Part 2 β FreeTamilEbooks
At FreeTamilEbooks.com we have published 850 ebooks. All in sharable creative commons license. There are many people asking for the text only content of all these books many times. As it is a big task, took long time for it. Thanks to Lenin, Anwar of Kaniyam Foundation, all the contributors, all the writers and readers for making this project alive and a great success.
We are publishing the books as epub format, along with PDF format. Epub is just a zip file of HTML files. So, we can copy all the content from it as unicode text. Pandoc is a wonderful open source software, which can convert an epub to plaintext file.
There are the list of actions we have to do.
- Get URLs of all the 850+ epub files
- Download them all.
- using pandoc, convert to text file.
So far, we dont have a metadata file for all the books published. Getting the links of all epub files need some programming. As Python is a swiss knife to automate anything, started to explore the wordpress REST api with python to get all the books pages content.
https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_Data.py
Wrote the code here to get all the books info.
This gave a JSON file with book name, author, genre, epub, mobi, a4 pdf,6 inch pdf links.
Converted this to a CSV file with the below code. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/parse.py
I had to fix few things manually on the CSV file.
This is the final CSV file. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/fte_metadata.csv
The below code is to download all the epub files from their links in the fte_metadata.csv file. Used pandoc to convert to text.
https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_fte_books.py
Got 845 txt files. Total size is 374 MB
Compressed with 7z to get 47MB compressed file.
Published the data here. https://kaniyam.cloudns.nz/tamil_datasets/fte-books/
Download, share the text data for free. Dont sell them as most of the books are released as CC-BY-NC ( No Commercial ) license.
Use these data to build awesome open source applications and researches like Spellchekers, grammar checkers, LLm, RAG, what not?
Data is always the oil. Let us grow the open data oil.
Please share all your text, audio, video content in sharable license like creative commons. They will use to build a better future.