❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

16 June 2024 at 02:35

At FreeTamilEbooks.com we have published 850 ebooks. All in sharable creative commons license. There are many people asking for the text only content of all these books many times. As it is a big task, took long time for it. Thanks to Lenin, Anwar of Kaniyam Foundation, all the contributors, all the writers and readers for making this project alive and a great success.

We are publishing the books as epub format, along with PDF format. Epub is just a zip file of HTML files. So, we can copy all the content from it as unicode text. Pandoc is a wonderful open source software, which can convert an epub to plaintext file.

There are the list of actions we have to do.

  1. Get URLs of all the 850+ epub files
  2. Download them all.
  3. using pandoc, convert to text file.

So far, we dont have a metadata file for all the books published. Getting the links of all epub files need some programming. As Python is a swiss knife to automate anything, started to explore the wordpress REST api with python to get all the books pages content.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_Data.py

Wrote the code here to get all the books info.

This gave a JSON file with book name, author, genre, epub, mobi, a4 pdf,6 inch pdf links.

Converted this to a CSV file with the below code. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/parse.py

I had to fix few things manually on the CSV file.

This is the final CSV file. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/fte_metadata.csv

The below code is to download all the epub files from their links in the fte_metadata.csv file. Used pandoc to convert to text.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_fte_books.py

Got 845 txt files. Total size is 374 MB

Compressed with 7z to get 47MB compressed file.

Published the data here. https://kaniyam.cloudns.nz/tamil_datasets/fte-books/

Download, share the text data for free. Dont sell them as most of the books are released as CC-BY-NC ( No Commercial ) license.

Use these data to build awesome open source applications and researches like Spellchekers, grammar checkers, LLm, RAG, what not?

Data is always the oil. Let us grow the open data oil.

Please share all your text, audio, video content in sharable license like creative commons. They will use to build a better future.

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

16 June 2024 at 02:35

At FreeTamilEbooks.com we have published 850 ebooks. All in sharable creative commons license. There are many people asking for the text only content of all these books many times. As it is a big task, took long time for it. Thanks to Lenin, Anwar of Kaniyam Foundation, all the contributors, all the writers and readers for making this project alive and a great success.

We are publishing the books as epub format, along with PDF format. Epub is just a zip file of HTML files. So, we can copy all the content from it as unicode text. Pandoc is a wonderful open source software, which can convert an epub to plaintext file.

There are the list of actions we have to do.

  1. Get URLs of all the 850+ epub files
  2. Download them all.
  3. using pandoc, convert to text file.

So far, we dont have a metadata file for all the books published. Getting the links of all epub files need some programming. As Python is a swiss knife to automate anything, started to explore the wordpress REST api with python to get all the books pages content.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_Data.py

Wrote the code here to get all the books info.

This gave a JSON file with book name, author, genre, epub, mobi, a4 pdf,6 inch pdf links.

Converted this to a CSV file with the below code. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/parse.py

I had to fix few things manually on the CSV file.

This is the final CSV file. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/fte_metadata.csv

The below code is to download all the epub files from their links in the fte_metadata.csv file. Used pandoc to convert to text.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_fte_books.py

Got 845 txt files. Total size is 374 MB

Compressed with 7z to get 47MB compressed file.

Published the data here. https://kaniyam.cloudns.nz/tamil_datasets/fte-books/

Download, share the text data for free. Dont sell them as most of the books are released as CC-BY-NC ( No Commercial ) license.

Use these data to build awesome open source applications and researches like Spellchekers, grammar checkers, LLm, RAG, what not?

Data is always the oil. Let us grow the open data oil.

Please share all your text, audio, video content in sharable license like creative commons. They will use to build a better future.

Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

11 June 2024 at 00:00

At Kaniyam Foundation, we have a dream of collecting publishing TerraBytes of Tamil text data for Tamil LLM and other research works. We are documenting the websites that provide Open Licensed tamil content, like Public Domain, Creative Commons license here. https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

From here, we can get the websites, scrap them and use and share the data.

Firstly, Today, I started to explore the tamil wikipedia data.

All the wikepedia content are stored as XML and SQL files here.

Download the Wikipedia dump for the all the languages from http://dumps.wikimedia.org/backup-index.html.

For tamil wikipedia content, from here, https://dumps.wikimedia.org/tawiki/ I downloaded this file

tawiki-20240501-pages-articles-multistream.xml.bz2

it is 223.3 MB

That page has multiple files. But look for β€œpages-articles” to get the main content for wikipedia.

Then, extracted as

bunzip2 tawiki-20240501-pages-articles-multistream.xml.bz2

It gave a file tawiki-20240501-pages-articles-multistream.xml for 1.7 GB

It has a XML file. We have to extract the text content from it.

For it, explored and found a good tool. – https://github.com/apertium/WikiExtractor

Downloaded it and used it.

python3 WikiExtractor.py --infn tawiki-20240501-pages-articles-multistream.xml

It ran for 2 minutes and gave a file wiki.txt for 627 MB. It has all the articles content as a one single big plaintext file.

Compressed it with 7z as it gives better compression.

mv wiki.txt tawiki-20240501-pages-article-wiki.txt
7z a tawiki-20240501-pages-article-text.7z tawiki-20240501-pages-article-wiki.txt

it is 70 MB

Like this, will continue to get plain text tamil data from various sources. We have to find, where we can publish few 100 GBs to TBs of data, for free. Till then, will share these files on my self hosted desktop PC at my home.

Published the file here – https://kaniyam.cloudns.nz/tamil_datasets/

Let me know, if you are interested in joining this project.

Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

11 June 2024 at 00:00

At Kaniyam Foundation, we have a dream of collecting publishing TerraBytes of Tamil text data for Tamil LLM and other research works. We are documenting the websites that provide Open Licensed tamil content, like Public Domain, Creative Commons license here. https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

From here, we can get the websites, scrap them and use and share the data.

Firstly, Today, I started to explore the tamil wikipedia data.

All the wikepedia content are stored as XML and SQL files here.

Download the Wikipedia dump for the all the languages from http://dumps.wikimedia.org/backup-index.html.

For tamil wikipedia content, from here, https://dumps.wikimedia.org/tawiki/ I downloaded this file

tawiki-20240501-pages-articles-multistream.xml.bz2

it is 223.3 MB

That page has multiple files. But look for β€œpages-articles” to get the main content for wikipedia.

Then, extracted as

bunzip2 tawiki-20240501-pages-articles-multistream.xml.bz2

It gave a file tawiki-20240501-pages-articles-multistream.xml for 1.7 GB

It has a XML file. We have to extract the text content from it.

For it, explored and found a good tool. – https://github.com/apertium/WikiExtractor

Downloaded it and used it.

python3 WikiExtractor.py --infn tawiki-20240501-pages-articles-multistream.xml

It ran for 2 minutes and gave a file wiki.txt for 627 MB. It has all the articles content as a one single big plaintext file.

Compressed it with 7z as it gives better compression.

mv wiki.txt tawiki-20240501-pages-article-wiki.txt
7z a tawiki-20240501-pages-article-text.7z tawiki-20240501-pages-article-wiki.txt

it is 70 MB

Like this, will continue to get plain text tamil data from various sources. We have to find, where we can publish few 100 GBs to TBs of data, for free. Till then, will share these files on my self hosted desktop PC at my home.

Published the file here – https://kaniyam.cloudns.nz/tamil_datasets/

Let me know, if you are interested in joining this project.

HuggingBuddy

By: angu10
29 May 2024 at 13:32

Chrome App Link: https://chromewebstore.google.com/detail/huggingbuddy/hhkbebgakgkljpipmdblnabnoagemohb

If anyone would like to contribute more
GitHub Code: https://github.com/angu10/HuggingBuddy

Introducing HuggingBuddy: Your Friendly Companion for Reading Research Papers

Are you tired of feeling overwhelmed by complex research papers? Do you wish you had a friendly companion to help you understand the key ideas and insights? Look no further! Introducing HuggingBuddy, the user-friendly Chrome extension that simplifies the process of reading and understanding research papers from Hugging Face.

πŸ€— AI-Powered Summaries

HuggingBuddy harnesses the power of artificial intelligence to generate concise summaries of research papers. Say goodbye to hours of reading and hello to quick and easy understanding. With HuggingBuddy, you can grasp a paper's main ideas and contributions in just a few minutes.

❓ Interactive Q&A

Curious to learn more? HuggingBuddy has got you covered. The extension generates up to 5 relevant questions based on the paper's content, allowing you to explore and understand the research more deeply. Simply click on a question, and HuggingBuddy will provide a detailed answer using the advanced Gemini language model.

🎨 Customizable Reading Experience

We understand that everyone has different preferences when it comes to reading. That's why HuggingBuddy allows you to personalize your reading experience. Choose from various themes to suit your style and enable text-to-speech functionality to listen to the summaries and answers on the go.

🀝 Integration with Hugging Face

HuggingBuddy seamlessly integrates with the Hugging Face platform, giving you direct access to many research papers. No more searching through multiple websites or repositories. With HuggingBuddy, all the knowledge you need is just a click away.

🌟 Open Source and Community-Driven

HuggingBuddy is an open-source project licensed under the Apache License 2.0. We believe in the power of collaboration and encourage anyone to contribute to the project. Whether you're a developer, researcher, or enthusiast, you can help make HuggingBuddy better for everyone.

We welcome contributions in various forms, including:

  • πŸ› Bug reports and feature requests
  • πŸ’» Code contributions and pull requests
  • πŸ“š Documentation improvements
  • πŸ§ͺ Testing and feedback

By contributing to HuggingBuddy, you'll join a vibrant community of individuals passionate about making research more accessible and understandable. Together, we can create a powerful tool that benefits researchers, students, and anyone interested in exploring scientific knowledge.

πŸš€ Powered by Gemini API

HuggingBuddy leverages Google's cutting-edge Gemini API to generate summaries and provide interactive features. The Gemini API is a state-of-the-art language model that excels at natural language understanding and generation.

We are grateful to Google for making the Gemini API available and enabling us to build innovative tools like HuggingBuddy.

Ready to dive into the world of research papers with a friendly companion by your side? Install HuggingBuddy today and experience the joy of understanding complex ideas with ease. Happy reading! πŸ“–πŸ€—

❌
❌