❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Using Google Sheets as a makeshift Database [Depriciated]

By: ashish
9 March 2020 at 19:51

Do you want need a quick solution without going into the hassle of setting up a Database? If your answer to any of those questions was a yes, then you’ve come to the right place. This post will show you how you can use Google sheets as your database.

For the purposes of this blogpost I will be usiing this Google sheet.

As you can see, we will be collecting the following data from the user – Name, Email and Age.

Create the API

  • Go to the google sheet you want to use.
  • Create column headers in the first column
  • Click on tools> script editor
  • Copy the following code to the editor

    • Click on run>run function> setup.
    • Now publish your script to get the request URL with the following settings.

Now let us test this URL in a webpage.

See the Pen
Simple register form
by Thomas Ashish Cherian (@pandawhocodes)
on CodePen.

You can enter your details here to see your details being updated in the Google Sheet above( refresh to see changes) .

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

16 June 2024 at 02:35

At FreeTamilEbooks.com we have published 850 ebooks. All in sharable creative commons license. There are many people asking for the text only content of all these books many times. As it is a big task, took long time for it. Thanks to Lenin, Anwar of Kaniyam Foundation, all the contributors, all the writers and readers for making this project alive and a great success.

We are publishing the books as epub format, along with PDF format. Epub is just a zip file of HTML files. So, we can copy all the content from it as unicode text. Pandoc is a wonderful open source software, which can convert an epub to plaintext file.

There are the list of actions we have to do.

  1. Get URLs of all the 850+ epub files
  2. Download them all.
  3. using pandoc, convert to text file.

So far, we dont have a metadata file for all the books published. Getting the links of all epub files need some programming. As Python is a swiss knife to automate anything, started to explore the wordpress REST api with python to get all the books pages content.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_Data.py

Wrote the code here to get all the books info.

This gave a JSON file with book name, author, genre, epub, mobi, a4 pdf,6 inch pdf links.

Converted this to a CSV file with the below code. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/parse.py

I had to fix few things manually on the CSV file.

This is the final CSV file. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/fte_metadata.csv

The below code is to download all the epub files from their links in the fte_metadata.csv file. Used pandoc to convert to text.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_fte_books.py

Got 845 txt files. Total size is 374 MB

Compressed with 7z to get 47MB compressed file.

Published the data here. https://kaniyam.cloudns.nz/tamil_datasets/fte-books/

Download, share the text data for free. Dont sell them as most of the books are released as CC-BY-NC ( No Commercial ) license.

Use these data to build awesome open source applications and researches like Spellchekers, grammar checkers, LLm, RAG, what not?

Data is always the oil. Let us grow the open data oil.

Please share all your text, audio, video content in sharable license like creative commons. They will use to build a better future.

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

16 June 2024 at 02:35

At FreeTamilEbooks.com we have published 850 ebooks. All in sharable creative commons license. There are many people asking for the text only content of all these books many times. As it is a big task, took long time for it. Thanks to Lenin, Anwar of Kaniyam Foundation, all the contributors, all the writers and readers for making this project alive and a great success.

We are publishing the books as epub format, along with PDF format. Epub is just a zip file of HTML files. So, we can copy all the content from it as unicode text. Pandoc is a wonderful open source software, which can convert an epub to plaintext file.

There are the list of actions we have to do.

  1. Get URLs of all the 850+ epub files
  2. Download them all.
  3. using pandoc, convert to text file.

So far, we dont have a metadata file for all the books published. Getting the links of all epub files need some programming. As Python is a swiss knife to automate anything, started to explore the wordpress REST api with python to get all the books pages content.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_Data.py

Wrote the code here to get all the books info.

This gave a JSON file with book name, author, genre, epub, mobi, a4 pdf,6 inch pdf links.

Converted this to a CSV file with the below code. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/parse.py

I had to fix few things manually on the CSV file.

This is the final CSV file. https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/fte_metadata.csv

The below code is to download all the epub files from their links in the fte_metadata.csv file. Used pandoc to convert to text.

https://github.com/KaniyamFoundation/create_ebooks/blob/master/get_metadata/get_fte_books.py

Got 845 txt files. Total size is 374 MB

Compressed with 7z to get 47MB compressed file.

Published the data here. https://kaniyam.cloudns.nz/tamil_datasets/fte-books/

Download, share the text data for free. Dont sell them as most of the books are released as CC-BY-NC ( No Commercial ) license.

Use these data to build awesome open source applications and researches like Spellchekers, grammar checkers, LLm, RAG, what not?

Data is always the oil. Let us grow the open data oil.

Please share all your text, audio, video content in sharable license like creative commons. They will use to build a better future.

Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

11 June 2024 at 00:00

At Kaniyam Foundation, we have a dream of collecting publishing TerraBytes of Tamil text data for Tamil LLM and other research works. We are documenting the websites that provide Open Licensed tamil content, like Public Domain, Creative Commons license here. https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

From here, we can get the websites, scrap them and use and share the data.

Firstly, Today, I started to explore the tamil wikipedia data.

All the wikepedia content are stored as XML and SQL files here.

Download the Wikipedia dump for the all the languages from http://dumps.wikimedia.org/backup-index.html.

For tamil wikipedia content, from here, https://dumps.wikimedia.org/tawiki/ I downloaded this file

tawiki-20240501-pages-articles-multistream.xml.bz2

it is 223.3 MB

That page has multiple files. But look for β€œpages-articles” to get the main content for wikipedia.

Then, extracted as

bunzip2 tawiki-20240501-pages-articles-multistream.xml.bz2

It gave a file tawiki-20240501-pages-articles-multistream.xml for 1.7 GB

It has a XML file. We have to extract the text content from it.

For it, explored and found a good tool. – https://github.com/apertium/WikiExtractor

Downloaded it and used it.

python3 WikiExtractor.py --infn tawiki-20240501-pages-articles-multistream.xml

It ran for 2 minutes and gave a file wiki.txt for 627 MB. It has all the articles content as a one single big plaintext file.

Compressed it with 7z as it gives better compression.

mv wiki.txt tawiki-20240501-pages-article-wiki.txt
7z a tawiki-20240501-pages-article-text.7z tawiki-20240501-pages-article-wiki.txt

it is 70 MB

Like this, will continue to get plain text tamil data from various sources. We have to find, where we can publish few 100 GBs to TBs of data, for free. Till then, will share these files on my self hosted desktop PC at my home.

Published the file here – https://kaniyam.cloudns.nz/tamil_datasets/

Let me know, if you are interested in joining this project.

Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

11 June 2024 at 00:00

At Kaniyam Foundation, we have a dream of collecting publishing TerraBytes of Tamil text data for Tamil LLM and other research works. We are documenting the websites that provide Open Licensed tamil content, like Public Domain, Creative Commons license here. https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

From here, we can get the websites, scrap them and use and share the data.

Firstly, Today, I started to explore the tamil wikipedia data.

All the wikepedia content are stored as XML and SQL files here.

Download the Wikipedia dump for the all the languages from http://dumps.wikimedia.org/backup-index.html.

For tamil wikipedia content, from here, https://dumps.wikimedia.org/tawiki/ I downloaded this file

tawiki-20240501-pages-articles-multistream.xml.bz2

it is 223.3 MB

That page has multiple files. But look for β€œpages-articles” to get the main content for wikipedia.

Then, extracted as

bunzip2 tawiki-20240501-pages-articles-multistream.xml.bz2

It gave a file tawiki-20240501-pages-articles-multistream.xml for 1.7 GB

It has a XML file. We have to extract the text content from it.

For it, explored and found a good tool. – https://github.com/apertium/WikiExtractor

Downloaded it and used it.

python3 WikiExtractor.py --infn tawiki-20240501-pages-articles-multistream.xml

It ran for 2 minutes and gave a file wiki.txt for 627 MB. It has all the articles content as a one single big plaintext file.

Compressed it with 7z as it gives better compression.

mv wiki.txt tawiki-20240501-pages-article-wiki.txt
7z a tawiki-20240501-pages-article-text.7z tawiki-20240501-pages-article-wiki.txt

it is 70 MB

Like this, will continue to get plain text tamil data from various sources. We have to find, where we can publish few 100 GBs to TBs of data, for free. Till then, will share these files on my self hosted desktop PC at my home.

Published the file here – https://kaniyam.cloudns.nz/tamil_datasets/

Let me know, if you are interested in joining this project.

❌
❌