Ubuntu : சும்மா இருந்ததால் வந்த சிக்கல்

அக், 17 2024

இரவு ஒரு 10: 35 மணி இருக்கும் கணினியை திறந்து லாகின் செய்து விட்டு குரோமியம் உலவியில் தமிழ் லினக்ஸ் பாரத்தில் போட்ட கேள்விக்கு பதில் வந்ததா என்று பார்த்துக்கொண்டிருந்தேன்.

பார்த்துகொண்டிருக்கும் போது ஜேசன் அவர்களின் பதிவு தந்தி(Telegram App)யில் வந்தது. அதில் A2D நந்தா அவர்களின் CDK Offers மற்றும் CDKLabs பற்றிய சில செய்திகள் வந்தது. அவற்றையெல்லம் பார்த்துவிட்டு கணினியில் சுட்டியை சொடுக்கினேன் செய்தேன். எதற்கும் ஒத்துழைக்க வில்லை ஆனால் சுட்டி மட்டும் நகர்ந்தது. விசைப்பலகையை தட்டி டெர்மினலில் எதேனும் கட்டளைகளை பயன்படுத்தி குரோமியத்தினை நிறுத்திவிட்டு மீள் தொடங்கலாம் என நினைத்தேன். ஆனால் விசைகள் இயங்கின அதனுடைய உள்ளீட்டை கணிணி ஏற்கவில்லை.

திறன்பேசியில் தேடினேன்.

ரெடிட்டில் ஒரு பதிவு கிடைத்தது. அதில் குறிப்பிட்ட படி ctrl+alt+F3 யை அழுத்தினேன். பின்னர்

sudo systemctl status gdm 

கட்டளையிட்டேன் (inactive) நிலையில் இருந்தது. பின்னர் நான் lightdm பயன்படுத்துவதால் அதனுய நிலையை சரிபார்த்தேன்.

sudo systemctl status lightdm

அது இயக்கத்தில் இருந்தது. lightdm ஐ மீள்தொடக்கம் செய்தேன். பின்வரும் கட்டளை கொண்டு

sudo systemctl restart lightdm

மீள் தொடக்கம் செய்ததும் விசைப்பலகை சரியாக இயங்கியது. சுட்டியும் நன்றாக இயங்கியது.

நன்றாக இயங்கியதால் இந்த பதிவையும் எழுத முடிந்தது.


ரெடிட் பதிவு

React -Router

Understanding Routes

React- Router is used for render the components depends on URL without reloading the browser page. It navigates a page to another page without page reloads. we can use router in react project, first we install the react-router-dom package from react.

Then, access the router using

import {BrowserRouter,Router,Routes} from 'react-router-dom';

after that we use link to diplay the browser using Link instead <a href=''> in react.


import {BrowserRouter,Router,Routes,Link} from ‘react-router-dom’;

example for using Link and routes:

Using Links:

  '/' is the root page it display default home page

    <Link to='/'>Home</Link>
    <Link to='/about'>About</Link>
    <Link to='/contact'>Contact</Link>

}/> }> }/>

Using Routes:

<Browser Router>
<Routes path='/' element={<Home/>}/>
<Routes path='/about' element={<About/>}>
<Routes path='/contact' element={<Contact/>}/>
</Browser Router>

Kanchilug Monthly Meet – Oct ,13 2024

Lightning Talk by Krishna

Topic : Desktop Customization : picom, pywal

Resource : Krishna S

SSD File System Comparison :

Notes on Choosing HDD to SSD Linux Installation migration

Chat GPT Prompt : how to backup current linux installation configuration files and packages list as a archive and apply on a new installation


Backing up your current Linux installation’s configuration files and package lists is a great way to ensure you can easily restore or replicate your setup on a new installation. Here’s how you can do it:

Step 1: Backup Configuration Files

  1. Identify Configuration Files: Most configuration files are located in /etc. You may also have application-specific configurations in your home directory (like .bashrc, .vimrc, etc.).
  2. Create an Archive:
    You can create a compressed archive (like a tarball) of your configuration files. Run the following command:
   sudo tar -czvf config_backup.tar.gz /etc /home/your_username/.bashrc /home/your_username/.vimrc

Adjust the paths based on what configuration files you want to back up.

Step 2: Backup Installed Packages List

The method to back up the installed package list depends on your Linux distribution.

For Debian-based systems (like Ubuntu):

  1. Backup Installed Packages:
   dpkg --get-selections > package_list.txt
  1. Backup Sources List:
   sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup

For Red Hat-based systems (like Fedora, CentOS):

  1. Backup Installed Packages:
   rpm -qa > package_list.txt
  1. Backup Repositories:
   sudo cp -r /etc/yum.repos.d/ /path/to/backup/repo_backup/

Step 3: Create a Complete Backup Archive

Combine everything into one archive:

tar -czvf full_backup.tar.gz config_backup.tar.gz package_list.txt /etc/apt/sources.list.backup

Step 4: Restore on a New Installation

  1. Transfer the Backup: Copy the full_backup.tar.gz to your new installation.
  2. Extract the Backup:
   tar -xzvf full_backup.tar.gz
  1. Restore Configuration Files:
   sudo tar -xzvf config_backup.tar.gz -C /
  1. Restore Package List: For Debian-based systems:
   sudo dpkg --set-selections < package_list.txt
   sudo apt-get dselect-upgrade

#### For Red Hat-based systems:
You can reinstall packages from the list:

   xargs -a package_list.txt sudo dnf install -y
  1. Restore Sources List (Debian-based):
   sudo cp /etc/apt/sources.list.backup /etc/apt/sources.list
   sudo apt update

Step 5: Finalize and Reboot

After restoring everything, it’s a good idea to reboot your system to ensure all configurations take effect.


  • Make sure to review any new configurations or installed packages after the restoration, as dependencies and versions may vary.
  • For user-specific configurations, you may want to back up the entire home directory or specific application directories.

This process should help you smoothly transition to a new Linux installation with your desired configurations and packages!

Chat GPT Prompt : I am in a plan of migrating linux os to a new file system which is xfs or zfs brtfs which on is good

Choosing between XFS, ZFS, and Btrfs depends on your specific needs and use cases. Here’s a breakdown of each file system’s strengths and considerations to help you decide:



  • Performance: Excellent performance, especially for large files and high-performance workloads.
  • Scalability: Designed for scalability, making it suitable for high-capacity storage systems.
  • Mature: A well-established file system with a long history of stability.


  • Features: Lacks some advanced features like snapshots and built-in RAID compared to ZFS and Btrfs.
  • Management: Does not have as many management tools as ZFS.



  • Data Integrity: Strong focus on data integrity with built-in checksumming.
  • Snapshots and Clones: Supports efficient snapshots and clones, allowing for easy backups and recovery.
  • Compression: Offers built-in compression options, which can save space.
  • RAID: Supports various RAID levels natively (ZRAID), enhancing redundancy.


  • Memory Usage: Can require more RAM for optimal performance.
  • Complexity: More complex to set up and manage compared to traditional file systems.



  • Snapshots and Subvolumes: Supports snapshots and subvolumes, making it flexible for backups and system management.
  • Built-in RAID: Offers support for multiple RAID levels.
  • Compression and Deduplication: Supports on-the-fly compression and deduplication.


  • Stability: Still considered less mature than XFS and ZFS; while it has improved significantly, it may not be as stable for all workloads.
  • Performance: Can have variable performance depending on the workload and configuration.


  • For General Use: If you want stability and performance for general workloads, XFS is a great choice.
  • For Data Integrity and Advanced Features: If data integrity, snapshots, and advanced features are critical, ZFS is likely the best option.
  • For Flexibility and Modern Features: If you want modern features like snapshots and subvolumes with some flexibility, consider Btrfs.

Final Thoughts

Evaluate your specific use cases (e.g., data integrity, performance, ease of management) and the resources available on your system (e.g., RAM for ZFS). If possible, testing each file system in a non-production environment can also help you determine which one best meets your needs.

An Introduction to Tokenizers in Natural Language Processing


_Co-authored by Tamil Arasan, Selvakumar Murugan and Malaikannan Sankarasubbu

In Natural Language Processing (NLP), one of the foundational steps is transforming human language into a format that computational models can understand. This is where tokenizers come into play. Tokenizers are specialized tools that break down text into smaller units called tokens, and convert these tokens into numerical data that models can process.

Imagine you have the sentence:

Artificial intelligence is revolutionizing technology.

To a human, this sentence is clear and meaningful. But we do not understand the whole sentence in one shot(okay may be you did, but I am sure if I gave you a paragraph or a even better an essay, you will not be able to understand them in one shot), but we make sense of parts of it like words and then phrases and understand the whole sentence as a composition of meanings from its parts. It is just how things work, regardless whether we are trying to make a machine mimic our language understanding or not. This has nothing to do with the reason ML models or even computers in general work with numbers. It is purely how language works and there is no going around it.

ML models like everything else we run on computers can only work with numbers, and we need to transform the text into number or series of numbers (since we have more than one word). We have a lot of freedom when it comes to how we transform the text into numbers, and as always with freedom comes complexity. But basically, tokenization as a whole is a two step process. Finding all the words and assigning a unique number - an ID to each token.

There are so many ways we can segment a sentence/paragraph into pieces like phrases, words, sub-words or even individual characters. Understanding why particular tokenization scheme is better requires a grasp of how embeddings work. If you're familiar with NLP, you'd ask "Why? Tokenization comes before the Embedding, right?" Yes, you're right, but NLP is paradoxical like that. Don't worry we will cover that as we go.


Before we venture any further, lets understand the difference between Neural networks and our typical computer programs. We all know by now that for traditional computer programs, we write/translate the rules into code by hand whereas, NNs learn the rules(mapping across input and output) from data by the process called training. You see unlike in normal programming style, where we have a plethora of data-structures that can help with storing information in any shape or form we want, along with algorithms that jump up and down, back and forth in a set of instructions we call code, Neural Networks do not allow us to have all sorts of control flow we'd like. In Neural Networks, there is only one direction the "program" can run, left to right.

Unlike in traditional programs where the we can feed a program with input in complicated ways, in Neural Networks, there are only fixed number of ways, we can feed and it is usually in the form of vectors (fancy name for list of numbers) and the vectors are of fixed size (or dimension more precisely). In most DNNs, input and output sizes are fixed regardless of the problem it is trying to solve. For example, CNNs the input (usually image) size and number of channels is fixed. In RNNs, the embedding dimensions, input vocabulary size, number of output labels (classification problem e.g: sentiment classification) and or output vocabulary size (text generation problems e.g: QA, translation) are all fixed. In Transformer networks even the sentence length is fixed. This is not a bad thing, constraints like these enable the network to compress and capture the necessary information.

Also note that there are only few tools to test "equality" or "relevance" or "correctness" for things inside the network because only things that dwell inside the network are vectors. Cosine similarity and attention scores are popular. You can think of vectors as variables that keep track of state inside neural network program. But unlike in traditional programs where you can declare variables as you'd like and print them for troubleshooting, in networks the vector-variables are only meaningful only at the boundaries of the layers(not entirely true) within the networks.

Lets take a look at the simplest example to understand why just pulling a vector from anywhere in the network will not be of any value for us. In the following code, three functions perform the identical calculation despite their code is slightly different. The unnecessarily intentionally named variables temp and growth_factor need not be created as exemplified by the first function, which directly embodies the compound interest calculation formula, $A = P(1+\frac{R}{100})^{T}$. When compared to temp, the variable growth_factor hold a more meaningful interpretation - represents how much the money will grow due to compounding interest over time. For more complicated formulae and functions, we might create intermediate variables so that the code goes easy on the eye, but they have no significance to the operation of the function.

def compound_interest_1(P,R,T):
    A = P * (math.pow((1 + (R/100)),T))
    CI = A - P
    return CI

def compound_interest_2(P,R,T):
    temp = (1 + (R/100))
    A = P * (math.pow(temp, T))
    CI = A - P
    return CI

def compound_interest_3(P,R,T):
    growth_factor = (math.pow((1 + (R/100)),T))
    A = P * growth_factor
    CI = A - P
    return CI

Another example to illustrate from operations perspective. Clock arithmetic. Lets assign numbers 0 through 7 to weekdays starting from Sunday to Saturday.

Table 1

Sun Mon Tue Wed Thu Fri Sat
0 1 2 3 4 5 6

John Conway suggests, a mnemonic device for thinking of the days of the week as Noneday, Oneday, Twosday, Treblesday, Foursday, Fiveday, and Six-a-day.

So if you want to know what day it is 137 days from today if today is say, Thursday (i.e. 4). We can do $(4+137) mod 7 => 1$ i.e Monday. As you can see adding numbers(days) in clock arithmetic results in a meaningful output. You can days together to get another day. Okay lets ask the question can we multiply two days together? Is it is in anyway meaningful to multiply days? Just because we can multiply any number mathematically, is it useful to do so in our clock arithmetic?

All of this digression is to emphasize that the embedding is deemed to capture the meaning of words, vector from the last layers is deemed to capture the meaning of a sentence lets say. But when you take a vector (just because you can) within the layers for instance, it does not refer to any meaningful unit such as words or phrases and sentence as we understand it.

A little bit of history

If you're old enough, you might remember that before transformers became standard paradigm in NLP, we had another one EEAP (Embed, Encode, Attend, Predict). I am grossly oversimplifying here, but you can think of it as follows,


Captures the meaning of words A matrix of size $N \times D$, where

  • $N$ is the size of the vocabulary, i.e unique number of words in the language
  • $D$ is the dimension of embedding, vector corresponding to each word.

Lookup the word-vector (embedding) for each word

Find the meaning of a sentence, by using the meaning captured in embeddings of the constituent words with help of RNNs like LSTM, GRU or transformers like BERT, GPT that take the embeddings and produce vector(s) for whole the sequence.
Depending upon the task at hand, either assigns a label to the input sentence, or generate another sentence word by word.
Helps with Prediction by focusing on what is important right now by drawing a probability distribution (normalized attention scores) over the all words. Words with high score are deemed important.

As you can see above, $N$ is the vocabulary size, i.e unique number of words in the language. And handful of years ago, language usually meant the corpus at hand (in order of few thousands of sentences) and datasets like CNN/DailyMail were considered huge. There were clever tricks like anonymizing named entities to force the ML models to focus on language specific features like grammar instead of open world words like names of Places, Presidents, Corporations and Countries, etc. Good times they were! Point is, it is possible that the corpus you have in your possession might not have all the words of the language. As we have seen, the size of the Embedding must be fixed before training the network. By good fortune if you stumble upon a new dataset and hence new words, adding them to your model was not easy, because Embedding needs to extend to accommodate this new (OOV) words and that requires retraining of the whole network. OOV means Out Of the current model's Vocabulary. And this is why simply segmenting the text on empty spaces will not work.

With that background, lets dive in.


Tokenization is the process of segmenting the text into individual pieces (usually words) so that ML model can digest them. It is the very first step in any NLP system and influences everything that follows. For understanding impact of tokenization, we need to understand how embeddings and sentence length influence the model. We will call sentence length as sequence length from here on, because sentence is understood to be sequence of words, and we will experiment with sequence of different things not just words, which we will call tokens.

Tokens can be anything

  • Words - "telephone" "booth" "is" "nearby" "the" "post" "office"
  • Multiword Expressions (MWEs) - "telephone booth" "is" "nearby" "the" "post office"
  • Sub-words - "tele" "#phone" "booth" "is" "near " "#by" "the" "post" "office"
  • Characters - "t" "e" "l" "e" "p" ... "c" "e"

We know segmenting the text based on empty spaces will not work, because the vocabulary will keep growing. What about punctuations? Surely they will help with words don't, won't, aren't, o'clock, Wendy's, co-operation{.verbatim} etc, same reasoning applies here too. Moreover segmenting at punctuations will create different problems, e.g: I.S.R.O > I, S, R, O{.verbatim} which is not ideal.

Objectives of Tokenization

The primary objectives of tokenization are:

Handling OOV
Tokenizers should be able to segment the text into pieces so that any word in the language whether it is in the dataset or not, any word we might conjure in foreseeable future, whether it is a technical/domain specific terminology that scientists might utter to sound intelligent or commonly used by everyone in day to day life. An ideal tokenizer should be able to deal with all and any of them.
Reducing the size (length) of the input text to make computation feasible and faster.
Meaningful Representation
Capturing the semantic essence of the text so that the model can learn effectively. Which we will discuss a bit later.

Simple Tokenization Methods

Go through the code below, and see if you can make any inferences on the table produced. It reads the book The Republic and counts the tokens on character, word and sentence levels and also indicated the number of unique tokens in the whole book.


``` {.python results=”output raw” exports=”both”} from collections import Counter from nltk.tokenize import sent_tokenize with open(‘plato.txt’) as f: text =

words = text.split() sentences = sent_tokenize(text)

char_counter = Counter() word_counter = Counter() sent_counter = Counter()

char_counter.update(text) word_counter.update(words) sent_counter.update(sentences)

print(‘#+name: Vocabulary Size’) print(‘|Type|Vocabulary Size|Sequence Length|’) print(f’|Unique Characters|{len(char_counter)}|{len(text)}’) print(f’|Unique Words|{len(word_counter)}|{len(words)}’) print(f’|Unique Sentences|{len(sent_counter)}|{len(sentences)}’)

**Table 2**

| Type              | Vocabulary Size | Sequence Length |
| ----------------- | --------------- | --------------- |
| Unique Characters | 115             | 1,213,712       |
| Unique Words      | 20,710          | 219,318         |
| Unique Sentences  | 7,777           | 8,714           |

## Study

Character-Level Tokenization

:   In this most elementary method, text is broken down into individual

    *\"data\"* \> `"d" "a" "t" "a"`{.verbatim}

Word-Level Tokenization

:   This is the simplest and most used (before sub-word methods became
    popular) method of tokenization, where text is split into individual
    words based on spaces and punctuation. Still useful in some
    applications and as a pedagogical launch pad into other tokenization

    *\"Machine learning models require data.\"* \>
    `"Machine", "learning", "models", "require", "data", "."`{.verbatim}

Sentence-Level Tokenization

:   This approach segments text into sentences, which is useful for
    tasks like machine translation or text summarization. Sentence
    tokenization is not as popular as we\'d like it to be.

    *\"Tokenizers convert text. They are essential in NLP.\"* \>
    `"Tokenizers convert text.", "They are essential in NLP."`{.verbatim}

n-gram Tokenization

:   Instead of using sentences as a tokens, what if you could use
    phrases of fixed length. The following shows the n-grams for n=2,
    i.e 2-gram or bigram. Yes the `n`{.verbatim} in the n-grams stands
    for how many words are chosen. n-grams can also be built from
    characters instead of words, though not as useful as word level

    *\"Data science is fun\"* \>
    `"Data science", "science is", "is fun"`{.verbatim}.

**Table 3**

| Tokenization | Advantages                             | Disadvantages                                        |
| ------------ | -------------------------------------- | ---------------------------------------------------- |
| Character    | Minimal vocabulary size                | Very long token sequences                            |
|              | Handles any possible input             | Require huge amount of compute                       |
| Word         | Easy to implement and understand       | Large vocabulary size                                |
|              | Preserves meaning of words             | Cannot cover the whole language                      |
| Sentence     | Preserves the context within sentences | Less granular; may miss important word-level details |
|              | Sentence-level semantics               | Sentence boundary detection is challenging           |

As you can see from the table, the vocabulary size and sequence length
have inverse correlation. The Neural networks requires that the tokens
should be present in many places and many times. That is how the
networks understand words. Remember when you don\'t know the meaning of
a word, you ask someone to use it in sentences? Same thing here, the
more sentences the token is present, the better the network can
understand it. But in case of sentence tokenization, you can see there
are as many tokens in its vocabulary as in the tokenized corpus. It is
safe to say that each token is occuring only once and that is not a
healthy diet for a network. This problem occurs in word-level
tokenization too but it is subtle, the out-of-vocabulary(OoV) problem.
To deal with OOV we need to stay between character level and word-level
tokens, enter \>\>\> sub-words \<\<\<.

# Advanced Tokenization Methods

Subword tokenization is an advanced tokenization technique that breaks
text into smaller units, smaller than words. It helps in handling rare
or unseen words by decomposing them into known subword units. Our hope
is that, the sub-words decomposed from text, can be used to compose new
unseen words and so act as the tokens for the unseen words. Common
algorithms include Byte Pair Encoding (BPE), WordPiece, SentencePiece.

*\"unhappiness\"* \> `"un", "happi", "ness"`{.verbatim}

BPE is originally a technique for compression of data. Repurposed to
compress text corpus by merging frequently occurring pairs of characters
or subwords. Think of it like what and how little number of unique
tokens you need to recreate the whole book when you are free to arrange
those tokens in a line as many time as you want.


:   1.  *Initialization*: Start with a list of characters (initial
        vocabulary) from the text(whole corpus).
    2.  *Frequency Counting*: Count all pair occurrences of consecutive
    3.  *Pair Merging*: Find the most frequent pair and merge it into a
        single new subword.
    4.  *Update Text*: Replace all occurrences of the pair in the text
        with the new subword.
    5.  *Repeat*: Continue the process until reaching the desired
        vocabulary size or merging no longer provides significant


:   -   Reduces the vocabulary size significantly.
    -   Handles rare and complex words effectively.
    -   Balances between word-level and character-level tokenization.


:   -   Tokens may not be meaningful standalone units.
    -   Slightly more complex to implement.

## Trained Tokenizers

WordPiece and SentencePiece tokenization methods are extensions of BPE
where the vocabulary is not merely created by assuming merging most
frequent pair. These variants evaluate whether the given merges were
useful or not by measuring how much each merge maximizes the likelihood
of the corpus. In simple words, lets take two vocabularies, before and
after the merges, and train two language models and the model trained on
vocabulary after the merges have lower perplexity(think loss) then we
assume that the merges were useful. And we need to repeat this every
time we make a merge. Not practical, and hence there some mathematical
tricks we use to make this more practical that we will discuss in a
future post.

The iterative merging process is the training of tokenizer and this
training is different training of actual models. There are python
libraries for training your own tokenizer, but when you\'re planning to
use a pretrained language model, it is better to stick with the
pretrained tokenizer associated with that model. In the following
section we see how to train a simple BPE tokenizer, SentencePiece
tokenizer and how to use BERT tokenizer that comes with huggingface\'s
`transformers`{.verbatim} library.

## Tokenization Techniques Used in Popular Language Models

### Byte Pair Encoding (BPE) in GPT Models

GPT models, such as GPT-2 and GPT-3, utilize Byte Pair Encoding (BPE)
for tokenization.

``` {.python results="output code" exports="both"}
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer =  Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
files = ["plato.txt"]

tokenizer.train(files, trainer)'.', 'bpe_tokenizer')

output = tokenizer.encode("Tokenization is essential first step for any NLP model.")
print("Tokens:", output.tokens)
print("Token IDs:", output.ids)
print("Length: ", len(output.ids))
Tokens: ['T', 'oken', 'ization', 'is', 'essential', 'first', 'step', 'for', 'any', 'N', 'L', 'P', 'model', '.']
Token IDs: [50, 6436, 2897, 127, 3532, 399, 1697, 184, 256, 44, 42, 46, 3017, 15]
Length:  14

SentencePiece in T5

T5 models use a Unigram Language Model for tokenization, implemented via the SentencePiece library. This approach treats tokenization as a probabilistic model over all possible tokenizations.

import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=plato.txt --model_prefix=unigram_tokenizer --vocab_size=3000 --model_type=unigram')

``` {.python results=”output code” exports=”both”} import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load(“unigram_tokenizer.model”) text = “Tokenization is essential first step for any NLP model.” pieces = sp.EncodeAsPieces(text) ids = sp.EncodeAsIds(text) print(“Pieces:”, pieces) print(“Piece IDs:”, ids) print(“Length: “, len(ids))

``` python
Pieces: ['▁To', 'k', 'en', 'iz', 'ation', '▁is', '▁essential', '▁first', '▁step', '▁for', '▁any', '▁', 'N', 'L', 'P', '▁model', '.']
Piece IDs: [436, 191, 128, 931, 141, 11, 1945, 123, 962, 39, 65, 17, 499, 1054, 1441, 1925, 8]
Length:  17

WordPiece Tokenization in BERT

``` {.python results=”output code”} from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) text = “Tokenization is essential first step for any NLP model.” encoded_input = tokenizer(text, return_tensors=’pt’)

print(“Tokens:”, tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’][0])) print(“Token IDs:”, encoded_input[‘input_ids’][0].tolist()) print(“Length: “, len(encoded_input[‘input_ids’][0].tolist())) ```

Summary of Tokenization Methods

Table 4

Method Length Tokens
BPE 14 [‘T’, ‘oken’, ‘ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘N’, ‘L’, ‘P’, ‘model’, ‘.’]
SentencePiece 17 [‘▁To’, ‘k’, ‘en’, ‘iz’, ‘ation’, ‘▁is’, ‘▁essential’, ‘▁first’, ‘▁step’, ‘▁for’, ‘▁any’, ‘▁’, ‘N’, ‘L’, ‘P’, ‘▁model’, ‘.’]
WordPiece (BERT) 12 [‘token’, ‘##ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘nl’, ‘##p’, ‘model’, ‘.’]

Different tokenization methods give different results for the same input sentence. As we add more data to the tokenizer training, the differences between WordPiece and SentencePiece might decrease, but they will not vanish, because of the difference in their training process.

Table 5

Model Tokenization Method Library Key Features
GPT Byte Pair Encoding tokenizers Balances vocabulary size and granularity
BERT WordPiece transformers Efficient vocabulary, handles morphology
T5 Unigram Language Model sentencepiece Probabilistic, flexible across languages

Tokenization and Non English Languages

Tokenizing text is complex, especially when dealing with diverse languages and scripts. Various challenges can impact the effectiveness of tokenization.

Tokenization Issues with Complex Languages: With a focus on Tamil

Tokenizing text in languages like Tamil presents unique challenges due to their linguistic and script characteristics. Understanding these challenges is essential for developing effective NLP applications that handle Tamil text accurately.

Challenges in Tokenizing Tamil Language

  1. 1. Agglutinative Morphology

    Tamil is an agglutinative language, meaning it forms words by concatenating morphemes (roots, suffixes, prefixes) to convey grammatical relationships and meanings. A single word may express what would be a full sentence in English.

    Impact on Tokenization
    • Words can be very lengthy and contain many morphemes.
      • போகமுடியாதவர்களுக்காவேயேதான்
  2. 2. Punarchi and Phonology

    Tamil specific rules on how two words can be combined and resulting word may not be phonologically identical to its parts. The phonological transformations can cause problems with TTS/STT systems too.

    Impact on Tokenization
    • Surface forms of words may change when combined, making boundary detection challenging.
      • மரம் + வேர் > மரவேர்
      • தமிழ் + இனிது > தமிழினிது
  3. 3. Complex Script and Orthography

    Tamil alphabet representation in Unicode is suboptimal for everything except for standardized storage format. Even simple operations that are intuitive for native Tamil speaker, are harder to implement because of this. Techniques like BPE applied on Tamil text will break words at completely inappropriate points like cutting an uyirmei letter into consonant and diacritic resulting in meaningless output.

    தமிழ் > த ம ி ழ, ்

Strategies for Effective Tokenization of Tamil Text

  1. Language-Specific Tokenizers

    Train Tamil specific subword tokenizers with initial seed tokens prepared by better preprocessing techniques to avoid [problem-3]{.spurious-link target=”*3. Complex Script and Orthography”} type cases. Use morphological analyzers to decompose words into root and affixes, aiding in understanding and processing complex word forms.

Choosing the Right Tokenization Method

Challenges in Tokenization

  • Ambiguity: Words can have multiple meanings, and tokenizers cannot capture context. Example: The word "lead" can be a verb or a noun.
  • Handling Special Characters and Emojis: Modern text often includes emojis, URLs, and hashtags, which require specialized handling.
  • Multilingual Texts: Tokenizing text that includes multiple languages or scripts adds complexity, necessitating adaptable tokenization strategies.

Best Practices for Effective Tokenization

  • Understand Your Data: Analyze the text data to choose the most suitable tokenization method.
  • Consider the Task Requirements: Different NLP tasks may benefit from different tokenization granularities.
  • Use Pre-trained Tokenizers When Possible: Leveraging existing tokenizers associated with pre-trained models can save time and improve performance.
  • Normalize Text Before Tokenization: Cleaning and standardizing text

Top Data quality tools

Image descriptionTop 10 Data Quality Tools

⎯⎯⎯Source From @ Deepak Bhardwaj ⎯⎯⎯⎯⎯

➤ 𝐀𝐛 𝐈𝐧𝐢𝐭𝐢𝐨
↳ Comprehensive data management with advanced data quality features.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Automated data analysis, rule definition, issue ticketing, centralised control.

➤ 𝐒𝐀𝐒 𝐃𝐚𝐭𝐚 𝐐𝐮𝐚𝐥𝐢𝐭𝐲
↳ Enhances data accuracy, consistency, and completeness with comprehensive tools.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Data profiling, duplicate merging, standardisation, SAS Quality Knowledge Base.

➤ 𝐃𝐐𝐋𝐚𝐛𝐬 𝐏𝐥𝐚𝐭𝐟𝐨𝐫𝐦
↳ Holistic data quality and observability tool using AI and machine learning.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Automated data profiling, anomaly detection, proactive monitoring.

➤ 𝐎𝐩𝐞𝐧𝐑𝐞𝐟𝐢𝐧𝐞
↳ A free, open-source tool for cleaning and transforming messy data.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Data consistency, error correction, versatile data transformations.

➤ 𝐏𝐫𝐞𝐜𝐢𝐬𝐞𝐥𝐲 𝐃𝐚𝐭𝐚 𝐈𝐧𝐭𝐞𝐠𝐫𝐢𝐭𝐲 𝐒𝐮𝐢𝐭𝐞
↳ A modular suite provides data quality, governance, and mastering.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Profiling, cleansing, standardisation, real-time data visualisation.

➤ 𝐎𝐫𝐚𝐜𝐥𝐞 𝐄𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐃𝐚𝐭𝐚 𝐐𝐮𝐚𝐥𝐢𝐭𝐲
↳ Comprehensive solution for data governance and quality management.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Data profiling, extensible features, batch and real-time processing.

➤ 𝐓𝐚𝐥𝐞𝐧𝐝 𝐃𝐚𝐭𝐚 𝐅𝐚𝐛𝐫𝐢𝐜
↳ Unified platform for data integration and quality management.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Data profiling, cleansing, integration, real-time monitoring.

➤ 𝐒𝐀𝐏 𝐃𝐚𝐭𝐚 𝐒𝐞𝐫𝐯𝐢𝐜𝐞𝐬
↳ Advanced tool for data integration and quality across diverse sources.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Data transformation, cleansing, profiling, integration with SAP systems.

➤ 𝐀𝐭𝐚𝐜𝐜𝐚𝐦𝐚 𝐎𝐍𝐄
↳ An AI-powered platform integrating data governance, quality, and master data management.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Data profiling, real-time monitoring, integrated data catalog.

➤ 𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐜𝐚 𝐂𝐥𝐨𝐮𝐝 𝐃𝐚𝐭𝐚 𝐐𝐮𝐚𝐥𝐢𝐭𝐲
↳ AI-driven data quality solution for cloud and hybrid environments.
↳ 𝐊𝐞𝐲 𝐅𝐞𝐚𝐭𝐮𝐫𝐞𝐬: Data profiling, cleansing, automated anomaly detection, CLAIRE engine.

Deepak Bhardwaj

Keyboard Tamil99 :தேடிய விசைப்பலகை பயிற்சி வலைதளம் கிடைத்தது.

அக் 5, 2024

தமிழில் தட்டச்சு செய்ய தமிழ்99 விசைப்பலகை பயன்படுத்த மிகவும் எளிதாக இருக்கும். முதலில் இருந்தே எகலப்பை மற்றும் தமிழ் பொனெட்டிக்(ஒலிப்பு) விசைப்பலகையில் தமிழில் தட்டச்சு செய்ய பழகிய என்னை போன்றவர்களுக்கு தமிழ் தட்டச்சு செய்ய தமிழ்99 விசைப்பலகை பயிற்சி எடுக்க மிகவும் கடினமாக இருக்கிறது.

ஆகவே இந்த சிக்கலை தீர்க்க எதேனும் வலைதளங்கள் இருக்கின்றனவா என தேடிய பொழுது கிடைத்த ஒர் வலைதளம்.

இந்த வலைதளத்தில் தமிழ்99 விசைப்பலகையை பயிற்சி பெற 3 வித அமைப்புகள் உள்ளன.

முதல் அமைப்பு தமிழ்99 விசைப்பலகை மட்டும் வைத்துகொள்வது. இதில் வடமொழி மற்றும் ஆங்கில எழுத்துகள் இருக்காது.

இரண்டாம் அமைப்பு தமிழ்99 விசைப்பலகை வடமொழி எழுத்துகள் உடன் ஆங்கில எழுத்துகளும் லேசான நிறத்தில் தோன்றும் வடமொழி எழுத்துகளை தட்டச்சு செய்ய {shift} விசையை பயன்படுத்தவும்.

மூன்றாம் அமைப்பு தமிழ்99 விசைப்பலகை வடமொழி எழுத்துகள் உடன் ஆங்கில எழுத்துகளும் அடர் நிறத்தில் தோன்றும் .

தொடக்க பயனர்கள் மூன்றாம் அமைப்பை பயன்படுத்தலாம் எளிமையாக ஆங்கில எழுத்துக்களின் விசையை பயன்படுத்தி பழகலாம் பின்னர் நன்கு பயிற்சி பெற்றபின் முதலாம் அமைப்பிற்கு வரலாம்.

வாங்க பழகலாம் தமிழ்99!

How to Deploy a Laravel application in Shared Hosting

Preparing Hosting Infrastructure for Deployment:

  1. Create a subdomain or main domain
  2. login into public_html folder of the site or subdomain
  3. put a index.html file and check if the sub domain is working

Laravel Project Upload :

  1. Prepare a Zip Archive of the Laravel Project
  2. Create a folder into server and extract your project into public_html folder created earlier
  3. Then change the public folder content to the public_html folder and change the bootstrap and other paths in index as per need
  4. After that you’re going to check

Ubuntu : How to make partitions mount at startup -தமிழில்

அக் 01, 2024

உபுண்டு இயக்கமுறையில் வன்வட்டு மற்றும் திடநிலைவட்டினை இயங்குதளத்தின் தொடக்கத்தில் இணைப்பது எவ்வாறு என்பதனை இக்கட்டுரையில் காண்போம்.

பயனர் இடைமுக வழி (புதிய பயனர்களுக்கு) : வட்டுகள் (Disks) எனும் பயன்பாடானது தானமைவாகவே


கோப்பினை நமக்காக திருத்தி நமது தேவைக்கு ஏற்றார்போல மாற்றிக்கொள்ள வழிவகை செய்கிறது (இயங்குதளத்தினை உடைக்காமல்).

வட்டுகள் (Disks) பயன்பாட்டினை பயன்பாட்டு ஏவி (launcher) துணைகொண்டு இயக்க (disks) என பயன்பாட்டு ஏவியில் தேடவும்.

மேற்கண்ட துவக்கபட்டியில் காட்டபட்டுள்ளது போல வட்டுக்கள்(Disks) பயன்பாடு தோன்றும். அந்தப் பயன்பாட்டினை திறக்கையில் கீழே காட்டபட்டுள்ளது போல பட்டியலிடப்பட்டு வன்வட்டுக்களும் திடநிலை வட்டுக்களும் தோன்றும்.

நான் இரண்டாவது வன்வட்டினை சொடுக்குகையில் அதில் உள்ள வன்வட்டின் பகுதிகள் (Partitions) திரையில் காட்டப்படும்.

அதில் நாம் தானமைவாக இணையக்கூடிய அமைப்பை கட்டமைக்க அந்த வட்டினை தேர்வு செய்து இணைக்கவேண்டிய பகுதியையும் தெரிவு செய்துகொள்ளவேண்டும்.

அப்போது வன்வட்டின் பகுதிகளின் கீழ் ஒரு மூன்று தேர்வுகள் தோன்றும்.

முதல் தேர்வு – இயங்குதளத்தில் இணை (Mount)

இரண்டாம் தேர்வு – பகுதியை நீக்கு (Delete Partition) (தேர்வினை தேர்வுசெய்துவிடாதீர்கள் வன்வட்டின் அந்தபகுதியில் உள்ள தரவுகள் அனைத்தும் நீக்கப்பட்டு ஒதுக்கப்படாத நினைவிடமாக மாற்றப்பபட்டுவிடும்)

மூன்றாம் தேர்வு – பிற அமைப்புகளை இந்த தெரிவில் காணலாம்.

மூன்றாவது தேர்வினை சொடுக்கினால் ஒரு சுறுக்குப்பட்டி(Context Menu) விரியும் அதில் இணைக்கும் தெரிவுகளை திருத்து (Edit Mount Options) எனும் தொடுப்பை அழுத்தினால் இணைக்கும் தெரிவுகள் உரையாடல் பெட்டி(Dialog Box) தோன்றும்.

இணைக்கும் தெரிவுகள் உரையாடல் பெட்டியில் இருப்பவை எல்லாம் பயன்படுத்தா இயலா நிலையில் (grayed out) காட்சியளிக்கும்.

இதனைப் பயன்படுத்தும் நிலைக்கு கொணற பயனை அமர்வு இயல்புநிலை (User Session Default) அமைப்புகளை மாற்று பொத்தான் (toggle button) பயன்படுத்தி மாற்றும் போது எல்லா அமைப்புகளும் திருத்தகக் கூடிய நிலையில் மாறிவிடும். பின்னர் அதனை சேமித்தால் அந்த வன்வட்டின் பகுதி தானமைவாகவே இயங்குதளத்தின் தொடக்கத்தில் இணைக்கப்பட்டுவிடும்.


அடுக்கு பரிமாற்றம் (stack exchange) :

முனையத்தில் பகுதிகளை இணைக்கும் வழிமுறையை மற்றொரு பதிவில் காணலாம்.

Hacktoberfest 2024

Hacktoberfest 2024 is just around the corner, and I hope you’re as excited as I am for this month-long celebration of all things open-source!

If you’re looking for beginner-friendly open-source projects to contribute to, we’ve got you covered. The Kaniyam Foundation Team has compiled a list of project ideas that you can take on and make your own. A special thanks to KanchiLug volunteer Syed Jaffer for putting together this list of projects to work on. Check it out at the link below:

If you need any open source project to be developed, share your project idea in detail here.

Register here –

#Hacktoberfest #Hacktoberfest2024

Chennaipy – September meetup

Hi Everyone,

Welcome to the September month meetup.

# Schedule

* AI in Digital marketing
* Novice with Metaprogramming — Decorates with Decorator
* Best practices in optimizing large scale data processing using pandas-like libraries
* Transforming Automotive Electronics Testing with Python and Robot Framework * Lightning Talks (10 mins/talk)

# Venue

Zilogic Systems
Development Centre I
2nd Floor, Ragula Tech Park,
Type II/16, Dr. VSI Estate (Phase 1),
Chennai – 600 041.


* RSVP to get the meeting link

# Date & Time

* 28/09/2024
* 3:00 PM to 5:00 PM

# New to Python ?

* Learn Python in 30 minutes

* How to think like a computer

Chennaipy mailing list

Docker : Creating and uploading docker image to docker hub – டாக்கர் படத்தை உருவாக்கி அதை டாக்கர் ஹப்பில் பதிவேற்றுதல்

செப் 25, 2024

நான் டாக்கர் வகுப்பில் கற்றவற்றை வைத்து ஒரு டாக்கர் படத்தை டாக்கர் ஹப்பில் பதிவேற்றுதல் வரை நடந்த செயல்பாடுகளை இந்தப்பதிவில் குறிப்பிடுகிறேன்.

டாக்கர் ஹப் கணக்கை துவக்குதல் இணைப்பை சொடுக்கவும் அதில் கூகுள் கணக்கை வைத்து (நீங்கள் பிற உள்நுழைவு அமைப்புகளையும் பயன்படுத்திக் கொள்ளலாம்) உள்நுழையவும்.

வெற்றிகரமான உள்நுழைவுக்கு பிறகு க்கு Docker Hub Link ஐ சொடுக்குவதுமூலம் செல்லவும்.

Click the Docker Hub Link

டாக்கர் ஹப்பில் நாம் படத்தை பதிவேற்றம் செய்யும் முன்னர் அதனை பதிவேற்ற ஒரு கோப்புறை ஒன்றை உருவாக்க வேண்டும்.

கோப்புறைஐ உருவாக்கிய பிறகு நாம் நமது கணினியில் டாக்கர் படத்தை உருவாக்கிய பின்னர் அதனை பதிவேற்றிக்கொள்ளலாம்.

கணிணியில் ஒரு டாக்கர் படத்தை உருவாக்குதல்

முதலில் டாக்கர் படத்தை உருவாக்கும் முன்னர் பழைய டாக்கர் கலன்களின் (Container) இயக்கத்தை நிறுத்திவிட்டு சற்று நினைவத்தினை தயார் செய்து கொள்கிறேன் (நினைவக பற்றாக்குறை இருப்பதால்).

docker rm $(docker ps -aq)

பின்னர் Dockerfile எழுத துவங்க வேண்டியதுதான்

டாக்கர் படத்தை உருவாக்க நமக்கு தேவயான சார்பு படங்களை முதலில் பதிவிறக்கி அதனை தயார்படுத்திக்கொள்வோம்.

என்னுடய டாக்கர் படம் மிகவும் சிறியதாக வேண்டும் என நினைப்பதால் நான் python3-alpine பயன்படுத்துகிறேன்.

நிறுவல் சரிபார்த்தல்

மேற்கண்ட கட்டளைவரிகளை பயன்படுத்தி நாம் நமது நிறுவலை சரிபார்க்கலாம்.

டாக்கர் கோப்பை எழுதுதல் மற்றும் டாக்கர் படத்தை உருவாக்குதல்

# we are choosing the base image as python alpine
FROM activestate/python3-alpine:latest
# setting work directory
WORKDIR ./foss-event-aggregator
# Copying the workdirectory files to the container 
COPY ./foss-event-aggregator ./foss-event-aggregator
# Installing required dev-dependencies 
# RUN ["pip3","install","-r","./foss-event-aggregator/dev-requirements.txt"]
# Running PIP commands to update the dependencies for the
RUN ["apk","add","libxml2-dev","libxslt-dev","python-dev"]

RUN ["pip3","install","-r","./foss-event-aggregator/requirements.txt"]

CMD ["python",""]

டாக்கர் கோப்பு எழுதும் போது தேவையான சார்புகள் அனைத்தும் சரியாக நிறுவப்படுகிறதா என்பதை சரிபார்க்க பிழைச்செய்தி வரும்போது அதனை சரிசெய்ய டாக்கர் கோப்பை தேவைப்படி மாற்றுக.

வெற்றிகரமாக foss-event-aggregator எனும் டாக்கர் படம் உருவாக்கப்பட்டது.

உருவாக்கப்பட்ட படத்தை பரிசோதித்தாகிவிட்டது இப்பொழுது டாக்கர் ஹப்புக்கு பதிவேற்றலாம்.

டாக்கர் ஹப்புக்கு பதிவேற்றுதல்

படத்தை பரிசோதித்த பிறகு கோப்புறை பெயரில் டாக் செய்யவேண்டும்

docker image tag foss-event-aggregator:v1 itzmrevil/foss-events-aggregator:v1

டாக் செய்யபட்ட பிறகு டாக்கரில் CLIல் உள்நுழைவு செய்து டாக்கரில் பதிவேற்றம் செய்தால் மட்டுமே டாக்கர் ஏற்றுக்கொள்ளும்.

டாக்கரில் உள்நுழைய

docker login

கொடுத்து டெர்மினலில் வரும் படிகளை பின்பற்றவும்.

உள்நுழைவு செய்த பின்னர்

docker push itzmrevil/foss-events-aggregator:v1

கட்டளையை கொடுத்து டாக்கர் படத்தை பதிவேற்றவும்

பதிவேற்றம் செய்யபட்டதை டாக்கர் ஹப்பில் பார்க்க

வெற்றி ! வெற்றி !! வெற்றி !!!

Lets Learn சிவப்புHat Linux - 2

nmcli - NetworkManager Command Line Interface

  • The nmcli utility can be used by both users and scripts for controlling NetworkManager.
  • nmcli is a command-line tool which is used for controlling NetworkManager.
  • nmcli command can also be used to display network device status, create, edit, activate/deactivate, and delete network connections.

List of commands

nmcli general status

Image description

nmcli connection

Image description

Image description

Image description

nmcli connection modify <name> i<tab>

  • Once modified & then bring the nmcli UP.
  • nmcli connection modify "name" ipv4.addresses ipv4.gateway ipv4.dns
  • Now bring it UP.
nmcli connection up "<name>"


  • Nameserver is also referred as DNS.
  • Give the tab always so that option appears , its like what is next command or word.
  • IP address , Netmask , Gateway & Nameserver.
  • How to assign number after "/" in the IP address --> TBD