Reading view

There are new articles available, click to refresh the page.

VPC Lattice

9 April 2025 at 11:11

Amazon VPC Lattice is a fully managed application networking service that you use to connect, secure, and monitor the services and resources for your application

VPC Lattice, you should be familiar with its key components.

Service
service can run on EC2 instances or ECS/EKS/Fargate containers, or as Lambda functions, within an account or a virtual private cloud (VPC). A VPC Lattice service has the following components: target groups, listeners, and rules.
Resource
Amazon Relational Database Service (Amazon RDS) database, an Amazon EC2 instance, an application endpoint, a domain-name target, or an IP address.
Resource gateway
A resource gateway is a point of ingress into the VPC in which resources reside.
Resource configuration
A resource configuration is a logical object that represents either a single resource or a group of resources. A resource can be an IP address, a domain-name target, or an Amazon RDS database.
Service network
A client can be in a VPC that is associated with the service network. Clients and services that are associated with the same service network can communicate with each other
Service directory
A central registry of all VPC Lattice services that you own or are shared with your account through AWS RAM.
Auth policies
create a policy for how a payment service running on an auto scaling group of EC2 instances should interact with a billing service running in AWS Lambda.

Auth-policies are not supported on resource configurations. Auth policies of a service-network are not applicable to resource configurations in the service network.

Features of VPC Lattice

You need not be concerned about overlapping IP addresses between VPCs.

As the traffic is internal between VPCs, you do not need to modify the route table.

Here we go with the usecase - i have hosted 2 web server on 2 instance and make use of vpc lattice

I have created 2 linux EC2-instance as webserver "poc-server1","poc-server2"
Both have seperate vpc and security groups to access the port from 80 using http.
Here are the servers

poc-server1

poc-server2

request from local machine to verify the web server

VPC lattice connection between the webservers

Go to VPC dashboard
Click “Target groups” under the VPC Lattice section of the VPC Console.
Click “Create target group“.
Create the target group by instance type,protocol and VPC add the poc-server1 to the target group.
Follow the same steps for the poc-server2.

Under the VPC Lattice section, click “Services”

Need to create lattice service to associate the service with service network.
create vpc lattice service.
Click “Next”.
Click “Next” on the next page (the Define routing page).
Click “Next” on the next page (the Create network associations page).
Click “Create VPC Lattice Service” on the next page (the Review and create page).

Follow the same steps for the poc-server2.

VPC Console and click “Service networks” under the VPC Lattice section

Click “Create service network”. Create a service network
Under service association attach the services which we created as "poc-server1","poc-server2".
Under VPC association attach the VPC and security group which we created for the web servers.
Click “Create service network”

Go to the "poc-server1" service overview and Click “Routing”
Click “Add listener”

Follow the same steps for the service "poc-server2".
Return to the payment-svc overview and copy the domain name
VPC Lattice configurations have been completed. Let’s see how the setup works
Try to access the VPC Lattice domains from the EC2 instances

VPC lattice Domain address

VibeCoding

Malaikannan

Malaikannan Sankarasubbu

4 April 2025 at 00:00

VibeCoding

As a CTO, I don’t typically get a lot of time to sit and code—there’s a lot of grunt work involved in my role. Quite a bit of my time goes into the operational aspects of running an engineering team: feature prioritization, production issue reviews, cloud cost reviews, one-on-ones, status updates, budgeting, etc.

Although I’m deeply involved in architecture, design, and scaling decisions, I’m not contributing as a senior developer writing code for features in the product as much as I’d like. It’s not just about writing code—it’s about maintaining it. And with everything else I do, I felt I didn’t have the time or energy to maintain a complex feature.

Over the past couple of years—apart from AI research work—my coding has been pretty much limited to pair programming or contributing to some minor, non-critical features. Sometimes I end up writing small utilities here and there.

Pair Programming as a Connection Tool

I love to pair program with at least two engineers for a couple of hours every week. This gives me an opportunity to connect with them 1:1, understand the problems on the ground, and help them see the bigger picture of how the features they’re working on fit into our product ecosystem.

1:1s were never effective for me. In that 30-minute window, engineers rarely open up. But if you sit and code for 2–3 hours at a stretch, you get to learn a lot about them—their problem-solving approach, what motivates them, and more.

I also send out a weekly video update to the engineering team, where I talk about the engineers I pair programmed with, their background, and what we worked on. It helps the broader engineering team learn more about their peers as well.

The Engineer in Me

The engineer in me always wants to get back to coding—because there’s no joy quite like building something and making it work. I’m happiest when I code.

I’ve worked in several languages over the years—Java, VB6, C#, Perl, good old shell scripting, Python, JavaScript, and more. Once I found Python, I never looked back. I absolutely love the Python community.

I’ve been a full-stack developer throughout my career. My path to becoming a CTO was non-traditional (that’s a story for another blog). I started in a consulting firm and worked across different projects and tech stacks early on, which helped me become a well-rounded full-stack engineer.

I still remember building a simple timesheet entry application in 2006 using HTML and JavaScript (with AJAX) for a client’s invoicing needs. It was a small utility, but it made timesheet entry so much easier for engineers. That experience stuck with me.

I’ll get to why being a full-stack engineer helped me build the app using VibeCoding shortly.

The Spark: Coffee with Shuveb Hussain

I was catching up over coffee with Shuveb Hussain, founder and CEO of ZipStack. Their product, Unstract, is really good for extracting entities from different types of documents. They’ve even open-sourced a version—go check it out.

Shuveb, being a seasoned engineer’s engineer, mentioned how GenAI code editors helped him quickly build a few apps for his own use over a weekend. That sparked something in me: why wasn’t I automating my grunt work with one of these GenAI code editors?

I’ve used GitHub Copilot for a while, but these newer GenAI editors—like Cursor and Windsurf—are in a different league. Based on Shuveb’s recommendation, I chose Windsurf.

Let’s be honest though—I don’t remember any weekend project of mine that ended in just one weekend 😅

The Grunt Work I Wanted to Automate

I was looking for ways to automate the boring but necessary stuff, so I could focus more on external-facing activities.

Every Thursday, I spent about 6 hours analyzing production issues before the weekly Friday review with engineering leaders, the SRE team, product leads, and support. I’d get a CSV dump of all the tickets and manually go through each one to identify patterns or repeated issues. Then I’d start asking questions on Slack or during the review meeting.

The process was painful and time-consuming. I did this for over 6 months and knew I needed to change something.

In addition to that:

I regularly reviewed cloud compute costs across environments and products to identify areas for optimization.
I monitored feature usage metrics to see what customers actually used.
I examined job runtime stats (it’s a low-code platform, so this matters).
I looked at engineering team metrics from the operations side.

Each of these lived in different tools, dashboards, or portals. I was tired of logging into 10 places and context-switching constantly while fielding distractions.

The Build Begins: CTO Dashboard

I decided to build an internal tool I nicknamed CTODashboard—to consolidate everything I needed.

My Tech Stack (via Windsurf):

Frontend: ReactJS
Backend: Python (FastAPI)
Database: Postgres
Deployment: EC2 (with some help from the SRE team)

I used Windsurf’s Cascade interface to prompt out code, even while attending meetings. It was surprisingly effective… except for one time when a prompt completely messed up my day’s work. Lesson learned: commit code at every working logical step.

In a couple of days, I had:

A feature to upload the CSV dump
Filters to slice and dice data
A paginated data table
Trend analytics with visualizations

Even when I hit errors, I just screenshot them or pasted logs into Windsurf and asked it to fix them. It did a decent job. When it hallucinated or got stuck, I just restarted with a fresh Cascade.

I had to rewrite the CSV upload logic manually when the semantic mapping to the backend tables went wrong. But overall, 80% of the code was generated—20% I wrote myself. And I reviewed everything to ensure it worked as intended.

Early Feedback & Iteration

I gave access to a couple of colleagues for early feedback. It was overwhelmingly positive. They even suggested new features like:

Summarizing long tickets
Copy-pasting ticket details directly into Slack
Copying visualizations without taking screenshots

I implemented all of those using Windsurf in just a few more days.

In under a week, I had an MVP that cut my Thursday analysis time from 6+ hours to under 2. Production Issues Dashboard_1 Production Issues Dashboard_2 Production Issues Dashboard_3

Then Abhay Dandekar, another senior developer, offered to help. He built a Lambda function to call our Helpdesk API every hour to fetch the latest tickets and updates. He got it working in 4 hours—like a boss.

Growing Usage

Word about the dashboard started leaking (okay, I may have leaked it myself 😉). As more people requested access, I had to:

Add Google Sign-In
Implement authorization controls
Build a user admin module
Secure the backend APIs with proper access control
Add audit logs to track who was using what

I got all this done over a weekend. It consumed my entire weekend, but it was worth it.

AWS Cost Analytics Module

Next, I added a module to analyze AWS cost trends across production and non-prod environments by product.

Initially, it was another CSV upload feature. Later, Abhay added a Lambda to fetch the data daily. I wanted engineering directors to see the cost implications of design decisions—especially when non-prod environments were always-on.

Before this, I spent 30 minutes daily reviewing AWS cost trends. Once the dashboard launched, engineers started checking it themselves. That awareness led to much smarter decisions and significant cost savings.

I added visualizations for:

Daily cost trends
Monthly breakdowns
Environment-specific views
Product-level costs

AWS Cost Analytics Dashboard_1 AWS Cost Analytics Dashboard_2

More Modules Coming Soon

I’ve since added:

Usage metrics
Capitalization tracking
(In progress): Performance and engineering metrics

The dashboard now has 200+ users, and I’m releasing access in batches to manage performance.

Lessons Learned from VibeCoding

This was a fun experiment to see how far I could go with GenAI-based development using just prompts.

What I Learned:

Strong system design fundamentals are essential.
Windsurf can get stuck in loops—step in and take control.
Commit frequently. Mandatory.
You’re also the tester—don’t skip this.
If the app breaks badly, roll back. Don’t fix bad code.
GenAI editors are great for senior engineers; less convinced about junior devs.
Model training cutoffs matter—it affects library choices.
Write smart prompts with guardrails (e.g., “no files >300 lines”).
GenAI tools struggle to edit large files (my app.py hit 5,000 lines; I had to refactor manually).
Use virtual environments—GenAI often forgets.
Deployments are tedious—I took help from the SRE team for Jenkins/Terraform setup.
If you love coding, VibeCoding is addictive.

Gratitude

Special thanks to:

Bhavani Shankar, Navin Kumaran, Geetha Eswaran, and Sabyasachi Rout for helping with deployment scripts and automation (yes, I bugged them a lot).
Pravin Kumar, Vikas Kishore, Nandini PS, and Prithviraj Subburamanian for feedback and acting as de facto product managers for dashboard features.

Rag

Malaikannan

Malaikannan Sankarasubbu

31 October 2024 at 00:00

How do I use the ResourceTag, condition keys to create an IAM policy for tag-based restriction

Kannan

28 February 2025 at 08:12

The following IAM policies use condition keys to create tag-based restriction.

Before you use tags to control access to your AWS resources, you must understand how AWS grants access. AWS is composed of collections of resources. An Amazon EC2 instance is a resource. An Amazon S3 bucket is a resource. You can use the AWS API, the AWS CLI, or the AWS Management Console to perform an operation, such as creating a bucket in Amazon S3. When you do, you send a request for that operation. Your request specifies an action, a resource, a principal entity (user or role), a principal account, and any necessary request information.
You can then create an IAM policy that allows or denies access to a resource based on that resource's tag. In that policy, you can use tag condition keys to control access to any of the following:
Resource – Control access to AWS service resources based on the tags on those resources. To do this, use the_ aws:ResourceTag/key-name_ condition key to determine whether to allow access to the resource based on the tags that are attached to the resource.

ResourceTag condition key

Use the _aws:ResourceTag/tag-key _condition key to compare the tag key-value pair that's specified in the IAM policy with the key-value pair that's attached to the AWS resource. For more information, see Controlling access to AWS resources.

You can use this condition key with the global aws:ResourceTag version and AWS services, such as ec2:ResourceTag. For more information, see Actions, resources, and condition keys for AWS services.

The following IAM policy allows users to start, stop, and terminate instances that are in the test application tag

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:StartInstances",
                "ec2:StopInstances",
                "ec2:TerminateInstances"
            ],
            "Resource": "arn:aws:ec2:*:3817********:instance/*",
            "Condition": {
                "StringEquals": {
                    "ec2:ResourceTag/application": "test"
                }
            }
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeTags"
                "ec2:Describedescribe-instance-status"
            ],
            "Resource": "*"
        }
    ]
}

Create the policy and attach the policy to user or role.

Created 2 instance one is with application tag and other is non tagged.

You can see the tagged instance are able to perform Start and Stop action using the IAM resources tag condition.
non-tagged instance we are not able to perform the same.

check the status of the instance

perform the Termination action

reference commands

aws ec2 start-instances --instance-ids "instance-id"

aws ec2 stop-instances --instance-ids "instance-id"

aws ec2 describe-instance-status  --instance-ids "instance-id"

aws ec2 terminate-instances --instance-ids "instance-id"

String condition operators

String condition operators let you construct Condition elements that restrict access based on comparing a key to a string value.

StringEquals - Exact matching, case sensitive
StringNotEquals - Negated matching
StringEqualsIgnoreCase - Exact matching, ignoring case
StringNotEqualsIgnoreCase - Negated matching, ignoring case
StringLike - Case-sensitive matching. The values can include multi-character match wildcards (*) and single-character match wildcards (?) anywhere in the string. You must specify wildcards to achieve partial string matches.
Note
If a key contains multiple values, StringLike can be qualified with set operators—ForAllValues:StringLike and ForAnyValue:StringLike.
StringNotLike - Negated case-sensitive matching. The values can include multi-character match wildcards (*) or single-character match wildcards (?) anywhere in the string.

Script to list the S3 Bucket storage size

Kannan

28 February 2025 at 05:41

If we need to fetch the S3 bucket storage size we need to trace via individual bucket under metrics we get the storage size.
on one go use the below script to get the bucket name with storage size.

s3list=`aws s3 ls | awk  '{print $3}'`
for s3dir in $s3list
do
    echo $s3dir
    aws s3 ls "s3://$s3dir"  --recursive --human-readable --summarize | grep "Total Size" 
done

create the .sh file
copy the code on the file
Excecute the script to get the s3 bucket details

An Introduction to Tokenizers in Natural Language Processing

Malaikannan

Malaikannan Sankarasubbu

25 September 2024 at 00:00

Tokenizers

_Co-authored by Tamil Arasan, Selvakumar Murugan and Malaikannan Sankarasubbu

In Natural Language Processing (NLP), one of the foundational steps is transforming human language into a format that computational models can understand. This is where tokenizers come into play. Tokenizers are specialized tools that break down text into smaller units called tokens, and convert these tokens into numerical data that models can process.

Imagine you have the sentence:

Artificial intelligence is revolutionizing technology.

To a human, this sentence is clear and meaningful. But we do not understand the whole sentence in one shot(okay may be you did, but I am sure if I gave you a paragraph or a even better an essay, you will not be able to understand them in one shot), but we make sense of parts of it like words and then phrases and understand the whole sentence as a composition of meanings from its parts. It is just how things work, regardless whether we are trying to make a machine mimic our language understanding or not. This has nothing to do with the reason ML models or even computers in general work with numbers. It is purely how language works and there is no going around it.

ML models like everything else we run on computers can only work with numbers, and we need to transform the text into number or series of numbers (since we have more than one word). We have a lot of freedom when it comes to how we transform the text into numbers, and as always with freedom comes complexity. But basically, tokenization as a whole is a two step process. Finding all the words and assigning a unique number - an ID to each token.

There are so many ways we can segment a sentence/paragraph into pieces like phrases, words, sub-words or even individual characters. Understanding why particular tokenization scheme is better requires a grasp of how embeddings work. If you're familiar with NLP, you'd ask "Why? Tokenization comes before the Embedding, right?" Yes, you're right, but NLP is paradoxical like that. Don't worry we will cover that as we go.

Background

Before we venture any further, lets understand the difference between Neural networks and our typical computer programs. We all know by now that for traditional computer programs, we write/translate the rules into code by hand whereas, NNs learn the rules(mapping across input and output) from data by the process called training. You see unlike in normal programming style, where we have a plethora of data-structures that can help with storing information in any shape or form we want, along with algorithms that jump up and down, back and forth in a set of instructions we call code, Neural Networks do not allow us to have all sorts of control flow we'd like. In Neural Networks, there is only one direction the "program" can run, left to right.

Unlike in traditional programs where the we can feed a program with input in complicated ways, in Neural Networks, there are only fixed number of ways, we can feed and it is usually in the form of vectors (fancy name for list of numbers) and the vectors are of fixed size (or dimension more precisely). In most DNNs, input and output sizes are fixed regardless of the problem it is trying to solve. For example, CNNs the input (usually image) size and number of channels is fixed. In RNNs, the embedding dimensions, input vocabulary size, number of output labels (classification problem e.g: sentiment classification) and or output vocabulary size (text generation problems e.g: QA, translation) are all fixed. In Transformer networks even the sentence length is fixed. This is not a bad thing, constraints like these enable the network to compress and capture the necessary information.

Also note that there are only few tools to test "equality" or "relevance" or "correctness" for things inside the network because only things that dwell inside the network are vectors. Cosine similarity and attention scores are popular. You can think of vectors as variables that keep track of state inside neural network program. But unlike in traditional programs where you can declare variables as you'd like and print them for troubleshooting, in networks the vector-variables are only meaningful only at the boundaries of the layers(not entirely true) within the networks.

Lets take a look at the simplest example to understand why just pulling a vector from anywhere in the network will not be of any value for us. In the following code, three functions perform the identical calculation despite their code is slightly different. The unnecessarily intentionally named variables temp and growth_factor need not be created as exemplified by the first function, which directly embodies the compound interest calculation formula, $A = P(1+\frac{R}{100})^{T}$. When compared to temp, the variable growth_factor hold a more meaningful interpretation - represents how much the money will grow due to compounding interest over time. For more complicated formulae and functions, we might create intermediate variables so that the code goes easy on the eye, but they have no significance to the operation of the function.

def compound_interest_1(P,R,T):
    A = P * (math.pow((1 + (R/100)),T))
    CI = A - P
    return CI

def compound_interest_2(P,R,T):
    temp = (1 + (R/100))
    A = P * (math.pow(temp, T))
    CI = A - P
    return CI

def compound_interest_3(P,R,T):
    growth_factor = (math.pow((1 + (R/100)),T))
    A = P * growth_factor
    CI = A - P
    return CI

Another example to illustrate from operations perspective. Clock arithmetic. Lets assign numbers 0 through 7 to weekdays starting from Sunday to Saturday.

Table 1

Sun	Mon	Tue	Wed	Thu	Fri	Sat
0	1	2	3	4	5	6

John Conway suggests, a mnemonic device for thinking of the days of the week as Noneday, Oneday, Twosday, Treblesday, Foursday, Fiveday, and Six-a-day.

So if you want to know what day it is 137 days from today if today is say, Thursday (i.e. 4). We can do $(4+137) mod 7 => 1$ i.e Monday. As you can see adding numbers(days) in clock arithmetic results in a meaningful output. You can days together to get another day. Okay lets ask the question can we multiply two days together? Is it is in anyway meaningful to multiply days? Just because we can multiply any number mathematically, is it useful to do so in our clock arithmetic?

All of this digression is to emphasize that the embedding is deemed to capture the meaning of words, vector from the last layers is deemed to capture the meaning of a sentence lets say. But when you take a vector (just because you can) within the layers for instance, it does not refer to any meaningful unit such as words or phrases and sentence as we understand it.

A little bit of history

If you're old enough, you might remember that before transformers became standard paradigm in NLP, we had another one EEAP (Embed, Encode, Attend, Predict). I am grossly oversimplifying here, but you can think of it as follows,

Embedding

Captures the meaning of words A matrix of size $N \times D$, where

$N$ is the size of the vocabulary, i.e unique number of words in the language
$D$ is the dimension of embedding, vector corresponding to each word.

Lookup the word-vector (embedding) for each word

Encoding

Find the meaning of a sentence, by using the meaning captured in embeddings of the constituent words with help of RNNs like LSTM, GRU or transformers like BERT, GPT that take the embeddings and produce vector(s) for whole the sequence.

Prediction

Depending upon the task at hand, either assigns a label to the input sentence, or generate another sentence word by word.

Attention

Helps with Prediction by focusing on what is important right now by drawing a probability distribution (normalized attention scores) over the all words. Words with high score are deemed important.

As you can see above, $N$ is the vocabulary size, i.e unique number of words in the language. And handful of years ago, language usually meant the corpus at hand (in order of few thousands of sentences) and datasets like CNN/DailyMail were considered huge. There were clever tricks like anonymizing named entities to force the ML models to focus on language specific features like grammar instead of open world words like names of Places, Presidents, Corporations and Countries, etc. Good times they were! Point is, it is possible that the corpus you have in your possession might not have all the words of the language. As we have seen, the size of the Embedding must be fixed before training the network. By good fortune if you stumble upon a new dataset and hence new words, adding them to your model was not easy, because Embedding needs to extend to accommodate this new (OOV) words and that requires retraining of the whole network. OOV means Out Of the current model's Vocabulary. And this is why simply segmenting the text on empty spaces will not work.

With that background, lets dive in.

Tokenization

Tokenization is the process of segmenting the text into individual pieces (usually words) so that ML model can digest them. It is the very first step in any NLP system and influences everything that follows. For understanding impact of tokenization, we need to understand how embeddings and sentence length influence the model. We will call sentence length as sequence length from here on, because sentence is understood to be sequence of words, and we will experiment with sequence of different things not just words, which we will call tokens.

Tokens can be anything

Words - "telephone" "booth" "is" "nearby" "the" "post" "office"
Multiword Expressions (MWEs) - "telephone booth" "is" "nearby" "the" "post office"
Sub-words - "tele" "#phone" "booth" "is" "near " "#by" "the" "post" "office"
Characters - "t" "e" "l" "e" "p" ... "c" "e"

We know segmenting the text based on empty spaces will not work, because the vocabulary will keep growing. What about punctuations? Surely they will help with words don't, won't, aren't, o'clock, Wendy's, co-operation{.verbatim} etc, same reasoning applies here too. Moreover segmenting at punctuations will create different problems, e.g: I.S.R.O > I, S, R, O{.verbatim} which is not ideal.

Objectives of Tokenization

The primary objectives of tokenization are:

Handling OOV: Tokenizers should be able to segment the text into pieces so that any word in the language whether it is in the dataset or not, any word we might conjure in foreseeable future, whether it is a technical/domain specific terminology that scientists might utter to sound intelligent or commonly used by everyone in day to day life. An ideal tokenizer should be able to deal with all and any of them.
Efficiency: Reducing the size (length) of the input text to make computation feasible and faster.
Meaningful Representation: Capturing the semantic essence of the text so that the model can learn effectively. Which we will discuss a bit later.

Simple Tokenization Methods

Go through the code below, and see if you can make any inferences on the table produced. It reads the book The Republic and counts the tokens on character, word and sentence levels and also indicated the number of unique tokens in the whole book.

Code

``` {.python results=”output raw” exports=”both”} from collections import Counter from nltk.tokenize import sent_tokenize with open(‘plato.txt’) as f: text = f.read()

words = text.split() sentences = sent_tokenize(text)

char_counter = Counter() word_counter = Counter() sent_counter = Counter()

char_counter.update(text) word_counter.update(words) sent_counter.update(sentences)

print(‘#+name: Vocabulary Size’) print(‘|Type|Vocabulary Size|Sequence Length|’) print(f’|Unique Characters|{len(char_counter)}|{len(text)}’) print(f’|Unique Words|{len(word_counter)}|{len(words)}’) print(f’|Unique Sentences|{len(sent_counter)}|{len(sentences)}’)


**Table 2**

| Type              | Vocabulary Size | Sequence Length |
| ----------------- | --------------- | --------------- |
| Unique Characters | 115             | 1,213,712       |
| Unique Words      | 20,710          | 219,318         |
| Unique Sentences  | 7,777           | 8,714           |



## Study

Character-Level Tokenization

:   In this most elementary method, text is broken down into individual
    characters.

    *\"data\"* \> `"d" "a" "t" "a"`{.verbatim}

Word-Level Tokenization

:   This is the simplest and most used (before sub-word methods became
    popular) method of tokenization, where text is split into individual
    words based on spaces and punctuation. Still useful in some
    applications and as a pedagogical launch pad into other tokenization
    techniques.

    *\"Machine learning models require data.\"* \>
    `"Machine", "learning", "models", "require", "data", "."`{.verbatim}

Sentence-Level Tokenization

:   This approach segments text into sentences, which is useful for
    tasks like machine translation or text summarization. Sentence
    tokenization is not as popular as we\'d like it to be.

    *\"Tokenizers convert text. They are essential in NLP.\"* \>
    `"Tokenizers convert text.", "They are essential in NLP."`{.verbatim}

n-gram Tokenization

:   Instead of using sentences as a tokens, what if you could use
    phrases of fixed length. The following shows the n-grams for n=2,
    i.e 2-gram or bigram. Yes the `n`{.verbatim} in the n-grams stands
    for how many words are chosen. n-grams can also be built from
    characters instead of words, though not as useful as word level
    n-grams.

    *\"Data science is fun\"* \>
    `"Data science", "science is", "is fun"`{.verbatim}.





**Table 3**

| Tokenization | Advantages                             | Disadvantages                                        |
| ------------ | -------------------------------------- | ---------------------------------------------------- |
| Character    | Minimal vocabulary size                | Very long token sequences                            |
|              | Handles any possible input             | Require huge amount of compute                       |
| Word         | Easy to implement and understand       | Large vocabulary size                                |
|              | Preserves meaning of words             | Cannot cover the whole language                      |
| Sentence     | Preserves the context within sentences | Less granular; may miss important word-level details |
|              | Sentence-level semantics               | Sentence boundary detection is challenging           |

As you can see from the table, the vocabulary size and sequence length
have inverse correlation. The Neural networks requires that the tokens
should be present in many places and many times. That is how the
networks understand words. Remember when you don\'t know the meaning of
a word, you ask someone to use it in sentences? Same thing here, the
more sentences the token is present, the better the network can
understand it. But in case of sentence tokenization, you can see there
are as many tokens in its vocabulary as in the tokenized corpus. It is
safe to say that each token is occuring only once and that is not a
healthy diet for a network. This problem occurs in word-level
tokenization too but it is subtle, the out-of-vocabulary(OoV) problem.
To deal with OOV we need to stay between character level and word-level
tokens, enter \>\>\> sub-words \<\<\<.

# Advanced Tokenization Methods

Subword tokenization is an advanced tokenization technique that breaks
text into smaller units, smaller than words. It helps in handling rare
or unseen words by decomposing them into known subword units. Our hope
is that, the sub-words decomposed from text, can be used to compose new
unseen words and so act as the tokens for the unseen words. Common
algorithms include Byte Pair Encoding (BPE), WordPiece, SentencePiece.

*\"unhappiness\"* \> `"un", "happi", "ness"`{.verbatim}

BPE is originally a technique for compression of data. Repurposed to
compress text corpus by merging frequently occurring pairs of characters
or subwords. Think of it like what and how little number of unique
tokens you need to recreate the whole book when you are free to arrange
those tokens in a line as many time as you want.

Algorithm

:   1.  *Initialization*: Start with a list of characters (initial
        vocabulary) from the text(whole corpus).
    2.  *Frequency Counting*: Count all pair occurrences of consecutive
        characters/subwords.
    3.  *Pair Merging*: Find the most frequent pair and merge it into a
        single new subword.
    4.  *Update Text*: Replace all occurrences of the pair in the text
        with the new subword.
    5.  *Repeat*: Continue the process until reaching the desired
        vocabulary size or merging no longer provides significant
        compression.

Advantages

:   -   Reduces the vocabulary size significantly.
    -   Handles rare and complex words effectively.
    -   Balances between word-level and character-level tokenization.

Disadvantages

:   -   Tokens may not be meaningful standalone units.
    -   Slightly more complex to implement.

## Trained Tokenizers

WordPiece and SentencePiece tokenization methods are extensions of BPE
where the vocabulary is not merely created by assuming merging most
frequent pair. These variants evaluate whether the given merges were
useful or not by measuring how much each merge maximizes the likelihood
of the corpus. In simple words, lets take two vocabularies, before and
after the merges, and train two language models and the model trained on
vocabulary after the merges have lower perplexity(think loss) then we
assume that the merges were useful. And we need to repeat this every
time we make a merge. Not practical, and hence there some mathematical
tricks we use to make this more practical that we will discuss in a
future post.

The iterative merging process is the training of tokenizer and this
training is different training of actual models. There are python
libraries for training your own tokenizer, but when you\'re planning to
use a pretrained language model, it is better to stick with the
pretrained tokenizer associated with that model. In the following
section we see how to train a simple BPE tokenizer, SentencePiece
tokenizer and how to use BERT tokenizer that comes with huggingface\'s
`transformers`{.verbatim} library.

## Tokenization Techniques Used in Popular Language Models

### Byte Pair Encoding (BPE) in GPT Models

GPT models, such as GPT-2 and GPT-3, utilize Byte Pair Encoding (BPE)
for tokenization.

``` {.python results="output code" exports="both"}
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer =  Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
                     vocab_size=30000)
files = ["plato.txt"]

tokenizer.train(files, trainer)
tokenizer.model.save('.', 'bpe_tokenizer')

output = tokenizer.encode("Tokenization is essential first step for any NLP model.")
print("Tokens:", output.tokens)
print("Token IDs:", output.ids)
print("Length: ", len(output.ids))

Tokens: ['T', 'oken', 'ization', 'is', 'essential', 'first', 'step', 'for', 'any', 'N', 'L', 'P', 'model', '.']
Token IDs: [50, 6436, 2897, 127, 3532, 399, 1697, 184, 256, 44, 42, 46, 3017, 15]
Length:  14

SentencePiece in T5

T5 models use a Unigram Language Model for tokenization, implemented via the SentencePiece library. This approach treats tokenization as a probabilistic model over all possible tokenizations.

import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=plato.txt --model_prefix=unigram_tokenizer --vocab_size=3000 --model_type=unigram')

``` {.python results=”output code” exports=”both”} import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load(“unigram_tokenizer.model”) text = “Tokenization is essential first step for any NLP model.” pieces = sp.EncodeAsPieces(text) ids = sp.EncodeAsIds(text) print(“Pieces:”, pieces) print(“Piece IDs:”, ids) print(“Length: “, len(ids))


``` python
Pieces: ['▁To', 'k', 'en', 'iz', 'ation', '▁is', '▁essential', '▁first', '▁step', '▁for', '▁any', '▁', 'N', 'L', 'P', '▁model', '.']
Piece IDs: [436, 191, 128, 931, 141, 11, 1945, 123, 962, 39, 65, 17, 499, 1054, 1441, 1925, 8]
Length:  17

WordPiece Tokenization in BERT

``` {.python results=”output code”} from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) text = “Tokenization is essential first step for any NLP model.” encoded_input = tokenizer(text, return_tensors=’pt’)

print(“Tokens:”, tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’][0])) print(“Token IDs:”, encoded_input[‘input_ids’][0].tolist()) print(“Length: “, len(encoded_input[‘input_ids’][0].tolist())) ```

Summary of Tokenization Methods

Table 4

Method	Length	Tokens
BPE	14	[‘T’, ‘oken’, ‘ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘N’, ‘L’, ‘P’, ‘model’, ‘.’]
SentencePiece	17	[‘▁To’, ‘k’, ‘en’, ‘iz’, ‘ation’, ‘▁is’, ‘▁essential’, ‘▁first’, ‘▁step’, ‘▁for’, ‘▁any’, ‘▁’, ‘N’, ‘L’, ‘P’, ‘▁model’, ‘.’]
WordPiece (BERT)	12	[‘token’, ‘##ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘nl’, ‘##p’, ‘model’, ‘.’]

Different tokenization methods give different results for the same input sentence. As we add more data to the tokenizer training, the differences between WordPiece and SentencePiece might decrease, but they will not vanish, because of the difference in their training process.

Table 5

Model	Tokenization Method	Library	Key Features
GPT	Byte Pair Encoding	`tokenizers`	Balances vocabulary size and granularity
BERT	WordPiece	`transformers`	Efficient vocabulary, handles morphology
T5	Unigram Language Model	`sentencepiece`	Probabilistic, flexible across languages

Tokenization and Non English Languages

Tokenizing text is complex, especially when dealing with diverse languages and scripts. Various challenges can impact the effectiveness of tokenization.

Tokenization Issues with Complex Languages: With a focus on Tamil

Tokenizing text in languages like Tamil presents unique challenges due to their linguistic and script characteristics. Understanding these challenges is essential for developing effective NLP applications that handle Tamil text accurately.

Challenges in Tokenizing Tamil Language

1. Agglutinative Morphology

Tamil is an agglutinative language, meaning it forms words by concatenating morphemes (roots, suffixes, prefixes) to convey grammatical relationships and meanings. A single word may express what would be a full sentence in English.
Impact on Tokenization
- Words can be very lengthy and contain many morphemes.
  - போகமுடியாதவர்களுக்காவேயேதான்
2. Punarchi and Phonology

Tamil specific rules on how two words can be combined and resulting word may not be phonologically identical to its parts. The phonological transformations can cause problems with TTS/STT systems too.
Impact on Tokenization
- Surface forms of words may change when combined, making boundary detection challenging.
  - மரம் + வேர் > மரவேர்
  - தமிழ் + இனிது > தமிழினிது
3. Complex Script and Orthography

Tamil alphabet representation in Unicode is suboptimal for everything except for standardized storage format. Even simple operations that are intuitive for native Tamil speaker, are harder to implement because of this. Techniques like BPE applied on Tamil text will break words at completely inappropriate points like cutting an uyirmei letter into consonant and diacritic resulting in meaningless output.

தமிழ் > த ம ி ழ, ்

Strategies for Effective Tokenization of Tamil Text

Language-Specific Tokenizers

Train Tamil specific subword tokenizers with initial seed tokens prepared by better preprocessing techniques to avoid [problem-3]{.spurious-link target=”*3. Complex Script and Orthography”} type cases. Use morphological analyzers to decompose words into root and affixes, aiding in understanding and processing complex word forms.

Choosing the Right Tokenization Method

Challenges in Tokenization

Ambiguity: Words can have multiple meanings, and tokenizers cannot capture context. Example: The word "lead" can be a verb or a noun.
Handling Special Characters and Emojis: Modern text often includes emojis, URLs, and hashtags, which require specialized handling.
Multilingual Texts: Tokenizing text that includes multiple languages or scripts adds complexity, necessitating adaptable tokenization strategies.

Best Practices for Effective Tokenization

Understand Your Data: Analyze the text data to choose the most suitable tokenization method.
Consider the Task Requirements: Different NLP tasks may benefit from different tokenization granularities.
Use Pre-trained Tokenizers When Possible: Leveraging existing tokenizers associated with pre-trained models can save time and improve performance.
Normalize Text Before Tokenization: Cleaning and standardizing text

Vectordb

Malaikannan

Malaikannan Sankarasubbu

31 August 2024 at 00:00

Vector Databases 101

_Co-authored by Angu S KrishnaKumar, Kamal raj Kanagarajan and Malaikannan Sankarasubbu

Introduction

In the world of Large Language Models (LLMs), vector databases play a pivotal role in Retrieval Augmented Generation (RAG) applications.** These specialized databases are designed to store and retrieve high-dimensional vectors, which represent complex data structures like text, images, and audio. By leveraging vector databases, LLMs can access vast amounts of information and generate more informative and accurate responses. Retrieval Augmented Generation (RAG) is a technique that combines the power of large language models (LLMs) with external knowledge bases to generate more informative and accurate responses. By retrieving relevant information from a knowledge base and incorporating it into the LLM’s generation process, RAG can produce more comprehensive and contextually appropriate outputs.

How RAG Works:

User Query: A user submits a query or prompt to the RAG system.
Information Retrieval: The system retrieves relevant information from a knowledge base based on the query. VectorDBs play a key role in this. Embeddings aka vectors are stored in VectorDB and retrieval is done using similarity measures.
Language Model Generation: The retrieved information is fed into a language model, which generates a response based on the query and the retrieved context.

In this blog series, we will delve into the intricacies of vector databases, exploring their underlying principles, key features, and real-world applications. We will also discuss the advantages they offer over traditional databases and how they are transforming the way we store, manage, and retrieve data.

What is a Vector?

A vector is a sequence of numbers that forms a group. For example

(3) is a one dimensional vector.
(2,8) is a two dimensional vector.
(12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only visualize 3 dimensions, anything more than that you can just say it not visualize. Below is an example of 4 dimension vector representation of the word king

King Vector

What is a Vector Database?

A Vector Database (VectorDB) is a specialized database system designed to store, manage, and efficiently query high-dimensional vector data. Unlike traditional relational databases that work with structured data in tables, VectorDBs are optimized for handling vector embeddings – numerical representations of data in multi-dimensional space.

In a VectorDB:

Each item (like a document, image, or concept) is represented as a vector – a list of numbers that describe the item’s features or characteristics.
These vectors are stored in a way that allows for fast similarity searches and comparisons.
The database is optimized for operations like finding the nearest neighbors to a given vector, which is crucial for many AI and machine learning applications.

VectorDBs are particularly useful in scenarios where you need to find similarities or relationships between large amounts of complex data, such as in recommendation systems, image recognition, or natural language processing tasks.

Key Concepts

Vector Embeddings
- Vector embeddings are numerical representations of data in a multi-dimensional space.
- They capture semantic meaning and relationships between different pieces of information.
- In natural language processing, word embeddings are a common type of vector embedding. Each word is represented by a vector of real numbers, where words with similar meanings are closer in the vector space.
- For detail concepts of embedding please refer to earlier blog Embeddings

Let’s look at an example of Word Vector output generated by Word2Vec

from gensim.models import Word2Vec

# Example corpus (a list of sentences, where each sentence is a list of words)
sentences = [
    ["machine", "learning", "is", "fascinating"],
    ["gensim", "is", "a", "useful", "library", "for", "word", "embeddings"],
    ["vector", "representations", "are", "important", "for", "NLP", "tasks"]
]

# Train a Word2Vec model with 300-dimensional vectors
model = Word2Vec(sentences, vector_size=300, window=5, min_count=1, workers=4)

# Get the 300-dimensional vector for a specific word
word_vector = model.wv['machine']

# Print the vector
print(f"Vector for 'machine': {word_vector}")

Sample Output for 300 dimension vector


Vector for 'machine': [ 2.41737941e-03 -1.42750892e-03 -4.85344668e-03  3.12493594e-03, 4.84531874e-03 -1.00165956e-03  3.41092921e-03 -3.41384278e-03, 4.22888929e-03  1.44586214e-03 -1.35438916e-03 -3.27448458e-03
  4.70721726e-03 -4.50850562e-03  2.64214014e-03 -3.29884756e-03, -3.13906092e-03  1.09677911e-03 -4.94637461e-03  3.32896863e-03,2.03538216e-03 -1.52456785e-03  2.28793684e-03 -1.43519988e-03, 4.34566711e-03 -1.94705374e-03  1.93231280e-03  4.34081139e-03
  ...
  3.40303702e-03  1.58637420e-03 -3.31261402e-03  2.01543484e-03,4.39879852e-03  2.54576413e-03 -3.30528596e-03  3.01509819e-03,2.15555660e-03  1.64605413e-03  3.02376228e-03 -2.62048110e-03
  3.80181967e-03 -3.14147812e-03  2.23554621e-03  2.68812295e-03,1.80951719e-03  1.74256027e-03 -2.47024545e-03  4.06702763e-03,2.30203426e-03 -4.75471295e-03 -3.66776927e-03  2.06539119e-03]

High Dimensional Space

Vector databases typically work with vectors that have hundreds or thousands of dimensions. This high dimensionality allows for rich and nuanced representations of data.
For example:
- A word might be represented by 300 dimensions
- An image could be represented by 1000 dimensions
- A user’s preferences might be captured in 500 dimensions

Why do you need a Vector Database when there is RDBMS like PostGreSQL or NoSQL DB like Elastic Search or MongoDB?

RDBMS

RDBMS are designed to store and manage structured data in a tabular format. They are based on the relational model, which defines data as a collection of tables, where each table represents a relation.

Key components of RDBMS:

Tables: A collection of rows and columns, where each row represents a record and each column represents an attribute.
Rows: Also known as records, they represent instances of an entity.
Columns: Also known as attributes, they define the properties of an entity.
Primary key: A unique identifier for each row in a table.
Foreign key: A column in one table that references the primary key of another table, establishing a relationship between the two tables.
Normalization: A process of organizing data into tables to minimize redundancy and improve data integrity.

Why RDBMS don’t apply to storing vectors:

Data Representation:
- RDBMS store data in a tabular format, where each row represents an instance of an entity and each column represents an attribute.
- Vectors are represented as a sequence of numbers, which doesn’t fit well into the tabular structure of RDBMS.
Query Patterns:
- RDBMS are optimized for queries based on joining tables and filtering data based on specific conditions.
- Vector databases are optimized for similarity search, which involves finding vectors that are closest to a given query vector. This type of query doesn’t align well with the traditional join-based queries of RDBMS.
Data Relationships:
- RDBMS define relationships between entities using foreign keys and primary keys.
- In vector databases, relationships are implicitly defined by the proximity of vectors in the vector space. There’s no explicit need for foreign keys or primary keys.
Performance Considerations:
- RDBMS are often optimized for join operations and range queries.
- Vector databases are optimized for similarity search, which requires efficient indexing and partitioning techniques.

Let’s also look at a table for a comparison of features

Feature	VectorDB	RDBMS
Dimensional Efficiency	Designed to handle high-dimensional data efficiently	Performance degrades rapidly as dimensions increase
Similarity Search	Implement specialized algorithms for fast approximate nearest neighbor (ANN) searches	Lack native support for ANN algorithms, making similarity searches slow and computationally expensive
Indexing for Vector Spaces	Use index structures optimized for vector data (e.g., HNSW, IVF)	Rely on B-trees and hash indexes, which become ineffective in high-dimensional spaces
Vector Operations	Provide built-in, optimized support for vector operations	Require complex, often inefficient SQL queries to perform vector computations
Scalability for Vector Data	Designed to distribute vector data and parallelize similarity searches across multiple nodes efficiently	While scalable for traditional data, they’re not optimized for distributing and querying vector data at scale
Real-time Processing	Optimized for fast insertions and queries of vector data, supporting real-time applications	May struggle with the speed requirements of real-time vector processing, especially at scale
Storage Efficiency	Use compact, specialized formats for storing dense vector data	Less efficient when storing high-dimensional vectors, often requiring more space and slower retrieval
Machine Learning Integration	Seamlessly integrate with ML workflows, supporting operations common in AI applications	Require additional processing and transformations to work effectively with ML pipelines
Approximate Query Support	Often support approximate queries, trading off some accuracy for significant speed improvements	Primarily designed for exact queries, lacking native support for approximate vector searches

In a nutshell, RDBMS are well-suited for storing and managing structured data, but they are not optimized for storing and querying vectors. Vector databases, on the other hand, are specifically designed for handling vectors and performing similarity search operations.

NoSQL Databases

NoSQL databases are designed to handle large datasets and unstructured or semi-structured data that don’t fit well into the relational model. They offer flexibility in data structures, scalability, and high performance.

Common types of NoSQL databases include:

Key-value stores: Store data as key-value pairs.
Document stores: Store data as documents, often in JSON or BSON format.
Wide-column stores: Store data in wide columns, where each column can have multiple values.
Graph databases: Store data as nodes and relationships, representing connected data.

Key characteristics of NoSQL databases:

Flexibility: NoSQL databases offer flexibility in data structures, allowing for dynamic schema changes and accommodating evolving data requirements.
Scalability: Many NoSQL databases are designed to scale horizontally, allowing for better performance and scalability as data volumes grow.
High performance: NoSQL databases often provide high performance, especially for certain types of workloads.
Eventual consistency: Some NoSQL databases prioritize availability and performance over strong consistency, offering eventual consistency guarantees.

Why NoSQL Databases Might Not Be Ideal for Storing and Retrieving Vectors

While NoSQL databases offer many advantages, they might not be the best choice for storing and retrieving vectors due to the following reasons:

Data Representation: NoSQL databases, while flexible, might not be specifically optimized for storing and querying high-dimensional vectors. The data structures used in NoSQL databases might not be the most efficient for vector-based operations.
Query Patterns: NoSQL databases are often designed for different query patterns than vector-based operations. While they can handle complex queries, they might not be as efficient for similarity search, which is a core operation for vector databases.
Performance Considerations:
- Indexing: NoSQL databases often use different indexing techniques than RDBMS. While they can be efficient for certain types of queries, they might not be as optimized for vector-based similarity search.
- Memory requirements: For vector-based operations, especially in large-scale applications, the memory requirements can be significant. NoSQL databases like Elasticsearch, which are often used for full-text search and analytics, might require substantial memory resources to handle large vector datasets efficiently.

Elasticsearch as an Example:

Elasticsearch is a popular NoSQL database often used for full-text search and analytics. While it can be used to store and retrieve vectors, there are some considerations:

Memory requirements: Storing and indexing large vector datasets in Elasticsearch can be memory-intensive, especially for high-dimensional vectors.
Query performance: The performance of vector-based queries in Elasticsearch can depend on factors like the number of dimensions, the size of the dataset, and the indexing strategy used.
Specialized plugins: Elasticsearch offers plugins like the knn plugin that can be used to optimize vector-based similarity search. However, these plugins might have additional performance and memory implications.

In a nutshell, while NoSQL databases offer many advantages, their suitability for storing and retrieving vectors depends on specific use cases and requirements. For applications that heavily rely on vector-based similarity search and require high performance, specialized vector databases might be a more appropriate choice.

A Deeper Dive into Similarity Search in Vector Databases

Similarity search is a fundamental operation in vector databases, involving finding the closest matches to a given query vector from a large dataset of vectors. This is crucial for applications like recommendation systems, image search, and natural language processing.

Similarity measures, algorithms, and data structures are crucial for efficient similarity search. Similarity measures (e.g., cosine, Euclidean) quantify the closeness between vectors. Algorithms (e.g., brute force, LSH, HNSW) determine how vectors are compared and retrieved. Data structures (e.g., inverted indexes, hierarchical graphs) optimize storage and retrieval. The choice of these components depends on factors like dataset size, dimensionality, and desired accuracy. By selecting appropriate measures, algorithms, and data structures, you can achieve efficient and accurate similarity search in various applications. Let’s look in details about the different similarity measures and algorithms/datastructures in the below section.

Understanding Similarity Measures

Cosine Similarity: Measures the angle between two vectors. It’s suitable when the magnitude of the vectors doesn’t matter (e.g., document similarity based on word counts).


import numpy as np

def cosine_similarity(v1, v2):
    """Calculates the cosine similarity between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The cosine similarity between the two vectors.
    """

    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)

    return dot_product / (norm_v1 * norm_v2)

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
similarity = cosine_similarity(vector1, vector2)
print(similarity)

Euclidean Distance: Measures the straight-line distance between two points in Euclidean space. It’s suitable when the magnitude of the vectors is important (e.g., image similarity based on pixel values).


import numpy as np

def euclidean_distance(v1, v2):
    """Calculates the Euclidean distance between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The Euclidean distance between the two vectors.
    """

    return np.linalg.norm(v1 - v2)

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = euclidean_distance(vector1, vector2)
print(distance)

Hamming Distance: Measures the number of positions where two binary vectors differ. It’s useful for comparing binary data.

import numpy as np

def hamming_distance(v1, v2):
    """Calculates the Hamming distance between two binary vectors.

    Args:
        v1: The first binary vector.
        v2: The second binary vector.

    Returns:
        The Hamming distance between the two vectors.
    """

    return np.sum(v1 != v2)

# Example usage
vector1 = np.array([0, 1, 1, 0])
vector2 = np.array([1, 1, 0, 1])
distance = hamming_distance(vector1, vector2)
print(distance)

Manhattan Distance: Also known as L1 distance, it measures the sum of absolute differences between corresponding elements of two vectors.

import numpy as np

def manhattan_distance(v1, v2):
    """Calculates the Manhattan distance between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The Manhattan distance between the two vectors.
    """

    return np.sum(np.abs(v1 - v2))

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = manhattan_distance(vector1, vector2)
print(distance)

Algorithms and Data Structures

Brute Force: is a straightforward but computationally expensive algorithm for finding the nearest neighbors in a dataset. It involves comparing the query vector with every other vector in the dataset to find the closest matches.

How Brute Force Works

Iterate through the dataset: For each vector in the dataset, calculate its distance to the query vector.
Maintain a list of closest neighbors: Keep track of the closest vectors found so far.
Update the list: If the distance between the current vector and the query vector is smaller than the distance to the farthest neighbor in the list, replace the farthest neighbor with the current vector.
Repeat: Continue this process until all vectors in the dataset have been compared.

Advantages and Disadvantages

Advantages:
- Simple to implement.
- Guaranteed to find the exact nearest neighbors.
Disadvantages:
- Extremely slow for large datasets.
- Inefficient for high-dimensional data.

Python Code Example

import numpy as np

def brute_force_search(query_vector, vectors, k=10):
    """Performs brute force search for the nearest neighbors.

    Args:
        query_vector: The query vector.
        vectors: The dataset of vectors.
        k: The number of nearest neighbors to find.

    Returns:
        A list of indices of the nearest neighbors.
    """

    distances = np.linalg.norm(vectors - query_vector, axis=1)
    nearest_neighbors = np.argsort(distances)[:k]
    return nearest_neighbors

# Example usage
query_vector = np.random.rand(128)
vectors = np.random.rand(1000, 128)
nearest_neighbors = brute_force_search(query_vector, vectors, k=10)

Brute Force is generally not suitable for large datasets or high-dimensional data due to its computational complexity. For these scenarios, more efficient algorithms like LSH, HNSW, or IVF-Flat are typically used. However, it can be useful for small datasets or as a baseline for comparison with other algorithms.

Locality Sensitive Hashing (LSH): is a technique used to efficiently find similar items in large datasets. It works by partitioning the vector space into buckets and hashing similar vectors into the same bucket. This makes it possible to quickly find approximate nearest neighbors without having to compare every vector in the dataset.

How LSH Works

Hash Function Selection: Choose a hash function that is sensitive to the distance between vectors. This means that similar vectors are more likely to be hashed into the same bucket.
Hash Table Creation: Create multiple hash tables, each using a different hash function.
Vector Hashing: For each vector, hash it into each hash table.
Query Processing: When a query vector is given, hash it into each hash table.
Candidate Selection: Retrieve all vectors that are in the same buckets as the query vector.
Similarity Calculation: Calculate the actual similarity between the query vector and the candidate vectors.

LSH Families

Random Projection: Projects vectors onto random hyperplanes.
MinHash: Used for comparing sets of items.
SimHash: Used for comparing documents based on their shingles.

LSH Advantages and Disadvantages

Advantages:
- Efficient for large datasets.
- Can be used for approximate nearest neighbor search.
- Can be parallelized.
Disadvantages:
- Can introduce false positives or negatives.
- Accuracy can be affected by the choice of hash functions and the number of hash tables.

Python Code Example using Annoy

from annoy import Annoy

# Create an Annoy index with LSH
annoy_index = Annoy(128, metric='angular', n_trees=10)

# Add vectors to the index
for i in range(1000):
    vector = np.random.rand(128)
    annoy_index.add_item(i, vector)

# Build the index
annoy_index.build()

# Search for nearest neighbors
query_vector = np.random.rand(128)
nns = annoy_index.get_nns_by_vector(query_vector, 10)

Note: The n_trees parameter in Annoy determines the number of hash tables used. A larger number of trees generally improves accuracy but can increase memory usage.

By understanding the fundamentals of LSH and carefully selecting the appropriate parameters, you can effectively use it for similarity search in your applications.

Hierarchical Navigable Small World (HNSW): is a highly efficient algorithm for approximate nearest neighbor search in high-dimensional spaces. It constructs a hierarchical graph structure that allows for fast and accurate retrieval of similar items.

How HNSW Works

Initialization: The algorithm starts by creating a single layer with all data points.
Layer Creation: New layers are added iteratively. Each new point is connected to a subset of existing points based on their distance.
Hierarchical Structure: The layers form a hierarchical structure, with higher layers having fewer connections and lower layers having more connections.
Search: To find the nearest neighbors of a query point, the search starts from the top layer and gradually moves down the hierarchy, following the connections to find the most promising candidates.

Advantages of HNSW

High Accuracy: HNSW often achieves high accuracy, even for high-dimensional data.
Efficiency: It is very efficient for large datasets and can handle dynamic updates.
Flexibility: The algorithm can be adapted to different distance metrics and data distributions.

Python Code Example using NMSLIB

from nmslib import NMSLIB

# Create an HNSW index
nmslib_index = NMSLIB.init(method='hnsw', space='cos')

# Add vectors to the index
nmslib_index.addDataPointBatch(vectors)

# Create the index
nmslib_index.createIndex()

# Search for nearest neighbors
query_vector = np.random.rand(128)
knn = nmslib_index.knnQuery(query_vector, k=10)

Note: The space parameter in NMSLIB specifies the distance metric used (e.g., cos for cosine similarity). You can also customize other parameters like the number of layers and the number of connections per layer to optimize performance for your specific application.

HNSW is a powerful algorithm for approximate nearest neighbor search, offering a good balance between accuracy and efficiency. It’s particularly well-suited for high-dimensional data and can be used in various applications, such as recommendation systems, image search, and natural language processing.

IVF-Flat: is a hybrid indexing technique that combines Inverted File (IVF) and Flat Hierarchical Indexing (Flat) to efficiently perform approximate nearest neighbor search (ANN) in high-dimensional vector spaces. It’s particularly effective for large datasets and high-dimensional vectors.

How IVF-Flat Works

Quantization: The dataset is divided into n quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
Inverted File: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
Flat Index: For each quantized cell, a flat index (e.g., a linear scan or a tree-based structure) is built to store the vectors assigned to that cell.
Query Processing: When a query vector is given, it’s first quantized to find the corresponding cell. Then, the flat index for that cell is searched for the nearest neighbors.
Refinement: The top candidates from the flat index can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of IVF-Flat

Efficiency: IVF-Flat can be significantly faster than brute-force search for large datasets.
Accuracy: It can achieve good accuracy, especially when combined with refinement techniques.
Scalability: It can handle large datasets and high-dimensional vectors.
Flexibility: The number of quantized cells and the type of flat index can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create an IVF-Flat index
index = faiss.IndexIVFFlat(faiss.IndexFlatL2(dim), nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

dim is the dimensionality of the vectors.
nlist is the number of quantized cells.
nprobe is the number of cells to query during search.

IVF-Flat is a powerful technique for approximate nearest neighbor search in vector databases, offering a good balance between efficiency and accuracy. By carefully tuning the parameters, you can optimize its performance for your specific application.

ScanNN: is a scalable and efficient approximate nearest neighbor search algorithm designed for large-scale datasets. It combines inverted indexes with quantization techniques to achieve high performance.

How ScanNN Works

Quantization: The dataset is divided into quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
Scan: During query processing, the query vector is quantized to find the corresponding cell. Then, the vectors in that cell are scanned to find the nearest neighbors.
Refinement: The top candidates from the scan can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of ScanNN

Scalability: ScanNN can handle large datasets and high-dimensional vectors efficiently.
Efficiency: It uses inverted indexes to reduce the search space, making it faster than brute-force search.
Accuracy: ScanNN can achieve good accuracy, especially when combined with refinement techniques.
Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create a ScanNN index
index = faiss.IndexScanNN(faiss.IndexFlatL2(dim), nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

dim is the dimensionality of the vectors.
nlist is the number of quantized cells.
nprobe is the number of cells to query during search.

ScanNN is a powerful algorithm for approximate nearest neighbor search in large-scale applications. It offers a good balance between efficiency and accuracy, making it a popular choice for various tasks, such as recommendation systems, image search, and natural language processing.

Disk-ANN: is a scalable approximate nearest neighbor search algorithm designed for very large datasets that don’t fit entirely in memory. It combines inverted files with on-disk storage to efficiently handle large-scale vector search.

How Disk-ANN Works

Quantization: The dataset is divided into quantized subspaces (quantization cells), similar to IVF-Flat.
Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
On-Disk Storage: The inverted index and the vectors themselves are stored on disk, allowing for efficient handling of large datasets.
Query Processing: When a query vector is given, it’s quantized to find the corresponding cell. The inverted index is used to retrieve the vectors in that cell from disk.
Refinement: The retrieved vectors can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of Disk-ANN

Scalability: Disk-ANN can handle extremely large datasets that don’t fit in memory.
Efficiency: It uses inverted indexes and on-disk storage to optimize performance for large-scale search.
Accuracy: Disk-ANN can achieve good accuracy, especially when combined with refinement techniques.
Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create a Disk-ANN index
index = faiss.IndexDiskANN(faiss.IndexFlatL2(dim), filename, nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

filename is the path to the disk file where the index will be stored.
Other parameters are the same as in IVF-Flat.

Disk-ANN is a powerful algorithm for approximate nearest neighbor search in very large datasets. It provides a scalable and efficient solution for handling massive amounts of data while maintaining good accuracy.

Vector Database Comparison: Features, Use Cases, and Selection Guide

Just like in RDBMS or NOSQL world there are lot of choices for different databases, Vector Databases also have quite a bit choices, choosing the right one for your application matters quite a bit. Below table compares key features, use-cases and a selection guide

VectorDB	Key Features	Best For	When to Choose
Pinecone	Fully managed service, Real-time updates, Hybrid search (vector + metadata), Serverless	Production-ready applications, Rapid development, Scalable solutions	When you need a fully managed solution, For applications requiring real-time updates, When combining vector search with metadata filtering
Milvus	Open-source, Scalable to billions of vectors, Supports multiple index types, Hybrid search capabilities	Large-scale vector search, On-premises deployments, Customizable solutions	When you need an open-source solution, for very large-scale vector search applications, When you require fine-grained control over indexing
Qdrant	Open-source, Rust-based for high performance, Supports filtering with payload, On-prem and cloud options	High-performance vector search, Applications with complex filtering needs	When performance is critical, for applications requiring advanced filtering, When you need both cloud and on-prem options
Weaviate	Open-source, GraphQL API, Multi-modal data support, AI-first database	Semantic search applications, Multi-modal data storage and retrieval	When working with multiple data types (text, images, etc.), If you prefer GraphQL for querying, for AI-centric applications
Faiss (Facebook AI Similarity Search)	Open-source, Highly efficient for dense vectors, GPU support	Research and experimentation, Large-scale similarity search	When you need low-level control, for integration into custom systems, When GPU acceleration is beneficial
Elasticsearch with vector search	Full-text search + vector capabilities, Mature ecosystem and extensive analytics features	Applications combining traditional search and vector search	when you need rich text analytics, When you’re already using Elasticsearch, For hybrid search applications (text + vector), When you need advanced analytics alongside vector search
pgvector (PostgreSQL extension)	Vector similarity search in PostgreSQL, Integrates with existing PostgreSQL databases	Adding vector capabilities to existing PostgreSQL systems, Small to medium-scale applications	When you’re already heavily invested in PostgreSQL, for projects that don’t require specialized vector DB features, When simplicity and familiarity are priorities
Vespa	Open-source, Combines full-text search, vector search, and structured data, Real-time indexing and serving	Complex search and recommendation systems, Applications requiring structured, text, and vector data	For large-scale, multi-modal search applications, When you need a unified platform for different data types, For real-time, high-volume applications
AWS OpenSearch	Fully managed AWS service, Combines traditional full-text search capabilities with vector-based similarity search.	When you need to search for both text-based content and vectors.	When you need to perform real-time searches and analytics on large datasets. When you want to leverage the broader AWS ecosystem for your application. For applications that require processing billions of vectors.

Conclusion

For my previous startup that I (Malaikannan Sankarasubbu) founded Datalog dot ai doing a low code virtual assistant platform, we heavily leveraged FAISS to do Intent similarity, from that point to now there are quite a few options for Vector Databases

Vector databases have emerged as a powerful tool for handling unstructured and semi-structured data, offering efficient similarity search capabilities and supporting a wide range of applications. By understanding the fundamentals of vector databases, including similarity measures, algorithms, and data structures, you can select the right approach for your specific needs.

In future blog posts, we will delve deeper into performance considerations, including indexing techniques, hardware optimization, and best practices for scaling vector databases. We will also explore real-world use cases and discuss the challenges and opportunities that lie ahead in the field of vector databases.

Chunking

Malaikannan

Malaikannan Sankarasubbu

5 August 2024 at 00:00

Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline

As the field of Natural Language Processing (NLP) continues to evolve, the combination of retrieval-based and generative models has emerged as a powerful approach for enhancing various NLP applications. One of the key techniques that significantly improves the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) is chunking. In this blog, we will explore what chunking is, why it is important in RAG, the different ways to implement chunking, including content-aware and recursive chunking, how to evaluate the performance of chunking, chunking alternatives, and how it can be applied to optimize NLP systems.

Chunking

What is Retrieval-Augmented Generation (RAG)?

Before diving into chunking, let’s briefly understand RAG. Retrieval-Augmented Generation is a framework that combines the strengths of retrieval-based models and generative models. It involves retrieving relevant information from a large corpus based on a query and using this retrieved information as context for a generative model to produce accurate and contextually relevant responses or content.

What is Chunking?

Chunking is the process of breaking down large text documents or datasets into smaller, manageable pieces, or “chunks.” These chunks can then be individually processed, indexed, and retrieved, making the overall system more efficient and effective. Chunking helps in dealing with large volumes of text by dividing them into smaller, coherent units that are easier to handle.

Why Do We Need Chunking?

Chunking is essential in RAG for several reasons:

Efficiency

Computational cost: Processing smaller chunks of text requires less computational power compared to handling entire documents.
Storage: Chunking allows for more efficient storage and indexing of information.

Accuracy

Relevance: By breaking down documents into smaller units, it’s easier to identify and retrieve the most relevant information for a given query.
Context preservation: Careful chunking can help maintain the original context of the text within each chunk.

Speed

Retrieval time: Smaller chunks can be retrieved and processed faster, leading to quicker response times.
Model processing: Language models can process smaller inputs more efficiently.

Limitations of Large Language Models

Context window: LLMs have limitations on the amount of text they can process at once. Chunking helps to overcome this limitation.

In essence, chunking optimizes the RAG process by making it more efficient, accurate, and responsive.

Different Ways to Implement Chunking

There are various methods to implement chunking, depending on the specific requirements and structure of the text data. Here are some common approaches:

Fixed-Length Chunking: Divide the text into chunks of fixed length, typically based on a predetermined number of words or characters.

def chunk_text_fixed_length(text, chunk_size=200, by='words'):
    if by == 'words':
        words = text.split()
        return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    elif by == 'characters':
        return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
    else:
        raise ValueError("Parameter 'by' must be either 'words' or 'characters'.")

text = "The process is more important than the results. And if you take care of the process, you will get the results."
word_chunks = chunk_text_fixed_length(text, chunk_size=5, by='words')  
character_chunks = chunk_text_fixed_length(text, chunk_size=5, by='characters')

print(word_chunks)
['The process is more important', 'than the results. And if', 'you take care of the', 'process, you will get the', 'results.']

print(character_chunks)
['The p', 'roces', 's is ', 'more ', 'impor', 'tant ', 'than ', 'the r', 'esult', 's. An', 'd if ', 'you t', 'ake c', 'are o', 'f the', ' proc', 'ess, ', 'you w', 'ill g', 'et th', 'e res', 'ults.']

Sentence-Based Chunking: Split the text into chunks based on complete sentences. This method ensures that each chunk contains coherent and complete thoughts.

import nltk
nltk.download('punkt')
   
def chunk_text_sentences(text, max_sentences=3):
    sentences = nltk.sent_tokenize(text)
    return [' '.join(sentences[i:i + max_sentences]) for i in range(0, len(sentences), max_sentences)]

text = """Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language. NLP techniques are used to apply algorithms to identify and extract the natural language rules such that 
the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world."""

sentence_chunks = chunk_text_sentences(text, max_sentences=2)
   
   
for i, chunk in enumerate(sentence_chunks, 1):
    print(f"Chunk {i}:\n{chunk}\n")

Chunk 1:
Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language.
   
Chunk 2:
NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP.
   
Chunk 3:
It's a powerful tool in the modern data-driven world.

Paragraph-Based Chunking: Divide the text into chunks based on paragraphs. This approach is useful when the text is naturally structured into paragraphs that represent distinct sections or topics.

def chunk_text_paragraphs(text):
    paragraphs = text.split('\n\n')
    return [paragraph for paragraph in paragraphs if paragraph.strip()]

paragraph_chunks = chunk_text_paragraphs(text)
   
   
for i, chunk in enumerate(paragraph_chunks, 1):
    print(f"Paragraph {i}:\n{chunk}\n")

Paragraph 1:
Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence.
   
Paragraph 2:
It deals with the interaction between computers and humans through natural language.
   
Paragraph 3:
NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.
   
Paragraph 4:
Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world.

Thematic or Semantic Chunking: Use NLP techniques to identify and group related sentences or paragraphs into chunks based on their thematic or semantic content. This can be done using topic modeling or clustering algorithms.

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
   
nltk.download('punkt')
   
def chunk_text_thematic(text, n_clusters=5):
    sentences = nltk.sent_tokenize(text)
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(sentences)
    kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
    clusters = kmeans.predict(X)
       
    chunks = [[] for _ in range(n_clusters)]
    for i, sentence in enumerate(sentences):
        chunks[clusters[i]].append(sentence)
       
    return [' '.join(chunk) for chunk in chunks]
   
   
   
thematic_chunks = chunk_text_thematic(text, n_clusters=3)
   
   
for i, chunk in enumerate(thematic_chunks, 1):
    print(f"Chunk {i}:\n{chunk}\n")

Sliding Window Chunking: Use a sliding window approach to create overlapping chunks. This method ensures that important information near the boundaries of chunks is not missed.

 def chunk_text_sliding_window(text, chunk_size=200, overlap=50, unit='word'):
 """Chunks text using a sliding window.

 Args:
     text: The input text.
     chunk_size: The desired size of each chunk.
     overlap: The overlap between consecutive chunks.
     unit: The chunking unit ('word', 'char', or 'token').

 Returns:
     A list of text chunks.
 """

 if unit == 'word':
     data = text.split()
 elif unit == 'char':
     data = text
 else:
     # Implement tokenization for other units
     pass

 chunks = []
 for i in range(0, len(data), chunk_size - overlap):
     if unit == 'word':
     chunk = ' '.join(data[i:i+chunk_size])
     else:
     chunk = data[i:i+chunk_size]
     chunks.append(chunk)

 return chunks

Content-Aware Chunking: This advanced method involves using more sophisticated NLP techniques to chunk the text based on its content and structure. Content-aware chunking can take into account factors such as topic continuity, coherence, and discourse markers. It aims to create chunks that are not only manageable but also meaningful and contextually rich.

Example of Content-Aware Chunking using Sentence Transformers:

from sentence_transformers import SentenceTransformer, util

def content_aware_chunking(text, max_chunk_size=200):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    sentences = nltk.sent_tokenize(text)
    embeddings = model.encode(sentences, convert_to_tensor=True)
    clusters = util.community_detection(embeddings, min_community_size=1)
       
    chunks = []
    for cluster in clusters:
        chunk = ' '.join([sentences[i] for i in cluster])
        if len(chunk.split()) <= max_chunk_size:
            chunks.append(chunk)
        else:
            sub_chunks = chunk_text_fixed_length(chunk, max_chunk_size)
            chunks.extend(sub_chunks)
       
    return chunks

Recursive Chunking: Recursive chunking involves repeatedly breaking down chunks into smaller sub-chunks until each chunk meets a desired size or level of detail. This method ensures that very large texts are reduced to manageable and meaningful units at each level of recursion, making it easier to process and retrieve information.

Example of Recursive Chunking: ```python def recursive_chunk(text, max_chunk_size): “"”Recursively chunks text into smaller chunks.

Args: text: The input text. max_chunk_size: The maximum desired chunk size.

Returns: A list of text chunks. “””

if len(text) <= max_chunk_size: return [text]

# Choose a splitting point based on paragraphs, sentences, or other criteria # For example: paragraphs = text.split(‘\n\n’) if len(paragraphs) > 1: chunks = [] for paragraph in paragraphs: chunks.extend(recursive_chunk(paragraph, max_chunk_size)) return chunks else: # Handle single paragraph chunking, e.g., by sentence splitting # …

# …


8. **Agentic Chunking**: Agent chunking is a sophisticated technique that involves using an LLM to dynamically determine chunk boundaries based on the content and context of the text. Below is an example of a prompt example for Agentic Chunking 

**Example Prompt**:

``` Prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|> 
## You are an agentic chunker. You will be provided with a content. 
Decompose the content into clear and simple propositions, ensuring they are interpretable out of context. 
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible. 
2. For any named entity that is accompanied by additional descriptive informaiton separate this information into its own distinct proposition.
3. Decontextualize proposition by adding necessary modifier to nouns or entire sentence and replacing pronouns (e.g. it, he, she, they, this, that) with the full name of the entities they refer to.
4. Present the results as list of strings, formatted in JSON 
<|eot_id|><|start_header_id|>user<|end_header_id|>
Here is the content : {content}
strictly follow the instructions provided and output in the desired format only.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Chunk Size and Overlapping in Chunking

Determining the appropriate chunk size and whether to use overlapping chunks are critical decisions in the chunking process. These factors significantly impact the efficiency and effectiveness of the retrieval and generation stages in RAG systems.

Chunk Size

Choosing Chunk Size: The ideal chunk size depends on the specific application and the nature of the text. Smaller chunks can provide more precise context but may miss broader information, while larger chunks capture more context but may introduce noise or irrelevant information.
- Small Chunks: Typically 100-200 words. Suitable for fine-grained retrieval where specific details are crucial.
- Medium Chunks: Typically 200-500 words. Balance between detail and context, suitable for most applications.
- Large Chunks: Typically 500-1000 words. Useful for capturing broader context but may be less precise.
Impact of Chunk Size: The chunk size affects the retrieval accuracy and computational efficiency. Smaller chunks generally lead to higher retrieval precision but may require more chunks to cover the same amount of text, increasing computational overhead. Larger chunks reduce the number of chunks but may lower retrieval precision.

Overlapping Chunks

Purpose of Overlapping: Overlapping chunks ensure that important information near the boundaries of chunks is not missed. This approach is particularly useful when the text has high semantic continuity, and critical information may span across chunk boundaries.
Degree of Overlap: The overlap size should be carefully chosen to balance redundancy and completeness. Common overlap sizes range from 10% to 50% of the chunk size.
- Small Overlap: 10-20% of the chunk size. Minimizes redundancy but may still miss some boundary information.
- Medium Overlap: 20-30% of the chunk size. Good balance between coverage and redundancy.
- Large Overlap: 30-50% of the chunk size. Ensures comprehensive coverage but increases redundancy and computational load.

Example of Overlapping Chunking:

def chunk_text_sliding_window(text, chunk_size=200, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = words[i:i + chunk_size]
        chunks.append(' '.join(chunk))
    return chunks

Evaluating the Performance of Chunking

Evaluating the performance of chunking is crucial to ensure that the chosen method effectively enhances the retrieval and generation processes. Here are some key metrics and approaches for evaluating chunking performance:

Retrieval Metrics

Precision@K: Measures the proportion of relevant chunks among the top K retrieved chunks.

def precision_at_k(retrieved_chunks, relevant_chunks, k):
    return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / k

Recall@K: Measures the proportion of relevant chunks retrieved among the top K chunks.

def recall_at_k(retrieved_chunks, relevant_chunks, k):
    return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / len(relevant_chunks)

F1 Score: Harmonic mean of Precision@K and Recall@K, providing a balance between precision and recall.

def f1_score_at_k(precision, recall):
    if precision + recall == 0:
        return 0
    return 2 * (precision * recall) / (precision + recall)

MAP : Mean Average Precision (MAP) is primarily used in information retrieval and object detection tasks to evaluate the ranking of retrieved items

 import numpy as np

 def calculate_ap(y_true, y_score):
 """Calculates average precision for a single query.

 Args:
     y_true: Ground truth labels (0 or 1).
     y_score: Predicted scores.

 Returns:
     Average precision.
 """

 # Sort y_score and corresponding y_true in descending order
 y_score, y_true = zip(*sorted(zip(y_score, y_true), key=lambda x: x[0], reverse=True))

 correct_hits = 0
 sum_precision = 0
 for i, y in enumerate(y_true):
     if y == 1:
     correct_hits += 1
     precision = correct_hits / (i + 1)
     sum_precision += precision
 return sum_precision / sum(y_true)

 def calculate_map(y_true, y_score):
 """Calculates mean average precision.

 Args:
     y_true: Ground truth labels (list of lists).
     y_score: Predicted scores (list of lists).

 Returns:
     Mean average precision.
 """

 aps = []
 for i in range(len(y_true)):
     ap = calculate_ap(y_true[i], y_score[i])
     aps.append(ap)
 return np.mean(aps)

NDCG: NDCG is a metric used to evaluate the quality of a ranking of items. It measures how well the most relevant items are ranked at the top of the list. In the context of chunking, we can potentially apply NDCG by ranking chunks based on a relevance score and evaluating how well the most relevant chunks are placed at the beginning of the list.

import numpy as np

def calculate_dcg(rel):
  """Calculates Discounted Cumulative Gain (DCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    DCG value.
  """

  return np.sum(rel / np.log2(np.arange(len(rel)) + 2))

def calculate_idcg(rel):
  """Calculates Ideal Discounted Cumulative Gain (IDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    IDCG value.
  """

  rel = np.sort(rel)[::-1]
  return calculate_dcg(rel)

def calculate_ndcg(rel):
  """Calculates Normalized Discounted Cumulative Gain (NDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    NDCG value.
  """

  dcg = calculate_dcg(rel)
  idcg = calculate_idcg(rel)
  return dcg / idcg

# Example usage
relevance_scores = [3, 2, 1, 0]
ndcg_score = calculate_ndcg(relevance_scores)
print(ndcg_score)

Generation Metrics

BLEU Score: Measures the overlap between the generated text and reference text, considering n-grams.

from nltk.translate.bleu_score import sentence_bleu

def bleu_score(reference, generated):
    return sentence_bleu([reference.split()], generated.split())

ROUGE Score: Measures the overlap of n-grams, longest common subsequence (LCS), and skip-bigram between the generated text and reference text.

from rouge import Rouge

rouge = Rouge()

def rouge_score(reference, generated):
    scores = rouge.get_scores(generated, reference)
    return scores[0]['rouge-l']['f']

Human Evaluation: Involves subjective evaluation by human judges to assess the relevance, coherence, and overall quality of the generated responses. Human evaluation can provide insights that automated metrics might miss.

Chunking Alternatives

While chunking is an effective method for improving the efficiency and effectiveness of RAG systems, there are alternative techniques that can also be considered:

Hierarchical Indexing: Instead of chunking the text, hierarchical indexing organizes the data into a tree structure where each node represents a topic or subtopic. This allows for efficient retrieval by navigating through the tree based on the query’s context. ```python class HierarchicalIndex: def init(self): self.tree = {}

def add_document(self, doc_id, topics):
    current_level = self.tree
    for topic in topics:
        if topic not in current_level:
            current_level[topic] = {}
        current_level = current_level[topic]
    current_level['doc_id'] = doc_id

def retrieve(self, query_topics):
    current_level = self.tree
    for topic in query_topics:
        if topic in current_level:
            current_level = current_level[topic]
        else:
            return []
    return current_level.get('doc_id', [])

Summarization: Instead of retrieving chunks, the system generates summaries of documents or sections that are relevant to the query. This can be done using extractive or abstractive summarization techniques.

from transformers import BartTokenizer, BartForConditionalGeneration

def generate_summary(text):
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

    inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=150, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

Dense Passage Retrieval (DPR): DPR uses dense vector representations for both questions and passages, allowing for efficient similarity search using vector databases like FAISS.

from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
from sklearn.metrics.pairwise import cosine_similarity

question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

def encode_texts(texts, tokenizer, encoder):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    return encoder(**inputs).pooler_output

question_embeddings = encode_texts(["What is chunking?"], question_tokenizer, question_encoder)
context_embeddings = encode_texts(["Chunking is a process...", "Another context..."], context_tokenizer, context_encoder)

similarities = cosine_similarity(question_embeddings, context_embeddings)

Graph-Based Representations: Instead of breaking the text into chunks, graph-based representations model the relationships between different parts of the text. Nodes represent entities, concepts, or chunks of text, and edges represent the relationships between them. This approach allows for more flexible and context-aware retrieval.

   import networkx as nx

   def build_graph(texts):
       graph = nx.Graph()
       for i, text in enumerate(texts):
           graph.add_node(i, text=text)
           # Add edges based on some similarity metric
           for j in range(i + 1, len(texts)):
               similarity = compute_similarity(text, texts[j])
               if similarity > threshold:
                   graph.add_edge(i, j, weight=similarity)
       return graph

   def retrieve_from_graph(graph, query):
       query_node = len(graph.nodes)
       graph.add_node(query_node, text=query)
       for i in range(query_node):
           similarity = compute_similarity(query, graph.nodes[i]['text'])
           if similarity > threshold:
               graph.add_edge(query_node, i, weight=similarity)
       # Retrieve nodes with highest similarity
       neighbors = sorted(graph[query_node], key=lambda x: graph[query_node][x]['weight'], reverse=True)
       return [graph.nodes[n]['text'] for n in neighbors[:k]]

Graph-based representations can capture complex relationships and provide a more holistic view of the text, making them a powerful alternative to chunking.

Conclusion

Chunking plays a pivotal role in enhancing the efficiency and effectiveness of Retrieval-Augmented Generation systems. By breaking down large texts into manageable chunks, we can improve retrieval speed, contextual relevance, scalability, and the overall quality of generated responses. Evaluating the performance of chunking methods involves considering retrieval and generation metrics, as well as efficiency and cost metrics. As NLP continues to advance, techniques like chunking will remain essential for optimizing the performance of RAG and other language processing systems. Additionally, exploring alternatives such as hierarchical indexing, passage retrieval, summarization, dense passage retrieval, and graph-based representations can further enhance the capabilities of RAG systems.

Embark on your journey to harness the power of chunking in RAG and unlock new possibilities in the world of Natural Language Processing!

If you found this blog post helpful, please consider citing it in your work:

@misc{malaikannan2024chunking, author = {Sankarasubbu, Malaikannan}, title = {Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline}, year = {2024}, url = {https://malaikannan.github.io/2024/08/05/Chunking/}, note = {Accessed: 2024-08-12} }

Embeddings

Malaikannan

Malaikannan Sankarasubbu

31 July 2024 at 00:00

Computers are meant to crunch numbers; it goes back to the original design of these machines. Representing text as numbers is the holy grail of Natural Language Processing (NLP), but how do we do that? Over the years, various techniques have been developed to achieve this. Early methods like n-grams (like bigrams and trigrams) and TF-IDF were able to convert words into numbers. Not just one number, a collection of them. Each word is represented by the collection of numbers. The collection of numbers is called vector and it had a size that is fixed called the dimension of the vector. Though they were useful, they had their limitations. The most important of the limitations is that the vectors for each words stands alone, i.e we could not do any mathematical operations like addition or subtraction between the vectors(actually we could but the resulting vector will not represent any word). That is where embeddings come in. Embedding is also a vector, and so each word get a corresponding vector but we can now do King - Man + Woman that will give us a vector which is close to the vector corresponding to Queen. Why is this useful? That is what we are going to explore in this article.

What are Embeddings?

Embeddings are numerical representations of text data where words or phrases from the vocabulary are mapped to vectors of real numbers. This mapping is crucial because it allows us to quantify and manipulate textual data in a way that machines can understand and process.

We understand what a word is, lets see what a vector is. A vector is a sequence of numbers that forms a group. For example

(3) is a one dimensional vector.
(2,8) is a two dimensional vector.
(12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only 3 dimensions, anything more than that you can just say it not visualize.

Below is an example of 4 dimension vector representation of the word king

King Vector

One of the seminal papers that have come out from Google is Word2vec. Lets see how Word2Vec works to get a conceptual understanding of how embedding works

How Word2Vec works

For a input text it looks at each word and the context of words around it. It trains on the text, and recognizes the order of each word, and the structure of the sentences. At the end of training each word is represented by a vector of N (mostly in 100 to 300 range) dimension.

Word2Vec

When we train word2vec algorithm in the example discussed above “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Lets assume that it outputs 2 dimension vectors for each words, since we can’t visualize anything more than 3 dimension.

SanFrancisco (6,6)
beautiful (-13,-4)
California (10,8)
city (2,10)
LosAngeles (6.5,5)
lovely(-12,-7)
metropolis(2.5,8)

Below is a 2D Plot of vectors

2DPlot

You can see in the image that Word2vec algorithm inferred from the input text. SanFrancisco and LosAngeles are grouped together. Beautiful and lovely are grouped together. City and metropolis are grouped together. Beauty about this is, Word2vec deduced this purely from data, without being explicitly taught english or geography.

You will see more embedding approaches in the below sections

Key Characteristics of Embeddings:

Dimensionality: Embeddings are vectors of fixed size. Common sizes range from 50 to 300 dimensions, though they can be larger depending on the complexity of the task.
Continuous Space: Unlike traditional one-hot encoding, embeddings are dense and reside in a continuous vector space, making them more efficient and informative.
Semantic Proximity: Words with similar meanings tend to have vectors that are close to each other in the embedding space.

The Evolution of Embeddings

Embeddings have evolved significantly over the years. Here are some key milestones:

Word2Vec (2013): Developed by Mikolov et al. at Google, Word2Vec was one of the first algorithms to create word embeddings. It uses two architectures—Continuous Bag of Words (CBOW) and Skip-gram—to learn word associations.
GloVe (2014): Developed by the Stanford NLP Group, GloVe (Global Vectors for Word Representation) improves upon Word2Vec by incorporating global statistical information of the corpus.
FastText (2016): Developed by Facebook’s AI Research (FAIR) lab, FastText extends Word2Vec by considering subword information, which helps in handling out-of-vocabulary words and capturing morphological details.
ELMo (2018): Developed by the Allen Institute for AI, ELMo (Embeddings from Language Models) generates context-sensitive embeddings, meaning the representation of a word changes based on its context in a sentence.
BERT (2018): Developed by Google, BERT (Bidirectional Encoder Representations from Transformers) revolutionized embeddings by using transformers to understand the context of a word bidirectionally. This model significantly improved performance on various NLP tasks.

From Word Embeddings to Sentence Embeddings

While word embeddings provide a way to represent individual words, they do not capture the meaning of entire sentences or documents. This limitation led to the development of sentence embeddings, which are designed to represent longer text sequences.

Word Embeddings

Word embeddings, such as those created by Word2Vec, GloVe, and FastText, map individual words to vectors. These embeddings capture semantic similarities between words based on their context within a large corpus of text. For example, the words “king” and “queen” might be close together in the embedding space because they often appear in similar contexts.

Sentence Embeddings

Sentence embeddings extend the concept of word embeddings to entire sentences or even paragraphs. These embeddings aim to capture the meaning of a whole sentence, taking into account the context and relationships between words within the sentence. There are several methods to create sentence embeddings:

Averaging Word Embeddings: One of the simplest methods is to average the word embeddings of all words in a sentence. While this method is straightforward, it often fails to capture the nuances and syntactic structures of sentences.
Doc2Vec: Developed by Mikolov and Le, Doc2Vec extends Word2Vec to larger text segments by considering the paragraph as an additional feature during training. This method generates embeddings for sentences or documents that capture more context compared to averaging word embeddings.
Recurrent Neural Networks (RNNs): RNNs, particularly Long Short-Term Memory (LSTM) networks, can be used to generate sentence embeddings by processing the sequence of words in a sentence. The hidden state of the RNN after processing the entire sentence can serve as the sentence embedding.
Transformers (BERT, GPT, etc.): Modern approaches like BERT and GPT use transformer architectures to generate context-aware embeddings for sentences. These models can process a sentence bidirectionally, capturing dependencies and relationships between words more effectively than previous methods.

Example: BERT Sentence Embeddings

BERT (Bidirectional Encoder Representations from Transformers) has set a new standard for generating high-quality sentence embeddings. By processing a sentence in both directions, BERT captures the full context of each word in relation to the entire sentence. The embeddings generated by BERT can be fine-tuned for various NLP tasks, such as sentiment analysis, question answering, and text classification.

To create a sentence embedding with BERT, you can use the hidden states of the transformer model. Typically, the hidden state corresponding to the [CLS] token (which stands for “classification”) is used as the sentence embedding.

How to Generate Embeddings

Generating embeddings involves training a model on a large corpus of text data. Here’s a step-by-step guide to generating word and sentence embeddings:

Generating Word Embeddings with Word2Vec

Data Preparation: Collect and preprocess a large text corpus. This involves tokenizing the text, removing stop words, and handling punctuation.

Training the Model: Use the Word2Vec algorithm to train the model. You can choose between the CBOW or Skip-gram architecture. Libraries like Gensim in Python provide easy-to-use implementations of Word2Vec.

from gensim.models import Word2Vec

# Example sentences
sentences = [["I", "love", "machine", "learning"], ["Word2Vec", "is", "great"]]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

Using the Embeddings: Once the model is trained, you can use it to get the embedding for any word in the vocabulary.
```
word_embedding = model.wv['machine']
```

Generating Sentence Embeddings with BERT

Install Transformers Library: Use the Hugging Face Transformers library to easily work with BERT.
```
pip install transformers
```

Load Pretrained BERT Model: Load a pretrained BERT model and tokenizer.

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Tokenize Input Text: Tokenize your input text and convert it to input IDs and attention masks.

sentence = "BERT is amazing for sentence embeddings."
inputs = tokenizer(sentence, return_tensors='pt')

Generate Embeddings: Pass the inputs through the BERT model to get the embeddings.

with torch.no_grad():
    outputs = model(**inputs)

# The [CLS] token embedding
sentence_embedding = outputs.last_hidden_state[0][0]

Using the Embeddings: The sentence_embedding can now be used for various NLP tasks.

Data Needed for Training Embeddings

The quality of embeddings heavily depends on the data used for training. Here are key considerations regarding the data needed:

Size of the Corpus: A large corpus is generally required to capture the diverse contexts in which words can appear. For example, training Word2Vec or BERT models typically requires billions of words. The larger the corpus, the better the embeddings can capture semantic nuances.
Diversity of the Corpus: The corpus should cover a wide range of topics and genres to ensure that the embeddings are generalizable. This means including text from various domains such as news articles, books, social media, academic papers, and more.
Preprocessing: Proper preprocessing of the corpus is essential. This includes:
- Tokenization: Splitting text into words or subwords.
- Lowercasing: Converting all text to lowercase to reduce the vocabulary size.
- Removing Punctuation and Stop Words: Cleaning the text by removing unnecessary punctuation and common stop words that do not contribute to the meaning.
- Handling Special Characters: Dealing with special characters, numbers, and other non-alphabetic tokens appropriately.
Domain-Specific Data: For specialized applications, it is beneficial to include domain-specific data. For instance, medical embeddings should be trained on medical literature to capture the specialized vocabulary and context of the field.
Balanced Dataset: Ensuring that the dataset is balanced and not biased towards a particular topic or genre helps in creating more neutral and representative embeddings.
Data Augmentation: In cases where data is limited, data augmentation techniques such as back-translation, paraphrasing, and synthetic data generation can be used to enhance the corpus.

Applications of Sentence Embeddings

Sentence embeddings have a wide range of applications in NLP:

Text Classification: Sentence embeddings are used to represent sentences for classification tasks, such as identifying the topic of a sentence or determining the sentiment expressed in a review.
Semantic Search: By comparing sentence embeddings, search engines can retrieve documents that are semantically similar to a query, even if the exact keywords are not matched.
Summarization

: Sentence embeddings help in generating summaries by identifying the most important sentences in a document based on their semantic content.

Translation: Sentence embeddings improve machine translation systems by providing a richer representation of the source sentence, leading to more accurate translations.

Embedding Dimension Reduction Methods

High-dimensional embeddings can be computationally expensive and may contain redundant information. Dimension reduction techniques help in simplifying these embeddings while preserving their essential characteristics. Here are some common methods:

Principal Component Analysis (PCA): PCA is a linear method that reduces the dimensionality of data by transforming it into a new coordinate system where the greatest variances by any projection of the data come to lie on the first coordinates (principal components).
```
from sklearn.decomposition import PCA

# Assuming 'embeddings' is a numpy array of shape (n_samples, n_features)
pca = PCA(n_components=50)
reduced_embeddings = pca.fit_transform(embeddings)
```
t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a nonlinear technique primarily used for visualizing high-dimensional data by reducing it to two or three dimensions.
```
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)
```
Uniform Manifold Approximation and Projection (UMAP): UMAP is another nonlinear technique that is faster and often more effective than t-SNE for dimension reduction, especially for larger datasets.
```
import umap

reducer = umap.UMAP(n_components=2)
reduced_embeddings = reducer.fit_transform(embeddings)
```

Autoencoders: Autoencoders are a type of neural network used to learn efficient codings of input data. An autoencoder consists of an encoder and a decoder. The encoder compresses the input into a lower-dimensional latent space, and the decoder reconstructs the input from this latent space.

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# Define encoder
input_dim = embeddings.shape[1]
encoding_dim = 50  # Size of the reduced dimension
input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)

# Define decoder
decoded = Dense(input_dim, activation='sigmoid')(encoded)

# Build the autoencoder model
autoencoder = Model(input_layer, decoded)
encoder = Model(input_layer, encoded)

# Compile and train the autoencoder
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
autoencoder.fit(embeddings, embeddings, epochs=50, batch_size=256, shuffle=True)

# Get the reduced embeddings
reduced_embeddings = encoder.predict(embeddings)

Random Projection: Random projection is a simple and computationally efficient technique to reduce dimensionality. It is based on the Johnson-Lindenstrauss lemma, which states that high-dimensional data can be embedded into a lower-dimensional space with minimal distortion.
```
from sklearn.random_projection import SparseRandomProjection

transformer = SparseRandomProjection(n_components=50)
reduced_embeddings = transformer.fit_transform(embeddings)
```

Evaluating Embeddings

Evaluating embeddings is crucial to ensure that they capture meaningful relationships and semantics. Here are some common methods to evaluate embeddings:

Intrinsic Evaluation: These methods evaluate the quality of embeddings based on predefined linguistic tasks or properties without involving downstream tasks.
- Word Similarity: Measure the cosine similarity between word pairs and compare with human-annotated similarity scores. Popular datasets include WordSim-353 and SimLex-999.
```
from scipy.spatial.distance import cosine

similarity = 1 - cosine(embedding1, embedding2)
```
- Analogy Tasks: Evaluate embeddings based on their ability to solve word analogy tasks, such as “king - man + woman = queen.” Datasets like Google Analogy dataset are commonly used.
```
def analogy(model, word1, word2, word3):
    vec = model[word1] - model[word2] + model[word3]
    return model.most_similar([vec])[0][0]
```

Extrinsic Evaluation: These methods evaluate embeddings based on their performance on downstream NLP tasks.

Text Classification: Use embeddings as features for text classification tasks and measure performance using metrics like accuracy, precision, recall, and F1 score.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(train_embeddings, train_labels)
predictions = model.predict(test_embeddings)
accuracy = accuracy_score(test_labels, predictions)

Named Entity Recognition (NER): Evaluate embeddings by their performance on NER tasks, measuring precision, recall, and F1 score.

# Example using spaCy for NER
import spacy
from spacy.tokens import DocBin

nlp = spacy.load("en_core_web_sm")
nlp.entity.add_label("ORG")

train_docs = [nlp(text) for text in train_texts]
train_db = DocBin(docs=train_docs)

Machine Translation: Assess the quality of embeddings by their impact on machine translation tasks, using BLEU or METEOR scores.

Clustering and Visualization: Visualizing embeddings using t-SNE or UMAP can provide qualitative insights into the structure and quality of embeddings.

import matplotlib.pyplot as plt

tsne = TSNE(n_components=2)
reduced_embeddings = tsne.fit_transform(embeddings)

plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
for i, word in enumerate(words):
    plt.annotate(word, xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]))
plt.show()

Similarity vs. Retrieval Embeddings

Embeddings can be tailored for different purposes, such as similarity or retrieval tasks. Understanding the distinction between these two types of embeddings is crucial for optimizing their use in various applications.

Similarity Embeddings

Similarity embeddings are designed to capture the semantic similarity between different pieces of text. The primary goal is to ensure that semantically similar texts have similar embeddings.

Use Cases:

Semantic Search: Finding documents or sentences that are semantically similar to a query.
Recommendation Systems: Recommending items (e.g., articles, products) that are similar to a given item.
Paraphrase Detection: Identifying sentences or phrases that convey the same meaning.

Evaluation:

Cosine Similarity: Measure the cosine similarity between embeddings to evaluate their closeness.

from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity([embedding1], [embedding2])

Clustering: Grouping similar items together using clustering algorithms like K-means.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(embeddings)

Retrieval Embeddings

Retrieval embeddings are optimized for information retrieval tasks, where the goal is to retrieve the most relevant documents from a large corpus based on a query.

Use Cases:

Search Engines: Retrieving relevant web pages or documents based on user queries.
Question Answering Systems: Finding relevant passages or documents that contain the answer to a user’s question.
Document Retrieval: Retrieving documents that are most relevant to a given query.

Evaluation:

Precision and Recall: Measure the accuracy of retrieved documents using precision, recall, and F1 score.

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(true_labels, predicted_labels, average='weighted')
recall = recall_score(true_labels, predicted_labels, average='weighted')
f1 = f1_score(true_labels, predicted_labels, average='weighted')

Mean Reciprocal Rank (MRR): Evaluate the rank of the first relevant document.

def mean_reciprocal_rank(rs):
    """Score is reciprocal of the rank of the first relevant item
    First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
    Example from information retrieval with binary relevance:
    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.61111111111111105
    """
    rs = (np.asarray(r).nonzero()[0] for r in rs)
    return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])

Symmetric vs. Asymmetric Embeddings

Symmetric and asymmetric embeddings are designed to handle different types of relationships in data, and understanding their differences can help in choosing the right approach for specific tasks.

Symmetric Embeddings

Symmetric embeddings are used when the relationship between two items is mutual. The similarity between two items is expected to be the same regardless of the order in which they are compared.

Use Cases:

Similarity Search: Comparing the similarity between two items, such as text or images, where the similarity score should be the same in both directions.
Collaborative Filtering: Recommending items

based on mutual user-item interactions, where the relationship is bidirectional.

Evaluation:

Cosine Similarity: Symmetric embeddings often use cosine similarity to measure the closeness of vectors.
```
similarity = cosine_similarity([embedding1], [embedding2])
```

Asymmetric Embeddings

Asymmetric embeddings are used when the relationship between two items is directional. The similarity or relevance of one item to another may not be the same when the order is reversed.

Use Cases:

Information Retrieval: Retrieving relevant documents for a query, where the relevance of a document to a query is not necessarily the same as the relevance of the query to the document.
Knowledge Graph Embeddings: Representing entities and relationships in a knowledge graph, where the relationship is directional (e.g., parent-child, teacher-student).

Evaluation:

Rank-Based Metrics: Asymmetric embeddings often use rank-based metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to evaluate performance.

def mean_reciprocal_rank(rs):
    rs = (np.asarray(r).nonzero()[0] for r in rs)
    return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])

The Future of Embeddings

The field of embeddings is rapidly evolving. Researchers are exploring new ways to create more efficient and accurate representations, such as using unsupervised learning and combining embeddings with other techniques like graph networks. The ongoing advancements in this area promise to further enhance the capabilities of NLP systems.

Conclusion

Embeddings have revolutionized the field of NLP, providing a robust and efficient way to represent and process textual data. From word embeddings to sentence embeddings, these techniques have enabled significant advancements in how machines understand and interact with human language. With the help of dimension reduction methods, evaluation techniques, and tailored similarity and retrieval embeddings, embeddings can be optimized for a wide range of NLP tasks. Understanding the differences between symmetric and asymmetric embeddings further allows for more specialized applications. As we continue to develop more sophisticated models and techniques, embeddings will undoubtedly play a crucial role in advancing our understanding and interaction with human language.

How to find a home with a kitchen you like with deeplearning?

Malaikannan

Malaikannan Sankarasubbu

25 October 2018 at 00:00

Ai in clinical trials

Malaikannan

Malaikannan Sankarasubbu

20 October 2018 at 00:00

American Tamil Enterprenur association invited me for a talk on AI. Below is the video on the talk, this was a fun talk loaded with lot of MEMEs

Learned index

Malaikannan

Malaikannan Sankarasubbu

19 October 2018 at 00:00

Jeff Dean and Co came up with a Seminal Paper on whether Indexes can be learned using Neural Networks. I gave a talk in Saama Tech Talk Series on this topic. Guess I am moving more towards talking rather than writing nowadays.

Natural language processing tools for tamil

Malaikannan

Malaikannan Sankarasubbu

1 August 2018 at 00:00

Language is a beautiful thing, Tamil as a Language is even more beautiful. Need for learning other languages like English is for commerce, communication among other things. Languages that have written form and grammar have survived longer than languages that are without it. Technology is playing the role of written form and grammar now. English being the most spoken and written language in the world has the benefit of lot of technology and tools developed to understand language better. English has lot of good systems developed as open source as well as by the companies like Google,Facebook, Microsoft etc for Speech to Text, Syntactic Parser, Stemmer, Lemmatizer, Parts of Speech Tagger etc. This was due to years of data collection, research, availability of computing power and to great extent due to Deep Learning. We have to develop most of these Natural Language Understanding tools for Tamil, there are some that exist but they are in very very early stages. When a language is beautiful it is complicated too, English has 26 alphabets and Tamil has 247 alphabets, and the grammar rules are different between English and Tamil. In most cases we are limited by the lack of available data for us(tech community) to build these tools. These tools when built should be open sourced under Apache 2.0 license, so it is available for public to use for free. I have listed few key activities that have to be performed for this to be successful

Website to be created to list out mission statement and project status. This can be as simple as github pages.
Android app to be developed for data collection and even gamify it. For e.g. for Speech to Text similar to how Mozilla recently asked people to donate their voices to develop open source Speech to Text System for English. They open sourced the code and the trained model.
Data Access. There are research libraries that have digitized lot of books, dailies, news papers etc. If that data is publicly available, it can spawn multiple research similar to how Project Gutenberg did. One low hanging fruit for algorithm would be a vectorization engine for Tamil words.
Community involvement. This effort will not succeed without community involvement, particularly without student and teacher community for data collection.
Tech community involvement to develop this as open-source.
Eminent linguistic advisors to advise the correct approach for Tamil.
Technology advisors on right technology choices to solving problems.
Advisors to help registering and managing this as a non-profit and may be even seeking grants and sponsorship. This can start as a small thing, but if it has to be taken seriously it has to be registered as a non-profit organization.
Advisors who can help with media and press out-reach.

Why use Activation Functions in Deep Learning ?

Malaikannan

Malaikannan Sankarasubbu

18 July 2017 at 20:31

Machine Learning or Deep Learning is all about using Affine Maps. Affine map is a function which can be expressed as

f(x) = WX + b

Where W and X are matrix, and b (bias term) is a vector. Deep learning learns parameters W and b. In Deep Learning you can stack multiple affine maps on top of one another. for e.g

f(x) = WX + b
g(x) = VX + d

If we stack one affine map over the other then

f(g(x)) = W (VX +d) + b
f(g(x)) = WVx + Wd + b

WV is a matrix , Wd and b are vectors.

Deep learning requires lot of affine maps stacked on top of the other. But Composing one affine map over the other gives another affine map so stacking is not going to give the desired effect and it gives nothing more than what a single affine map is going to give. It still leaves us with a linear model. In a classification problem linear model will not be able to solve for a non-linear decision boundary.

How do we solve this ? By introducing non-linearity between affine maps/layers. Most commonly used non-linear functions are

Tanh
Sigmoid
RELU

When there are lot of non-linear functions why use only the above ones ? Because the derivatives of these functions are easier to compute which is how Deep Learning algorithms learn. Non-Linear functions are called Activation functions in Deep Learning world.

Thanks to Dr Jacob Minz suggestion to add explanation about Universal Approximation Theorem. Universal Approximation Theorem says that when you introduce simple non-linearity between affine layers, you’ll be able to approximate any function to any arbitrary degree (as close to that function as you want). If there is a pattern in the data, the neural network will “learn” it given enough of computation and data.

You can read more about the Activation functions in wiki. Who writes better about Neural Networks than Chris Olah. Refer to his blog for further reading. Spandan Madan has written a quora answer on the similar topic

செயற்கை அறிவாற்றல் வல்லுனர்களுக்கு ஏன் சரித்திரம் தெரிந்து இருக்க வேண்டும் ?

Malaikannan

Malaikannan Sankarasubbu

10 July 2017 at 18:37

தொழில்நுட்ப வளர்ச்சி சாமான்ய மக்களை வியக்க வைக்கின்றது, அவர்களும் தொழில்நுட்ப வளர்ச்சியை முடிந்த வரை அரவணைத்து வருகின்றனர்.வரலாற்றை புரட்டி பார்த்தால் ஒவ்வொரு 20 வருடங்களுக்கு ஒரு முறை மிகவும் சக்தி வாய்ந்த தொழில்நுட்பம் வந்து ஒரு பெரும் புரட்சியை உண்டு பண்ணும். செயற்கை அறிவாற்றல் தொழில்நுப்டமோ 100 ஆண்டுகளுக்கு ஒரு முறை வரும் ஒரு பெரும் மாற்றம். செயற்கை அறிவாற்றல் அறிஞர் ஆண்ட்ரூ ந.ஜி (Andrew Ng) இதை புதிய மின்சாரம் என்று கூறுகிறார். இத்தகைய மிக சக்தி வாய்ந்த தொழில்நுட்பத்தை வல்லுநர்கள் ஆக்கவும் பயன்படுத்த முடியும் அழிக்கவும் பயன்படுத்த முடியும்.

மின்சாரம்

செயற்கை அறிவாற்றல் வல்லுநராக ஆக ஒருவர் மிக சிறந்த மென்பொருள் வல்லுநராக, கணித மேதையாக, விடா முயற்சி கொண்டவராக இருக்க வேண்டும். இவ்வளவு விஷயத்தில் வல்லமை பெற்ற ஒருத்தர் அதி புத்திசாலியாக இருக்க வாய்ப்பு உண்டு. அதி புத்திசாலிகளுக்கு உரிய அகந்தையும் கர்வமும், எதையும் செய்யலாம் எந்த விதிகளும் நமக்கு பொருந்தாது என்ற பண்பு இருக்கும் வாய்ப்பு மிக அதிகம். இத்தகைய பண்பை தான் ஒரு அறிஞர் ஒருவர் “எண்கள் பொய் சொல்லாது ஆனால் நன்றாக பொய் பேசுபவர்கள் எண்களை உபயோகிப்பர்” (Numbers don’t lie but liars use numbers) என்று கூறி இருக்கிறார்.

செயற்கை அறிவாற்றலை சுருக்கமாக சொல்ல வேண்டுமானால் கடந்த கால நிகழ்வுகளை வைத்து எதிர்காலத்தில் நடப்பதை கணிக்கக்கூடிய வல்லமை பெற்றது. இது ஒரு கேள்வியை எழுப்புகிறது, கடந்த கால நிகழ்வு ஒன்று இல்லை என்றால் செயற்கை அறிவாற்றல் அதை எப்படி சரியாக கணிக்க முடியும் ? உதாரணத்துக்கு நாம் செயற்கை அறிவாற்றலை வங்கியில் தொழில்முனைவோருக்கு கடன்கொடுப்பதா இல்லையா என்று முடிவு எடுக்க பயன்படுத்துகிறோம் என்றால். ஒரு சமூகம் தொழில்முனைவதில் பெயர் பெற்றவர்கள், அவர்கள் பற்றிய தகவல்கள் செயற்கை அறிவாற்றலுக்கு தெரியும், அது கடன் கொடுப்பதை பற்றி சாதகமான முடிவு எடுக்க வாய்ப்பு மிகவும் அதிகம். அதை சமயம் இன்னொரு சமூகம் காலம் காலமாக அடிமை பெற்ற சமூகம், அதில் இருந்து தொழில் முனைவோர் வந்தது இல்லை, இந்த சமூகத்தை பற்றி தவகல்கள் செயற்கை அறிவாற்றலுக்கு தெரியாது, அது கடன் கொடுப்பதை பற்றி பாதகமான முடிவு எடுக்க வாய்ப்பு மிகவும் அதிகம். அமெரிக்காவில் சிறையில் இருக்கும் மக்கள் தொகையில் 34% விழுக்காடு கறுப்பினத்தவர் ஆனால் அமெரிக்காவின் மொத்த மக்கள் தொகையில் 12.2% விழுக்காடு தான் கறுப்பினத்தவர். தவறு செய்த ஒருவரை சிறையில் இருந்து சீக்கிரம் விடுவிக்கலாமா இல்லையா என்று முடிவு செய்ய செயற்கை அறிவாற்றலை பயன்படுத்தினால் கறுப்பினத்தவருளுக்கு பாதகமான முடிவு வர சாத்தியம் அதிகம். இதை பற்றி நியூயார்க் டைம்ஸ் ஒரு அருமையான கருத்து வெளியிட்டறிந்தது.

செயற்கை அறிவாற்றலலில் இரண்டு வகைப்பாடுகள் உண்டு. முதல் வகை அது ஒரு முடிவை ஏன் எடுக்கிறன்றது என்று வல்லுநர்களுக்கு தெரியும். இரண்டாம் வகை அது ஏன் ஒரு முடிவை எடுக்கின்றது என்று தெரியாத வகை. DeepLearning இரண்டாவது வகையறாவை சாறும். பேராசிரியர் பீன் கிம் (Prof Been Kim ) அணைத்து செயற்கை ஆறிவாற்றலும் ஏன் ஒரு முடிவை எடுத்தது என்பது மனிதர்களுக்கு புரிய வேண்டும் என்று ஆராய்ச்சி செய்து வருகிறார். வங்கியில் கடன் கொடுப்பதா இல்லையா போன்ற காரியத்துக்கு இரண்டாம் வகை செயற்கை ஆறிவுஆற்றலை பயன் படுத்த கூடாது.

செயற்கை அறிவாற்றல் ஒரு குழந்தை மாதிரி நல்ல தகவல்களை சொல்லி கொடுத்தால் நல்ல முடிவு எடுக்கும், தவறான தகவல்களை சொல்லி கொடுத்தால் தவறான முடிவு எடுக்கும். செயற்கை அறிவாற்றல் வல்லுநர்கள் கண்மூடித்தனமாக சரித்திரம் மற்றும் சமூக கட்டமைப்பு புரியாமல் தகவல்களை செயற்கை ஆறிவாற்றலுக்கு சொல்லி கொடுத்தால் அது பாரபட்சமான முடிவு எடுக்கும் வாய்ப்பு மிகவும் அதிகம்.இதற்கு முன்பு வந்த தொழில்நுட்பங்களை போல செயற்கை அறிவாற்றல் அதன் படைப்பாளிகள் அறநெறிகளை பிரதிபலிக்கும். செயற்கை அறிவாற்றல் வல்லுனர்களுக்கு மிகுந்த பொறுப்பு உள்ளது.

கவிஞர் புலமைப்பித்தனின் வரிகள் “எந்தக்குழந்தையும் நல்ல குழந்தை தான் மண்ணில் பிறக்கையிலே…பின் நல்லவராவதும் தீயவராவதும் அன்னை வளர்ப்பதிலே” செயற்கை அறிவாற்றல் வல்லுநர்களுக்கு மிகவும் பொருந்தும்.

செயற்கை அறிவாற்றல்

Malaikannan

Malaikannan Sankarasubbu

27 June 2017 at 20:50

செயற்கை அறிவாற்றல் என்று சொன்னால் நமக்கு நினைவு வருவது ரஜினியின் எந்திரன் திரைப்படம் அதில் சிட்டி அடிக்கும் லூட்டிகளைக்கண்டு இதெல்லாம் சாத்தியமா என்று வியந்தது உண்டு. சிட்டி செயற்கை அறிவுஆற்றலின் ஒரு பரிமாணம். எல்லா செயற்கை அறிவாற்றல் பரிமாணங்களும் சிட்டி மாதிரி இருக்க வேன்றும் என்று அவசியம் இல்லை.

எந்திரன்

நமக்கு தெரியாமலே அதிகமான செயற்கை அறிவாற்றல் சாதனங்களை நாம் பயன் படுத்தி வருகிறோம். நம்மில் பலபேர் முகநூலில் ஒரு கணக்கு வைத்துயிருக்கிறோம். செய்திகளை அறியக்கூட நாம் முகநூலுக்கு செல்ல தொடங்குகிறோம். உங்களுக்கு ஒரு செய்தியோ அல்லது தகவலோ பிடித்துஇருந்தால் அதற்கான விருப்பத்தை(LIKE) நீங்கள் முகநூலில் பதிவு செய்துயிருந்தால் உங்களுக்கு அதற்கு சம்பந்தப்பட்ட செய்திகளோ தகவல்களோ முகநூலில் அதிகம் வர தொடங்கும் இது ஒரு வகையில் நல்லது. நமக்கு விருப்பமான விவரங்களை மட்டும் நாம் கற்றுக்கொள்கிறோம் விரும்பாத விவரங்களை தவிர்க்கிறோம். அனால் இது நம்மளை கிணற்று தவளையாக மாற்றிவிடுகிறது. இந்த கிணறு தான் நம் உலகம். முகநூலின் செயற்கை அறிவாற்றல் அதிகம் கற்று கொள்ள கற்று கொள்ள அவர்களை சுற்றி ஒரு கிணறு வெட்ட தொடங்குகிறது.

கிணற்று தவளையாக

அட உங்களின் விருப்பத்தை அது எப்படி கற்று கொள்கிறது?

உங்களின் விருப்பத்தினை நீங்கள் பதிவு செய்யும் விரும்புதல்(LIKE) மற்றும் கருத்து தெரிவிப்பதின் (COMMENT) மூலமும் செயற்கை அறிவு ஆற்றலக்கு கற்று கொடுக்கிறீர்கள். செயற்கை அறிவாற்றல் உங்களது விருப்பத்திற்கு இணங்க உரிய விவரங்களை மட்டும் திரும்ப திரும்ப காட்டுகிறது. உதாரணத்திற்கு நீங்கள் ஜெயகாந்தன் கட்டுரைகளுக்கு விருப்பம் தெரிவித்துஇருந்தால் அசோகமித்திரன் கட்டுரைகளும் முகநூலின் தகவல் பலகைகளில் வர தொடங்கும். கவிஞர் கண்ணதாசன் கவிதைகள் விருப்பம் தெரிவுத்துஇருந்தால் கவிஞர் வாலியின் கவிதைகளும் வர தொடங்கும்.

2016அம் ஆண்டு அமெரிக்க தேர்தலில் போலியான செய்திகளின் மூலம் செய்த பிரச்சாரம் முகநூலின் எதிரொலி அறை விளைவாக அதிக தாக்கத்தை ஏற்படுத்தி தேர்தலில் ஒரு மாறுதலை உண்டாக்கியது என வல்லுனர்களின் கருத்து.

மென்பொருளுக்கும் செயற்கை அறிவு ஆற்றலுக்கும் என்ன வித்தியாசம் ?

உதாரணத்துக்கு மைதானத்தில் உடற்பயிற்சி செய்வது எனக்கு ரொம்ப பிடிக்கும். மென்பொருள் பொறியாளரான எனக்கு எந்தவித பிரச்னை இருந்தாலும் அதற்கான தீர்வு செய்ய ஒரு மென்பொருளை நான் வடிவுவமைப்பேன். உடல் பயிற்சி செய்யலாமா என்ற என் முடிவை பாதிக்கும் கீழ்கண்ட காரணங்கள்.

மழை பெய்கிறதா ?
- மழை பெய்தால் உடற்பயிற்சி செய்ய வேண்டாம்.
மணி என்ன ?
- இரவு 10 மணி முதல் காலை 5 மணி வரை : கும்முஇருட்டில் உடற்பயிற்சி செய்யாதே.
- காலை 5 மணி முதல் 7 மணி வரை : உடற்பயிற்சி செய்யலாம்.
- காலை 8 மணி முதல் 9 மணி வரை : வீட்டிலிருந்து அலுவலக வேலை செய்தால் உடற்பயிற்சி செய்யலாம்
- காலை 9 மணி முதல் மாலை 5 மணி வரை : அலுவலக பணி உடற்பயிற்சி செய்யலாம்.
நான் மகனை பள்ளியில் விட வேண்டுமா ?
- காலை 8 மணி முதல் 9 மணி வரை : உடற்பயிற்சி செய்ய முடியாது
- மலை 4 மணி முதல் 5 மணி வரை : உடற்பயிற்சி செய்ய முடியாது
சீதோஷண நிலை
- 22 முதல் 32 டிகிரி செல்சியஸ் : உடற்பயிற்சி செய்யலாம்
- 32 முதல் 44 டிகிரி செல்சியஸ் : உடற்பயிற்சி செய்ய முடியாது

பாரம்பரிய மென்பொருள் முறைப்படி தீர்வு செய்ய முயன்றால் வரிசை மாற்றம் மற்றும் சேர்கை காரணங்கள்(PERMUTATION AND COMBINATION) எண்ணிக்கைகள் மிக அதிகமாக இருக்கும்.

மணி	மழை பெய்கிறதா	பள்ளி சென்று விடும் நேரமா	வெயிலின் தாக்கம்	ஆமாம்/இல்லை
காலை 5-7	ஆமாம்	இல்லை	22-32 டிகிரி செல்சியஸ்	இல்லை
இரவு 9-5	இல்லை	இல்லை	22-32 டிகிரி செல்சியஸ்	இல்லை
காலை 5-7	இல்லை	இல்லை	22-32 டிகிரி செல்சியஸ்	ஆமாம்
…..	….	….	….	…

நான்கே காரணங்குளுக்கு 20க்கு மேற்பட்ட வரிசை மாற்றம் சேர்க்கை காரணங்கள்(PERMUTATION AND COMBINATION) 20க்கு மேல் வருகிறது. இதில் தனிப்பட்ட காரணங்கள் 100க்கு மேல் இருந்தால் வரிசை மாற்றம் சேர்க்கை காரணங்கள்(PERMUTATION AND COMBINATION) 1000க்கு மேல் வரும். இது பாரம்பரிய மென்பொருள் வடிவமைப்பின் மூலம் எளிதில் தீர்வு செய்ய முடியாது. இது மாதிரியான பிரச்சனைக்கு செயற்கை அறிவாற்றல் தான் சரியான தீர்வாக அமையும். நமது நவீன தொலைபேசியில் உடற்பயிற்சியை கண்காணிக்க நிறைய செயளிகள் உள்ளன. Run keeper, Mapmyrun இந்த மாறி செயலிகளை ஒரு வருடம் உபயோகித்து இருந்தால் அதன் தகவல்களை செயற்கை அறிவு ஆற்றலுக்கு ஒரு குழந்தைக்கு எது சரி அது தவறு என்று சொல்லி கொடுப்பதுபோல் நான் எப்போதுஎல்லாம் உடற்பயிற்சி செய்யலாம் /செய்யக்கூடாது என்பதை சொல்லி கொடுக்க முடியும்.

செயற்கை அறிவு ஆற்றலையை சுருக்கமாக சொல்ல வேண்டுமானால் கடந்த கால நிகழ்வுகளை வைத்து எதிர்காலத்தில் நடப்பதை கணிக்கக்கூடிய வல்லமை பெற்றது. மேலே பார்த்த உதாரணம் ஒரு சிறிய நடைமுறையைப் பயன்பாடு. செயற்கை அறிவு ஆற்றல் பல இடங்களில் பயன்படுத்த முடியும். உதாரணம் - வானிலை அறிக்கை, செயற்கோள் மூலமாக பூமியின் அடியில் உள்ள கனிம வளத்தினை அறியலாம். மலைப் பகுதியில் உள்ள வளங்களை அறியலாம். ஒரு புகைப்படத்தினை பயன்படுத்தி தோல் புற்று நோயை அறியலாம்.

செயற்கை அறிவாற்றல் என்பது ஒரு சக்தி வாய்ந்த கருவி அதை சரியாக பயன்படுத்தினால் மனித குலத்திற்கு அநேக நன்மைகளை உண்டாக்கலாம். இதன் மூலம் புதிய வேலை வாய்ப்புகள் பல உருவாகும். அதே நேரத்தில் பழைய தொழில் நுட்பங்கள் மறைவதற்க்கான வாய்ப்புகள் உண்டு. சான்றோர்களும், வல்லுனர்களும் , அரசுஆட்சி செய்ப்பவர்களும் அமர்ந்து விவாதித்து வழிகாட்டுதலை வெளியிட வேண்டும்.

இது எனது முதல் செயற்கை அறிவாற்றல் பற்றிய வலை பதிவு. நேரம் கிடைக்கும் போதுஎல்லாம் மேலும் எனது கருத்தினை எழுத எனது மனது தூண்டுகிறது.

Introduction to Deep Learning Image Classification using Keras

Malaikannan

Malaikannan Sankarasubbu

24 February 2017 at 08:57

DeepLearning is a powerful tool that can be used to solve lot of problems. I can’t solve all the problems in this world, if I can inspire few others to take Deep Learning to solve other problems it will make a positive impact in this society. I believe in spreading knowledge to all segments. Below is the youtube link for a talk i gave in Demystifying Artificial Intelligence conference couple of weeks back

I am part of IDLI group (Indian Deep Learning Initiative) which is trying get DeepLearning promoted in India. We are organizing lot of youtube live sessions. I gave the same talk with screen sharing. If you like to see code when someone is talking and not presenters face then this youtube link is for you

When to use Machine Learning?

Malaikannan

Malaikannan Sankarasubbu

5 February 2017 at 18:46

Should I get started with “Machine Learning” to improve my situation?

If you are trying to pick an action based on a problem you are facing, how should you approach it? You have heard a lot or a little about “Machine Learning.” Here’s my personal story to help you decide.

As a software programmer, when I want solve a problem, I try to figure out a way to automate it. Being a machine learning engineer, I begin comparing to the oldest problem solving method: deterministic programming.

Malai Running

Being active is my passion. Everyday, I decide whether to go outside for a run outdoors. Situations that influence my decision are:

Is it raining outside?
- Yes –> Don’t run
- No –> I think I can run.
Time of the day?
- Is it 10 P.M to 5 A.M –> Dont’ run. Too dark outside.
- 5 A.M to 7 A.M –> I think I can run.
- 8 A.M to 9 A.M –> Am I working from home? I think I can run.
- 9-5 P.M –> Don’t run. Office work.
- 5 P.M -7 P.M –> I think I can run.
- 7 P.M - 9 P.M –> if it is not too dark then I can run.
- 9 P.M - 10 P.M –> After dinner, I don’t run.
Do I have to drop my son in school?
- Is it Morning school drop time ( 8.30 A.M to 9 A.M) –> Dont run
- Is it evening school pickup time (4.45 P.M to 5.15 P.M) –> Dont run.
- Other times. I think I can run.
What is the temperature outside?
- 50-80F –> I think I can run.
- 80F-110F –> I don’t think I can run. Too hot.
- 0F - 50F –> I am not running.

If I do this deterministically, then I will end up with lot of “If” conditions for all the permutations of combinations that are possible . With just 4 attributes, it is getting complex. Imagine having hundreds of attributes influencing whether I want to run or not. This is where machine learning will save my day. (Did I mention that I am a lazy programmer? Good code is better than more code.).

A machine learning model “projects past on future” to “learn” whether I will run outside or not. With historical data of my running (Map my run, Garmin, Fitbit, Apple Watch) + publicly available weather data, I can train a machine learning model (classification problem) to decide whether I want to run or not. A set of machine learning algorithms will approximate a decision boundary for this problem. Output from the machine learning algorithm is probability for 2 classes: a) run or b) don’t run.

Do you see? Machine learning rocks. This story is perfect for a supervised classification problem.

Natural Language Processing using Word2Vec

Malaikannan

Malaikannan Sankarasubbu

22 January 2017 at 07:42

Natural Language Processing is a complicated area. Computers are really good at crunching numbers but not so much with text. We have to specifically instruct them on how to handle text. For Text analytics to be carried out we have to represent text in a form the computers can understand i.e. in the form of numbers.

The method for doing this varies.
A good method is one that captures as much of the meaning of the text as possible.

After this is done, analysis can be carried out on these numbers in the same way as dealing with numbers.

Conventional Text Analysis or Bag of Word Models

Lets take this example text “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Popular way of converting this text to numbers is known as “Bag of Words” model. Each word is extracted from the text and put in a bag together.Each word in the bag is then assigned a suitable value. Lets parse the example text and extract words

“SanFrancisco”, “is”, “a”, “city”, “LosAngeles” ,”lovely”, “California”, “metropolis”.

Stop words are the most common words in a language. In the extracted list of words “is” , “a” are considered stop words hence it will be ignored. So the final list of extracted words will be

“SanFrancisco”, “beautiful”, “city” ,”LosAngeles”, “lovely” , “California” , “metropolis”

In Bag of Words model each word is assigned a value based on number of times it occurs in the text.

SanFrancisco	beautiful	city	LosAngeles	lovely	California	metropolis
1	1	1	1	1	2	1

This approach has lot of disadvantages, few of those are listed below

It does not capture the order of words in the original text.
It does not capture the context of the text.
It does not capture about the meanings of the words.

What we are telling the computer using this approach is city, lovely, metropolis are all equal. Even though city and metropolis are equivalent and lovely is something very different.

Neural Language models

Prof Yoshua Bengio started this work in 2008. A language model is an algorithm for capturing the salient statistical characteristics of the distribution of sequences of words in a natural language.

A neural language model learn distributed representations on words to reduce the impact of the curse of dimensionality. Curse of dimensionality is when number of input variable grows the number of required examples to train a model grows exponentially.

You can read more about this work in this link

Word2vec – Word to Vector

Word2Vec is one of the influential papers in Natural Language Processing. It has nearly 3000 citations. Word2Vec improves on Prof Yoshua Bengio’s earlier work on Neural Language Models. Word2Vec was created by google, with main author being Tomas Mikolov. You can read the original paper here.

Word2vec is an answer to all the disadvantages listed in the previous section.

It is intelligent, learns context of the text.
Understands sentences, and not just individual words.
Understands relationships between words.

We understand what a word is, lets see what a vector is. A vector is a sequence of numbers that forms a group. For example

(3) is a one dimensional vector.
(2,8) is a two dimensional vector.
(12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only 3 dimensions, anything more than that you can just say it not visualize.

How Word2Vec works

Word2Vec

When we train word2vec algorithm in the example discussed above “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Lets assume that it outputs 2 dimension vectors for each words, since we can’t visualize anything more than 3 dimension.

SanFrancisco (6,6)
beautiful (-13,-4)
California (10,8)
city (2,10)
LosAngeles (6.5,5)
lovely(-12,-7)
metropolis(2.5,8)

Below is a 2D Plot of vectors

2DPlot

Word2vec and Analogies

Word2vec algorithm is really good in discovering analogies on data. In the below plot from relative position analogies can be observed.

2DPlot

Algorithm knows the answer to

If SanFrancisco : beautiful
then LosAngeles : ???

Answer : lovely

To get to the answer do

Draw a line from SanFrancisco to beautiful
Shift this line to LosAngeles
Find the end-point of this line.

Similarly you can draw other analogies like

if SanFrancisco : city
then LosAngeles : metropolis

Math for Analogies is beautiful, it can be expressed in simple vector arithmetic.

SanFrancisco - LosAngeles = beautiful - [unknown]
[unknown] = beautiful + LosAngeles - SanFrancisco
[unknown] = (-12.5,5) which is close to lovely

vector

Google trained Word2Vec on a large volume data, it came up with some interesting analogies

King

countries and capitals

Word2vec depicting relationships between countries and capitals Analogy

Machine translation

In complicated use-cases models can understand translations from one language to another.

MachineTranslation

Contributions

Thanks to Arthur Chan for suggesting changes to the blog.