Reading view

There are new articles available, click to refresh the page.

An Introduction to Tokenizers in Natural Language Processing

Tokenizers

_Co-authored by Tamil Arasan, Selvakumar Murugan and Malaikannan Sankarasubbu

In Natural Language Processing (NLP), one of the foundational steps is transforming human language into a format that computational models can understand. This is where tokenizers come into play. Tokenizers are specialized tools that break down text into smaller units called tokens, and convert these tokens into numerical data that models can process.

Imagine you have the sentence:

Artificial intelligence is revolutionizing technology.

To a human, this sentence is clear and meaningful. But we do not understand the whole sentence in one shot(okay may be you did, but I am sure if I gave you a paragraph or a even better an essay, you will not be able to understand them in one shot), but we make sense of parts of it like words and then phrases and understand the whole sentence as a composition of meanings from its parts. It is just how things work, regardless whether we are trying to make a machine mimic our language understanding or not. This has nothing to do with the reason ML models or even computers in general work with numbers. It is purely how language works and there is no going around it.

ML models like everything else we run on computers can only work with numbers, and we need to transform the text into number or series of numbers (since we have more than one word). We have a lot of freedom when it comes to how we transform the text into numbers, and as always with freedom comes complexity. But basically, tokenization as a whole is a two step process. Finding all the words and assigning a unique number - an ID to each token.

There are so many ways we can segment a sentence/paragraph into pieces like phrases, words, sub-words or even individual characters. Understanding why particular tokenization scheme is better requires a grasp of how embeddings work. If you're familiar with NLP, you'd ask "Why? Tokenization comes before the Embedding, right?" Yes, you're right, but NLP is paradoxical like that. Don't worry we will cover that as we go.

Background

Before we venture any further, lets understand the difference between Neural networks and our typical computer programs. We all know by now that for traditional computer programs, we write/translate the rules into code by hand whereas, NNs learn the rules(mapping across input and output) from data by the process called training. You see unlike in normal programming style, where we have a plethora of data-structures that can help with storing information in any shape or form we want, along with algorithms that jump up and down, back and forth in a set of instructions we call code, Neural Networks do not allow us to have all sorts of control flow we'd like. In Neural Networks, there is only one direction the "program" can run, left to right.

Unlike in traditional programs where the we can feed a program with input in complicated ways, in Neural Networks, there are only fixed number of ways, we can feed and it is usually in the form of vectors (fancy name for list of numbers) and the vectors are of fixed size (or dimension more precisely). In most DNNs, input and output sizes are fixed regardless of the problem it is trying to solve. For example, CNNs the input (usually image) size and number of channels is fixed. In RNNs, the embedding dimensions, input vocabulary size, number of output labels (classification problem e.g: sentiment classification) and or output vocabulary size (text generation problems e.g: QA, translation) are all fixed. In Transformer networks even the sentence length is fixed. This is not a bad thing, constraints like these enable the network to compress and capture the necessary information.

Also note that there are only few tools to test "equality" or "relevance" or "correctness" for things inside the network because only things that dwell inside the network are vectors. Cosine similarity and attention scores are popular. You can think of vectors as variables that keep track of state inside neural network program. But unlike in traditional programs where you can declare variables as you'd like and print them for troubleshooting, in networks the vector-variables are only meaningful only at the boundaries of the layers(not entirely true) within the networks.

Lets take a look at the simplest example to understand why just pulling a vector from anywhere in the network will not be of any value for us. In the following code, three functions perform the identical calculation despite their code is slightly different. The unnecessarily intentionally named variables temp and growth_factor need not be created as exemplified by the first function, which directly embodies the compound interest calculation formula, $A = P(1+\frac{R}{100})^{T}$. When compared to temp, the variable growth_factor hold a more meaningful interpretation - represents how much the money will grow due to compounding interest over time. For more complicated formulae and functions, we might create intermediate variables so that the code goes easy on the eye, but they have no significance to the operation of the function.

def compound_interest_1(P,R,T):
    A = P * (math.pow((1 + (R/100)),T))
    CI = A - P
    return CI

def compound_interest_2(P,R,T):
    temp = (1 + (R/100))
    A = P * (math.pow(temp, T))
    CI = A - P
    return CI

def compound_interest_3(P,R,T):
    growth_factor = (math.pow((1 + (R/100)),T))
    A = P * growth_factor
    CI = A - P
    return CI

Another example to illustrate from operations perspective. Clock arithmetic. Lets assign numbers 0 through 7 to weekdays starting from Sunday to Saturday.

Table 1

Sun Mon Tue Wed Thu Fri Sat
0 1 2 3 4 5 6

John Conway suggests, a mnemonic device for thinking of the days of the week as Noneday, Oneday, Twosday, Treblesday, Foursday, Fiveday, and Six-a-day.

So if you want to know what day it is 137 days from today if today is say, Thursday (i.e. 4). We can do $(4+137) mod 7 => 1$ i.e Monday. As you can see adding numbers(days) in clock arithmetic results in a meaningful output. You can days together to get another day. Okay lets ask the question can we multiply two days together? Is it is in anyway meaningful to multiply days? Just because we can multiply any number mathematically, is it useful to do so in our clock arithmetic?

All of this digression is to emphasize that the embedding is deemed to capture the meaning of words, vector from the last layers is deemed to capture the meaning of a sentence lets say. But when you take a vector (just because you can) within the layers for instance, it does not refer to any meaningful unit such as words or phrases and sentence as we understand it.

A little bit of history

If you're old enough, you might remember that before transformers became standard paradigm in NLP, we had another one EEAP (Embed, Encode, Attend, Predict). I am grossly oversimplifying here, but you can think of it as follows,

Embedding

Captures the meaning of words A matrix of size $N \times D$, where

  • $N$ is the size of the vocabulary, i.e unique number of words in the language
  • $D$ is the dimension of embedding, vector corresponding to each word.

Lookup the word-vector (embedding) for each word

Encoding
Find the meaning of a sentence, by using the meaning captured in embeddings of the constituent words with help of RNNs like LSTM, GRU or transformers like BERT, GPT that take the embeddings and produce vector(s) for whole the sequence.
Prediction
Depending upon the task at hand, either assigns a label to the input sentence, or generate another sentence word by word.
Attention
Helps with Prediction by focusing on what is important right now by drawing a probability distribution (normalized attention scores) over the all words. Words with high score are deemed important.

As you can see above, $N$ is the vocabulary size, i.e unique number of words in the language. And handful of years ago, language usually meant the corpus at hand (in order of few thousands of sentences) and datasets like CNN/DailyMail were considered huge. There were clever tricks like anonymizing named entities to force the ML models to focus on language specific features like grammar instead of open world words like names of Places, Presidents, Corporations and Countries, etc. Good times they were! Point is, it is possible that the corpus you have in your possession might not have all the words of the language. As we have seen, the size of the Embedding must be fixed before training the network. By good fortune if you stumble upon a new dataset and hence new words, adding them to your model was not easy, because Embedding needs to extend to accommodate this new (OOV) words and that requires retraining of the whole network. OOV means Out Of the current model's Vocabulary. And this is why simply segmenting the text on empty spaces will not work.

With that background, lets dive in.

Tokenization

Tokenization is the process of segmenting the text into individual pieces (usually words) so that ML model can digest them. It is the very first step in any NLP system and influences everything that follows. For understanding impact of tokenization, we need to understand how embeddings and sentence length influence the model. We will call sentence length as sequence length from here on, because sentence is understood to be sequence of words, and we will experiment with sequence of different things not just words, which we will call tokens.

Tokens can be anything

  • Words - "telephone" "booth" "is" "nearby" "the" "post" "office"
  • Multiword Expressions (MWEs) - "telephone booth" "is" "nearby" "the" "post office"
  • Sub-words - "tele" "#phone" "booth" "is" "near " "#by" "the" "post" "office"
  • Characters - "t" "e" "l" "e" "p" ... "c" "e"

We know segmenting the text based on empty spaces will not work, because the vocabulary will keep growing. What about punctuations? Surely they will help with words don't, won't, aren't, o'clock, Wendy's, co-operation{.verbatim} etc, same reasoning applies here too. Moreover segmenting at punctuations will create different problems, e.g: I.S.R.O > I, S, R, O{.verbatim} which is not ideal.

Objectives of Tokenization

The primary objectives of tokenization are:

Handling OOV
Tokenizers should be able to segment the text into pieces so that any word in the language whether it is in the dataset or not, any word we might conjure in foreseeable future, whether it is a technical/domain specific terminology that scientists might utter to sound intelligent or commonly used by everyone in day to day life. An ideal tokenizer should be able to deal with all and any of them.
Efficiency
Reducing the size (length) of the input text to make computation feasible and faster.
Meaningful Representation
Capturing the semantic essence of the text so that the model can learn effectively. Which we will discuss a bit later.

Simple Tokenization Methods

Go through the code below, and see if you can make any inferences on the table produced. It reads the book The Republic and counts the tokens on character, word and sentence levels and also indicated the number of unique tokens in the whole book.

Code

``` {.python results=”output raw” exports=”both”} from collections import Counter from nltk.tokenize import sent_tokenize with open(‘plato.txt’) as f: text = f.read()

words = text.split() sentences = sent_tokenize(text)

char_counter = Counter() word_counter = Counter() sent_counter = Counter()

char_counter.update(text) word_counter.update(words) sent_counter.update(sentences)

print(‘#+name: Vocabulary Size’) print(‘|Type|Vocabulary Size|Sequence Length|’) print(f’|Unique Characters|{len(char_counter)}|{len(text)}’) print(f’|Unique Words|{len(word_counter)}|{len(words)}’) print(f’|Unique Sentences|{len(sent_counter)}|{len(sentences)}’)


**Table 2**

| Type              | Vocabulary Size | Sequence Length |
| ----------------- | --------------- | --------------- |
| Unique Characters | 115             | 1,213,712       |
| Unique Words      | 20,710          | 219,318         |
| Unique Sentences  | 7,777           | 8,714           |



## Study

Character-Level Tokenization

:   In this most elementary method, text is broken down into individual
    characters.

    *\"data\"* \> `"d" "a" "t" "a"`{.verbatim}

Word-Level Tokenization

:   This is the simplest and most used (before sub-word methods became
    popular) method of tokenization, where text is split into individual
    words based on spaces and punctuation. Still useful in some
    applications and as a pedagogical launch pad into other tokenization
    techniques.

    *\"Machine learning models require data.\"* \>
    `"Machine", "learning", "models", "require", "data", "."`{.verbatim}

Sentence-Level Tokenization

:   This approach segments text into sentences, which is useful for
    tasks like machine translation or text summarization. Sentence
    tokenization is not as popular as we\'d like it to be.

    *\"Tokenizers convert text. They are essential in NLP.\"* \>
    `"Tokenizers convert text.", "They are essential in NLP."`{.verbatim}

n-gram Tokenization

:   Instead of using sentences as a tokens, what if you could use
    phrases of fixed length. The following shows the n-grams for n=2,
    i.e 2-gram or bigram. Yes the `n`{.verbatim} in the n-grams stands
    for how many words are chosen. n-grams can also be built from
    characters instead of words, though not as useful as word level
    n-grams.

    *\"Data science is fun\"* \>
    `"Data science", "science is", "is fun"`{.verbatim}.





**Table 3**

| Tokenization | Advantages                             | Disadvantages                                        |
| ------------ | -------------------------------------- | ---------------------------------------------------- |
| Character    | Minimal vocabulary size                | Very long token sequences                            |
|              | Handles any possible input             | Require huge amount of compute                       |
| Word         | Easy to implement and understand       | Large vocabulary size                                |
|              | Preserves meaning of words             | Cannot cover the whole language                      |
| Sentence     | Preserves the context within sentences | Less granular; may miss important word-level details |
|              | Sentence-level semantics               | Sentence boundary detection is challenging           |

As you can see from the table, the vocabulary size and sequence length
have inverse correlation. The Neural networks requires that the tokens
should be present in many places and many times. That is how the
networks understand words. Remember when you don\'t know the meaning of
a word, you ask someone to use it in sentences? Same thing here, the
more sentences the token is present, the better the network can
understand it. But in case of sentence tokenization, you can see there
are as many tokens in its vocabulary as in the tokenized corpus. It is
safe to say that each token is occuring only once and that is not a
healthy diet for a network. This problem occurs in word-level
tokenization too but it is subtle, the out-of-vocabulary(OoV) problem.
To deal with OOV we need to stay between character level and word-level
tokens, enter \>\>\> sub-words \<\<\<.

# Advanced Tokenization Methods

Subword tokenization is an advanced tokenization technique that breaks
text into smaller units, smaller than words. It helps in handling rare
or unseen words by decomposing them into known subword units. Our hope
is that, the sub-words decomposed from text, can be used to compose new
unseen words and so act as the tokens for the unseen words. Common
algorithms include Byte Pair Encoding (BPE), WordPiece, SentencePiece.

*\"unhappiness\"* \> `"un", "happi", "ness"`{.verbatim}

BPE is originally a technique for compression of data. Repurposed to
compress text corpus by merging frequently occurring pairs of characters
or subwords. Think of it like what and how little number of unique
tokens you need to recreate the whole book when you are free to arrange
those tokens in a line as many time as you want.

Algorithm

:   1.  *Initialization*: Start with a list of characters (initial
        vocabulary) from the text(whole corpus).
    2.  *Frequency Counting*: Count all pair occurrences of consecutive
        characters/subwords.
    3.  *Pair Merging*: Find the most frequent pair and merge it into a
        single new subword.
    4.  *Update Text*: Replace all occurrences of the pair in the text
        with the new subword.
    5.  *Repeat*: Continue the process until reaching the desired
        vocabulary size or merging no longer provides significant
        compression.

Advantages

:   -   Reduces the vocabulary size significantly.
    -   Handles rare and complex words effectively.
    -   Balances between word-level and character-level tokenization.

Disadvantages

:   -   Tokens may not be meaningful standalone units.
    -   Slightly more complex to implement.

## Trained Tokenizers

WordPiece and SentencePiece tokenization methods are extensions of BPE
where the vocabulary is not merely created by assuming merging most
frequent pair. These variants evaluate whether the given merges were
useful or not by measuring how much each merge maximizes the likelihood
of the corpus. In simple words, lets take two vocabularies, before and
after the merges, and train two language models and the model trained on
vocabulary after the merges have lower perplexity(think loss) then we
assume that the merges were useful. And we need to repeat this every
time we make a merge. Not practical, and hence there some mathematical
tricks we use to make this more practical that we will discuss in a
future post.

The iterative merging process is the training of tokenizer and this
training is different training of actual models. There are python
libraries for training your own tokenizer, but when you\'re planning to
use a pretrained language model, it is better to stick with the
pretrained tokenizer associated with that model. In the following
section we see how to train a simple BPE tokenizer, SentencePiece
tokenizer and how to use BERT tokenizer that comes with huggingface\'s
`transformers`{.verbatim} library.

## Tokenization Techniques Used in Popular Language Models

### Byte Pair Encoding (BPE) in GPT Models

GPT models, such as GPT-2 and GPT-3, utilize Byte Pair Encoding (BPE)
for tokenization.

``` {.python results="output code" exports="both"}
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer =  Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
                     vocab_size=30000)
files = ["plato.txt"]

tokenizer.train(files, trainer)
tokenizer.model.save('.', 'bpe_tokenizer')

output = tokenizer.encode("Tokenization is essential first step for any NLP model.")
print("Tokens:", output.tokens)
print("Token IDs:", output.ids)
print("Length: ", len(output.ids))
Tokens: ['T', 'oken', 'ization', 'is', 'essential', 'first', 'step', 'for', 'any', 'N', 'L', 'P', 'model', '.']
Token IDs: [50, 6436, 2897, 127, 3532, 399, 1697, 184, 256, 44, 42, 46, 3017, 15]
Length:  14

SentencePiece in T5

T5 models use a Unigram Language Model for tokenization, implemented via the SentencePiece library. This approach treats tokenization as a probabilistic model over all possible tokenizations.

import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=plato.txt --model_prefix=unigram_tokenizer --vocab_size=3000 --model_type=unigram')

``` {.python results=”output code” exports=”both”} import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load(“unigram_tokenizer.model”) text = “Tokenization is essential first step for any NLP model.” pieces = sp.EncodeAsPieces(text) ids = sp.EncodeAsIds(text) print(“Pieces:”, pieces) print(“Piece IDs:”, ids) print(“Length: “, len(ids))


``` python
Pieces: ['▁To', 'k', 'en', 'iz', 'ation', '▁is', '▁essential', '▁first', '▁step', '▁for', '▁any', '▁', 'N', 'L', 'P', '▁model', '.']
Piece IDs: [436, 191, 128, 931, 141, 11, 1945, 123, 962, 39, 65, 17, 499, 1054, 1441, 1925, 8]
Length:  17

WordPiece Tokenization in BERT

``` {.python results=”output code”} from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) text = “Tokenization is essential first step for any NLP model.” encoded_input = tokenizer(text, return_tensors=’pt’)

print(“Tokens:”, tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’][0])) print(“Token IDs:”, encoded_input[‘input_ids’][0].tolist()) print(“Length: “, len(encoded_input[‘input_ids’][0].tolist())) ```

Summary of Tokenization Methods

Table 4

Method Length Tokens
BPE 14 [‘T’, ‘oken’, ‘ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘N’, ‘L’, ‘P’, ‘model’, ‘.’]
SentencePiece 17 [‘▁To’, ‘k’, ‘en’, ‘iz’, ‘ation’, ‘▁is’, ‘▁essential’, ‘▁first’, ‘▁step’, ‘▁for’, ‘▁any’, ‘▁’, ‘N’, ‘L’, ‘P’, ‘▁model’, ‘.’]
WordPiece (BERT) 12 [‘token’, ‘##ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘nl’, ‘##p’, ‘model’, ‘.’]

Different tokenization methods give different results for the same input sentence. As we add more data to the tokenizer training, the differences between WordPiece and SentencePiece might decrease, but they will not vanish, because of the difference in their training process.

Table 5

Model Tokenization Method Library Key Features
GPT Byte Pair Encoding tokenizers Balances vocabulary size and granularity
BERT WordPiece transformers Efficient vocabulary, handles morphology
T5 Unigram Language Model sentencepiece Probabilistic, flexible across languages

Tokenization and Non English Languages

Tokenizing text is complex, especially when dealing with diverse languages and scripts. Various challenges can impact the effectiveness of tokenization.

Tokenization Issues with Complex Languages: With a focus on Tamil

Tokenizing text in languages like Tamil presents unique challenges due to their linguistic and script characteristics. Understanding these challenges is essential for developing effective NLP applications that handle Tamil text accurately.

Challenges in Tokenizing Tamil Language

  1. 1. Agglutinative Morphology

    Tamil is an agglutinative language, meaning it forms words by concatenating morphemes (roots, suffixes, prefixes) to convey grammatical relationships and meanings. A single word may express what would be a full sentence in English.

    Impact on Tokenization
    • Words can be very lengthy and contain many morphemes.
      • போகமுடியாதவர்களுக்காவேயேதான்
  2. 2. Punarchi and Phonology

    Tamil specific rules on how two words can be combined and resulting word may not be phonologically identical to its parts. The phonological transformations can cause problems with TTS/STT systems too.

    Impact on Tokenization
    • Surface forms of words may change when combined, making boundary detection challenging.
      • மரம் + வேர் > மரவேர்
      • தமிழ் + இனிது > தமிழினிது
  3. 3. Complex Script and Orthography

    Tamil alphabet representation in Unicode is suboptimal for everything except for standardized storage format. Even simple operations that are intuitive for native Tamil speaker, are harder to implement because of this. Techniques like BPE applied on Tamil text will break words at completely inappropriate points like cutting an uyirmei letter into consonant and diacritic resulting in meaningless output.

    தமிழ் > த ம ி ழ, ்

Strategies for Effective Tokenization of Tamil Text

  1. Language-Specific Tokenizers

    Train Tamil specific subword tokenizers with initial seed tokens prepared by better preprocessing techniques to avoid [problem-3]{.spurious-link target=”*3. Complex Script and Orthography”} type cases. Use morphological analyzers to decompose words into root and affixes, aiding in understanding and processing complex word forms.

Choosing the Right Tokenization Method

Challenges in Tokenization

  • Ambiguity: Words can have multiple meanings, and tokenizers cannot capture context. Example: The word "lead" can be a verb or a noun.
  • Handling Special Characters and Emojis: Modern text often includes emojis, URLs, and hashtags, which require specialized handling.
  • Multilingual Texts: Tokenizing text that includes multiple languages or scripts adds complexity, necessitating adaptable tokenization strategies.

Best Practices for Effective Tokenization

  • Understand Your Data: Analyze the text data to choose the most suitable tokenization method.
  • Consider the Task Requirements: Different NLP tasks may benefit from different tokenization granularities.
  • Use Pre-trained Tokenizers When Possible: Leveraging existing tokenizers associated with pre-trained models can save time and improve performance.
  • Normalize Text Before Tokenization: Cleaning and standardizing text

Vectordb

Vector Databases 101

_Co-authored by Angu S KrishnaKumar, Kamal raj Kanagarajan and Malaikannan Sankarasubbu

Introduction

In the world of Large Language Models (LLMs), vector databases play a pivotal role in Retrieval Augmented Generation (RAG) applications.** These specialized databases are designed to store and retrieve high-dimensional vectors, which represent complex data structures like text, images, and audio. By leveraging vector databases, LLMs can access vast amounts of information and generate more informative and accurate responses. Retrieval Augmented Generation (RAG) is a technique that combines the power of large language models (LLMs) with external knowledge bases to generate more informative and accurate responses. By retrieving relevant information from a knowledge base and incorporating it into the LLM’s generation process, RAG can produce more comprehensive and contextually appropriate outputs.

How RAG Works:

  • User Query: A user submits a query or prompt to the RAG system.
  • Information Retrieval: The system retrieves relevant information from a knowledge base based on the query. VectorDBs play a key role in this. Embeddings aka vectors are stored in VectorDB and retrieval is done using similarity measures.
  • Language Model Generation: The retrieved information is fed into a language model, which generates a response based on the query and the retrieved context.

In this blog series, we will delve into the intricacies of vector databases, exploring their underlying principles, key features, and real-world applications. We will also discuss the advantages they offer over traditional databases and how they are transforming the way we store, manage, and retrieve data.

What is a Vector?

A vector is a sequence of numbers that forms a group. For example

  • (3) is a one dimensional vector.
  • (2,8) is a two dimensional vector.
  • (12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only visualize 3 dimensions, anything more than that you can just say it not visualize. Below is an example of 4 dimension vector representation of the word king

King Vector

What is a Vector Database?

A Vector Database (VectorDB) is a specialized database system designed to store, manage, and efficiently query high-dimensional vector data. Unlike traditional relational databases that work with structured data in tables, VectorDBs are optimized for handling vector embeddings – numerical representations of data in multi-dimensional space.

In a VectorDB:

  1. Each item (like a document, image, or concept) is represented as a vector – a list of numbers that describe the item’s features or characteristics.
  2. These vectors are stored in a way that allows for fast similarity searches and comparisons.
  3. The database is optimized for operations like finding the nearest neighbors to a given vector, which is crucial for many AI and machine learning applications.

VectorDBs are particularly useful in scenarios where you need to find similarities or relationships between large amounts of complex data, such as in recommendation systems, image recognition, or natural language processing tasks.

Key Concepts

  1. Vector Embeddings

    • Vector embeddings are numerical representations of data in a multi-dimensional space.
    • They capture semantic meaning and relationships between different pieces of information.
    • In natural language processing, word embeddings are a common type of vector embedding. Each word is represented by a vector of real numbers, where words with similar meanings are closer in the vector space.
    • For detail concepts of embedding please refer to earlier blog Embeddings

Let’s look at an example of Word Vector output generated by Word2Vec

from gensim.models import Word2Vec

# Example corpus (a list of sentences, where each sentence is a list of words)
sentences = [
    ["machine", "learning", "is", "fascinating"],
    ["gensim", "is", "a", "useful", "library", "for", "word", "embeddings"],
    ["vector", "representations", "are", "important", "for", "NLP", "tasks"]
]

# Train a Word2Vec model with 300-dimensional vectors
model = Word2Vec(sentences, vector_size=300, window=5, min_count=1, workers=4)

# Get the 300-dimensional vector for a specific word
word_vector = model.wv['machine']

# Print the vector
print(f"Vector for 'machine': {word_vector}")

Sample Output for 300 dimension vector


Vector for 'machine': [ 2.41737941e-03 -1.42750892e-03 -4.85344668e-03  3.12493594e-03, 4.84531874e-03 -1.00165956e-03  3.41092921e-03 -3.41384278e-03, 4.22888929e-03  1.44586214e-03 -1.35438916e-03 -3.27448458e-03
  4.70721726e-03 -4.50850562e-03  2.64214014e-03 -3.29884756e-03, -3.13906092e-03  1.09677911e-03 -4.94637461e-03  3.32896863e-03,2.03538216e-03 -1.52456785e-03  2.28793684e-03 -1.43519988e-03, 4.34566711e-03 -1.94705374e-03  1.93231280e-03  4.34081139e-03
  ...
  3.40303702e-03  1.58637420e-03 -3.31261402e-03  2.01543484e-03,4.39879852e-03  2.54576413e-03 -3.30528596e-03  3.01509819e-03,2.15555660e-03  1.64605413e-03  3.02376228e-03 -2.62048110e-03
  3.80181967e-03 -3.14147812e-03  2.23554621e-03  2.68812295e-03,1.80951719e-03  1.74256027e-03 -2.47024545e-03  4.06702763e-03,2.30203426e-03 -4.75471295e-03 -3.66776927e-03  2.06539119e-03]

  1. High Dimensional Space
  • Vector databases typically work with vectors that have hundreds or thousands of dimensions. This high dimensionality allows for rich and nuanced representations of data.
  • For example:
    • A word might be represented by 300 dimensions
    • An image could be represented by 1000 dimensions
    • A user’s preferences might be captured in 500 dimensions

Why do you need a Vector Database when there is RDBMS like PostGreSQL or NoSQL DB like Elastic Search or MongoDB?

RDBMS

RDBMS are designed to store and manage structured data in a tabular format. They are based on the relational model, which defines data as a collection of tables, where each table represents a relation.

Key components of RDBMS:

  • Tables: A collection of rows and columns, where each row represents a record and each column represents an attribute.
  • Rows: Also known as records, they represent instances of an entity.
  • Columns: Also known as attributes, they define the properties of an entity.
  • Primary key: A unique identifier for each row in a table.
  • Foreign key: A column in one table that references the primary key of another table, establishing a relationship between the two tables.
  • Normalization: A process of organizing data into tables to minimize redundancy and improve data integrity.

Why RDBMS don’t apply to storing vectors:

  1. Data Representation:
    • RDBMS store data in a tabular format, where each row represents an instance of an entity and each column represents an attribute.
    • Vectors are represented as a sequence of numbers, which doesn’t fit well into the tabular structure of RDBMS.
  2. Query Patterns:
    • RDBMS are optimized for queries based on joining tables and filtering data based on specific conditions.
    • Vector databases are optimized for similarity search, which involves finding vectors that are closest to a given query vector. This type of query doesn’t align well with the traditional join-based queries of RDBMS.
  3. Data Relationships:
    • RDBMS define relationships between entities using foreign keys and primary keys.
    • In vector databases, relationships are implicitly defined by the proximity of vectors in the vector space. There’s no explicit need for foreign keys or primary keys.
  4. Performance Considerations:
    • RDBMS are often optimized for join operations and range queries.
    • Vector databases are optimized for similarity search, which requires efficient indexing and partitioning techniques.

Let’s also look at a table for a comparison of features

Feature VectorDB RDBMS
Dimensional Efficiency Designed to handle high-dimensional data efficiently Performance degrades rapidly as dimensions increase
Similarity Search Implement specialized algorithms for fast approximate nearest neighbor (ANN) searches Lack native support for ANN algorithms, making similarity searches slow and computationally expensive
Indexing for Vector Spaces Use index structures optimized for vector data (e.g., HNSW, IVF) Rely on B-trees and hash indexes, which become ineffective in high-dimensional spaces
Vector Operations Provide built-in, optimized support for vector operations Require complex, often inefficient SQL queries to perform vector computations
Scalability for Vector Data Designed to distribute vector data and parallelize similarity searches across multiple nodes efficiently While scalable for traditional data, they’re not optimized for distributing and querying vector data at scale
Real-time Processing Optimized for fast insertions and queries of vector data, supporting real-time applications May struggle with the speed requirements of real-time vector processing, especially at scale
Storage Efficiency Use compact, specialized formats for storing dense vector data Less efficient when storing high-dimensional vectors, often requiring more space and slower retrieval
Machine Learning Integration Seamlessly integrate with ML workflows, supporting operations common in AI applications Require additional processing and transformations to work effectively with ML pipelines
Approximate Query Support Often support approximate queries, trading off some accuracy for significant speed improvements Primarily designed for exact queries, lacking native support for approximate vector searches

In a nutshell, RDBMS are well-suited for storing and managing structured data, but they are not optimized for storing and querying vectors. Vector databases, on the other hand, are specifically designed for handling vectors and performing similarity search operations.

NoSQL Databases

NoSQL databases are designed to handle large datasets and unstructured or semi-structured data that don’t fit well into the relational model. They offer flexibility in data structures, scalability, and high performance.

Common types of NoSQL databases include:

  • Key-value stores: Store data as key-value pairs.
  • Document stores: Store data as documents, often in JSON or BSON format.
  • Wide-column stores: Store data in wide columns, where each column can have multiple values.
  • Graph databases: Store data as nodes and relationships, representing connected data.

Key characteristics of NoSQL databases:

  • Flexibility: NoSQL databases offer flexibility in data structures, allowing for dynamic schema changes and accommodating evolving data requirements.
  • Scalability: Many NoSQL databases are designed to scale horizontally, allowing for better performance and scalability as data volumes grow.
  • High performance: NoSQL databases often provide high performance, especially for certain types of workloads.
  • Eventual consistency: Some NoSQL databases prioritize availability and performance over strong consistency, offering eventual consistency guarantees.

Why NoSQL Databases Might Not Be Ideal for Storing and Retrieving Vectors

While NoSQL databases offer many advantages, they might not be the best choice for storing and retrieving vectors due to the following reasons:

  1. Data Representation: NoSQL databases, while flexible, might not be specifically optimized for storing and querying high-dimensional vectors. The data structures used in NoSQL databases might not be the most efficient for vector-based operations.
  2. Query Patterns: NoSQL databases are often designed for different query patterns than vector-based operations. While they can handle complex queries, they might not be as efficient for similarity search, which is a core operation for vector databases.
  3. Performance Considerations:
    • Indexing: NoSQL databases often use different indexing techniques than RDBMS. While they can be efficient for certain types of queries, they might not be as optimized for vector-based similarity search.
    • Memory requirements: For vector-based operations, especially in large-scale applications, the memory requirements can be significant. NoSQL databases like Elasticsearch, which are often used for full-text search and analytics, might require substantial memory resources to handle large vector datasets efficiently.

Elasticsearch as an Example:

Elasticsearch is a popular NoSQL database often used for full-text search and analytics. While it can be used to store and retrieve vectors, there are some considerations:

  • Memory requirements: Storing and indexing large vector datasets in Elasticsearch can be memory-intensive, especially for high-dimensional vectors.
  • Query performance: The performance of vector-based queries in Elasticsearch can depend on factors like the number of dimensions, the size of the dataset, and the indexing strategy used.
  • Specialized plugins: Elasticsearch offers plugins like the knn plugin that can be used to optimize vector-based similarity search. However, these plugins might have additional performance and memory implications.

In a nutshell, while NoSQL databases offer many advantages, their suitability for storing and retrieving vectors depends on specific use cases and requirements. For applications that heavily rely on vector-based similarity search and require high performance, specialized vector databases might be a more appropriate choice.

A Deeper Dive into Similarity Search in Vector Databases

Similarity search is a fundamental operation in vector databases, involving finding the closest matches to a given query vector from a large dataset of vectors. This is crucial for applications like recommendation systems, image search, and natural language processing.

Similarity measures, algorithms, and data structures are crucial for efficient similarity search. Similarity measures (e.g., cosine, Euclidean) quantify the closeness between vectors. Algorithms (e.g., brute force, LSH, HNSW) determine how vectors are compared and retrieved. Data structures (e.g., inverted indexes, hierarchical graphs) optimize storage and retrieval. The choice of these components depends on factors like dataset size, dimensionality, and desired accuracy. By selecting appropriate measures, algorithms, and data structures, you can achieve efficient and accurate similarity search in various applications. Let’s look in details about the different similarity measures and algorithms/datastructures in the below section.

Understanding Similarity Measures

  • Cosine Similarity: Measures the angle between two vectors. It’s suitable when the magnitude of the vectors doesn’t matter (e.g., document similarity based on word counts).

import numpy as np

def cosine_similarity(v1, v2):
    """Calculates the cosine similarity between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The cosine similarity between the two vectors.
    """

    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)

    return dot_product / (norm_v1 * norm_v2)

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
similarity = cosine_similarity(vector1, vector2)
print(similarity)

  • Euclidean Distance: Measures the straight-line distance between two points in Euclidean space. It’s suitable when the magnitude of the vectors is important (e.g., image similarity based on pixel values).

import numpy as np

def euclidean_distance(v1, v2):
    """Calculates the Euclidean distance between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The Euclidean distance between the two vectors.
    """

    return np.linalg.norm(v1 - v2)

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = euclidean_distance(vector1, vector2)
print(distance)

  • Hamming Distance: Measures the number of positions where two binary vectors differ. It’s useful for comparing binary data.
import numpy as np

def hamming_distance(v1, v2):
    """Calculates the Hamming distance between two binary vectors.

    Args:
        v1: The first binary vector.
        v2: The second binary vector.

    Returns:
        The Hamming distance between the two vectors.
    """

    return np.sum(v1 != v2)

# Example usage
vector1 = np.array([0, 1, 1, 0])
vector2 = np.array([1, 1, 0, 1])
distance = hamming_distance(vector1, vector2)
print(distance)
  • Manhattan Distance: Also known as L1 distance, it measures the sum of absolute differences between corresponding elements of two vectors.
import numpy as np

def manhattan_distance(v1, v2):
    """Calculates the Manhattan distance between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The Manhattan distance between the two vectors.
    """

    return np.sum(np.abs(v1 - v2))

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = manhattan_distance(vector1, vector2)
print(distance)

Algorithms and Data Structures

Brute Force: is a straightforward but computationally expensive algorithm for finding the nearest neighbors in a dataset. It involves comparing the query vector with every other vector in the dataset to find the closest matches.

How Brute Force Works

  1. Iterate through the dataset: For each vector in the dataset, calculate its distance to the query vector.
  2. Maintain a list of closest neighbors: Keep track of the closest vectors found so far.
  3. Update the list: If the distance between the current vector and the query vector is smaller than the distance to the farthest neighbor in the list, replace the farthest neighbor with the current vector.
  4. Repeat: Continue this process until all vectors in the dataset have been compared.

Advantages and Disadvantages

  • Advantages:
    • Simple to implement.
    • Guaranteed to find the exact nearest neighbors.
  • Disadvantages:
    • Extremely slow for large datasets.
    • Inefficient for high-dimensional data.

Python Code Example

import numpy as np

def brute_force_search(query_vector, vectors, k=10):
    """Performs brute force search for the nearest neighbors.

    Args:
        query_vector: The query vector.
        vectors: The dataset of vectors.
        k: The number of nearest neighbors to find.

    Returns:
        A list of indices of the nearest neighbors.
    """

    distances = np.linalg.norm(vectors - query_vector, axis=1)
    nearest_neighbors = np.argsort(distances)[:k]
    return nearest_neighbors

# Example usage
query_vector = np.random.rand(128)
vectors = np.random.rand(1000, 128)
nearest_neighbors = brute_force_search(query_vector, vectors, k=10)

Brute Force is generally not suitable for large datasets or high-dimensional data due to its computational complexity. For these scenarios, more efficient algorithms like LSH, HNSW, or IVF-Flat are typically used. However, it can be useful for small datasets or as a baseline for comparison with other algorithms.

Locality Sensitive Hashing (LSH): is a technique used to efficiently find similar items in large datasets. It works by partitioning the vector space into buckets and hashing similar vectors into the same bucket. This makes it possible to quickly find approximate nearest neighbors without having to compare every vector in the dataset.

How LSH Works

  1. Hash Function Selection: Choose a hash function that is sensitive to the distance between vectors. This means that similar vectors are more likely to be hashed into the same bucket.
  2. Hash Table Creation: Create multiple hash tables, each using a different hash function.
  3. Vector Hashing: For each vector, hash it into each hash table.
  4. Query Processing: When a query vector is given, hash it into each hash table.
  5. Candidate Selection: Retrieve all vectors that are in the same buckets as the query vector.
  6. Similarity Calculation: Calculate the actual similarity between the query vector and the candidate vectors.

LSH Families

  • Random Projection: Projects vectors onto random hyperplanes.
  • MinHash: Used for comparing sets of items.
  • SimHash: Used for comparing documents based on their shingles.

LSH Advantages and Disadvantages

  • Advantages:
    • Efficient for large datasets.
    • Can be used for approximate nearest neighbor search.
    • Can be parallelized.
  • Disadvantages:
    • Can introduce false positives or negatives.
    • Accuracy can be affected by the choice of hash functions and the number of hash tables.

Python Code Example using Annoy

from annoy import Annoy

# Create an Annoy index with LSH
annoy_index = Annoy(128, metric='angular', n_trees=10)

# Add vectors to the index
for i in range(1000):
    vector = np.random.rand(128)
    annoy_index.add_item(i, vector)

# Build the index
annoy_index.build()

# Search for nearest neighbors
query_vector = np.random.rand(128)
nns = annoy_index.get_nns_by_vector(query_vector, 10)

Note: The n_trees parameter in Annoy determines the number of hash tables used. A larger number of trees generally improves accuracy but can increase memory usage.

By understanding the fundamentals of LSH and carefully selecting the appropriate parameters, you can effectively use it for similarity search in your applications.

Hierarchical Navigable Small World (HNSW): is a highly efficient algorithm for approximate nearest neighbor search in high-dimensional spaces. It constructs a hierarchical graph structure that allows for fast and accurate retrieval of similar items.

How HNSW Works

  1. Initialization: The algorithm starts by creating a single layer with all data points.
  2. Layer Creation: New layers are added iteratively. Each new point is connected to a subset of existing points based on their distance.
  3. Hierarchical Structure: The layers form a hierarchical structure, with higher layers having fewer connections and lower layers having more connections.
  4. Search: To find the nearest neighbors of a query point, the search starts from the top layer and gradually moves down the hierarchy, following the connections to find the most promising candidates.

Advantages of HNSW

  • High Accuracy: HNSW often achieves high accuracy, even for high-dimensional data.
  • Efficiency: It is very efficient for large datasets and can handle dynamic updates.
  • Flexibility: The algorithm can be adapted to different distance metrics and data distributions.

Python Code Example using NMSLIB

from nmslib import NMSLIB

# Create an HNSW index
nmslib_index = NMSLIB.init(method='hnsw', space='cos')

# Add vectors to the index
nmslib_index.addDataPointBatch(vectors)

# Create the index
nmslib_index.createIndex()

# Search for nearest neighbors
query_vector = np.random.rand(128)
knn = nmslib_index.knnQuery(query_vector, k=10)

Note: The space parameter in NMSLIB specifies the distance metric used (e.g., cos for cosine similarity). You can also customize other parameters like the number of layers and the number of connections per layer to optimize performance for your specific application.

HNSW is a powerful algorithm for approximate nearest neighbor search, offering a good balance between accuracy and efficiency. It’s particularly well-suited for high-dimensional data and can be used in various applications, such as recommendation systems, image search, and natural language processing.

IVF-Flat: is a hybrid indexing technique that combines Inverted File (IVF) and Flat Hierarchical Indexing (Flat) to efficiently perform approximate nearest neighbor search (ANN) in high-dimensional vector spaces. It’s particularly effective for large datasets and high-dimensional vectors.

How IVF-Flat Works

  1. Quantization: The dataset is divided into n quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
  2. Inverted File: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
  3. Flat Index: For each quantized cell, a flat index (e.g., a linear scan or a tree-based structure) is built to store the vectors assigned to that cell.
  4. Query Processing: When a query vector is given, it’s first quantized to find the corresponding cell. Then, the flat index for that cell is searched for the nearest neighbors.
  5. Refinement: The top candidates from the flat index can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of IVF-Flat

  • Efficiency: IVF-Flat can be significantly faster than brute-force search for large datasets.
  • Accuracy: It can achieve good accuracy, especially when combined with refinement techniques.
  • Scalability: It can handle large datasets and high-dimensional vectors.
  • Flexibility: The number of quantized cells and the type of flat index can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create an IVF-Flat index
index = faiss.IndexIVFFlat(faiss.IndexFlatL2(dim), nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

  • dim is the dimensionality of the vectors.
  • nlist is the number of quantized cells.
  • nprobe is the number of cells to query during search.

IVF-Flat is a powerful technique for approximate nearest neighbor search in vector databases, offering a good balance between efficiency and accuracy. By carefully tuning the parameters, you can optimize its performance for your specific application.

ScanNN: is a scalable and efficient approximate nearest neighbor search algorithm designed for large-scale datasets. It combines inverted indexes with quantization techniques to achieve high performance.

How ScanNN Works

  1. Quantization: The dataset is divided into quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
  2. Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
  3. Scan: During query processing, the query vector is quantized to find the corresponding cell. Then, the vectors in that cell are scanned to find the nearest neighbors.
  4. Refinement: The top candidates from the scan can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of ScanNN

  • Scalability: ScanNN can handle large datasets and high-dimensional vectors efficiently.
  • Efficiency: It uses inverted indexes to reduce the search space, making it faster than brute-force search.
  • Accuracy: ScanNN can achieve good accuracy, especially when combined with refinement techniques.
  • Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create a ScanNN index
index = faiss.IndexScanNN(faiss.IndexFlatL2(dim), nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

  • dim is the dimensionality of the vectors.
  • nlist is the number of quantized cells.
  • nprobe is the number of cells to query during search.

ScanNN is a powerful algorithm for approximate nearest neighbor search in large-scale applications. It offers a good balance between efficiency and accuracy, making it a popular choice for various tasks, such as recommendation systems, image search, and natural language processing.

Disk-ANN: is a scalable approximate nearest neighbor search algorithm designed for very large datasets that don’t fit entirely in memory. It combines inverted files with on-disk storage to efficiently handle large-scale vector search.

How Disk-ANN Works

  1. Quantization: The dataset is divided into quantized subspaces (quantization cells), similar to IVF-Flat.
  2. Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
  3. On-Disk Storage: The inverted index and the vectors themselves are stored on disk, allowing for efficient handling of large datasets.
  4. Query Processing: When a query vector is given, it’s quantized to find the corresponding cell. The inverted index is used to retrieve the vectors in that cell from disk.
  5. Refinement: The retrieved vectors can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of Disk-ANN

  • Scalability: Disk-ANN can handle extremely large datasets that don’t fit in memory.
  • Efficiency: It uses inverted indexes and on-disk storage to optimize performance for large-scale search.
  • Accuracy: Disk-ANN can achieve good accuracy, especially when combined with refinement techniques.
  • Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create a Disk-ANN index
index = faiss.IndexDiskANN(faiss.IndexFlatL2(dim), filename, nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

  • filename is the path to the disk file where the index will be stored.
  • Other parameters are the same as in IVF-Flat.

Disk-ANN is a powerful algorithm for approximate nearest neighbor search in very large datasets. It provides a scalable and efficient solution for handling massive amounts of data while maintaining good accuracy.

Vector Database Comparison: Features, Use Cases, and Selection Guide

Just like in RDBMS or NOSQL world there are lot of choices for different databases, Vector Databases also have quite a bit choices, choosing the right one for your application matters quite a bit. Below table compares key features, use-cases and a selection guide

VectorDB Key Features Best For When to Choose
Pinecone Fully managed service, Real-time updates, Hybrid search (vector + metadata), Serverless Production-ready applications, Rapid development, Scalable solutions When you need a fully managed solution, For applications requiring real-time updates, When combining vector search with metadata filtering
Milvus Open-source, Scalable to billions of vectors, Supports multiple index types, Hybrid search capabilities Large-scale vector search, On-premises deployments, Customizable solutions When you need an open-source solution, for very large-scale vector search applications, When you require fine-grained control over indexing
Qdrant Open-source, Rust-based for high performance, Supports filtering with payload, On-prem and cloud options High-performance vector search, Applications with complex filtering needs When performance is critical, for applications requiring advanced filtering, When you need both cloud and on-prem options
Weaviate Open-source, GraphQL API, Multi-modal data support, AI-first database Semantic search applications, Multi-modal data storage and retrieval When working with multiple data types (text, images, etc.), If you prefer GraphQL for querying, for AI-centric applications
Faiss (Facebook AI Similarity Search) Open-source, Highly efficient for dense vectors, GPU support Research and experimentation, Large-scale similarity search When you need low-level control, for integration into custom systems, When GPU acceleration is beneficial
Elasticsearch with vector search Full-text search + vector capabilities, Mature ecosystem and extensive analytics features Applications combining traditional search and vector search when you need rich text analytics, When you’re already using Elasticsearch, For hybrid search applications (text + vector), When you need advanced analytics alongside vector search
pgvector (PostgreSQL extension) Vector similarity search in PostgreSQL, Integrates with existing PostgreSQL databases Adding vector capabilities to existing PostgreSQL systems, Small to medium-scale applications When you’re already heavily invested in PostgreSQL, for projects that don’t require specialized vector DB features, When simplicity and familiarity are priorities
Vespa Open-source, Combines full-text search, vector search, and structured data, Real-time indexing and serving Complex search and recommendation systems, Applications requiring structured, text, and vector data For large-scale, multi-modal search applications, When you need a unified platform for different data types, For real-time, high-volume applications
AWS OpenSearch Fully managed AWS service, Combines traditional full-text search capabilities with vector-based similarity search. When you need to search for both text-based content and vectors. When you need to perform real-time searches and analytics on large datasets. When you want to leverage the broader AWS ecosystem for your application. For applications that require processing billions of vectors.

Conclusion

For my previous startup that I (Malaikannan Sankarasubbu) founded Datalog dot ai doing a low code virtual assistant platform, we heavily leveraged FAISS to do Intent similarity, from that point to now there are quite a few options for Vector Databases

Vector databases have emerged as a powerful tool for handling unstructured and semi-structured data, offering efficient similarity search capabilities and supporting a wide range of applications. By understanding the fundamentals of vector databases, including similarity measures, algorithms, and data structures, you can select the right approach for your specific needs.

In future blog posts, we will delve deeper into performance considerations, including indexing techniques, hardware optimization, and best practices for scaling vector databases. We will also explore real-world use cases and discuss the challenges and opportunities that lie ahead in the field of vector databases.

Chunking

Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline

As the field of Natural Language Processing (NLP) continues to evolve, the combination of retrieval-based and generative models has emerged as a powerful approach for enhancing various NLP applications. One of the key techniques that significantly improves the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) is chunking. In this blog, we will explore what chunking is, why it is important in RAG, the different ways to implement chunking, including content-aware and recursive chunking, how to evaluate the performance of chunking, chunking alternatives, and how it can be applied to optimize NLP systems.

Chunking

What is Retrieval-Augmented Generation (RAG)?

Before diving into chunking, let’s briefly understand RAG. Retrieval-Augmented Generation is a framework that combines the strengths of retrieval-based models and generative models. It involves retrieving relevant information from a large corpus based on a query and using this retrieved information as context for a generative model to produce accurate and contextually relevant responses or content.

What is Chunking?

Chunking is the process of breaking down large text documents or datasets into smaller, manageable pieces, or “chunks.” These chunks can then be individually processed, indexed, and retrieved, making the overall system more efficient and effective. Chunking helps in dealing with large volumes of text by dividing them into smaller, coherent units that are easier to handle.

Why Do We Need Chunking?

Chunking is essential in RAG for several reasons:

Efficiency

  • Computational cost: Processing smaller chunks of text requires less computational power compared to handling entire documents.
  • Storage: Chunking allows for more efficient storage and indexing of information.

Accuracy

  • Relevance: By breaking down documents into smaller units, it’s easier to identify and retrieve the most relevant information for a given query.
  • Context preservation: Careful chunking can help maintain the original context of the text within each chunk.

Speed

  • Retrieval time: Smaller chunks can be retrieved and processed faster, leading to quicker response times.
  • Model processing: Language models can process smaller inputs more efficiently.

Limitations of Large Language Models

  • Context window: LLMs have limitations on the amount of text they can process at once. Chunking helps to overcome this limitation.

In essence, chunking optimizes the RAG process by making it more efficient, accurate, and responsive.

Different Ways to Implement Chunking

There are various methods to implement chunking, depending on the specific requirements and structure of the text data. Here are some common approaches:

  1. Fixed-Length Chunking: Divide the text into chunks of fixed length, typically based on a predetermined number of words or characters.

    def chunk_text_fixed_length(text, chunk_size=200, by='words'):
        if by == 'words':
            words = text.split()
            return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
        elif by == 'characters':
            return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
        else:
            raise ValueError("Parameter 'by' must be either 'words' or 'characters'.")
    
    text = "The process is more important than the results. And if you take care of the process, you will get the results."
    word_chunks = chunk_text_fixed_length(text, chunk_size=5, by='words')  
    character_chunks = chunk_text_fixed_length(text, chunk_size=5, by='characters')  
       
    
    print(word_chunks)
    ['The process is more important', 'than the results. And if', 'you take care of the', 'process, you will get the', 'results.']
    
    print(character_chunks)
    ['The p', 'roces', 's is ', 'more ', 'impor', 'tant ', 'than ', 'the r', 'esult', 's. An', 'd if ', 'you t', 'ake c', 'are o', 'f the', ' proc', 'ess, ', 'you w', 'ill g', 'et th', 'e res', 'ults.']
    
  2. Sentence-Based Chunking: Split the text into chunks based on complete sentences. This method ensures that each chunk contains coherent and complete thoughts.

    import nltk
    nltk.download('punkt')
       
    def chunk_text_sentences(text, max_sentences=3):
        sentences = nltk.sent_tokenize(text)
        return [' '.join(sentences[i:i + max_sentences]) for i in range(0, len(sentences), max_sentences)]
    
    text = """Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language. NLP techniques are used to apply algorithms to identify and extract the natural language rules such that 
    the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world."""
       
       
    
    sentence_chunks = chunk_text_sentences(text, max_sentences=2)
       
       
    for i, chunk in enumerate(sentence_chunks, 1):
        print(f"Chunk {i}:\n{chunk}\n")
    
    Chunk 1:
    Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language.
       
    Chunk 2:
    NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP.
       
    Chunk 3:
    It's a powerful tool in the modern data-driven world.
    
  3. Paragraph-Based Chunking: Divide the text into chunks based on paragraphs. This approach is useful when the text is naturally structured into paragraphs that represent distinct sections or topics.

    def chunk_text_paragraphs(text):
        paragraphs = text.split('\n\n')
        return [paragraph for paragraph in paragraphs if paragraph.strip()]
    
    paragraph_chunks = chunk_text_paragraphs(text)
       
       
    for i, chunk in enumerate(paragraph_chunks, 1):
        print(f"Paragraph {i}:\n{chunk}\n")
    
    Paragraph 1:
    Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence.
       
    Paragraph 2:
    It deals with the interaction between computers and humans through natural language.
       
    Paragraph 3:
    NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.
       
    Paragraph 4:
    Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world.
    
  4. Thematic or Semantic Chunking: Use NLP techniques to identify and group related sentences or paragraphs into chunks based on their thematic or semantic content. This can be done using topic modeling or clustering algorithms.

    import nltk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.cluster import KMeans
       
    nltk.download('punkt')
       
    def chunk_text_thematic(text, n_clusters=5):
        sentences = nltk.sent_tokenize(text)
        vectorizer = TfidfVectorizer(stop_words='english')
        X = vectorizer.fit_transform(sentences)
        kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
        clusters = kmeans.predict(X)
           
        chunks = [[] for _ in range(n_clusters)]
        for i, sentence in enumerate(sentences):
            chunks[clusters[i]].append(sentence)
           
        return [' '.join(chunk) for chunk in chunks]
       
       
       
    thematic_chunks = chunk_text_thematic(text, n_clusters=3)
       
       
    for i, chunk in enumerate(thematic_chunks, 1):
        print(f"Chunk {i}:\n{chunk}\n")
       
    
  5. Sliding Window Chunking: Use a sliding window approach to create overlapping chunks. This method ensures that important information near the boundaries of chunks is not missed.
     def chunk_text_sliding_window(text, chunk_size=200, overlap=50, unit='word'):
     """Chunks text using a sliding window.
    
     Args:
         text: The input text.
         chunk_size: The desired size of each chunk.
         overlap: The overlap between consecutive chunks.
         unit: The chunking unit ('word', 'char', or 'token').
    
     Returns:
         A list of text chunks.
     """
    
     if unit == 'word':
         data = text.split()
     elif unit == 'char':
         data = text
     else:
         # Implement tokenization for other units
         pass
    
     chunks = []
     for i in range(0, len(data), chunk_size - overlap):
         if unit == 'word':
         chunk = ' '.join(data[i:i+chunk_size])
         else:
         chunk = data[i:i+chunk_size]
         chunks.append(chunk)
    
     return chunks
    
    
  6. Content-Aware Chunking: This advanced method involves using more sophisticated NLP techniques to chunk the text based on its content and structure. Content-aware chunking can take into account factors such as topic continuity, coherence, and discourse markers. It aims to create chunks that are not only manageable but also meaningful and contextually rich.

    Example of Content-Aware Chunking using Sentence Transformers:

    from sentence_transformers import SentenceTransformer, util
    
    def content_aware_chunking(text, max_chunk_size=200):
        model = SentenceTransformer('all-MiniLM-L6-v2')
        sentences = nltk.sent_tokenize(text)
        embeddings = model.encode(sentences, convert_to_tensor=True)
        clusters = util.community_detection(embeddings, min_community_size=1)
           
        chunks = []
        for cluster in clusters:
            chunk = ' '.join([sentences[i] for i in cluster])
            if len(chunk.split()) <= max_chunk_size:
                chunks.append(chunk)
            else:
                sub_chunks = chunk_text_fixed_length(chunk, max_chunk_size)
                chunks.extend(sub_chunks)
           
        return chunks
    
  7. Recursive Chunking: Recursive chunking involves repeatedly breaking down chunks into smaller sub-chunks until each chunk meets a desired size or level of detail. This method ensures that very large texts are reduced to manageable and meaningful units at each level of recursion, making it easier to process and retrieve information.

    Example of Recursive Chunking: ```python def recursive_chunk(text, max_chunk_size): “"”Recursively chunks text into smaller chunks.

    Args: text: The input text. max_chunk_size: The maximum desired chunk size.

    Returns: A list of text chunks. “””

    if len(text) <= max_chunk_size: return [text]

    # Choose a splitting point based on paragraphs, sentences, or other criteria # For example: paragraphs = text.split(‘\n\n’) if len(paragraphs) > 1: chunks = [] for paragraph in paragraphs: chunks.extend(recursive_chunk(paragraph, max_chunk_size)) return chunks else: # Handle single paragraph chunking, e.g., by sentence splitting # …

# …


8. **Agentic Chunking**: Agent chunking is a sophisticated technique that involves using an LLM to dynamically determine chunk boundaries based on the content and context of the text. Below is an example of a prompt example for Agentic Chunking 

**Example Prompt**:

``` Prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|> 
## You are an agentic chunker. You will be provided with a content. 
Decompose the content into clear and simple propositions, ensuring they are interpretable out of context. 
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible. 
2. For any named entity that is accompanied by additional descriptive informaiton separate this information into its own distinct proposition.
3. Decontextualize proposition by adding necessary modifier to nouns or entire sentence and replacing pronouns (e.g. it, he, she, they, this, that) with the full name of the entities they refer to.
4. Present the results as list of strings, formatted in JSON 
<|eot_id|><|start_header_id|>user<|end_header_id|>
Here is the content : {content}
strictly follow the instructions provided and output in the desired format only.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Chunk Size and Overlapping in Chunking

Determining the appropriate chunk size and whether to use overlapping chunks are critical decisions in the chunking process. These factors significantly impact the efficiency and effectiveness of the retrieval and generation stages in RAG systems.

Chunk Size
  1. Choosing Chunk Size: The ideal chunk size depends on the specific application and the nature of the text. Smaller chunks can provide more precise context but may miss broader information, while larger chunks capture more context but may introduce noise or irrelevant information.
    • Small Chunks: Typically 100-200 words. Suitable for fine-grained retrieval where specific details are crucial.
    • Medium Chunks: Typically 200-500 words. Balance between detail and context, suitable for most applications.
    • Large Chunks: Typically 500-1000 words. Useful for capturing broader context but may be less precise.
  2. Impact of Chunk Size: The chunk size affects the retrieval accuracy and computational efficiency. Smaller chunks generally lead to higher retrieval precision but may require more chunks to cover the same amount of text, increasing computational overhead. Larger chunks reduce the number of chunks but may lower retrieval precision.
Overlapping Chunks
  1. Purpose of Overlapping: Overlapping chunks ensure that important information near the boundaries of chunks is not missed. This approach is particularly useful when the text has high semantic continuity, and critical information may span across chunk boundaries.

  2. Degree of Overlap: The overlap size should be carefully chosen to balance redundancy and completeness. Common overlap sizes range from 10% to 50% of the chunk size.
    • Small Overlap: 10-20% of the chunk size. Minimizes redundancy but may still miss some boundary information.
    • Medium Overlap: 20-30% of the chunk size. Good balance between coverage and redundancy.
    • Large Overlap: 30-50% of the chunk size. Ensures comprehensive coverage but increases redundancy and computational load.
  3. Example of Overlapping Chunking:
    def chunk_text_sliding_window(text, chunk_size=200, overlap=50):
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size - overlap):
            chunk = words[i:i + chunk_size]
            chunks.append(' '.join(chunk))
        return chunks
    

Evaluating the Performance of Chunking

Evaluating the performance of chunking is crucial to ensure that the chosen method effectively enhances the retrieval and generation processes. Here are some key metrics and approaches for evaluating chunking performance:

Retrieval Metrics
  1. Precision@K: Measures the proportion of relevant chunks among the top K retrieved chunks.
    def precision_at_k(retrieved_chunks, relevant_chunks, k):
        return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / k
    
  2. Recall@K: Measures the proportion of relevant chunks retrieved among the top K chunks.
    def recall_at_k(retrieved_chunks, relevant_chunks, k):
        return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / len(relevant_chunks)
    
  3. F1 Score: Harmonic mean of Precision@K and Recall@K, providing a balance between precision and recall.
    def f1_score_at_k(precision, recall):
        if precision + recall == 0:
            return 0
        return 2 * (precision * recall) / (precision + recall)
    
  4. MAP : Mean Average Precision (MAP) is primarily used in information retrieval and object detection tasks to evaluate the ranking of retrieved items
     import numpy as np
    
     def calculate_ap(y_true, y_score):
     """Calculates average precision for a single query.
    
     Args:
         y_true: Ground truth labels (0 or 1).
         y_score: Predicted scores.
    
     Returns:
         Average precision.
     """
    
     # Sort y_score and corresponding y_true in descending order
     y_score, y_true = zip(*sorted(zip(y_score, y_true), key=lambda x: x[0], reverse=True))
    
     correct_hits = 0
     sum_precision = 0
     for i, y in enumerate(y_true):
         if y == 1:
         correct_hits += 1
         precision = correct_hits / (i + 1)
         sum_precision += precision
     return sum_precision / sum(y_true)
    
     def calculate_map(y_true, y_score):
     """Calculates mean average precision.
    
     Args:
         y_true: Ground truth labels (list of lists).
         y_score: Predicted scores (list of lists).
    
     Returns:
         Mean average precision.
     """
    
     aps = []
     for i in range(len(y_true)):
         ap = calculate_ap(y_true[i], y_score[i])
         aps.append(ap)
     return np.mean(aps)
    
    
    
  5. NDCG: NDCG is a metric used to evaluate the quality of a ranking of items. It measures how well the most relevant items are ranked at the top of the list. In the context of chunking, we can potentially apply NDCG by ranking chunks based on a relevance score and evaluating how well the most relevant chunks are placed at the beginning of the list.
import numpy as np

def calculate_dcg(rel):
  """Calculates Discounted Cumulative Gain (DCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    DCG value.
  """

  return np.sum(rel / np.log2(np.arange(len(rel)) + 2))

def calculate_idcg(rel):
  """Calculates Ideal Discounted Cumulative Gain (IDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    IDCG value.
  """

  rel = np.sort(rel)[::-1]
  return calculate_dcg(rel)

def calculate_ndcg(rel):
  """Calculates Normalized Discounted Cumulative Gain (NDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    NDCG value.
  """

  dcg = calculate_dcg(rel)
  idcg = calculate_idcg(rel)
  return dcg / idcg

# Example usage
relevance_scores = [3, 2, 1, 0]
ndcg_score = calculate_ndcg(relevance_scores)
print(ndcg_score)


Generation Metrics
  1. BLEU Score: Measures the overlap between the generated text and reference text, considering n-grams.
    from nltk.translate.bleu_score import sentence_bleu
    
    def bleu_score(reference, generated):
        return sentence_bleu([reference.split()], generated.split())
    
  2. ROUGE Score: Measures the overlap of n-grams, longest common subsequence (LCS), and skip-bigram between the generated text and reference text.
    from rouge import Rouge
    
    rouge = Rouge()
    
    def rouge_score(reference, generated):
        scores = rouge.get_scores(generated, reference)
        return scores[0]['rouge-l']['f']
    
  3. Human Evaluation: Involves subjective evaluation by human judges to assess the relevance, coherence, and overall quality of the generated responses. Human evaluation can provide insights that automated metrics might miss.

Chunking Alternatives

While chunking is an effective method for improving the efficiency and effectiveness of RAG systems, there are alternative techniques that can also be considered:

  1. Hierarchical Indexing: Instead of chunking the text, hierarchical indexing organizes the data into a tree structure where each node represents a topic or subtopic. This allows for efficient retrieval by navigating through the tree based on the query’s context. ```python class HierarchicalIndex: def init(self): self.tree = {}

    def add_document(self, doc_id, topics):
        current_level = self.tree
        for topic in topics:
            if topic not in current_level:
                current_level[topic] = {}
            current_level = current_level[topic]
        current_level['doc_id'] = doc_id
    
    def retrieve(self, query_topics):
        current_level = self.tree
        for topic in query_topics:
            if topic in current_level:
                current_level = current_level[topic]
            else:
                return []
        return current_level.get('doc_id', [])
    
  2. Summarization: Instead of retrieving chunks, the system generates summaries of documents or sections that are relevant to the query. This can be done using extractive or abstractive summarization techniques.
    from transformers import BartTokenizer, BartForConditionalGeneration
    
    def generate_summary(text):
        tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
        model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    
        inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
        summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=150, early_stopping=True)
        return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
  3. Dense Passage Retrieval (DPR): DPR uses dense vector representations for both questions and passages, allowing for efficient similarity search using vector databases like FAISS.
    from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    
    question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    
    def encode_texts(texts, tokenizer, encoder):
        inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
        return encoder(**inputs).pooler_output
    
    question_embeddings = encode_texts(["What is chunking?"], question_tokenizer, question_encoder)
    context_embeddings = encode_texts(["Chunking is a process...", "Another context..."], context_tokenizer, context_encoder)
    
    similarities = cosine_similarity(question_embeddings, context_embeddings)
    
  4. Graph-Based Representations: Instead of breaking the text into chunks, graph-based representations model the relationships between different parts of the text. Nodes represent entities, concepts, or chunks of text, and edges represent the relationships between them. This approach allows for more flexible and context-aware retrieval.
   import networkx as nx

   def build_graph(texts):
       graph = nx.Graph()
       for i, text in enumerate(texts):
           graph.add_node(i, text=text)
           # Add edges based on some similarity metric
           for j in range(i + 1, len(texts)):
               similarity = compute_similarity(text, texts[j])
               if similarity > threshold:
                   graph.add_edge(i, j, weight=similarity)
       return graph

   def retrieve_from_graph(graph, query):
       query_node = len(graph.nodes)
       graph.add_node(query_node, text=query)
       for i in range(query_node):
           similarity = compute_similarity(query, graph.nodes[i]['text'])
           if similarity > threshold:
               graph.add_edge(query_node, i, weight=similarity)
       # Retrieve nodes with highest similarity
       neighbors = sorted(graph[query_node], key=lambda x: graph[query_node][x]['weight'], reverse=True)
       return [graph.nodes[n]['text'] for n in neighbors[:k]]

Graph-based representations can capture complex relationships and provide a more holistic view of the text, making them a powerful alternative to chunking.

Conclusion

Chunking plays a pivotal role in enhancing the efficiency and effectiveness of Retrieval-Augmented Generation systems. By breaking down large texts into manageable chunks, we can improve retrieval speed, contextual relevance, scalability, and the overall quality of generated responses. Evaluating the performance of chunking methods involves considering retrieval and generation metrics, as well as efficiency and cost metrics. As NLP continues to advance, techniques like chunking will remain essential for optimizing the performance of RAG and other language processing systems. Additionally, exploring alternatives such as hierarchical indexing, passage retrieval, summarization, dense passage retrieval, and graph-based representations can further enhance the capabilities of RAG systems.

Embark on your journey to harness the power of chunking in RAG and unlock new possibilities in the world of Natural Language Processing!

If you found this blog post helpful, please consider citing it in your work:

@misc{malaikannan2024chunking, author = {Sankarasubbu, Malaikannan}, title = {Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline}, year = {2024}, url = {https://malaikannan.github.io/2024/08/05/Chunking/}, note = {Accessed: 2024-08-12} }

Embeddings

Computers are meant to crunch numbers; it goes back to the original design of these machines. Representing text as numbers is the holy grail of Natural Language Processing (NLP), but how do we do that? Over the years, various techniques have been developed to achieve this. Early methods like n-grams (like bigrams and trigrams) and TF-IDF were able to convert words into numbers. Not just one number, a collection of them. Each word is represented by the collection of numbers. The collection of numbers is called vector and it had a size that is fixed called the dimension of the vector. Though they were useful, they had their limitations. The most important of the limitations is that the vectors for each words stands alone, i.e we could not do any mathematical operations like addition or subtraction between the vectors(actually we could but the resulting vector will not represent any word). That is where embeddings come in. Embedding is also a vector, and so each word get a corresponding vector but we can now do King - Man + Woman that will give us a vector which is close to the vector corresponding to Queen. Why is this useful? That is what we are going to explore in this article.

What are Embeddings?

Embeddings are numerical representations of text data where words or phrases from the vocabulary are mapped to vectors of real numbers. This mapping is crucial because it allows us to quantify and manipulate textual data in a way that machines can understand and process.

We understand what a word is, lets see what a vector is. A vector is a sequence of numbers that forms a group. For example

  • (3) is a one dimensional vector.
  • (2,8) is a two dimensional vector.
  • (12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only 3 dimensions, anything more than that you can just say it not visualize.

Below is an example of 4 dimension vector representation of the word king

King Vector

One of the seminal papers that have come out from Google is Word2vec. Lets see how Word2Vec works to get a conceptual understanding of how embedding works

How Word2Vec works

For a input text it looks at each word and the context of words around it. It trains on the text, and recognizes the order of each word, and the structure of the sentences. At the end of training each word is represented by a vector of N (mostly in 100 to 300 range) dimension.

Word2Vec

When we train word2vec algorithm in the example discussed above “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Lets assume that it outputs 2 dimension vectors for each words, since we can’t visualize anything more than 3 dimension.

  • SanFrancisco (6,6)
  • beautiful (-13,-4)
  • California (10,8)
  • city (2,10)
  • LosAngeles (6.5,5)
  • lovely(-12,-7)
  • metropolis(2.5,8)

Below is a 2D Plot of vectors

2DPlot

You can see in the image that Word2vec algorithm inferred from the input text. SanFrancisco and LosAngeles are grouped together. Beautiful and lovely are grouped together. City and metropolis are grouped together. Beauty about this is, Word2vec deduced this purely from data, without being explicitly taught english or geography.

You will see more embedding approaches in the below sections

Key Characteristics of Embeddings:
  1. Dimensionality: Embeddings are vectors of fixed size. Common sizes range from 50 to 300 dimensions, though they can be larger depending on the complexity of the task.
  2. Continuous Space: Unlike traditional one-hot encoding, embeddings are dense and reside in a continuous vector space, making them more efficient and informative.
  3. Semantic Proximity: Words with similar meanings tend to have vectors that are close to each other in the embedding space.

The Evolution of Embeddings

Embeddings have evolved significantly over the years. Here are some key milestones:

  1. Word2Vec (2013): Developed by Mikolov et al. at Google, Word2Vec was one of the first algorithms to create word embeddings. It uses two architectures—Continuous Bag of Words (CBOW) and Skip-gram—to learn word associations.

  2. GloVe (2014): Developed by the Stanford NLP Group, GloVe (Global Vectors for Word Representation) improves upon Word2Vec by incorporating global statistical information of the corpus.

  3. FastText (2016): Developed by Facebook’s AI Research (FAIR) lab, FastText extends Word2Vec by considering subword information, which helps in handling out-of-vocabulary words and capturing morphological details.

  4. ELMo (2018): Developed by the Allen Institute for AI, ELMo (Embeddings from Language Models) generates context-sensitive embeddings, meaning the representation of a word changes based on its context in a sentence.

  5. BERT (2018): Developed by Google, BERT (Bidirectional Encoder Representations from Transformers) revolutionized embeddings by using transformers to understand the context of a word bidirectionally. This model significantly improved performance on various NLP tasks.

From Word Embeddings to Sentence Embeddings

While word embeddings provide a way to represent individual words, they do not capture the meaning of entire sentences or documents. This limitation led to the development of sentence embeddings, which are designed to represent longer text sequences.

Word Embeddings

Word embeddings, such as those created by Word2Vec, GloVe, and FastText, map individual words to vectors. These embeddings capture semantic similarities between words based on their context within a large corpus of text. For example, the words “king” and “queen” might be close together in the embedding space because they often appear in similar contexts.

Sentence Embeddings

Sentence embeddings extend the concept of word embeddings to entire sentences or even paragraphs. These embeddings aim to capture the meaning of a whole sentence, taking into account the context and relationships between words within the sentence. There are several methods to create sentence embeddings:

  1. Averaging Word Embeddings: One of the simplest methods is to average the word embeddings of all words in a sentence. While this method is straightforward, it often fails to capture the nuances and syntactic structures of sentences.

  2. Doc2Vec: Developed by Mikolov and Le, Doc2Vec extends Word2Vec to larger text segments by considering the paragraph as an additional feature during training. This method generates embeddings for sentences or documents that capture more context compared to averaging word embeddings.

  3. Recurrent Neural Networks (RNNs): RNNs, particularly Long Short-Term Memory (LSTM) networks, can be used to generate sentence embeddings by processing the sequence of words in a sentence. The hidden state of the RNN after processing the entire sentence can serve as the sentence embedding.

  4. Transformers (BERT, GPT, etc.): Modern approaches like BERT and GPT use transformer architectures to generate context-aware embeddings for sentences. These models can process a sentence bidirectionally, capturing dependencies and relationships between words more effectively than previous methods.

Example: BERT Sentence Embeddings

BERT (Bidirectional Encoder Representations from Transformers) has set a new standard for generating high-quality sentence embeddings. By processing a sentence in both directions, BERT captures the full context of each word in relation to the entire sentence. The embeddings generated by BERT can be fine-tuned for various NLP tasks, such as sentiment analysis, question answering, and text classification.

To create a sentence embedding with BERT, you can use the hidden states of the transformer model. Typically, the hidden state corresponding to the [CLS] token (which stands for “classification”) is used as the sentence embedding.

How to Generate Embeddings

Generating embeddings involves training a model on a large corpus of text data. Here’s a step-by-step guide to generating word and sentence embeddings:

Generating Word Embeddings with Word2Vec
  1. Data Preparation: Collect and preprocess a large text corpus. This involves tokenizing the text, removing stop words, and handling punctuation.

  2. Training the Model: Use the Word2Vec algorithm to train the model. You can choose between the CBOW or Skip-gram architecture. Libraries like Gensim in Python provide easy-to-use implementations of Word2Vec.
    from gensim.models import Word2Vec
    
    # Example sentences
    sentences = [["I", "love", "machine", "learning"], ["Word2Vec", "is", "great"]]
    
    # Train Word2Vec model
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    
  3. Using the Embeddings: Once the model is trained, you can use it to get the embedding for any word in the vocabulary.
    word_embedding = model.wv['machine']
    
Generating Sentence Embeddings with BERT
  1. Install Transformers Library: Use the Hugging Face Transformers library to easily work with BERT.
    pip install transformers
    
  2. Load Pretrained BERT Model: Load a pretrained BERT model and tokenizer.
    from transformers import BertTokenizer, BertModel
    import torch
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    
  3. Tokenize Input Text: Tokenize your input text and convert it to input IDs and attention masks.
    sentence = "BERT is amazing for sentence embeddings."
    inputs = tokenizer(sentence, return_tensors='pt')
    
  4. Generate Embeddings: Pass the inputs through the BERT model to get the embeddings.
    with torch.no_grad():
        outputs = model(**inputs)
    
    # The [CLS] token embedding
    sentence_embedding = outputs.last_hidden_state[0][0]
    
  5. Using the Embeddings: The sentence_embedding can now be used for various NLP tasks.

Data Needed for Training Embeddings

The quality of embeddings heavily depends on the data used for training. Here are key considerations regarding the data needed:

  1. Size of the Corpus: A large corpus is generally required to capture the diverse contexts in which words can appear. For example, training Word2Vec or BERT models typically requires billions of words. The larger the corpus, the better the embeddings can capture semantic nuances.

  2. Diversity of the Corpus: The corpus should cover a wide range of topics and genres to ensure that the embeddings are generalizable. This means including text from various domains such as news articles, books, social media, academic papers, and more.

  3. Preprocessing: Proper preprocessing of the corpus is essential. This includes:
    • Tokenization: Splitting text into words or subwords.
    • Lowercasing: Converting all text to lowercase to reduce the vocabulary size.
    • Removing Punctuation and Stop Words: Cleaning the text by removing unnecessary punctuation and common stop words that do not contribute to the meaning.
    • Handling Special Characters: Dealing with special characters, numbers, and other non-alphabetic tokens appropriately.
  4. Domain-Specific Data: For specialized applications, it is beneficial to include domain-specific data. For instance, medical embeddings should be trained on medical literature to capture the specialized vocabulary and context of the field.

  5. Balanced Dataset: Ensuring that the dataset is balanced and not biased towards a particular topic or genre helps in creating more neutral and representative embeddings.

  6. Data Augmentation: In cases where data is limited, data augmentation techniques such as back-translation, paraphrasing, and synthetic data generation can be used to enhance the corpus.

Applications of Sentence Embeddings

Sentence embeddings have a wide range of applications in NLP:

  1. Text Classification: Sentence embeddings are used to represent sentences for classification tasks, such as identifying the topic of a sentence or determining the sentiment expressed in a review.
  2. Semantic Search: By comparing sentence embeddings, search engines can retrieve documents that are semantically similar to a query, even if the exact keywords are not matched.
  3. Summarization

: Sentence embeddings help in generating summaries by identifying the most important sentences in a document based on their semantic content.

  1. Translation: Sentence embeddings improve machine translation systems by providing a richer representation of the source sentence, leading to more accurate translations.

Embedding Dimension Reduction Methods

High-dimensional embeddings can be computationally expensive and may contain redundant information. Dimension reduction techniques help in simplifying these embeddings while preserving their essential characteristics. Here are some common methods:

  1. Principal Component Analysis (PCA): PCA is a linear method that reduces the dimensionality of data by transforming it into a new coordinate system where the greatest variances by any projection of the data come to lie on the first coordinates (principal components).
    from sklearn.decomposition import PCA
    
    # Assuming 'embeddings' is a numpy array of shape (n_samples, n_features)
    pca = PCA(n_components=50)
    reduced_embeddings = pca.fit_transform(embeddings)
    
  2. t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a nonlinear technique primarily used for visualizing high-dimensional data by reducing it to two or three dimensions.
    from sklearn.manifold import TSNE
    
    tsne = TSNE(n_components=2)
    reduced_embeddings = tsne.fit_transform(embeddings)
    
  3. Uniform Manifold Approximation and Projection (UMAP): UMAP is another nonlinear technique that is faster and often more effective than t-SNE for dimension reduction, especially for larger datasets.
    import umap
    
    reducer = umap.UMAP(n_components=2)
    reduced_embeddings = reducer.fit_transform(embeddings)
    
  4. Autoencoders: Autoencoders are a type of neural network used to learn efficient codings of input data. An autoencoder consists of an encoder and a decoder. The encoder compresses the input into a lower-dimensional latent space, and the decoder reconstructs the input from this latent space.
    from tensorflow.keras.layers import Input, Dense
    from tensorflow.keras.models import Model
    
    # Define encoder
    input_dim = embeddings.shape[1]
    encoding_dim = 50  # Size of the reduced dimension
    input_layer = Input(shape=(input_dim,))
    encoded = Dense(encoding_dim, activation='relu')(input_layer)
    
    # Define decoder
    decoded = Dense(input_dim, activation='sigmoid')(encoded)
    
    # Build the autoencoder model
    autoencoder = Model(input_layer, decoded)
    encoder = Model(input_layer, encoded)
    
    # Compile and train the autoencoder
    autoencoder.compile(optimizer='adam', loss='mean_squared_error')
    autoencoder.fit(embeddings, embeddings, epochs=50, batch_size=256, shuffle=True)
    
    # Get the reduced embeddings
    reduced_embeddings = encoder.predict(embeddings)
    
  5. Random Projection: Random projection is a simple and computationally efficient technique to reduce dimensionality. It is based on the Johnson-Lindenstrauss lemma, which states that high-dimensional data can be embedded into a lower-dimensional space with minimal distortion.
    from sklearn.random_projection import SparseRandomProjection
    
    transformer = SparseRandomProjection(n_components=50)
    reduced_embeddings = transformer.fit_transform(embeddings)
    

Evaluating Embeddings

Evaluating embeddings is crucial to ensure that they capture meaningful relationships and semantics. Here are some common methods to evaluate embeddings:

  1. Intrinsic Evaluation: These methods evaluate the quality of embeddings based on predefined linguistic tasks or properties without involving downstream tasks.

    • Word Similarity: Measure the cosine similarity between word pairs and compare with human-annotated similarity scores. Popular datasets include WordSim-353 and SimLex-999.
      from scipy.spatial.distance import cosine
      
      similarity = 1 - cosine(embedding1, embedding2)
      
    • Analogy Tasks: Evaluate embeddings based on their ability to solve word analogy tasks, such as “king - man + woman = queen.” Datasets like Google Analogy dataset are commonly used.
      def analogy(model, word1, word2, word3):
          vec = model[word1] - model[word2] + model[word3]
          return model.most_similar([vec])[0][0]
      
  2. Extrinsic Evaluation: These methods evaluate embeddings based on their performance on downstream NLP tasks.

    • Text Classification: Use embeddings as features for text classification tasks and measure performance using metrics like accuracy, precision, recall, and F1 score.
      from sklearn.linear_model import LogisticRegression
      from sklearn.metrics import accuracy_score
      
      model = LogisticRegression()
      model.fit(train_embeddings, train_labels)
      predictions = model.predict(test_embeddings)
      accuracy = accuracy_score(test_labels, predictions)
      
    • Named Entity Recognition (NER): Evaluate embeddings by their performance on NER tasks, measuring precision, recall, and F1 score.
      # Example using spaCy for NER
      import spacy
      from spacy.tokens import DocBin
      
      nlp = spacy.load("en_core_web_sm")
      nlp.entity.add_label("ORG")
      
      train_docs = [nlp(text) for text in train_texts]
      train_db = DocBin(docs=train_docs)
      
    • Machine Translation: Assess the quality of embeddings by their impact on machine translation tasks, using BLEU or METEOR scores.
  3. Clustering and Visualization: Visualizing embeddings using t-SNE or UMAP can provide qualitative insights into the structure and quality of embeddings.

    import matplotlib.pyplot as plt
    
    tsne = TSNE(n_components=2)
    reduced_embeddings = tsne.fit_transform(embeddings)
    
    plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
    for i, word in enumerate(words):
        plt.annotate(word, xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]))
    plt.show()
    

Similarity vs. Retrieval Embeddings

Embeddings can be tailored for different purposes, such as similarity or retrieval tasks. Understanding the distinction between these two types of embeddings is crucial for optimizing their use in various applications.

Similarity Embeddings

Similarity embeddings are designed to capture the semantic similarity between different pieces of text. The primary goal is to ensure that semantically similar texts have similar embeddings.

Use Cases:

  • Semantic Search: Finding documents or sentences that are semantically similar to a query.
  • Recommendation Systems: Recommending items (e.g., articles, products) that are similar to a given item.
  • Paraphrase Detection: Identifying sentences or phrases that convey the same meaning.

Evaluation:

  • Cosine Similarity: Measure the cosine similarity between embeddings to evaluate their closeness.
    from sklearn.metrics.pairwise import cosine_similarity
    
    similarity = cosine_similarity([embedding1], [embedding2])
    
  • Clustering: Grouping similar items together using clustering algorithms like K-means.
    from sklearn.cluster import KMeans
    
    kmeans = KMeans(n_clusters=5)
    clusters = kmeans.fit_predict(embeddings)
    
Retrieval Embeddings

Retrieval embeddings are optimized for information retrieval tasks, where the goal is to retrieve the most relevant documents from a large corpus based on a query.

Use Cases:

  • Search Engines: Retrieving relevant web pages or documents based on user queries.
  • Question Answering Systems: Finding relevant passages or documents that contain the answer to a user’s question.
  • Document Retrieval: Retrieving documents that are most relevant to a given query.

Evaluation:

  • Precision and Recall: Measure the accuracy of retrieved documents using precision, recall, and F1 score.
    from sklearn.metrics import precision_score, recall_score, f1_score
    
    precision = precision_score(true_labels, predicted_labels, average='weighted')
    recall = recall_score(true_labels, predicted_labels, average='weighted')
    f1 = f1_score(true_labels, predicted_labels, average='weighted')
    
  • Mean Reciprocal Rank (MRR): Evaluate the rank of the first relevant document.
    def mean_reciprocal_rank(rs):
        """Score is reciprocal of the rank of the first relevant item
        First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
        Example from information retrieval with binary relevance:
        >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
        >>> mean_reciprocal_rank(rs)
        0.61111111111111105
        """
        rs = (np.asarray(r).nonzero()[0] for r in rs)
        return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
    

Symmetric vs. Asymmetric Embeddings

Symmetric and asymmetric embeddings are designed to handle different types of relationships in data, and understanding their differences can help in choosing the right approach for specific tasks.

Symmetric Embeddings

Symmetric embeddings are used when the relationship between two items is mutual. The similarity between two items is expected to be the same regardless of the order in which they are compared.

Use Cases:

  • Similarity Search: Comparing the similarity between two items, such as text or images, where the similarity score should be the same in both directions.
  • Collaborative Filtering: Recommending items

based on mutual user-item interactions, where the relationship is bidirectional.

Evaluation:

  • Cosine Similarity: Symmetric embeddings often use cosine similarity to measure the closeness of vectors.
    similarity = cosine_similarity([embedding1], [embedding2])
    
Asymmetric Embeddings

Asymmetric embeddings are used when the relationship between two items is directional. The similarity or relevance of one item to another may not be the same when the order is reversed.

Use Cases:

  • Information Retrieval: Retrieving relevant documents for a query, where the relevance of a document to a query is not necessarily the same as the relevance of the query to the document.
  • Knowledge Graph Embeddings: Representing entities and relationships in a knowledge graph, where the relationship is directional (e.g., parent-child, teacher-student).

Evaluation:

  • Rank-Based Metrics: Asymmetric embeddings often use rank-based metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to evaluate performance.
    def mean_reciprocal_rank(rs):
        rs = (np.asarray(r).nonzero()[0] for r in rs)
        return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
    

The Future of Embeddings

The field of embeddings is rapidly evolving. Researchers are exploring new ways to create more efficient and accurate representations, such as using unsupervised learning and combining embeddings with other techniques like graph networks. The ongoing advancements in this area promise to further enhance the capabilities of NLP systems.

Conclusion

Embeddings have revolutionized the field of NLP, providing a robust and efficient way to represent and process textual data. From word embeddings to sentence embeddings, these techniques have enabled significant advancements in how machines understand and interact with human language. With the help of dimension reduction methods, evaluation techniques, and tailored similarity and retrieval embeddings, embeddings can be optimized for a wide range of NLP tasks. Understanding the differences between symmetric and asymmetric embeddings further allows for more specialized applications. As we continue to develop more sophisticated models and techniques, embeddings will undoubtedly play a crucial role in advancing our understanding and interaction with human language.

Learned index

Jeff Dean and Co came up with a Seminal Paper on whether Indexes can be learned using Neural Networks. I gave a talk in Saama Tech Talk Series on this topic. Guess I am moving more towards talking rather than writing nowadays.

Natural language processing tools for tamil

Language is a beautiful thing, Tamil as a Language is even more beautiful. Need for learning other languages like English is for commerce, communication among other things. Languages that have written form and grammar have survived longer than languages that are without it. Technology is playing the role of written form and grammar now. English being the most spoken and written language in the world has the benefit of lot of technology and tools developed to understand language better. English has lot of good systems developed as open source as well as by the companies like Google,Facebook, Microsoft etc for Speech to Text, Syntactic Parser, Stemmer, Lemmatizer, Parts of Speech Tagger etc. This was due to years of data collection, research, availability of computing power and to great extent due to Deep Learning. We have to develop most of these Natural Language Understanding tools for Tamil, there are some that exist but they are in very very early stages. When a language is beautiful it is complicated too, English has 26 alphabets and Tamil has 247 alphabets, and the grammar rules are different between English and Tamil. In most cases we are limited by the lack of available data for us(tech community) to build these tools. These tools when built should be open sourced under Apache 2.0 license, so it is available for public to use for free. I have listed few key activities that have to be performed for this to be successful

  1. Website to be created to list out mission statement and project status. This can be as simple as github pages.
  2. Android app to be developed for data collection and even gamify it. For e.g. for Speech to Text similar to how Mozilla recently asked people to donate their voices to develop open source Speech to Text System for English. They open sourced the code and the trained model.
  3. Data Access. There are research libraries that have digitized lot of books, dailies, news papers etc. If that data is publicly available, it can spawn multiple research similar to how Project Gutenberg did. One low hanging fruit for algorithm would be a vectorization engine for Tamil words.
  4. Community involvement. This effort will not succeed without community involvement, particularly without student and teacher community for data collection.
  5. Tech community involvement to develop this as open-source.
  6. Eminent linguistic advisors to advise the correct approach for Tamil.
  7. Technology advisors on right technology choices to solving problems.
  8. Advisors to help registering and managing this as a non-profit and may be even seeking grants and sponsorship. This can start as a small thing, but if it has to be taken seriously it has to be registered as a non-profit organization.
  9. Advisors who can help with media and press out-reach.

Why use Activation Functions in Deep Learning ?

Machine Learning or Deep Learning is all about using Affine Maps. Affine map is a function which can be expressed as

f(x) = WX + b

Where W and X are matrix, and b (bias term) is a vector. Deep learning learns parameters W and b. In Deep Learning you can stack multiple affine maps on top of one another. for e.g

  • f(x) = WX + b
  • g(x) = VX + d

If we stack one affine map over the other then

  • f(g(x)) = W (VX +d) + b
  • f(g(x)) = WVx + Wd + b

WV is a matrix , Wd and b are vectors.

Deep learning requires lot of affine maps stacked on top of the other. But Composing one affine map over the other gives another affine map so stacking is not going to give the desired effect and it gives nothing more than what a single affine map is going to give. It still leaves us with a linear model. In a classification problem linear model will not be able to solve for a non-linear decision boundary.

How do we solve this ? By introducing non-linearity between affine maps/layers. Most commonly used non-linear functions are

  • Tanh
  • Sigmoid
  • RELU

When there are lot of non-linear functions why use only the above ones ? Because the derivatives of these functions are easier to compute which is how Deep Learning algorithms learn. Non-Linear functions are called Activation functions in Deep Learning world.

Thanks to Dr Jacob Minz suggestion to add explanation about Universal Approximation Theorem. Universal Approximation Theorem says that when you introduce simple non-linearity between affine layers, you’ll be able to approximate any function to any arbitrary degree (as close to that function as you want). If there is a pattern in the data, the neural network will “learn” it given enough of computation and data.

You can read more about the Activation functions in wiki. Who writes better about Neural Networks than Chris Olah. Refer to his blog for further reading. Spandan Madan has written a quora answer on the similar topic

செயற்கை அறிவாற்றல் வல்லுனர்களுக்கு ஏன் சரித்திரம் தெரிந்து இருக்க வேண்டும் ?

தொழில்நுட்ப வளர்ச்சி சாமான்ய மக்களை வியக்க வைக்கின்றது, அவர்களும் தொழில்நுட்ப வளர்ச்சியை முடிந்த வரை அரவணைத்து வருகின்றனர்.வரலாற்றை புரட்டி பார்த்தால் ஒவ்வொரு 20 வருடங்களுக்கு ஒரு முறை மிகவும் சக்தி வாய்ந்த தொழில்நுட்பம் வந்து ஒரு பெரும் புரட்சியை உண்டு பண்ணும். செயற்கை அறிவாற்றல் தொழில்நுப்டமோ 100 ஆண்டுகளுக்கு ஒரு முறை வரும் ஒரு பெரும் மாற்றம். செயற்கை அறிவாற்றல் அறிஞர் ஆண்ட்ரூ ந.ஜி (Andrew Ng) இதை புதிய மின்சாரம் என்று கூறுகிறார். இத்தகைய மிக சக்தி வாய்ந்த தொழில்நுட்பத்தை வல்லுநர்கள் ஆக்கவும் பயன்படுத்த முடியும் அழிக்கவும் பயன்படுத்த முடியும்.

மின்சாரம்

செயற்கை அறிவாற்றல் வல்லுநராக ஆக ஒருவர் மிக சிறந்த மென்பொருள் வல்லுநராக, கணித மேதையாக, விடா முயற்சி கொண்டவராக இருக்க வேண்டும். இவ்வளவு விஷயத்தில் வல்லமை பெற்ற ஒருத்தர் அதி புத்திசாலியாக இருக்க வாய்ப்பு உண்டு. அதி புத்திசாலிகளுக்கு உரிய அகந்தையும் கர்வமும், எதையும் செய்யலாம் எந்த விதிகளும் நமக்கு பொருந்தாது என்ற பண்பு இருக்கும் வாய்ப்பு மிக அதிகம். இத்தகைய பண்பை தான் ஒரு அறிஞர் ஒருவர் “எண்கள் பொய் சொல்லாது ஆனால் நன்றாக பொய் பேசுபவர்கள் எண்களை உபயோகிப்பர்” (Numbers don’t lie but liars use numbers) என்று கூறி இருக்கிறார்.

செயற்கை அறிவாற்றலை சுருக்கமாக சொல்ல வேண்டுமானால் கடந்த கால நிகழ்வுகளை வைத்து எதிர்காலத்தில் நடப்பதை கணிக்கக்கூடிய வல்லமை பெற்றது. இது ஒரு கேள்வியை எழுப்புகிறது, கடந்த கால நிகழ்வு ஒன்று இல்லை என்றால் செயற்கை அறிவாற்றல் அதை எப்படி சரியாக கணிக்க முடியும் ? உதாரணத்துக்கு நாம் செயற்கை அறிவாற்றலை வங்கியில் தொழில்முனைவோருக்கு கடன்கொடுப்பதா இல்லையா என்று முடிவு எடுக்க பயன்படுத்துகிறோம் என்றால். ஒரு சமூகம் தொழில்முனைவதில் பெயர் பெற்றவர்கள், அவர்கள் பற்றிய தகவல்கள் செயற்கை அறிவாற்றலுக்கு தெரியும், அது கடன் கொடுப்பதை பற்றி சாதகமான முடிவு எடுக்க வாய்ப்பு மிகவும் அதிகம். அதை சமயம் இன்னொரு சமூகம் காலம் காலமாக அடிமை பெற்ற சமூகம், அதில் இருந்து தொழில் முனைவோர் வந்தது இல்லை, இந்த சமூகத்தை பற்றி தவகல்கள் செயற்கை அறிவாற்றலுக்கு தெரியாது, அது கடன் கொடுப்பதை பற்றி பாதகமான முடிவு எடுக்க வாய்ப்பு மிகவும் அதிகம். அமெரிக்காவில் சிறையில் இருக்கும் மக்கள் தொகையில் 34% விழுக்காடு கறுப்பினத்தவர் ஆனால் அமெரிக்காவின் மொத்த மக்கள் தொகையில் 12.2% விழுக்காடு தான் கறுப்பினத்தவர். தவறு செய்த ஒருவரை சிறையில் இருந்து சீக்கிரம் விடுவிக்கலாமா இல்லையா என்று முடிவு செய்ய செயற்கை அறிவாற்றலை பயன்படுத்தினால் கறுப்பினத்தவருளுக்கு பாதகமான முடிவு வர சாத்தியம் அதிகம். இதை பற்றி நியூயார்க் டைம்ஸ் ஒரு அருமையான கருத்து வெளியிட்டறிந்தது.

செயற்கை அறிவாற்றலலில் இரண்டு வகைப்பாடுகள் உண்டு. முதல் வகை அது ஒரு முடிவை ஏன் எடுக்கிறன்றது என்று வல்லுநர்களுக்கு தெரியும். இரண்டாம் வகை அது ஏன் ஒரு முடிவை எடுக்கின்றது என்று தெரியாத வகை. DeepLearning இரண்டாவது வகையறாவை சாறும். பேராசிரியர் பீன் கிம் (Prof Been Kim ) அணைத்து செயற்கை ஆறிவாற்றலும் ஏன் ஒரு முடிவை எடுத்தது என்பது மனிதர்களுக்கு புரிய வேண்டும் என்று ஆராய்ச்சி செய்து வருகிறார். வங்கியில் கடன் கொடுப்பதா இல்லையா போன்ற காரியத்துக்கு இரண்டாம் வகை செயற்கை ஆறிவுஆற்றலை பயன் படுத்த கூடாது.

செயற்கை அறிவாற்றல் ஒரு குழந்தை மாதிரி நல்ல தகவல்களை சொல்லி கொடுத்தால் நல்ல முடிவு எடுக்கும், தவறான தகவல்களை சொல்லி கொடுத்தால் தவறான முடிவு எடுக்கும். செயற்கை அறிவாற்றல் வல்லுநர்கள் கண்மூடித்தனமாக சரித்திரம் மற்றும் சமூக கட்டமைப்பு புரியாமல் தகவல்களை செயற்கை ஆறிவாற்றலுக்கு சொல்லி கொடுத்தால் அது பாரபட்சமான முடிவு எடுக்கும் வாய்ப்பு மிகவும் அதிகம்.இதற்கு முன்பு வந்த தொழில்நுட்பங்களை போல செயற்கை அறிவாற்றல் அதன் படைப்பாளிகள் அறநெறிகளை பிரதிபலிக்கும். செயற்கை அறிவாற்றல் வல்லுனர்களுக்கு மிகுந்த பொறுப்பு உள்ளது.

கவிஞர் புலமைப்பித்தனின் வரிகள் “எந்தக்குழந்தையும் நல்ல குழந்தை தான் மண்ணில் பிறக்கையிலே…பின் நல்லவராவதும் தீயவராவதும் அன்னை வளர்ப்பதிலே” செயற்கை அறிவாற்றல் வல்லுநர்களுக்கு மிகவும் பொருந்தும்.

செயற்கை அறிவாற்றல்

செயற்கை அறிவாற்றல் என்று சொன்னால் நமக்கு நினைவு வருவது ரஜினியின் எந்திரன் திரைப்படம் அதில் சிட்டி அடிக்கும் லூட்டிகளைக்கண்டு இதெல்லாம் சாத்தியமா என்று வியந்தது உண்டு. சிட்டி செயற்கை அறிவுஆற்றலின் ஒரு பரிமாணம். எல்லா செயற்கை அறிவாற்றல் பரிமாணங்களும் சிட்டி மாதிரி இருக்க வேன்றும் என்று அவசியம் இல்லை.

எந்திரன்

நமக்கு தெரியாமலே அதிகமான செயற்கை அறிவாற்றல் சாதனங்களை நாம் பயன் படுத்தி வருகிறோம். நம்மில் பலபேர் முகநூலில் ஒரு கணக்கு வைத்துயிருக்கிறோம். செய்திகளை அறியக்கூட நாம் முகநூலுக்கு செல்ல தொடங்குகிறோம். உங்களுக்கு ஒரு செய்தியோ அல்லது தகவலோ பிடித்துஇருந்தால் அதற்கான விருப்பத்தை(LIKE) நீங்கள் முகநூலில் பதிவு செய்துயிருந்தால் உங்களுக்கு அதற்கு சம்பந்தப்பட்ட செய்திகளோ தகவல்களோ முகநூலில் அதிகம் வர தொடங்கும் இது ஒரு வகையில் நல்லது. நமக்கு விருப்பமான விவரங்களை மட்டும் நாம் கற்றுக்கொள்கிறோம் விரும்பாத விவரங்களை தவிர்க்கிறோம். அனால் இது நம்மளை கிணற்று தவளையாக மாற்றிவிடுகிறது. இந்த கிணறு தான் நம் உலகம். முகநூலின் செயற்கை அறிவாற்றல் அதிகம் கற்று கொள்ள கற்று கொள்ள அவர்களை சுற்றி ஒரு கிணறு வெட்ட தொடங்குகிறது.

கிணற்று தவளையாக

அட உங்களின் விருப்பத்தை அது எப்படி கற்று கொள்கிறது?

உங்களின் விருப்பத்தினை நீங்கள் பதிவு செய்யும் விரும்புதல்(LIKE) மற்றும் கருத்து தெரிவிப்பதின் (COMMENT) மூலமும் செயற்கை அறிவு ஆற்றலக்கு கற்று கொடுக்கிறீர்கள். செயற்கை அறிவாற்றல் உங்களது விருப்பத்திற்கு இணங்க உரிய விவரங்களை மட்டும் திரும்ப திரும்ப காட்டுகிறது. உதாரணத்திற்கு நீங்கள் ஜெயகாந்தன் கட்டுரைகளுக்கு விருப்பம் தெரிவித்துஇருந்தால் அசோகமித்திரன் கட்டுரைகளும் முகநூலின் தகவல் பலகைகளில் வர தொடங்கும். கவிஞர் கண்ணதாசன் கவிதைகள் விருப்பம் தெரிவுத்துஇருந்தால் கவிஞர் வாலியின் கவிதைகளும் வர தொடங்கும்.

2016அம் ஆண்டு அமெரிக்க தேர்தலில் போலியான செய்திகளின் மூலம் செய்த பிரச்சாரம் முகநூலின் எதிரொலி அறை விளைவாக அதிக தாக்கத்தை ஏற்படுத்தி தேர்தலில் ஒரு மாறுதலை உண்டாக்கியது என வல்லுனர்களின் கருத்து.

மென்பொருளுக்கும் செயற்கை அறிவு ஆற்றலுக்கும் என்ன வித்தியாசம் ?

உதாரணத்துக்கு மைதானத்தில் உடற்பயிற்சி செய்வது எனக்கு ரொம்ப பிடிக்கும். மென்பொருள் பொறியாளரான எனக்கு எந்தவித பிரச்னை இருந்தாலும் அதற்கான தீர்வு செய்ய ஒரு மென்பொருளை நான் வடிவுவமைப்பேன். உடல் பயிற்சி செய்யலாமா என்ற என் முடிவை பாதிக்கும் கீழ்கண்ட காரணங்கள்.

  1. மழை பெய்கிறதா ?
    • மழை பெய்தால் உடற்பயிற்சி செய்ய வேண்டாம்.
  2. மணி என்ன ?
    • இரவு 10 மணி முதல் காலை 5 மணி வரை : கும்முஇருட்டில் உடற்பயிற்சி செய்யாதே.
    • காலை 5 மணி முதல் 7 மணி வரை : உடற்பயிற்சி செய்யலாம்.
    • காலை 8 மணி முதல் 9 மணி வரை : வீட்டிலிருந்து அலுவலக வேலை செய்தால் உடற்பயிற்சி செய்யலாம்
    • காலை 9 மணி முதல் மாலை 5 மணி வரை : அலுவலக பணி உடற்பயிற்சி செய்யலாம்.
  3. நான் மகனை பள்ளியில் விட வேண்டுமா ?
    • காலை 8 மணி முதல் 9 மணி வரை : உடற்பயிற்சி செய்ய முடியாது
    • மலை 4 மணி முதல் 5 மணி வரை : உடற்பயிற்சி செய்ய முடியாது
  4. சீதோஷண நிலை
    • 22 முதல் 32 டிகிரி செல்சியஸ் : உடற்பயிற்சி செய்யலாம்
    • 32 முதல் 44 டிகிரி செல்சியஸ் : உடற்பயிற்சி செய்ய முடியாது

பாரம்பரிய மென்பொருள் முறைப்படி தீர்வு செய்ய முயன்றால் வரிசை மாற்றம் மற்றும் சேர்கை காரணங்கள்(PERMUTATION AND COMBINATION) எண்ணிக்கைகள் மிக அதிகமாக இருக்கும்.

மணி மழை பெய்கிறதா பள்ளி சென்று விடும் நேரமா வெயிலின் தாக்கம் ஆமாம்/இல்லை
காலை 5-7 ஆமாம் இல்லை 22-32 டிகிரி செல்சியஸ் இல்லை
இரவு 9-5 இல்லை இல்லை 22-32 டிகிரி செல்சியஸ் இல்லை
காலை 5-7 இல்லை இல்லை 22-32 டிகிரி செல்சியஸ் ஆமாம்
….. …. …. ….

நான்கே காரணங்குளுக்கு 20க்கு மேற்பட்ட வரிசை மாற்றம் சேர்க்கை காரணங்கள்(PERMUTATION AND COMBINATION) 20க்கு மேல் வருகிறது. இதில் தனிப்பட்ட காரணங்கள் 100க்கு மேல் இருந்தால் வரிசை மாற்றம் சேர்க்கை காரணங்கள்(PERMUTATION AND COMBINATION) 1000க்கு மேல் வரும். இது பாரம்பரிய மென்பொருள் வடிவமைப்பின் மூலம் எளிதில் தீர்வு செய்ய முடியாது. இது மாதிரியான பிரச்சனைக்கு செயற்கை அறிவாற்றல் தான் சரியான தீர்வாக அமையும். நமது நவீன தொலைபேசியில் உடற்பயிற்சியை கண்காணிக்க நிறைய செயளிகள் உள்ளன. Run keeper, Mapmyrun இந்த மாறி செயலிகளை ஒரு வருடம் உபயோகித்து இருந்தால் அதன் தகவல்களை செயற்கை அறிவு ஆற்றலுக்கு ஒரு குழந்தைக்கு எது சரி அது தவறு என்று சொல்லி கொடுப்பதுபோல் நான் எப்போதுஎல்லாம் உடற்பயிற்சி செய்யலாம் /செய்யக்கூடாது என்பதை சொல்லி கொடுக்க முடியும்.

செயற்கை அறிவு ஆற்றலையை சுருக்கமாக சொல்ல வேண்டுமானால் கடந்த கால நிகழ்வுகளை வைத்து எதிர்காலத்தில் நடப்பதை கணிக்கக்கூடிய வல்லமை பெற்றது. மேலே பார்த்த உதாரணம் ஒரு சிறிய நடைமுறையைப் பயன்பாடு. செயற்கை அறிவு ஆற்றல் பல இடங்களில் பயன்படுத்த முடியும். உதாரணம் - வானிலை அறிக்கை, செயற்கோள் மூலமாக பூமியின் அடியில் உள்ள கனிம வளத்தினை அறியலாம். மலைப் பகுதியில் உள்ள வளங்களை அறியலாம். ஒரு புகைப்படத்தினை பயன்படுத்தி தோல் புற்று நோயை அறியலாம்.

செயற்கை அறிவாற்றல் என்பது ஒரு சக்தி வாய்ந்த கருவி அதை சரியாக பயன்படுத்தினால் மனித குலத்திற்கு அநேக நன்மைகளை உண்டாக்கலாம். இதன் மூலம் புதிய வேலை வாய்ப்புகள் பல உருவாகும். அதே நேரத்தில் பழைய தொழில் நுட்பங்கள் மறைவதற்க்கான வாய்ப்புகள் உண்டு. சான்றோர்களும், வல்லுனர்களும் , அரசுஆட்சி செய்ப்பவர்களும் அமர்ந்து விவாதித்து வழிகாட்டுதலை வெளியிட வேண்டும்.

இது எனது முதல் செயற்கை அறிவாற்றல் பற்றிய வலை பதிவு. நேரம் கிடைக்கும் போதுஎல்லாம் மேலும் எனது கருத்தினை எழுத எனது மனது தூண்டுகிறது.

Introduction to Deep Learning Image Classification using Keras

DeepLearning is a powerful tool that can be used to solve lot of problems. I can’t solve all the problems in this world, if I can inspire few others to take Deep Learning to solve other problems it will make a positive impact in this society. I believe in spreading knowledge to all segments. Below is the youtube link for a talk i gave in Demystifying Artificial Intelligence conference couple of weeks back

I am part of IDLI group (Indian Deep Learning Initiative) which is trying get DeepLearning promoted in India. We are organizing lot of youtube live sessions. I gave the same talk with screen sharing. If you like to see code when someone is talking and not presenters face then this youtube link is for you

When to use Machine Learning?

Should I get started with “Machine Learning” to improve my situation?

If you are trying to pick an action based on a problem you are facing, how should you approach it? You have heard a lot or a little about “Machine Learning.” Here’s my personal story to help you decide.

As a software programmer, when I want solve a problem, I try to figure out a way to automate it. Being a machine learning engineer, I begin comparing to the oldest problem solving method: deterministic programming.

Malai Running

Being active is my passion. Everyday, I decide whether to go outside for a run outdoors. Situations that influence my decision are:

  1. Is it raining outside?

    • Yes –> Don’t run
    • No –> I think I can run.
  2. Time of the day?

    • Is it 10 P.M to 5 A.M –> Dont’ run. Too dark outside.
    • 5 A.M to 7 A.M –> I think I can run.
    • 8 A.M to 9 A.M –> Am I working from home? I think I can run.
    • 9-5 P.M –> Don’t run. Office work.
    • 5 P.M -7 P.M –> I think I can run.
    • 7 P.M - 9 P.M –> if it is not too dark then I can run.
    • 9 P.M - 10 P.M –> After dinner, I don’t run.
  3. Do I have to drop my son in school?

    • Is it Morning school drop time ( 8.30 A.M to 9 A.M) –> Dont run
    • Is it evening school pickup time (4.45 P.M to 5.15 P.M) –> Dont run.
    • Other times. I think I can run.
  4. What is the temperature outside?

    • 50-80F –> I think I can run.
    • 80F-110F –> I don’t think I can run. Too hot.
    • 0F - 50F –> I am not running.

If I do this deterministically, then I will end up with lot of “If” conditions for all the permutations of combinations that are possible . With just 4 attributes, it is getting complex. Imagine having hundreds of attributes influencing whether I want to run or not. This is where machine learning will save my day. (Did I mention that I am a lazy programmer? Good code is better than more code.).

A machine learning model “projects past on future” to “learn” whether I will run outside or not. With historical data of my running (Map my run, Garmin, Fitbit, Apple Watch) + publicly available weather data, I can train a machine learning model (classification problem) to decide whether I want to run or not. A set of machine learning algorithms will approximate a decision boundary for this problem. Output from the machine learning algorithm is probability for 2 classes: a) run or b) don’t run.

Do you see? Machine learning rocks. This story is perfect for a supervised classification problem.

Natural Language Processing using Word2Vec

Natural Language Processing is a complicated area. Computers are really good at crunching numbers but not so much with text. We have to specifically instruct them on how to handle text. For Text analytics to be carried out we have to represent text in a form the computers can understand i.e. in the form of numbers.

  • The method for doing this varies.
  • A good method is one that captures as much of the meaning of the text as possible.

After this is done, analysis can be carried out on these numbers in the same way as dealing with numbers.

Conventional Text Analysis or Bag of Word Models

Lets take this example text “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Popular way of converting this text to numbers is known as “Bag of Words” model. Each word is extracted from the text and put in a bag together.Each word in the bag is then assigned a suitable value. Lets parse the example text and extract words

“SanFrancisco”, “is”, “a”, “city”, “LosAngeles” ,”lovely”, “California”, “metropolis”.

Stop words are the most common words in a language. In the extracted list of words “is” , “a” are considered stop words hence it will be ignored. So the final list of extracted words will be

“SanFrancisco”, “beautiful”, “city” ,”LosAngeles”, “lovely” , “California” , “metropolis”

In Bag of Words model each word is assigned a value based on number of times it occurs in the text.

SanFrancisco beautiful city LosAngeles lovely California metropolis
1 1 1 1 1 2 1

This approach has lot of disadvantages, few of those are listed below

  • It does not capture the order of words in the original text.
  • It does not capture the context of the text.
  • It does not capture about the meanings of the words.

What we are telling the computer using this approach is city, lovely, metropolis are all equal. Even though city and metropolis are equivalent and lovely is something very different.

Neural Language models

Prof Yoshua Bengio started this work in 2008. A language model is an algorithm for capturing the salient statistical characteristics of the distribution of sequences of words in a natural language.

A neural language model learn distributed representations on words to reduce the impact of the curse of dimensionality. Curse of dimensionality is when number of input variable grows the number of required examples to train a model grows exponentially.

You can read more about this work in this link

Word2vec – Word to Vector

Word2Vec is one of the influential papers in Natural Language Processing. It has nearly 3000 citations. Word2Vec improves on Prof Yoshua Bengio’s earlier work on Neural Language Models. Word2Vec was created by google, with main author being Tomas Mikolov. You can read the original paper here.

Word2vec is an answer to all the disadvantages listed in the previous section.

  • It is intelligent, learns context of the text.
  • Understands sentences, and not just individual words.
  • Understands relationships between words.

We understand what a word is, lets see what a vector is. A vector is a sequence of numbers that forms a group. For example

  • (3) is a one dimensional vector.
  • (2,8) is a two dimensional vector.
  • (12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only 3 dimensions, anything more than that you can just say it not visualize.

How Word2Vec works

For a input text it looks at each word and the context of words around it. It trains on the text, and recognizes the order of each word, and the structure of the sentences. At the end of training each word is represented by a vector of N (mostly in 100 to 300 range) dimension.

Word2Vec

When we train word2vec algorithm in the example discussed above “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Lets assume that it outputs 2 dimension vectors for each words, since we can’t visualize anything more than 3 dimension.

  • SanFrancisco (6,6)
  • beautiful (-13,-4)
  • California (10,8)
  • city (2,10)
  • LosAngeles (6.5,5)
  • lovely(-12,-7)
  • metropolis(2.5,8)

Below is a 2D Plot of vectors

2DPlot

You can see in the image that Word2vec algorithm inferred from the input text. SanFrancisco and LosAngeles are grouped together. Beautiful and lovely are grouped together. City and metropolis are grouped together. Beauty about this is, Word2vec deduced this purely from data, without being explicitly taught english or geography.

Word2vec and Analogies

Word2vec algorithm is really good in discovering analogies on data. In the below plot from relative position analogies can be observed.

2DPlot

Algorithm knows the answer to

  • If SanFrancisco : beautiful
  • then LosAngeles : ???

Answer : lovely

To get to the answer do

  1. Draw a line from SanFrancisco to beautiful
  2. Shift this line to LosAngeles
  3. Find the end-point of this line.

Similarly you can draw other analogies like

  • if SanFrancisco : city
  • then LosAngeles : metropolis

Math for Analogies is beautiful, it can be expressed in simple vector arithmetic.

  • SanFrancisco - LosAngeles = beautiful - [unknown]
  • [unknown] = beautiful + LosAngeles - SanFrancisco
  • [unknown] = (-12.5,5) which is close to lovely

vector

Google trained Word2Vec on a large volume data, it came up with some interesting analogies

King

countries and capitals

Word2vec depicting relationships between countries and capitals Analogy

Machine translation

In complicated use-cases models can understand translations from one language to another.

MachineTranslation

Contributions

Thanks to Arthur Chan for suggesting changes to the blog.

Further Reading

Deep Learning using Numpy

DeepLearning is a verstaile tool to solve problems that cannot be solved using traditional programming approach. I am a CTO at Datalog.ai where we solve lot of cool problems using Deep Learning. ML Researchers and Engineers use lot of Deep Learning packages like Theano, Tensorflow, Torch, Keras etc. Packages are really good but when you want to get an understanding on how Deep Learning works, it is better to go back to basics and understand how it is done. This blog is at an attempt at that, it is going to be a 3 part of series with topics being

  1. DeepLearning using Numpy
  2. Why TensorFlow/Theano not Numpy?
  3. Why Keras not TensorFlow/Theano?

Neural Network with 1 hidden layer

Deep learning refers to artificial neural networks that are composed of many layers like the one shown above. Deep Learning has many flavor’s like Convolution Neural Networks, Recurrent Neural Networks, Reinforcement Learning, Feed Forward Neural Network etc. This blog is going to take the simplest of them, Feed Forward Neural network as an example to explain.

Machine Learning deals with lot of Linear Algebra operations like dot product, transpose, reshape etc. If you are not familiar with it, I would suggest refer to my previous blog post in All about Math section.

Deep Learning needs an activation function to squish real numbers to probability values between 0 and 1 , there are different activation functions like sigmoid, Tanh, RELU etc. For this toy example i have used sigmoid activation function.

Sigmoid

We are going to use Gradient Descent to find optimal parameters to solve for Y. Gradient descent uses the derivative of the sum of errors to update the systems parameters a little bit in such a way that the error decreases as much as possible.After every update the system learns to predict with a lower error. Let it run many iterations and it will converge at some optima(local). Sigmoid function takes a parameter to calculate Derivative. Don’t worry if you don’t understand this explanation, it is very intuitive if you can follow the code along. If you are looking for more explanation refer to this video by Prof Andrew Ng.

For this example on Numpy Deep Learning Code, I am going to use a synthetic dataset. Output is the target we are going to predict.

Input and Output

Randomly initialize weights for 2 synapses. Synapses 0 will be of shape 3x4, Synapses 1 will be of shape 4x1

With Gradient descent you have to run the process for n number of iterations, in ML lingo it is called epoch (since it will take ages to complete). In our case we are going to run it for 50 iterations. Since this is a 1 hidden Layer network, we do a dot product between input l0 and synapses_0 and then squish it using sigmoid function. Pass output of l1 as input to hidden layer and do dot product between l1 and synapses_1 weights and then squish it using sigmoid function.

Now we are off to calculate what is the error for our prediction for l2 layer. Then use derivative to find out how much we should update our Synapses 1.

Same step should be done for l1 layer, but error should be calculated based on how much we are off on l2.

Update weights for synapses_0 and synapses_1 based on calculated l1_delta and l2_delta respectively.

See below on how loss is decreasing for each iteration.

With just 50 iterations we are very close to actual value

Output

Siraj Raval has a really good youtube video on Intro to Deep Learning check it out too.

How to Learn Deep Learning?

I oscillated between different blogs and videos to become a deep learning practitioner. This blog is to document my learning and to follow an optimal path to become Deep Learning practioner faster

It is all about Math

Don’t be shy if you haven’t brushed your Math skills for a while. When you are programming for while bad habits creep in , it takes time to unlearn and learn new things. I had a tough time initially then started refreshing my Math again. I used Khan academy , i like how most of the sessions are only 10 minutes long. I followed below order

  1. Algebra – Yes you have to refresh Algebra. Remember the equation for straight line y = mx + b. That is the best equation you learned in your life. Most of machine learning is about finding the value of “m” called weights and “b” called biases.
  2. Trigonometry
  3. Differential Calculus – Machine Learning/Deep Learning is all about finding slope aka derivatives, hence do it thoroughly
  4. Partial Differential Equations
  5. Integral Calculus
  6. Probability and Statistics – this is important for anything in Machine Learning.
  7. Linear Algebra – Most of calculations are done using Matrix multiplication, dot products, transpose so learn this well.
  8. Linear Algebra Advanced – Yes it is that important. I referred to Prof Gilbert Strang lectures from MIT.

Intro into Machine Learning

I took Prof Andrew Ng’s Coursera Machine Learning course in 2012. It is the bible if you are starting with Machine Learning. Take your time and learn the basics.

Mining Massive Datasets

This was one of the best courses i took, it helped me to understand Mathematical intuition behind lot of Machine Learning algorithms.

Deep Learning

This course is Math heavy, but Prof Ali Ghodsi lectures explains it well. It is one of the hidden gems there are quite a series of lectures in youtube, watch it all. Watch it in loop, till you get hang of every concept.

Convolution Neural Networks

Convolution Neural networks is a class of Deep Learning that is predominantly used for computer vision. AndreJ Karpathy and Justin Johnson taught a great course cs231n in Stanford on CNN. It gives lot of practical tips on building Deep Learning models. I wrote an intro level CNN tutorial for Keras.

Natural Language processing

Richard Socher’s class on Natural Language processing is must if you want to work on Unstructured text. CS224d is heavy on Recurrent Neural Networks. Recently Convolution Neural Networks are being used more for NLP.

Learn Python

Python is becoming a de-facto language for scientific and numerical computation. Most Deep Learning libraries have a python front end. If you are new to python then use Byte of Python book to learn. There are lot of good youtube tutorials too.

GPU’s

If you have managed to take all these classes mentioned in the list above, then you are a serious about being a Deep Learning practitioner. Invest in a good NVIDIA GPU for trying out different models. You can use AWS for training but you will end up spending lot of money to train different models, in a long run it will make sense to buy your own hardware. Hey you can use it for gaming too, if you feel bored about Deep Learning.

Must Buy/Read Books

Ian GoodFellow, Aaron CourVille, and Yoshua Bengio wrote an awesome Deep Learning Book. I bought it, since it is a text-book theory book. Another book i often referred to is Neural Networks and DeepLearning. This book explains [Backpropagation]((http://neuralnetworksanddeeplearning.com/chap2.html), one of the most important concepts in deep learning very well.

Blogs to Read

I read Machine Learning Mastery, it has practical tips and good blogs.

Deep Learning Frameworks

There are quite a few options when it comes to Deep Learning frameworks

  1. Tensorflow
  2. Theano
  3. Keras
  4. Caffe
  5. CNTK
  6. Lasagne
  7. Other

I am personally big fan of Keras (wrapper over Tensorflow and Theano), since it abstracts lot of complexity of building a Deep Learning model, i can build a model and test whether it works or not very fast. There are tons of online tutorials on Tensorflow and Theano.

Gokul Krishnan wrote a really good blog on Anatomy of Deep Learning Frameworks

Kaggle

Kaggle is a data science competition forum, lot of researchers compete there and share their approach they used for solving that problem. Compete actively to learn and improve.

Follow Researchers on Twitter

I used twitter recommendation engine (learning machine learning using machine learning) to keep myself updated with latest research papers. Check whom i am following , on my Twitter

Indian Railway Status Check Chatbot

Indian Railways Status check Chatbot is integrated with Facebook Messenger. It can be used to find details about

  • PNR Number
  • Find Station By name
  • Find Station By stationcode
  • Find Stations on a Train route
  • Find Trains leaving from a Station in next 4 hours
  • Find Trains updating
  • Find Rescheduled Trains for a date
  • Find Cancelled Trains for a date
  • Find Train between stations
  • Find Train fare
  • Find Train Seat availability

Bot is will available for public use in another week.

Python: print() methods

Hi all,
Today, I learned about the Python print statement. It is fascinating to know that Python has so much functionality.I will share some of the thing i learned today

  1. sep, the sep parameter is used with the print() function to specify the separator between multiple arguments when they are printed.
  2. escape sequence like \n (new line), \t(adds space), \b(removes previous character).
  3. concatenation which adds two different strings.
  4. concatenating str and int which combine string and integer by converting integer into string by typecasting.
  5. Raw string A raw string in Python is defined by prefixing the string literal with an 'r' or 'R'.Raw strings are often used when working with regular expressions or when dealing with paths in file systems to avoid unintended interpretation of escape sequences.
  6. Format the format() method is used to format strings by replacing placeholders {} in the string with values passed as arguments.
  7. string multiplication here you can multiply strings by using the *operator. This operation allows you to multiply string a specified number of times.

Python : Simple Calculator

Today I learned about doing simple calculator arithmetic operations,
IF... ELSE... ELIF... statement and WHILE loop.

Also one learner asked about
num1 = input("Enter First number : ")
choice = input("Enter the Choice : ")
num2 = input("Enter Second number : ")
print ((num1)(choice)(num2)) # expected to print 9 for the input 4 + 5

This will Error out. But still possible to get result using eval.

print(eval(num1+choice+num2))

❌