Normal view

There are new articles available, click to refresh the page.
Yesterday — 8 April 2025Saama

Document Whisperer: Llama-4-Scout and the Future of Intelligent Content Extraction

By: angu10
6 April 2025 at 05:28

In today's data-driven world, the ability to quickly extract insights from documents is becoming increasingly valuable. Whether you're a researcher sifting through academic papers, a legal professional analyzing contracts, or a business analyst reviewing reports, the challenge of efficiently navigating through lengthy documents remains universal.

Enter Llama-4-Scout, Meta's latest language model that's showing remarkable capabilities in document question-answering tasks. Released on April 5, 2025, Scout is a groundbreaking addition to the new Llama 4 family, specifically designed to be more reliable, precise, and effective at finding and extracting information from documents.

What Makes Llama-4-Scout Special?

Llama-4-Scout utilizes a Mixture-of-Experts (MoE) architecture with 17 billion active parameters (out of 109 billion total parameters across 16 experts) that strikes an optimal balance between performance and efficiency. What truly sets Scout apart is its unprecedented 10-million token context window, making it best in class for long-context tasks like summarizing huge codebases and analyzing extensive documents.

Scout was pretrained on approximately 40 trillion tokens of multimodal data and excels at:

Document comprehension: Understanding context across extremely long documents

Information retrieval: Precisely locating specific details in complex content

Source attribution: Providing clear references to where information was found

Multimodal reasoning: Visual recognition, image reasoning, captioning, and answering questions about document visuals

This focused design makes Scout particularly well-suited for professional document analysis tasks that require both accuracy and reasonable computational resources. Remarkably, the model fits on a single H100 GPU when quantized to Int4, making it accessible for organizations without massive computing infrastructure.

Real-World Example: Document Q&A on the AGENTS Framework

I recently had the opportunity to test Llama-4-Scout's capabilities by building a simple PDF Q&A application with a research paper on the AGENTS framework (a system for autonomous language agents). The results demonstrated Scout's remarkable ability to extract, understand, and contextualize information from technical documents. Here's what I observed:

Source Paper: https://arxiv.org/pdf/2309.07870

Example 1: Identifying Core Framework Components

When asked about the main features of the framework described in the document, Scout successfully extracted and summarized the key components:

Query: "What are the main features of the AGENTS framework?"

Scout's Response: "The main features of the AGENTS framework include planning, memory (specifically long-short term memory implementation), tool usage capabilities, multi-agent communication protocols, and fine-grained symbolic control through SOPs (Standard Operating Procedures). Source: Page 4, Framework Architecture section"

Example 2: Understanding Technical Implementation Details

Scout excelled at extracting specific technical implementation details related to memory systems:

Query: "How does long-short term memory work in AGENTS?"

Scout's Response: "The AGENTS framework implements a dual-memory system where long-term memories are constructed as action histories embedded by sentence-transformers and stored in VectorDB for semantic search retrieval. In contrast, short-term memories (working memories) are maintained in natural language form and dynamically updated by an LLM through a carefully tuned prompt system. Source: Pages 7-8, Memory Architecture section"

Example 3: Explaining Human-Agent Interaction Mechanisms

The model demonstrated its ability to identify specific configuration details that enable key functionalities:

Query: "How does the AGENTS framework support human-agent interaction?"

Scout's Response: "The AGENTS framework supports human-agent interaction through a simple but effective configuration mechanism. Users can change an agent's 'is_human' field to 'True' in the config file, which enables humans to take the role of an agent and interact with other language agents in both single-agent and multi-agent scenarios. Source: Page 12, Human-Agent Interaction section"

Example 4: Clarifying Domain-Specific Terminology

When asked about specialized terminology in the document, Scout provided clear and accurate explanations:

Query: "What is SOP in the AGENTS framework?"

Scout's Response: "In the AGENTS framework, SOP (Standard Operating Procedures) is a symbolic plan that provides fine-grained control of agents' behaviors. Technically, it's implemented as a graph of multiple states that define different situations an agent may encounter, with explicit transition rules between these states. This architecture makes agents' behavior more stable and predictable across various scenarios. Source: Page 9, Agent Control Mechanisms section"

Example 5: Identifying Ecosystem Components

Scout successfully extracted information about the broader ecosystem supporting the framework:

Query: "What is the Agent Hub in the AGENTS framework?"

Scout's Response: "The Agent Hub is a platform component of the AGENTS framework that enables users to share their fine-tuned language agents and search/download useful language agents created by others. This significantly reduces the effort required for designing, testing, and tuning language agents from scratch, creating a collaborative ecosystem for agent development. Source: Page 15, Ecosystem and Tools section"

Building Your Own Document Intelligence Tool

The ability to build custom applications around Llama-4-Scout is straightforward through providers like Together.ai. I created a simple Streamlit application that allows users to:

  • Upload any PDF document
  • Process both the text and visual content
  • Ask questions in natural language
  • Receive answers with source references
  • Evaluate answer quality with multiple criteria

This type of application can transform how teams work with documents, significantly reducing the time spent searching for information and improving the accuracy of insights extracted. Scout's ability to process images in documents means it can work with charts, tables, diagrams, and other visual elements that are often critical for full document understanding and is "optimized for visual recognition, image reasoning, captioning, and answering general questions about an image."

Technical Capabilities and Performance

Llama-4-Scout demonstrates impressive performance relative to competing models. In comparative evaluations, Scout has shown "superior performance relative to contemporary models such as Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across recognized benchmark datasets."

What makes Scout particularly practical is its efficiency. Scout "fits on a single H100 GPU when quantized to Int4" while still delivering high-quality results. This efficiency means organizations can implement advanced document intelligence without requiring massive computational resources.

Looking Ahead: The Future of Document Intelligence

As models like Llama-4-Scout continue to evolve, we can expect even more sophisticated document intelligence capabilities. Future developments will likely include:

  • Deeper reasoning across multiple documents
  • More nuanced understanding of domain-specific content
  • Better handling of ambiguity and uncertain information
  • Enhanced multimodal capabilities for complex visual content

Conclusion

Llama-4-Scout represents a significant step forward in making advanced document intelligence accessible. Its balanced approach to performance and efficiency makes it particularly valuable for professional applications where accuracy and attribution matter.

For organizations dealing with large volumes of documents, investing in tools built around models like Scout could yield substantial returns through improved information accessibility and insight generation. The model's ability to "process and work with extremely lengthy documents" makes it ideal for enterprises with extensive documentation needs.

Have you experimented with Llama-4-Scout or similar models for document analysis? I'd love to hear about your experiences and applications in the comments below.

Note: The examples provided are based on actual testing of Llama-4-Scout through Together.ai's API integration. Results may vary depending on document complexity and specific implementation details.

VibeCoding

4 April 2025 at 00:00

VibeCoding

As a CTO, I don’t typically get a lot of time to sit and code—there’s a lot of grunt work involved in my role. Quite a bit of my time goes into the operational aspects of running an engineering team: feature prioritization, production issue reviews, cloud cost reviews, one-on-ones, status updates, budgeting, etc.

Although I’m deeply involved in architecture, design, and scaling decisions, I’m not contributing as a senior developer writing code for features in the product as much as I’d like. It’s not just about writing code—it’s about maintaining it. And with everything else I do, I felt I didn’t have the time or energy to maintain a complex feature.

Over the past couple of years—apart from AI research work—my coding has been pretty much limited to pair programming or contributing to some minor, non-critical features. Sometimes I end up writing small utilities here and there.

Pair Programming as a Connection Tool

I love to pair program with at least two engineers for a couple of hours every week. This gives me an opportunity to connect with them 1:1, understand the problems on the ground, and help them see the bigger picture of how the features they’re working on fit into our product ecosystem.

1:1s were never effective for me. In that 30-minute window, engineers rarely open up. But if you sit and code for 2–3 hours at a stretch, you get to learn a lot about them—their problem-solving approach, what motivates them, and more.

I also send out a weekly video update to the engineering team, where I talk about the engineers I pair programmed with, their background, and what we worked on. It helps the broader engineering team learn more about their peers as well.

The Engineer in Me

The engineer in me always wants to get back to coding—because there’s no joy quite like building something and making it work. I’m happiest when I code.

I’ve worked in several languages over the years—Java, VB6, C#, Perl, good old shell scripting, Python, JavaScript, and more. Once I found Python, I never looked back. I absolutely love the Python community.

I’ve been a full-stack developer throughout my career. My path to becoming a CTO was non-traditional (that’s a story for another blog). I started in a consulting firm and worked across different projects and tech stacks early on, which helped me become a well-rounded full-stack engineer.

I still remember building a simple timesheet entry application in 2006 using HTML and JavaScript (with AJAX) for a client’s invoicing needs. It was a small utility, but it made timesheet entry so much easier for engineers. That experience stuck with me.

I’ll get to why being a full-stack engineer helped me build the app using VibeCoding shortly.

The Spark: Coffee with Shuveb Hussain

I was catching up over coffee with Shuveb Hussain, founder and CEO of ZipStack. Their product, Unstract, is really good for extracting entities from different types of documents. They’ve even open-sourced a version—go check it out.

Shuveb, being a seasoned engineer’s engineer, mentioned how GenAI code editors helped him quickly build a few apps for his own use over a weekend. That sparked something in me: why wasn’t I automating my grunt work with one of these GenAI code editors?

I’ve used GitHub Copilot for a while, but these newer GenAI editors—like Cursor and Windsurf—are in a different league. Based on Shuveb’s recommendation, I chose Windsurf.

Let’s be honest though—I don’t remember any weekend project of mine that ended in just one weekend 😅

The Grunt Work I Wanted to Automate

I was looking for ways to automate the boring but necessary stuff, so I could focus more on external-facing activities.

Every Thursday, I spent about 6 hours analyzing production issues before the weekly Friday review with engineering leaders, the SRE team, product leads, and support. I’d get a CSV dump of all the tickets and manually go through each one to identify patterns or repeated issues. Then I’d start asking questions on Slack or during the review meeting.

The process was painful and time-consuming. I did this for over 6 months and knew I needed to change something.

In addition to that:

  • I regularly reviewed cloud compute costs across environments and products to identify areas for optimization.
  • I monitored feature usage metrics to see what customers actually used.
  • I examined job runtime stats (it’s a low-code platform, so this matters).
  • I looked at engineering team metrics from the operations side.

Each of these lived in different tools, dashboards, or portals. I was tired of logging into 10 places and context-switching constantly while fielding distractions.

The Build Begins: CTO Dashboard

I decided to build an internal tool I nicknamed CTODashboard—to consolidate everything I needed.

My Tech Stack (via Windsurf):

  • Frontend: ReactJS
  • Backend: Python (FastAPI)
  • Database: Postgres
  • Deployment: EC2 (with some help from the SRE team)

I used Windsurf’s Cascade interface to prompt out code, even while attending meetings. It was surprisingly effective… except for one time when a prompt completely messed up my day’s work. Lesson learned: commit code at every working logical step.

In a couple of days, I had:

  • A feature to upload the CSV dump
  • Filters to slice and dice data
  • A paginated data table
  • Trend analytics with visualizations

Even when I hit errors, I just screenshot them or pasted logs into Windsurf and asked it to fix them. It did a decent job. When it hallucinated or got stuck, I just restarted with a fresh Cascade.

I had to rewrite the CSV upload logic manually when the semantic mapping to the backend tables went wrong. But overall, 80% of the code was generated—20% I wrote myself. And I reviewed everything to ensure it worked as intended.

Early Feedback & Iteration

I gave access to a couple of colleagues for early feedback. It was overwhelmingly positive. They even suggested new features like:

  • Summarizing long tickets
  • Copy-pasting ticket details directly into Slack
  • Copying visualizations without taking screenshots

I implemented all of those using Windsurf in just a few more days.

In under a week, I had an MVP that cut my Thursday analysis time from 6+ hours to under 2. Production Issues Dashboard_1 Production Issues Dashboard_2 Production Issues Dashboard_3

Then Abhay Dandekar, another senior developer, offered to help. He built a Lambda function to call our Helpdesk API every hour to fetch the latest tickets and updates. He got it working in 4 hours—like a boss.

Growing Usage

Word about the dashboard started leaking (okay, I may have leaked it myself 😉). As more people requested access, I had to:

  • Add Google Sign-In
  • Implement authorization controls
  • Build a user admin module
  • Secure the backend APIs with proper access control
  • Add audit logs to track who was using what

I got all this done over a weekend. It consumed my entire weekend, but it was worth it.

AWS Cost Analytics Module

Next, I added a module to analyze AWS cost trends across production and non-prod environments by product.

Initially, it was another CSV upload feature. Later, Abhay added a Lambda to fetch the data daily. I wanted engineering directors to see the cost implications of design decisions—especially when non-prod environments were always-on.

Before this, I spent 30 minutes daily reviewing AWS cost trends. Once the dashboard launched, engineers started checking it themselves. That awareness led to much smarter decisions and significant cost savings.

I added visualizations for:

  • Daily cost trends
  • Monthly breakdowns
  • Environment-specific views
  • Product-level costs

AWS Cost Analytics Dashboard_1 AWS Cost Analytics Dashboard_2

More Modules Coming Soon

I’ve since added:

  • Usage metrics
  • Capitalization tracking
  • (In progress): Performance and engineering metrics

The dashboard now has 200+ users, and I’m releasing access in batches to manage performance.

Lessons Learned from VibeCoding

This was a fun experiment to see how far I could go with GenAI-based development using just prompts.

What I Learned:

  1. Strong system design fundamentals are essential.
  2. Windsurf can get stuck in loops—step in and take control.
  3. Commit frequently. Mandatory.
  4. You’re also the tester—don’t skip this.
  5. If the app breaks badly, roll back. Don’t fix bad code.
  6. GenAI editors are great for senior engineers; less convinced about junior devs.
  7. Model training cutoffs matter—it affects library choices.
  8. Write smart prompts with guardrails (e.g., “no files >300 lines”).
  9. GenAI tools struggle to edit large files (my app.py hit 5,000 lines; I had to refactor manually).
  10. Use virtual environments—GenAI often forgets.
  11. Deployments are tedious—I took help from the SRE team for Jenkins/Terraform setup.
  12. If you love coding, VibeCoding is addictive.

Gratitude

Special thanks to:

  • Bhavani Shankar, Navin Kumaran, Geetha Eswaran, and Sabyasachi Rout for helping with deployment scripts and automation (yes, I bugged them a lot).
  • Pravin Kumar, Vikas Kishore, Nandini PS, and Prithviraj Subburamanian for feedback and acting as de facto product managers for dashboard features.
Before yesterdaySaama

OpenAI - Gibili Portrait Assistance: AI-Powered Image Generation Made Simple

By: angu10
31 March 2025 at 17:50

Introduction

Ever wished you could create stunning portraits with just a few clicks? Meet Gibili Portrait Assistance, an AI-powered tool that makes generating high-quality portraits effortless. Whether you’re an artist, designer, or simply someone who loves experimenting with AI, Gibili can help bring your ideas to life.

In this post, we’ll walk you through how to use Gibili Portrait Assistance and explore the OpenAI architecture behind it.

How to Use Gibili Portrait Assistance

Using Gibili is straightforward and requires no prior technical knowledge. Here’s a simple step-by-step guide:

1. Enter Your Description or Upload an Image
You can either type a text description of the portrait you want or upload an existing image to be enhanced or transformed by AI.

Text Prompt Example:

  • “A realistic portrait of a woman with curly brown hair, wearing a red scarf, in a cinematic lighting style.”

Image Upload:

  • If you have an image you want to modify or enhance, simply upload it, and Gibili will apply AI-powered enhancements or transformations.

2. Customize Your Preferences
You can fine-tune details such as:

  • Art Style: Realistic, digital painting, anime, etc.
  • Background: Solid color, blurred, natural scenery.
  • Facial Expressions: Smiling, neutral, surprised.
  • Additional Features: Glasses, hats, jewelry, etc.

3. Generate the Image
Press Enter, and within seconds, Gibili will produce a high-resolution portrait based on your input or uploaded image.

4. Refine and Download
If you want adjustments, you can tweak your input and regenerate until you’re satisfied. Once ready, download your portrait in high-quality format.

The OpenAI Architecture Behind Gibili

Gibili Portrait Assistance is powered by OpenAI’s advanced image generation models, leveraging diffusion models to create highly detailed and realistic portraits. Here’s a simplified breakdown:

1. Text-to-Image & Image-to-Image Generation
When you provide a text prompt, the AI model translates it into a visual representation using deep learning techniques. If you upload an image, the model can enhance, transform, or stylize it while maintaining its core structure.

2. Fine-Tuned on Portrait Data
The model has been trained on a vast dataset of portraits across different styles, ensuring high accuracy and creativity in generated images.

3. Iterative Refinement
Instead of creating the final image instantly, the AI gradually refines it through multiple steps, ensuring greater precision and quality.

4. User-Guided Adjustments
Users can modify parameters like style and background, and the model will intelligently adjust the portrait while maintaining coherence.

Why Use Gibili Portrait Assistance?

✅ Easy to Use

No need for advanced design skills — just describe what you want or upload an image, and AI does the rest.

🎨 Customizable Output

From photorealistic portraits to artistic illustrations, you can tailor the results to your liking.

🚀 Fast & High-Quality

Generate high-resolution images within seconds.

🖌️ Creative Freedom

Perfect for artists, marketers, and content creators looking for unique visuals.

Get Started with Gibili Today!

Ready to create amazing AI-generated portraits? Try Gibili Portrait Assistance now and explore the limitless possibilities of AI-powered creativity!

Setting Up Kubernetes and Nginx Ingress Controller on an EC2 Instance

By: Ragul.M
19 March 2025 at 16:52

Introduction

Kubernetes (K8s) is a powerful container orchestration platform that simplifies application deployment and scaling. In this guide, we’ll set up Kubernetes on an AWS EC2 instance, install the Nginx Ingress Controller, and configure Ingress rules to expose multiple services (app1 and app2).

Step 1: Setting Up Kubernetes on an EC2 Instance
1.1 Launch an EC2 Instance
Choose an instance with enough resources (e.g., t3.medium or larger) and install Ubuntu 20.04 or Amazon Linux 2.
1.2 Update Packages

sudo apt update && sudo apt upgrade -y  # For Ubuntu 

1.3 Install Docker

sudo apt install -y docker.io  
sudo systemctl enable --now docker

1.4 Install Kubernetes (kubectl, kubeadm, kubelet)

sudo apt install -y apt-transport-https ca-certificates curl
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt install -y kubelet kubeadm kubectl

1.5 Initialize Kubernetes

sudo kubeadm init --pod-network-cidr=192.168.0.0/16

Follow the output instructions to set up kubectl for your user:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

1.6 Install a Network Plugin (Calico)

For Calico:
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

Now, Kubernetes is ready!

Step 2: Install Nginx Ingress Controller
Nginx Ingress Controller helps manage external traffic to services inside the cluster.

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml

Wait until the controller is running:

kubectl get pods -n ingress-nginx

You should see ingress-nginx-controller running.

Step 3: Deploy Two Applications (app1 and app2)
3.1 Deploy app1
Create app1-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app1
  template:
    metadata:
      labels:
        app: app1
    spec:
      containers:
      - name: app1
        image: nginx
        ports:
        - containerPort: 80

Create app1-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: app1-service
spec:
  selector:
    app: app1
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP

Apply the resources:

kubectl apply -f app1-deployment.yaml 
kubectl apply -f app1-service.yaml

3.2 Deploy app2
Create app2-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app2
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app2
  template:
    metadata:
      labels:
        app: app2
    spec:
      containers:
      - name: app2
        image: nginx
        ports:
        - containerPort: 80

Create app2-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: app2-service
spec:
  selector:
    app: app2
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP

Apply the resources:

kubectl apply -f app2-deployment.yaml 
kubectl apply -f app2-service.yaml

Step 4: Configure Ingress for app1 and app2
Create nginx-ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nginx-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: app1.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app1-service
            port:
              number: 80
  - host: app2.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app2-service
            port:
              number: 80

Apply the Ingress rule:

kubectl apply -f nginx-ingress.yaml

Step 5: Verify Everything
5.1 Get Ingress External IP

kubectl get ingress

5.2 Update /etc/hosts (Local Testing Only)
If you're testing on a local machine, add this to /etc/hosts:

<EXTERNAL-IP> app1.example.com
<EXTERNAL-IP> app2.example.com

Replace with the actual external IP of your Ingress Controller.
5.3 Test in Browser or Curl

curl http://app1.example.com
curl http://app2.example.com

If everything is set up correctly, you should see the default Nginx welcome page for both applications.

Conclusion
In this guide, we:

  • Installed Kubernetes on an EC2 instance
  • Set up Nginx Ingress Controller
  • Deployed two services (app1 and app2)
  • Configured Ingress to expose them via domain names

Now, you can easily manage multiple applications in your cluster using a single Ingress resource.

Follow for more . Happy learning :)

Deploying a Two-Tier Web Application on AWS with MySQL and Apache

By: Ragul.M
12 March 2025 at 12:46

In this blog, I will guide you through step-by-step instructions to set up a two-tier architecture on AWS using VPC, Subnets, Internet Gateway, Route Tables, RDS, EC2, Apache, MySQL, PHP, and HTML. This project will allow you to host a registration web application where users can submit their details, which will be stored in an RDS MySQL database.

Step 1: Create a VPC
1.1 Login to AWS Management Console

  • Navigate to the VPC service
  • Click Create VPC
  • Enter the following details:
  • VPC Name: my-vpc
  • IPv4 CIDR Block: 10.0.0.0/16
  • Tenancy: Default
  • Click Create VPC

Image description

Step 2: Create Subnets
2.1 Create a Public Subnet

  • Go to VPC > Subnets
  • Click Create Subnet
  • Choose my-vpc
  • Set Subnet Name: public-subnet
  • IPv4 CIDR Block: 10.0.1.0/24
  • Click Create

2.2 Create a Private Subnet
Repeat the steps above but set:

  • Subnet Name: private-subnet
  • IPv4 CIDR Block: 10.0.2.0/24

Image description

Step 3: Create an Internet Gateway (IGW) and Attach to VPC
3.1 Create IGW

  • Go to VPC > Internet Gateways
  • Click Create Internet Gateway
  • Set Name: your-igw
  • Click Create IGW 3.2 Attach IGW to VPC
  • Select your-igw
  • Click Actions > Attach to VPC
  • Choose my-vpc and click Attach

Image description

Step 4: Configure Route Tables
4.1 Create a Public Route Table

  • Go to VPC > Route Tables
  • Click Create Route Table
  • Set Name: public-route-table
  • Choose my-vpc and click Create
  • Edit Routes → Add a new route:
  • Destination: 0.0.0.0/0
  • Target: my-igw
  • Edit Subnet Associations → Attach public-subnet

Image description

Step 5: Create an RDS Database (MySQL)

  • Go to RDS > Create Database
  • Choose Standard Create
  • Select MySQL
  • Set DB instance identifier: my-rds
  • Master Username: admin
  • Master Password: yourpassword
  • Subnet Group: Select private-subnet
  • VPC Security Group: Allow 3306 (MySQL) from my-vpc
  • Click Create Database

Image description

Step 6: Launch an EC2 Instance

  • Go to EC2 > Launch Instance
  • Choose Ubuntu 22.04
  • Set Instance Name: my-ec2
  • Select my-vpc and attach public-subnet
  • Security Group: Allow
  • SSH (22) from your IP
  • HTTP (80) from anywhere
  • MySQL (3306) from my-vpc
  • Click Launch Instance

Image description

Step 7: Install Apache, PHP, and MySQL Client
7.1 Connect to EC2

ssh -i your-key.pem ubuntu@your-ec2-public-ip

7.2 Install LAMP Stack

sudo apt update && sudo apt install -y apache2 php libapache2-mod-php php-mysql mysql-client

7.3 Start Apache

sudo systemctl start apache2
sudo systemctl enable apache2

Step 8: Configure Web Application
8.1 Create the Registration Form

cd /var/www/html
sudo nano index.html
<!DOCTYPE html>
<html>
<head>
    <title>Registration Form</title>
</head>
<body>
    <h2>User Registration</h2>
    <form action="submit.php" method="POST">
        Name: <input type="text" name="name" required><br>
        DOB: <input type="date" name="dob" required><br>
        Email: <input type="email" name="email" required><br>
        <input type="submit" value="Register">
    </form>
</body>
</html>

Image description

8.2 Create PHP Script (submit.php)

sudo nano /var/www/html/submit.php
<?php
$servername = "your-rds-endpoint";
$username = "admin";
$password = "yourpassword";
$dbname = "registration";
$conn = new mysqli($servername, $username, $password, $dbname);
if ($conn->connect_error) {
    die("Connection failed: " . $conn->connect_error);
}
$name = $_POST['name'];
$dob = $_POST['dob'];
$email = $_POST['email'];
$stmt = $conn->prepare("INSERT INTO users (name, dob, email) VALUES (?, ?, ?)");
$stmt->bind_param("sss", $name, $dob, $email);
if ($stmt->execute()) {
    echo "Registration successful";
} else {
    echo "Error: " . $stmt->error;
}
$stmt->close();
$conn->close();
?>

Image description

Step 9: Create Target Group

  1. Go to the AWS EC2 Console → Navigate to Target Groups
  2. Click Create target group
  3. Choose Target type: Instance
  4. Enter Target group name: my-target-group
  5. Select Protocol: HTTP
  6. Select Port: 80
  7. Choose the VPC you created earlier
  8. Click Next
  9. Under Register Targets, select your EC2 instances
  10. Click Include as pending below, then Create target group

Image description

Image description

Step 10: Create an Application Load Balancer (ALB)

  1. Go to AWS EC2 Console → Navigate to Load Balancers
  2. Click Create Load Balancer
  3. Choose Application Load Balancer
  4. Enter ALB Name: my-alb
  5. Scheme: Internet-facing
  6. IP address type: IPv4
  7. Select the VPC
  8. Select at least two public subnets (for high availability)
  9. Click Next

Image description

Step 11: Test the Application

  1. Restart Apache sudo systemctl restart apache2
  2. Open your browser and visit: http://your-ec2-public-ip/
  3. Fill in the form and Submit
  4. Check MySQL Database:
mysql -u admin -p -h your-rds-endpoint
USE your_database;
SELECT * FROM table_name;

Image description

This setup ensures a scalable, secure, and high-availability application on AWS! 🚀

Follow for more and happy learning :)

Deploying a Scalable AWS Infrastructure with VPC, ALB, and Target Groups Using Terraform

By: Ragul.M
11 March 2025 at 06:01

Introduction
In this blog, we will walk through the process of deploying a scalable AWS infrastructure using Terraform. The setup includes:

  • A VPC with public and private subnets
  • An Internet Gateway for public access
  • Application Load Balancers (ALBs) for distributing traffic
  • Target Groups and EC2 instances for handling incoming requests
  • By the end of this guide, you’ll have a highly available setup with proper networking, security, and load balancing.

Step 1: Creating a VPC with Public and Private Subnets
The first step is to define our Virtual Private Cloud (VPC) with four subnets (two public, two private) spread across multiple Availability Zones.
Terraform Code: vpc.tf

resource "aws_vpc" "main_vpc" {
  cidr_block = "10.0.0.0/16"
}
# Public Subnet 1 - ap-south-1a
resource "aws_subnet" "public_subnet_1" {
  vpc_id            = aws_vpc.main_vpc.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "ap-south-1a"
  map_public_ip_on_launch = true
}
# Public Subnet 2 - ap-south-1b
resource "aws_subnet" "public_subnet_2" {
  vpc_id            = aws_vpc.main_vpc.id
  cidr_block        = "10.0.2.0/24"
  availability_zone = "ap-south-1b"
  map_public_ip_on_launch = true
}
# Private Subnet 1 - ap-south-1a
resource "aws_subnet" "private_subnet_1" {
  vpc_id            = aws_vpc.main_vpc.id
  cidr_block        = "10.0.3.0/24"
  availability_zone = "ap-south-1a"
}
# Private Subnet 2 - ap-south-1b
resource "aws_subnet" "private_subnet_2" {
  vpc_id            = aws_vpc.main_vpc.id
  cidr_block        = "10.0.4.0/24"
  availability_zone = "ap-south-1b"
}
# Internet Gateway for Public Access
resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.main_vpc.id
}
# Public Route Table
resource "aws_route_table" "public_rt" {
  vpc_id = aws_vpc.main_vpc.id
}
resource "aws_route" "internet_access" {
  route_table_id         = aws_route_table.public_rt.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.igw.id
}
resource "aws_route_table_association" "public_assoc_1" {
  subnet_id      = aws_subnet.public_subnet_1.id
  route_table_id = aws_route_table.public_rt.id
}
resource "aws_route_table_association" "public_assoc_2" {
  subnet_id      = aws_subnet.public_subnet_2.id
  route_table_id = aws_route_table.public_rt.id
}

This configuration ensures that our public subnets can access the internet, while our private subnets remain isolated.

Step 2: Setting Up Security Groups
Next, we define security groups to control access to our ALBs and EC2 instances.
Terraform Code: security_groups.tf

resource "aws_security_group" "alb_sg" {
  vpc_id = aws_vpc.main_vpc.id
  # Allow HTTP and HTTPS traffic to ALB
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  # Allow outbound traffic
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

This allows public access to the ALB but restricts other traffic.

Step 3: Creating the Application Load Balancers (ALB)
Now, let’s define two ALBs—one public and one private.
Terraform Code: alb.tf

# Public ALB
resource "aws_lb" "public_alb" {
  name               = "public-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb_sg.id]
  subnets           = [aws_subnet.public_subnet_1.id, aws_subnet.public_subnet_2.id]
}
# Private ALB
resource "aws_lb" "private_alb" {
  name               = "private-alb"
  internal           = true
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb_sg.id]
  subnets           = [aws_subnet.private_subnet_1.id, aws_subnet.private_subnet_2.id]
}

This ensures redundancy and distributes traffic across different subnets.

Step 4: Creating Target Groups for EC2 Instances
Each ALB needs target groups to route traffic to EC2 instances.
Terraform Code: target_groups.tf

resource "aws_lb_target_group" "public_tg" {
  name     = "public-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main_vpc.id
}
resource "aws_lb_target_group" "private_tg" {
  name     = "private-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main_vpc.id
}

These target groups allow ALBs to forward requests to backend EC2 instances.

Step 5: Launching EC2 Instances
Finally, we deploy EC2 instances and register them with the target groups.
Terraform Code: ec2.tf

resource "aws_instance" "public_instance" {
  ami           = "ami-0abcdef1234567890" # Replace with a valid AMI ID
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.public_subnet_1.id
}
resource "aws_instance" "private_instance" {
  ami           = "ami-0abcdef1234567890" # Replace with a valid AMI ID
  instance_type = "t2.micro"
  subnet_id     = aws_subnet.private_subnet_1.id
}

These instances will serve web requests.

Step 6: Registering Instances to Target Groups

resource "aws_lb_target_group_attachment" "public_attach" {
  target_group_arn = aws_lb_target_group.public_tg.arn
  target_id        = aws_instance.public_instance.id
}
resource "aws_lb_target_group_attachment" "private_attach" {
  target_group_arn = aws_lb_target_group.private_tg.arn
  target_id        = aws_instance.private_instance.id
}

This registers our EC2 instances as backend servers.

Final Step: Terraform Apply!
Run the following command to deploy everything:

terraform init
terraform apply -auto-approve

Once completed, you’ll get ALB DNS names, which you can use to access your deployed infrastructure.

Conclusion
This guide covered how to deploy a highly available AWS infrastructure using Terraform, including VPC, subnets, ALBs, security groups, target groups, and EC2 instances. This setup ensures a secure and scalable architecture.

Follow for more and happy learning :)

The Intelligent Loop: A Guide to Modern LLM Agents

By: angu10
24 February 2025 at 06:07

Introduction

Large Language Model (LLM) based AI agents represent a new paradigm in artificial intelligence. Unlike traditional software agents, these systems leverage the powerful capabilities of LLMs to understand, reason, and interact with their environment in more sophisticated ways. This guide will introduce you to the basics of LLM agents and their think-act-observe cycle.

What is an LLM Agent?

An LLM agent is a system that uses a large language model as its core reasoning engine to:

  1. Process natural language instructions
  2. Make decisions based on context and goals
  3. Generate human-like responses and actions
  4. Interact with external tools and APIs
  5. Learn from interactions and feedback

Think of an LLM agent as an AI assistant who can understand, respond, and take actions in the digital world, like searching the web, writing code, or analyzing data.

Image description

The Think-Act-Observe Cycle in LLM Agents

Observe (Input Processing)

LLM agents observe their environment through:

  1. Direct user instructions and queries
  2. Context from previous conversations
  3. Data from connected tools and APIs
  4. System prompts and constraints
  5. Environmental feedback

Think (LLM Processing)

The thinking phase for LLM agents involves:

  1. Parsing and understanding input context
  2. Reasoning about the task and requirements
  3. Planning necessary steps to achieve goals
  4. Selecting appropriate tools or actions
  5. Generating natural language responses

The LLM is the "brain," using its trained knowledge to process information and make decisions.

Act (Execution)

LLM agents can take various actions:

  1. Generate text responses
  2. Call external APIs
  3. Execute code
  4. Use specialized tools
  5. Store and retrieve information
  6. Request clarification from users

Key Components of LLM Agents

Core LLM

  1. Serves as the primary reasoning engine
  2. Processes natural language input
  3. Generates responses and decisions
  4. Maintains conversation context

Working Memory

  1. Stores conversation history
  2. Maintains current context
  3. Tracks task progress
  4. Manages temporary information

Tool Use

  1. API integrations
  2. Code execution capabilities
  3. Data processing tools
  4. External knowledge bases
  5. File manipulation utilities

Planning System

  1. Task decomposition
  2. Step-by-step reasoning
  3. Goal tracking
  4. Error handling and recovery

Types of LLM Agent Architectures

Simple Agents

  1. Single LLM with basic tool access
  2. Direct input-output processing
  3. Limited memory and context
  4. Example: Basic chatbots with API access

ReAct Agents

  1. Reasoning and Acting framework
  2. Step-by-step thought process
  3. Explicit action planning
  4. Self-reflection capabilities

Chain-of-Thought Agents

  1. Detailed reasoning steps
  2. Complex problem decomposition
  3. Transparent decision-making
  4. Better error handling

Multi-Agent Systems

  1. Multiple LLM agents working together
  2. Specialized roles and capabilities
  3. Inter-agent communication
  4. Collaborative problem-solving

Common Applications

LLM agents are increasingly used for:

  1. Personal assistance and task automation
  2. Code generation and debugging
  3. Data analysis and research
  4. Content creation and editing
  5. Customer service and support
  6. Process automation and workflow management

Best Practices for LLM Agent Design

Clear Instructions

  1. Provide explicit system prompts
  2. Define constraints and limitations
  3. Specify available tools and capabilities
  4. Set clear success criteria

Effective Memory Management

  1. Implement efficient context tracking
  2. Prioritize relevant information
  3. Clean up unnecessary data
  4. Maintain conversation coherence

Robust Tool Integration

  1. Define clear tool interfaces
  2. Handle API errors gracefully
  3. Validate tool outputs
  4. Monitor resource usage

Safety and Control

  1. Implement ethical guidelines
  2. Add safety checks and filters
  3. Monitor agent behavior
  4. Maintain user control

Ever Wonder How AI "Sees" Like You Do? A Beginner's Guide to Attention

By: angu10
19 February 2025 at 02:05

Understanding Attention in Large Language Models: A Beginner's Guide

Have you ever wondered how ChatGPT or other AI models can understand and respond to your messages so well? The secret lies in a mechanism called ATTENTION - a crucial component that helps these models understand relationships between words and generate meaningful responses. Let's break it down in simple terms!

What is Attention?

Imagine you're reading a long sentence: "The cat sat on the mat because it was comfortable." When you read "it," your brain naturally connects back to either "the cat" or "the mat" to understand what "it" refers to. This is exactly what attention does in AI models - it helps the model figure out which words are related to each other.

How Does Attention Work?

The attention mechanism works like a spotlight that can focus on different words when processing each word in a sentence. Here's a simple breakdown:

  1. For each word, the model calculates how important every other word is in relation to it.
  2. It then uses these importance scores to create a weighted combination of all words.
  3. This helps the model understand context and relationships between words.

Let's visualize this with an example:

Image description

In this diagram, the word "it" is paying attention to all other words in the sentence. The thickness of the arrows could represent the attention weights. The model would likely assign higher attention weights to "cat" and "mat" to determine which one "it" refers to.

Multi-Head Attention: Looking at Things from Different Angles

In modern language models, we don't just use one attention mechanism - we use several in parallel! This is called Multi-Head Attention. Each "head" can focus on different types of relationships between words.

Let's consider the sentence: The chef who won the competition prepared a delicious meal.

  • Head 1 could focus on subject-verb relationships (chef - prepared)
  • Head 2 might attend to adjective-noun pairs (delicious - meal)
  • Head 3 could look at broader context (competition - meal)

Here's a diagram:

Image description

This multi-headed approach helps the model understand text from different perspectives, just like how we humans might read a sentence multiple times to understand different aspects of its meaning.

Why Attention Matters

Attention mechanisms have revolutionized natural language processing because they:

  1. Handle long-range dependencies better than previous methods.
  2. Can process input sequences in parallel.
  3. Create interpretable connections between words.
  4. Allow models to focus on relevant information while ignoring irrelevant parts.

Recent Developments and Research

The field of LLMs is rapidly evolving, with new techniques and insights emerging regularly. Here are a few areas of active research:

Contextual Hallucinations

Large language models (LLMs) can sometimes hallucinate details and respond with unsubstantiated answers that are inaccurate with respect to the input context.

The Lookback Lens technique analyzes attention patterns to detect when a model might be generating information not present in the input context.

Extending Context Window

Researchers are working on extending the context window sizes of LLMs, allowing them to process longer text sequences.

Conclusion

While the math behind attention mechanisms can be complex, the core idea is simple: help the model focus on the most relevant parts of the input when processing each word. This allows language models to understand the context and relationships between words better, leading to more accurate and coherent responses.

Remember, this is just a high-level overview - there's much more to learn about attention mechanisms! Hopefully, this will give you a good foundation for understanding how modern AI models process and understand text.

A Step-by-Step Guide to LLM Function Calling in Python

By: angu10
12 February 2025 at 23:06

Function calling allows Claude to interact with external functions and tools in a structured way. This guide will walk you through implementing function calling with Claude using Python, complete with examples and best practices.

Prerequisites

To get started, you'll need:

  • Python 3.7+
  • anthropic Python package
  • A valid API key from Anthropic

Basic Setup

from anthropic import Anthropic
import json
# Initialize the client
anthropic = Anthropic(api_key='your-api-key')

Defining Functions

function_schema = {
    "name": "get_weather",
    "description": "Get the current weather for a specific location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name or coordinates"
            },
            "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature unit"
            }
        },
        "required": ["location"]
    }
}

Making Function Calls

A Step-by-Step Guide to LLM Function Calling in Python
Function calling allows Claude to interact with external functions and tools in a structured way. This guide will walk you through implementing function calling with Claude using Python, complete with examples and best practices.
Prerequisites
To get started, you'll need:
Python 3.7+
anthropic Python package
A valid API key from Anthropic

Basic Setup
from anthropic import Anthropic
import json
# Initialize the client
anthropic = Anthropic(api_key='your-api-key')
Defining Functions
function_schema = {
    "name": "get_weather",
    "description": "Get the current weather for a specific location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name or coordinates"
            },
            "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature unit"
            }
        },
        "required": ["location"]
    }
}
Making Function Calls
def get_weather(location, unit="celsius"):
    # This is a mock implementation but you can all call your API
    return {
        "location": location,
        "temperature": 22 if unit == "celsius" else 72,
        "conditions": "sunny"
    }
def process_function_call(message):
    try:
        # Parse the function call parameters
        params = json.loads(message.content)
        # Call the appropriate function
        if message.name == "get_weather":
            result = get_weather(**params)
            return json.dumps(result)
        else:
            raise ValueError(f"Unknown function: {message.name}")
    except Exception as e:
        return json.dumps({"error": str(e)})
# Example conversation with function calling
messages = [
    {
        "role": "user",
        "content": "What's the weather like in Paris?"
    }
]
while True:
    response = anthropic.messages.create(
        model="claude-3-5-haiku-latest",
        messages=messages,
        tools=[function_schema]
    )
    # Check if Claude wants to call a function
    if response.tool_calls:
        for tool_call in response.tool_calls:
            # Execute the function
            result = process_function_call(tool_call)
            # Add the function result to the conversation
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "name": tool_call.name,
                "content": result
            })
    else:
        # Normal response - print and break
        print(response.content)
        break

Best Practices

  1. Clear Function Descriptions
  • Write detailed descriptions for your functions
  • Specify parameter types and constraints clearly
  • Include examples in the descriptions when helpful
  1. Input Validation
  • Validate all function inputs before processing
  • Return meaningful error messages
  • Handle edge cases gracefully
  1. Response Formatting
  • Return consistent JSON structures
  • Include status indicators in responses
  • Format error messages uniformly

4 . Security Considerations

  • Validate and sanitize all inputs
  • Implement rate limiting if needed
  • Use appropriate authentication
  • Don't expose sensitive information in function descriptions

Conclusion

Function calling with Claude enables powerful integrations between the language model and external tools. By following these best practices and implementing proper error handling, you can create robust and reliable function-calling implementations.

Understanding RAGAS: A Comprehensive Framework for RAG System Evaluation

By: angu10
1 February 2025 at 01:40

In the rapidly evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) systems have emerged as a crucial technology for enhancing Large Language Models with external knowledge. However, ensuring the quality and reliability of these systems requires robust evaluation methods. Enter RAGAS (Retrieval Augmented Generation Assessment System), a groundbreaking framework that provides comprehensive metrics for evaluating RAG systems.

The Importance of RAG Evaluation

RAG systems combine the power of retrieval mechanisms with generative AI to produce more accurate and contextually relevant responses. However, their complexity introduces multiple potential points of failure, from retrieval accuracy to answer generation quality. This is where RAGAS steps in, offering a structured approach to assessment that helps developers and organizations maintain high standards in their RAG implementations.

Core RAGAS Metrics

Context Precision

Context precision measures how relevant the retrieved information is to the given query. This metric evaluates whether the system is pulling in the right pieces of information from its knowledge base. A high context precision score indicates that the retrieval component is effectively identifying and selecting relevant content, while a low score might suggest that the system is retrieving tangentially related or irrelevant information.

Faithfulness

Faithfulness assesses the alignment between the generated answer and the provided context. This crucial metric ensures that the system's responses are grounded in the retrieved information rather than hallucinated or drawn from the model's pre-trained knowledge. A faithful response should be directly supported by the context, without introducing external or contradictory information.

Answer Relevancy

The answer relevancy metric evaluates how well the generated response addresses the original question. This goes beyond mere factual accuracy to assess whether the answer provides the information the user was seeking. A highly relevant answer should directly address the query's intent and provide appropriate detail level.

Context Recall

Context recall compares the retrieved contexts against ground truth information, measuring how much of the necessary information was successfully retrieved. This metric helps identify cases where critical information might be missing from the system's responses, even if what was retrieved was accurate.

Practical Implementation

RAGAS's implementation is designed to be straightforward while providing deep insights. The framework accepts evaluation datasets containing:

Questions posed to the system
Retrieved contexts for each question
Generated answers
Ground truth answers for comparison

This structured approach allows for automated evaluation across multiple dimensions of RAG system performance, providing a comprehensive view of system quality.

Benefits and Applications

Quality Assurance

RAGAS enables continuous monitoring of RAG system performance, helping teams identify degradation or improvements over time. This is particularly valuable when making changes to the retrieval mechanism or underlying models.

Development Guidance

The granular metrics provided by RAGAS help developers pinpoint specific areas needing improvement. For instance, low context precision scores might indicate the need to refine the retrieval strategy, while poor faithfulness scores might suggest issues with the generation parameters.

Comparative Analysis

Organizations can use RAGAS to compare different RAG implementations or configurations, making it easier to make data-driven decisions about system architecture and deployment.

Best Practices for RAGAS Implementation

  1. Regular Evaluation Implement RAGAS as part of your regular testing pipeline to catch potential issues early and maintain consistent quality.
  2. Diverse Test Sets Create evaluation datasets that cover various query types, complexities, and subject matters to ensure robust assessment.
  3. Metric Thresholds Establish minimum acceptable scores for each metric based on your application's requirements and use these as quality gates in your deployment process.
  4. Iterative Refinement Use RAGAS metrics to guide iterative improvements to your RAG system, focusing on the areas showing the lowest performance scores.

Practical Code Examples

Basic RAGAS Evaluation

Here's a simple example of how to implement RAGAS evaluation in your Python code:

from ragas import evaluate
from datasets import Dataset
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision
)

def evaluate_rag_system(questions, contexts, answers, references):
    """
    Simple function to evaluate a RAG system using RAGAS

    Args:
        questions (list): List of questions
        contexts (list): List of contexts for each question
        answers (list): List of generated answers
        references (list): List of reference answers (ground truth)

    Returns:
        EvaluationResult: RAGAS evaluation results
    """
    # First, let's make sure you have the required packages
    try:
        import ragas
        import datasets
    except ImportError:
        print("Please install required packages:")
        print("pip install ragas datasets")
        return None

    # Prepare evaluation dataset
    eval_data = {
        "question": questions,
        "contexts": [[ctx] for ctx in contexts],  # RAGAS expects list of lists
        "answer": answers,
        "reference": references
    }

    # Convert to Dataset format
    eval_dataset = Dataset.from_dict(eval_data)

    # Run evaluation with key metrics
    results = evaluate(
        eval_dataset,
        metrics=[
            faithfulness,      # Measures if answer is supported by context
            answer_relevancy,  # Measures if answer is relevant to question
            context_precision  # Measures if retrieved context is relevant
        ]
    )

    return results

# Example usage
if __name__ == "__main__":
    # Sample data
    questions = [
        "What are the key features of Python?",
        "How does Python handle memory management?"
    ]

    contexts = [
        "Python is a high-level programming language known for its simple syntax and readability. It supports multiple programming paradigms including object-oriented, imperative, and functional programming.",
        "Python uses automatic memory management through garbage collection. It employs reference counting as the primary mechanism and has a cycle-detecting garbage collector for handling circular references."
    ]

    answers = [
        "Python is known for its simple syntax and readability, and it supports multiple programming paradigms including OOP.",
        "Python handles memory management automatically through garbage collection, using reference counting and cycle detection."
    ]

    references = [
        "Python's key features include readable syntax and support for multiple programming paradigms like OOP, imperative, and functional programming.",
        "Python uses automatic garbage collection with reference counting and cycle detection for memory management."
    ]

    # Run evaluation
    results = evaluate_rag_system(
        questions=questions,
        contexts=contexts,
        answers=answers,
        references=references
    )

    if results:
        # Print results
        print("\nRAG System Evaluation Results:")
        print(results)  

RAG vs GraphRAG

By: angu10
20 January 2025 at 04:47

Introduction to RAG and GraphRAG

What is RAG?

RAG, or Retrieval-Augmented Generation, is a technique that combines information retrieval with text generation to produce more accurate and contextually relevant responses. It works by retrieving relevant information from a knowledge base and then using that information to augment the input to a large language model (LLM).

What is GraphRAG?

GraphRAG is an extension of the RAG framework that incorporates graph-structured knowledge. Instead of using a flat document-based retrieval system, GraphRAG utilizes graph databases to represent and query complex relationships between entities and concepts.

Applications of RAG and GraphRAG

RAG Applications

  1. Question-answering systems
  2. Chatbots and virtual assistants
  3. Content summarization
  4. Fact-checking and information verification
  5. Personalized content generation

GraphRAG Applications

  1. Knowledge graph-based question answering
  2. Complex reasoning tasks
  3. Recommendation systems
  4. Fraud detection and financial analysis
  5. Scientific research and literature review

Pros and Cons of RAG

Pros of RAG

  1. Improved accuracy: By retrieving relevant information, RAG can provide more accurate and up-to-date responses.
  2. Reduced hallucinations: The retrieval step helps ground the model's responses in factual information.
  3. Scalability: Easy to update the knowledge base without retraining the entire model.
  4. Transparency: The retrieved documents can be used to explain the model's reasoning.
  5. Customizability: Can be tailored to specific domains or use cases.

Cons of RAG

  1. Latency: The retrieval step can introduce additional latency compared to pure generation models.
  2. Complexity: Implementing and maintaining a RAG system can be more complex than using a standalone LLM.
  3. Quality-dependent: The system's performance heavily relies on the quality and coverage of the knowledge base.
  4. Potential for irrelevant retrievals: If the retrieval system is not well-tuned, it may fetch irrelevant information.
  5. Storage requirements: Maintaining a large knowledge base can be resource-intensive.

Pros and Cons of GraphRAG

Pros of GraphRAG

  1. Complex relationship modeling: Can represent and query intricate relationships between entities.
  2. Improved context understanding: Graph structure allows for better capturing of contextual information.
  3. Multi-hop reasoning: Enables answering questions that require following multiple steps or connections.
  4. Flexibility: Can incorporate various types of information and relationships in a unified framework.
  5. Efficient querying: Graph databases can be more efficient for certain types of queries compared to traditional databases.

Cons of GraphRAG

  1. Increased complexity: Building and maintaining a knowledge graph is more complex than a document-based system.
  2. Higher computational requirements: Graph operations can be more computationally intensive.
  3. Data preparation challenges: Converting unstructured data into a graph format can be time-consuming and error-prone.
  4. Potential for overfitting: If the graph structure is too specific, it may not generalize well to new queries.
  5. Scalability concerns: As the graph grows, managing and querying it efficiently can become challenging.

Comparing RAG and GraphRAG

When to Use RAG

  • For general-purpose question-answering systems
  • When dealing with primarily textual information
  • In scenarios where quick implementation and simplicity are priorities
  • For applications that don't require complex relationship modeling

When to Use GraphRAG

  • For domain-specific applications with complex relationships (e.g., scientific research, financial analysis)
  • When multi-hop reasoning is crucial
  • In scenarios where understanding context and relationships is more important than raw text retrieval
  • For applications that can benefit from a structured knowledge representation

Future Directions and Challenges

Advancements in RAG

  1. Improved retrieval algorithms
  2. Better integration with LLMs
  3. Real-time knowledge base updates
  4. Multi-modal RAG (incorporating images, audio, etc.)

Advancements in GraphRAG

  1. More efficient graph embedding techniques
  2. Integration with other AI techniques (e.g., reinforcement learning)
  3. Automated graph construction and maintenance
  4. Explainable AI through graph structures

Common Challenges

  1. Ensuring data privacy and security
  2. Handling biases in knowledge bases
  3. Improving computational efficiency
  4. Enhancing the interpretability of results

Conclusion

Both RAG and GraphRAG represent significant advancements in augmenting language models with external knowledge. While RAG offers a more straightforward approach suitable for many general applications, GraphRAG provides a powerful framework for handling complex, relationship-rich domains. The choice between the two depends on the specific requirements of the application, the nature of the data, and the complexity of the reasoning tasks involved. As these technologies continue to evolve, we can expect to see even more sophisticated and efficient ways of combining retrieval, reasoning, and generation in AI systems.

🚀 How I Adopted the Lean Startup Mindset to Drive Innovation in My Team

By: angu10
11 January 2025 at 18:23

How I Adopted a Lean Startup Mindset in My Team’s Product Development 🚀

Developing innovative products in a world of uncertainty requires a mindset shift. At my team, we’ve adopted the Lean Startup mindset to ensure that every product we build is validated by real user needs and designed for scalability. Here’s how we integrated this approach into our team:

1. Value Hypothesis: Testing What Matters Most

We start by hypothesizing the value our product delivers. Since customers may not always articulate their needs, we focus on educating them about the problem and demonstrating how our solution fits into their lives. Through early user engagement and feedback, we validate whether the product solves a real problem.

2. Growth Hypothesis: Building for Scalability

Once we validate the product's value, we focus on testing its technical scalability. We run controlled experiments with system architecture, performance optimization, and infrastructure design to ensure our solution can handle growing user demands. Each iteration helps us identify potential bottlenecks, improve system reliability, and establish robust engineering practices that support future growth.

3. Minimum Viable Product (MVP): Launching to Learn

Instead of waiting to perfect our product, we launch an MVP to get it in front of users quickly. The goal is to learn, not to impress. By observing how users interact with the MVP, we gain valuable insights to prioritize features, fix pain points, and improve the user experience.

Fostering a Lean Mindset

Adopting the Lean Startup framework has been transformative for our team. It’s taught us to embrace experimentation, view failures as learning opportunities, and focus on delivering value to our users.

If you’re building a product and want to innovate smarter, consider adopting the Lean Startup mindset.

Building a Secure Web Application with AWS VPC, RDS, and a Simple Registration Page

31 December 2024 at 09:41

Here, we will see how to set up a Virtual Private Cloud (VPC) with two subnets: a public subnet to host a web application and a private subnet to host a secure RDS (Relational Database Service) instance. We’ll also build a simple registration page hosted in the public subnet, which will log user input into the RDS instance.

By the end of this tutorial, you will have a functional web application where user data from a registration form is captured and stored securely in a private RDS instance.

  1. VPC Setup: We will create a VPC with two subnets:
  • Public Subnet: Hosts a simple HTML-based registration page with an EC2 instance.
  • Private Subnet: Hosts an RDS instance (e.g., MySQL or PostgreSQL) to store registration data.
  1. Web Application: A simple registration page on the public subnet will allow users to input their data (e.g., name, email, and password). When submitted, this data will be logged into the RDS database in the private subnet.

  2. Security:

    • The EC2 instance will be in the public subnet, accessible from the internet.
    • The RDS instance will reside in the private subnet, isolated from direct public access for security purposes.
  3. Routing: We will set up appropriate route tables and security groups to ensure the EC2 instance in the public subnet can communicate with the RDS instance in the private subnet, but the RDS instance will not be accessible from the internet.

Step 1: Create a VPC with Public and Private Subnets

  1. Create the VPC:

    • Open the VPC Console in the AWS Management Console.
    • Click Create VPC and enter the details:
      • CIDR Block: 10.0.0.0/16 (this is the range of IP addresses your VPC will use).
      • Name: Eg:MyVPC.
  2. Create Subnets:

    • Public Subnet:
      • CIDR Block: 10.0.1.0/24
      • Name: PublicSubnet
      • Availability Zone: Choose an available zone.
    • Private Subnet:
      • CIDR Block: 10.0.2.0/24
      • Name: PrivateSubnet
      • Availability Zone: Choose a different zone.
  3. Create an Internet Gateway (IGW):

    • In the VPC Console, create an Internet Gateway and attach it to your VPC.
  4. Update the Route Table for Public Subnet:

    • Create or modify the route table for the public subnet to include a route to the Internet Gateway (0.0.0.0/0 → IGW).
  5. Update the Route Table for Private Subnet:

    • Create or modify the route table for the private subnet to route traffic to the NAT Gateway (for outbound internet access, if needed).

Step 2: Launch EC2 Instance in Public Subnet for Webpage Hosting

  1. Launch EC2 Instance:

    • Go to the EC2 Console, and launch a new EC2 instance using an Ubuntu or Amazon Linux AMI.
    • Select the Public Subnet and assign a public IP to the instance.
    • Attach a Security Group that allows inbound traffic on HTTP (port 80).
  2. Install Apache Web Server:

    • SSH into your EC2 instance and install Apache:
     sudo apt update
     sudo apt install apache2
    
  3. Create the Registration Page:

    • In /var/www/html, create an HTML file for the registration form (e.g., index.html):
     <html>
       <body>
         <h1>Registration Form</h1>
         <form action="/register" method="post">
           Name: <input type="text" name="name"><br>
           Email: <input type="email" name="email"><br>
           Password: <input type="password" name="password"><br>
           <input type="submit" value="Register">
         </form>
       </body>
     </html>
    
  4. Configure Apache:

  • Edit the Apache config files to ensure the server is serving the HTML page and can handle POST requests. You can use PHP or Python (Flask, Django) for handling backend processing.

Step 3: Launch RDS Instance in Private Subnet

  1. Create the RDS Instance:

    • In the RDS Console, create a new MySQL or PostgreSQL database instance.
    • Ensure the database is not publicly accessible (so it stays secure in the private subnet).
    • Choose the Private Subnet for deployment.
  2. Security Groups:

    • Create a Security Group for the RDS instance that allows inbound traffic on port 3306 (for MySQL) or 5432 (for PostgreSQL) from the public subnet EC2 instance.

Step 4: Connect the EC2 Web Server to RDS

  1. Install MySQL Client on EC2:

    • SSH into your EC2 instance and install the MySQL client:
     sudo apt-get install mysql-client
    
  2. Test Database Connectivity:

    • Test the connection to the RDS instance from the EC2 instance using the database endpoint:
     mysql -h <RDS-endpoint> -u <username> -p
    
  3. Create the Database and Table:

    • Once connected, create a database and table to store the registration data:
     CREATE DATABASE registration_db;
     USE registration_db;
     CREATE TABLE users (
       id INT AUTO_INCREMENT PRIMARY KEY,
       name VARCHAR(100),
       email VARCHAR(100),
       password VARCHAR(100)
     );
    

Step 5: Handle Form Submissions and Store Data in RDS

  1. Backend Processing:

    • You can use PHP, Python (Flask/Django), or Node.js to handle the form submission.
    • Example using PHP:
      • Install PHP and MySQL:
       sudo apt install php libapache2-mod-php php-mysql
    
 - Create a PHP script to handle the form submission (`register.php`):
   ```php
   <?php
   if ($_SERVER["REQUEST_METHOD"] == "POST") {
       $name = $_POST['name'];
       $email = $_POST['email'];
       $password = $_POST['password'];
       // Connect to RDS MySQL database
       $conn = new mysqli("<RDS-endpoint>", "<username>", "<password>", "registration_db");
       if ($conn->connect_error) {
           die("Connection failed: " . $conn->connect_error);
       }
       // Insert user data into database
       $sql = "INSERT INTO users (name, email, password) VALUES ('$name', '$email', '$password')";
       if ($conn->query($sql) === TRUE) {
           echo "New record created successfully";
       } else {
           echo "Error: " . $sql . "<br>" . $conn->error;
       }
       $conn->close();
   }
   ?>
   ```
 - Place this script in the public_html directory and configure Apache to serve the form.




Step 6: Test the Registration Form

  1. Access the Webpage:

    • Open a browser and go to the public IP address of the EC2 instance (e.g., http://<EC2-Public-IP>).
  2. Submit the Registration Form:

    • Enter a name, email, and password, then submit the form.
  • Check the RDS database to ensure the data has been correctly inserted.

MY OUTPUT:

Image description

Image description

By following these steps, we have successfully built a secure and scalable web application on AWS. The EC2 instance in the public subnet hosts the registration page, and the private subnet securely stores user data in an RDS instance. We have ensured security by isolating the RDS instance from public access, using VPC subnets, and configuring appropriate security groups.

Building a Highly Available and Secure Web Application Architecture with VPCs, Load Balancers, and Private Subnets

31 December 2024 at 09:29

Overview

1. Single VPC with Public and Private Subnets

In this architecture, we will use a single VPC that consists of both public and private subnets. Each subnet serves different purposes:

Public Subnet:

  • Hosts the website served by EC2 instances.
  • The EC2 instances are managed by an Auto Scaling Group (ASG) to ensure high availability and scalability.
  • A Load Balancer (ALB) distributes incoming traffic across the EC2 instances.

Private Subnet:

  • Hosts an RDS database, which securely stores the data submitted via the website.
  • The EC2 instances in the public subnet interact with the RDS instance in the private subnet via a private IP.
  • The private subnet has a VPC Endpoint to access S3 securely without traversing the public internet.

2. Route 53 Integration for Custom Domain Name

Using AWS Route 53, you can create a DNS record to point to the Load Balancer's DNS name, which allows users to access the website via a custom domain name. This step ensures that your application is accessible from a friendly, branded URL.

3. Secure S3 Access via VPC Endpoint

To securely interact with Amazon S3 from the EC2 instances in the private subnet, we will use an S3 VPC Endpoint. This VPC endpoint ensures that all traffic between the EC2 instances and S3 happens entirely within the AWS network, avoiding the public internet and enhancing security.

4. VPC Peering for Inter-VPC Communication

In some cases, you may want to establish communication between two VPCs for resource sharing or integration. VPC Peering or Transit Gateways are used to connect different VPCs, ensuring resources in one VPC can communicate with resources in another VPC securely.

Step 1: Set Up the VPC and Subnets

  1. Create a VPC:

    • Use the AWS VPC Wizard or AWS Management Console to create a VPC with a CIDR block (e.g., 10.0.0.0/16).
  2. Create Subnets:

  • Public Subnet: Assign a CIDR block like 10.0.1.0/24 to the public subnet. This subnet will host your web servers and load balancer.
  • Private Subnet: Assign a CIDR block like 10.0.2.0/24 to the private subnet, where your RDS instances will reside.
  1. Internet Gateway:
  • Attach an Internet Gateway to the VPC and route traffic from the public subnet to the internet.
  1. Route Table for Public Subnet:
  • Ensure that the public subnet has a route to the Internet Gateway so that traffic can flow in and out.
  1. Route Table for Private Subnet:
  • The private subnet should not have direct internet access. Instead, use a NAT Gateway in the public subnet for outbound internet access from the private subnet, if required.

Step 2: Set Up the Load Balancer (ALB)

  1. Create an Application Load Balancer (ALB):

    • Navigate to the EC2 console, select Load Balancers, and create an Application Load Balancer (ALB).
    • Choose the public subnet to deploy the ALB and configure listeners on port 80 (HTTP) or 443 (HTTPS).
    • Assign security groups to the ALB to allow traffic on these ports.
  2. Create Target Groups:

    • Create target groups for the ALB that point to your EC2 instances or Auto Scaling Group.
  3. Add EC2 Instances to the Target Group:

    • Add EC2 instances from the public subnet to the target group for load balancing.
  4. Configure Auto Scaling Group (ASG):

    • Create an Auto Scaling Group (ASG) with a launch configuration to automatically scale EC2 instances based on traffic load.

Step 3: Set Up Amazon RDS in the Private Subnet

  1. Launch an RDS Instance:

    • In the AWS RDS Console, launch a RDS database instance (e.g., MySQL, PostgreSQL) within the private subnet.
    • Ensure the RDS instance is not publicly accessible, keeping it secure within the VPC.
  2. Connect EC2 to RDS:

    • Ensure that your EC2 instances in the public subnet can connect to the RDS instance in the private subnet using private IPs.

Step 4: Set Up the S3 VPC Endpoint for Secure S3 Access

  1. Create a VPC Endpoint for S3:

    • In the VPC Console, navigate to Endpoints and create a Gateway VPC Endpoint for S3.
    • Select the private subnet and configure the route table to ensure traffic to S3 goes through the VPC endpoint.
  2. Configure Security Group and IAM Role:

    • Ensure your EC2 instances have the necessary IAM roles to access the S3 bucket.
    • Attach security groups to allow outbound traffic to the S3 VPC endpoint.

Step 5: Set Up Route 53 for Custom Domain

  1. Create a Hosted Zone:

    • In the Route 53 Console, create a hosted zone for your domain (e.g., example.com).
  2. Create Record Set for the Load Balancer:

    • Create an A Record or CNAME Record pointing to the DNS name of the ALB (e.g., mywebsite-1234567.elb.amazonaws.com).

Step 6: Set Up VPC Peering (Optional)

  1. Create VPC Peering:
    • If you need to connect two VPCs (e.g., for inter-VPC communication), create a VPC Peering Connection.
  • Update the route tables in both VPCs to ensure traffic can flow between the peered VPCs.
  1. Configure Routes:
    • In both VPCs, add routes to the route tables that allow traffic to flow between the VPCs via the peering connection.

With the use of public and private subnets, Auto Scaling Groups, Application Load Balancers, and VPC Endpoints, We can build a resilient infrastructure. Integrating Route 53 for custom domain management and VPC Peering for inter-VPC communication completes the solution for a fully managed, secure web application architecture on AWS.

Managing EKS Clusters Using AWS Lambda: A Step-by-Step Approach

By: Ragul.M
20 December 2024 at 12:20

Efficiently managing Amazon Elastic Kubernetes Service (EKS) clusters is critical for maintaining cost-effectiveness and performance. Automating the process of starting and stopping EKS clusters using AWS Lambda ensures optimal utilization and reduces manual intervention. Below is a structured approach to achieve this.

1. Define the Requirements

  • Identify the clusters that need automated start/stop operations.
  • Determine the dependencies among clusters, if any, to ensure smooth transitions.
  • Establish the scaling logic, such as leveraging tags to specify operational states (e.g., auto-start, auto-stop).

2. Prepare the Environment

  • AWS CLI Configuration: Ensure the AWS CLI is set up with appropriate credentials and access.
  • IAM Role for Lambda:
    • Create a role with permissions to manage EKS clusters (eks:DescribeCluster, eks:UpdateNodegroupConfig, etc.).
    • Include logging permissions for CloudWatch Logs to monitor the Lambda function execution.

3. Tag EKS Clusters

  • Use resource tagging to identify clusters for automation.
  • Example tags:
    • auto-start=true: Indicates clusters that should be started by the Lambda function.
    • dependency=<cluster-name>: Specifies any inter-cluster dependencies.

4. Design the Lambda Function

  • Trigger Setup:
    • Use CloudWatch Events or schedule triggers (e.g., daily or weekly) to invoke the function.
  • Environment Variables: Configure the function with environment variables for managing cluster names and dependency details.
  • Scaling Configuration: Ensure the function dynamically retrieves scaling logic via tags to handle operational states.

5. Define the Workflow

  • Fetch Cluster Information: Use AWS APIs to retrieve cluster details, including their tags and states.
  • Check Dependencies:
    • Identify dependent clusters and validate their status before initiating operations on others.
  • Start/Stop Clusters:
    • Update node group configurations or use cluster-level start/stop APIs where supported.
  • Implement Logging and Alerts: Capture the execution details and errors in CloudWatch Logs.

(If you want my code , just comment "ease-py-code" on my blog , will share you 🫶 )

6. Test and Validate

  • Dry Runs: Perform simulations to ensure the function executes as expected without making actual changes.
  • Dependency Scenarios: Test different scenarios involving dependencies to validate the logic.
  • Error Handling: Verify retries and exception handling for potential API failures.

7. Deploy and Monitor

  • Deploy the Function: Once validated, deploy the Lambda function in the desired region.
  • Set Up Monitoring:
    • Use CloudWatch Metrics to monitor function executions and errors.
    • Configure alarms for failure scenarios to take corrective actions.

By automating the start and stop operations for EKS clusters, organizations can significantly enhance resource management and optimize costs. This approach provides scalability and ensures that inter-cluster dependencies are handled efficiently.

Follow for more and happy learning :)

Automating RDS Snapshot Management for Daily Testing

18 December 2024 at 06:07

Creating a snapshot ensures you have a backup of the current RDS state. This snapshot can be used to restore the RDS instance later. 

Steps to Create a Snapshot via AWS Management Console: 

  1. Navigate to the RDS Dashboard
  2. Select the RDS instance you want to back up. 
  3. Click Actions > Take Snapshot
  4. Provide a name for the snapshot (e.g., rds-snapshot-test-date). 
  5. Click Take Snapshot

Automating Snapshot Creation with AWS CLI:

 

aws rds create-db-snapshot \
    --db-snapshot-identifier rds-snapshot-test-date \
    --db-instance-identifier your-rds-instance-id

Step 2: Use the RDS Instance for Testing 
Once the snapshot is created, continue using the RDS instance for your testing activities for the day. Ensure you document any changes made during testing, as these will not persist after restoring the instance from the snapshot. 

Step 3: Rename and Delete the RDS Instance 
At the end of the day, rename the existing RDS instance and delete it to avoid unnecessary costs. 

Steps to Rename the RDS Instance via AWS Management Console: 

  1. Navigate to the RDS Dashboard
  2. Select the RDS instance. 
  3. Click Actions > Modify
  4. Update the DB Instance Identifier (e.g., rds-instance-test-old). 
  5. Save the changes and wait for the instance to update. 

Steps to Delete the RDS Instance: 

  1. Select the renamed instance. 
  2. Click Actions > Delete
  3. Optionally, skip creating a final snapshot if you already have one. 
  4. Confirm the deletion. 

Automating Rename and Delete via AWS CLI:

 

# Rename the RDS instance
aws rds modify-db-instance \
    --db-instance-identifier your-rds-instance-id \
    --new-db-instance-identifier rds-instance-test-old

# Delete the RDS instance
aws rds delete-db-instance \
    --db-instance-identifier rds-instance-test-old \
    --skip-final-snapshot

Step 4: Restore the RDS Instance from the Snapshot 
Before starting the next day’s testing, restore the RDS instance from the snapshot created earlier. 

Steps to Restore an RDS Instance via AWS Management Console: 

  1. Navigate to the Snapshots section in the RDS Dashboard
  2. Select the snapshot you want to restore. 
  3. Click Actions > Restore Snapshot
  4. Provide a new identifier for the RDS instance (e.g., rds-instance-test). 
  5. Configure additional settings if needed and click Restore DB Instance

Automating Restore via AWS CLI:

 

aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier rds-instance-test \
    --db-snapshot-identifier rds-snapshot-test-date

Optional: Automate the Process with a Script 
To streamline these steps, you can use a script combining AWS CLI commands. Below is an example script:

 

#!/bin/bash

# Variables
RDS_INSTANCE_ID="your-rds-instance-id"
SNAPSHOT_ID="rds-snapshot-$(date +%F)"
RESTORED_RDS_INSTANCE_ID="rds-instance-test"

# Step 1: Create a Snapshot
echo "Creating snapshot..."
aws rds create-db-snapshot \
    --db-snapshot-identifier $SNAPSHOT_ID \
    --db-instance-identifier $RDS_INSTANCE_ID

# Step 2: Rename and Delete RDS Instance
echo "Renaming and deleting RDS instance..."
aws rds modify-db-instance \
    --db-instance-identifier $RDS_INSTANCE_ID \
    --new-db-instance-identifier "${RDS_INSTANCE_ID}-old"

aws rds delete-db-instance \
    --db-instance-identifier "${RDS_INSTANCE_ID}-old" \
    --skip-final-snapshot

# Step 3: Restore RDS from Snapshot
echo "Restoring RDS instance from snapshot..."
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier $RESTORED_RDS_INSTANCE_ID \
    --db-snapshot-identifier $SNAPSHOT_ID

How to Create a Lambda Function to Export IAM Users to S3 as a CSV File

By: Ragul.M
16 December 2024 at 15:36

Managing AWS resources efficiently often requires automation. One common task is exporting a list of IAM users into a CSV file for auditing or reporting purposes. AWS Lambda is an excellent tool to achieve this, combined with the power of S3 for storage. Here's a step-by-step guide:

Step 1: Understand the Requirements
Before starting, ensure you have the following:

  • IAM permissions to list users (iam:ListUsers) and access S3 (s3:PutObject).
  • An existing S3 bucket to store the generated CSV file.
  • A basic understanding of AWS Lambda and its environment.

Step 2: Create an S3 Bucket

  1. Log in to the AWS Management Console.
  2. Navigate to S3 and create a new bucket or use an existing one.
  3. Note the bucket name for use in the Lambda function.

Step 3: Set Up a Lambda Function

  1. Go to the Lambda service in the AWS Console.
  2. Click on Create Function and choose the option to create a function from scratch.
  3. Configure the runtime environment (e.g., Python or Node.js).
  4. Assign an appropriate IAM role to the Lambda function with permissions for IAM and S3 operations. (If you want my code , just comment "ease-py-code" on my blog , will share you 🫶 )

Step 4: Implement Logic for IAM and S3

  • The Lambda function will:
    • Retrieve a list of IAM users using the AWS SDK.
    • Format the list into a CSV structure.
    • Upload the file to the specified S3 bucket.

Step 5: Test the Function

  1. Use the AWS Lambda testing tools to trigger the function.
  2. Verify that the CSV file is successfully uploaded to the S3 bucket.

Step 7: Monitor and Review

  • Check the S3 bucket for the uploaded CSV files.
  • Review the Lambda logs in CloudWatch to ensure the function runs successfully.

By following these steps, you can automate the task of exporting IAM user information into a CSV file and store it securely in S3, making it easier to track and manage your AWS users.

Follow for more and happy learning :)

Automating AWS Cost Management Reports with Lambda

By: Ragul.M
11 December 2024 at 16:08

Monitoring AWS costs is essential for keeping budgets in check. In this guide, we’ll walk through creating an AWS Lambda function to retrieve cost details and send them to email (via SES) and Slack.
Prerequisites
1.AWS Account with IAM permissions for Lambda, SES, and Cost Explorer.
2.Slack Webhook URL to send messages.
3.Configured SES Email for notifications.
4.S3 Bucket for storing cost reports as CSV files.

Step 1: Enable Cost Explorer

  • Go to AWS Billing Dashboard > Cost Explorer.
  • Enable Cost Explorer to access detailed cost data.

Step 2: Create an S3 Bucket

  • Create an S3 bucket (e.g., aws-cost-reports) to store cost reports.
  • Ensure the bucket has appropriate read/write permissions for Lambda.

Step 3: Write the Lambda Code
1.Create a Lambda Function

  • Go to AWS Lambda > Create Function.
  • Select Python Runtime (e.g., Python 3.9).
    1. Add Dependencies
  • Use a Lambda layer or package libraries like boto3 and slack_sdk. 3.Write your python code and execute them. (If you want my code , just comment "ease-py-code" on my blog , will share you 🫶 )

Step 4: Add S3 Permissions
Update the Lambda execution role to allow s3:PutObject, ses:SendEmail, and ce:GetCostAndUsage.

Step 5: Test the Lambda
1.Trigger Lambda manually using a test event.

  1. Verify the cost report is:
    • Uploaded to the S3 bucket.
    • Emailed via SES.
    • Notified in Slack.

Conclusion
With this setup, AWS cost reports are automatically delivered to your inbox and Slack, keeping you updated on spending trends. Fine-tune this solution by customizing the report frequency or grouping costs by other dimensions.

Follow for more and happy learning :)

Exploring Kubernetes: A Step Ahead of Basics

By: Ragul.M
10 December 2024 at 05:35

Kubernetes is a powerful platform that simplifies the management of containerized applications. If you’re familiar with the fundamentals, it’s time to take a step further and explore intermediate concepts that enhance your ability to manage and optimize Kubernetes clusters.

  1. Understanding Deployments
    A Deployment ensures your application runs reliably by managing scaling, updates, and rollbacks.

  2. Using ConfigMaps and Secrets
    Kubernetes separates application configuration and sensitive data from the application code using ConfigMaps and Secrets.

    ConfigMaps

    Store non-sensitive configurations, such as environment variables or application settings.

kubectl create configmap app-config --from-literal=ENV=production 

3. Liveness and Readiness Probes

Probes ensure your application is healthy and ready to handle traffic.

Liveness Probe
Checks if your application is running. If it fails, Kubernetes restarts the pod.

Readiness Probe
Checks if your application is ready to accept traffic. If it fails, Kubernetes stops routing requests to the pod.

4.Resource Requests and Limits
To ensure efficient resource utilization, define requests (minimum resources a pod needs) and limits (maximum resources a pod can use).

5.Horizontal Pod Autoscaling (HPA)
Scale your application dynamically based on CPU or memory usage.
Example:

kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10 

This ensures your application scales automatically when resource usage increases or decreases.

6.Network Policies
Control how pods communicate with each other and external resources using Network Policies.

Conclusion
Kubernetes has revolutionized the way we manage containerized applications. By automating tasks like deployment, scaling, and maintenance, it allows developers and organizations to focus on innovation. Whether you're a beginner or a seasoned developer, mastering Kubernetes is a skill that will enhance your ability to build and manage modern applications.

By mastering these slightly advanced Kubernetes concepts, you’ll improve your cluster management, application reliability, and resource utilization. With this knowledge, you’re well-prepared to dive into more advanced topics like Helm, monitoring with Prometheus, and service meshes like Istio.

Follow for more and Happy learning :)

**Dynamic Scaling with AWS Auto Scaling Groups via Console**

9 December 2024 at 06:00

To configure an Auto Scaling Group (ASG) using the AWS Management Console. Auto Scaling Groups are an essential feature of AWS, allowing you to dynamically scale your EC2 instances based on workload demand. Here, we'll have a clear understanding of creating an ASG, configuring scaling policies, and testing the setup.

Introduction to Auto Scaling Groups

An Auto Scaling Group (ASG) ensures your application has the right number of EC2 instances running at all times. You can define scaling policies based on CloudWatch metrics, such as CPU utilization, to automatically add or remove instances. This provides cost-efficiency and ensures consistent performance.Auto Scaling Groups dynamically adjust EC2 instances based on workload.

Steps to Create an Auto Scaling Group Using the AWS Console

Step 1: Create a Launch Template

  1. Log in to the AWS Management Console and navigate to the EC2 Dashboard.
  2. Create a Launch Template:
    • Go to Launch Templates and click Create Launch Template.
    • Provide a Name and Description.
    • Specify the AMI ID (Amazon Machine Image) for the operating system. For example, use an Ubuntu AMI.
    • Select the Instance Type (e.g., t2.micro).
    • Add your Key Pair for SSH access.
    • Configure Network Settings (use the default VPC and a Subnet).
    • Leave other settings as default and save the Launch Template.
    • Launch Templates simplify EC2 instance configurations for ASG.

Step 2: Create an Auto Scaling Group

  1. Navigate to Auto Scaling Groups under the EC2 Dashboard.
  2. Click "Create Auto Scaling Group".
  3. Select Launch Template: Choose the Launch Template created in Step 1.
  4. Configure Group Size and Scaling Policies:
    • Specify the Minimum size (e.g., 1), Maximum size (e.g., 3), and Desired Capacity (e.g., 1).
    • Set scaling policies to increase or decrease capacity automatically.
  5. Choose Subnets:
    • Select the Subnets from your VPC where the EC2 instances will run.
    • Ensure these Subnets are public if instances need internet access.
  6. Health Checks:
    • Use EC2 health checks to automatically replace unhealthy instances.
    • Set a Health Check Grace Period (e.g., 300 seconds).
  7. Review and Create:
    • Review the settings and click Create Auto Scaling Group.
  8. Dynamic Scaling Policies allow automated scaling based on CloudWatch metrics like CPU utilization.

Step 3: Set Up Scaling Policies

  1. In the ASG configuration, choose Dynamic Scaling Policies.
  2. Add a policy to scale out:
    • Set the policy to add 1 instance when CPU utilization exceeds 70%.
  3. Add a policy to scale in:

    - Set the policy to remove 1 instance when CPU utilization falls below 30%.

    Stress Testing the Auto Scaling Group

    To test the Auto Scaling Group, you can simulate high CPU usage on one of the instances. This will trigger the scaling policy and add more instances.Stress testing helps verify that scaling policies are working as expected.

  4. Connect to an Instance:
    Use your private key to SSH into the instance.

   ssh -i "your-key.pem" ubuntu@<Instance-IP>
  1. Install Stress Tool: Update the system and install the stress tool.
   sudo apt update 
   sudo apt install stress 
  1. Run Stress Test: Simulate high CPU utilization to trigger the scale-out policy.
   stress --cpu 8 --timeout 600 
  1. Monitor Scaling:
    • Go to the Auto Scaling Groups dashboard in the AWS Console.
    • Check the Activity tab to observe if new instances are being launched.

My Output

Image description

Image description

Configuring Auto Scaling Groups using the AWS Management Console is a straightforward process that enables dynamic scaling of EC2 instances. By following these steps, we can ensure your application is resilient, cost-efficient, and capable of handling varying workloads.

Accessing Multiple Instances via Load Balancer in AWS

9 December 2024 at 05:49

When deploying scalable applications, distributing traffic efficiently across multiple instances is crucial for performance, fault tolerance, and reliability. AWS provides Elastic Load Balancing (ELB) to simplify this process. Here,we’ll explore the concept of load balancers, target groups, security groups, and subnets, along with a step-by-step process to setting up an Application Load Balancer (ALB) to access multiple instances.

Load Balancer:

A Load Balancer is a service that distributes incoming application traffic across multiple targets (e.g., EC2 instances) in one or more availability zones. It improves the availability and fault tolerance of your application by ensuring no single instance is overwhelmed by traffic.
AWS supports three types of load balancers:

  1. Application Load Balancer (ALB): Works at Layer 7 (HTTP/HTTPS) and is ideal for web applications.
  2. Network Load Balancer (NLB): Operates at Layer 4 (TCP/UDP) for ultra-low latency.
  3. Gateway Load Balancer (GWLB): Works at Layer 3 (IP) for distributing traffic to virtual appliances.

1. Target Groups

  • Target Groups are collections of targets (e.g., EC2 instances, IPs) that receive traffic from a load balancer.
  • You can define health checks for targets to ensure traffic is routed only to healthy instances. It can Organize and monitor targets (EC2 instances).

2. Security Groups

  • Security Groups act as virtual firewalls for your instances and load balancers.
  • For the load balancer, inbound rules allow traffic on ports like 80 (HTTP) or 443 (HTTPS).
  • For the instances, inbound rules allow traffic only from the load balancer's IP or security group.
  • It Protect resources by restricting traffic based on rules.

3. Subnets

  • Subnets are segments of a VPC that isolate resources.
  • Load balancers require at least two public subnets in different availability zones for high availability.
  • EC2 instances are usually deployed in private subnets, accessible only through the load balancer.
  • It isolate resources; public subnets for load balancers and private subnets for instances.

Steps to Set Up a Load Balancer for Multiple Instances

Step 1: Launch EC2 Instances

  1. Create Two or More EC2 Instances:
    • Use the AWS Management Console to launch multiple EC2 instances in a VPC.
    • Place them in private subnets across two different availability zones.
  2. Configure Security Groups for Instances:
    • Allow traffic only from the load balancer's security group on port 80 (HTTP) or 443 (HTTPS).

Step 2: Create a Target Group

  1. Navigate to Target Groups in the EC2 section of the console.
  2. Click Create Target Group and choose Instances as the target type.
  3. Provide the following configurations:
    • Protocol: HTTP or HTTPS
    • VPC: Select the same VPC as the EC2 instances.
    • Health Check Settings: Configure health checks (e.g., Path: / and Port: 80).
  4. Register the EC2 instances as targets in this group.

Step 3: Set Up a Load Balancer
Application Load Balancer Configuration:

  1. Go to the Load Balancers section of the EC2 console.
  2. Click Create Load Balancer and choose Application Load Balancer.
  3. Configure the following:
    • Name: Provide a unique name for the load balancer.
    • Scheme: Select Internet-facing for public access.
    • Listeners: Use port 80 or 443 (for HTTPS).
    • Availability Zones: Select public subnets from at least two availability zones.

Step 4: Attach Target Group to the Load Balancer

  1. In the Listener and Rules section, forward traffic to the target group created earlier.
  2. Save and create the load balancer.

Step 5: Update Security Groups

  1. For the Load Balancer:
    • Allow inbound traffic on port 80 or 443 (if HTTPS).
    • Allow inbound traffic from all IPs (or restrict by source).
  2. For EC2 Instances:
    • Allow inbound traffic from the load balancer's security group.

Step 6: Test the Setup

  1. Get the DNS name of the load balancer from the AWS console.
  2. Access the DNS name in your browser to verify traffic is being distributed to your instances.

Step:7 Scaling with Auto Scaling Groups
Attach an Auto Scaling Group (ASG) to the target group for dynamic scaling based on traffic demand.

To access multiple EC2 instances via a load balancer in AWS, you first deploy your EC2 instances within a Virtual Private Cloud (VPC), ensuring they are in the same target network. Install and configure your desired application (e.g., a web server like Apache) on these instances. Then, create an Application Load Balancer (ALB) or Network Load Balancer (NLB) to distribute traffic. Associate the load balancer with a Target Group that includes your EC2 instances and their ports. Next, configure the load balancer's listener to route incoming traffic (e.g., HTTP or HTTPS) to the Target Group. To make the setup accessible via a domain name, map your load balancer's DNS to a custom domain using Route 53. This ensures users can access your application by visiting the domain, with the load balancer evenly distributing traffic among the EC2 instances for high availability and scalability.

My output:

Image description

Image description

Understanding Kubernetes Basics: A Beginner’s Guide

By: Ragul.M
29 November 2024 at 17:12

In today’s tech-driven world, Kubernetes has emerged as one of the most powerful tools for container orchestration. Whether you’re managing a few containers or thousands of them, Kubernetes simplifies the process, ensuring high availability, scalability, and efficient resource utilization. This blog will guide you through the basics of Kubernetes, helping you understand its core components and functionality.

What is Kubernetes?
Kubernetes, often abbreviated as K8s, is an open-source platform developed by Google that automates the deployment, scaling, and management of containerized applications. It was later donated to the Cloud Native Computing Foundation (CNCF).
With Kubernetes, developers can focus on building applications, while Kubernetes takes care of managing their deployment and runtime.

Key Features of Kubernetes

  1. Automated Deployment and Scaling Kubernetes automates the deployment of containers and can scale them up or down based on demand.
  2. Self-Healing If a container fails, Kubernetes replaces it automatically, ensuring minimal downtime.
  3. Load Balancing Distributes traffic evenly across containers, optimizing performance and preventing overload.
  4. Rollbacks and Updates Kubernetes manages seamless updates and rollbacks for your applications without disrupting service.
  5. Resource Management Optimizes hardware utilization by efficiently scheduling containers across the cluster.

Core Components of Kubernetes
To understand Kubernetes, let’s break it down into its core components:

  1. Cluster A Kubernetes cluster consists of:
  2. Master Node: The control plane managing the entire cluster.
  3. Worker Nodes: Machines where containers run.
  4. Pods :The smallest deployable unit in Kubernetes. A pod can contain one or more containers that share resources like storage and networking.
  5. Nodes : Physical or virtual machines that run the pods. Managed by the Kubelet, a process ensuring pods are running as expected.
  6. Services : Allow communication between pods and other resources, both inside and outside the cluster. Examples include ClusterIP, NodePort, and LoadBalancer services.
  7. ConfigMaps and Secrets : ConfigMaps: Store configuration data for your applications. Secrets: Store sensitive data like passwords and tokens securely.
  8. Namespaces Virtual clusters within a Kubernetes cluster, used for organizing and isolating resources.

Conclusion
Kubernetes has revolutionized the way we manage containerized applications. By automating tasks like deployment, scaling, and maintenance, it allows developers and organizations to focus on innovation. Whether you're a beginner or a seasoned developer, mastering Kubernetes is a skill that will enhance your ability to build and manage modern applications.

Follow for more and Happy learning :)

Deep Dive into AWS

By: Ragul.M
26 November 2024 at 13:50

Hi folks , welcome to my blog. Here we are going to see about some interesting deep topics in AWS.

What is AWS?

AWS is a subsidiary of Amazon that offers on-demand cloud computing services. These services eliminate the need for physical infrastructure, allowing businesses to rent computing power, storage, and other resources as needed. AWS operates on a pay-as-you-go model, which means you only pay for what you use.

Deep Dive: Serverless Architecture

One of AWS’s most revolutionary offerings is serverless computing. Traditional servers are replaced with fully managed services, allowing developers to focus solely on writing code.

Key Components of Serverless Architecture:

  • AWS Lambda: Automatically scales based on the number of requests. Ideal for microservices and event-driven workflows.
  • API Gateway: Connects client applications with backend services using APIs.
  • DynamoDB: High-performance NoSQL database for low-latency reads and writes.
  • EventBridge: Orchestrates serverless workflows using event-driven triggers. Example Use Case: Build a RESTful API without managing servers. Combine Lambda for compute, DynamoDB for storage, and API Gateway for routing.

Advanced Concepts in AWS

1. Elasticity and Auto-Scaling

AWS Auto Scaling monitors your application and adjusts capacity automatically to maintain performance. For example, if traffic spikes, AWS can add more EC2 instances or scale down when traffic reduces.

2. Hybrid Cloud and Outposts

Hybrid cloud models integrate on-premises infrastructure with AWS. AWS Outposts allow you to run AWS services on your own hardware, enabling low-latency solutions for specific industries.

3. High Availability and Disaster Recovery

AWS provides tools like:

  • Route 53 for DNS failover.
  • Cross-Region Replication for S3.
  • AWS Backup for centralized backup management across multiple services.

4. Monitoring and Logging

  • CloudWatch: Collect and monitor logs, metrics, and events.
  • CloudTrail: Records all API calls for auditing purposes.
  • AWS Config: Tracks changes to your resources for compliance.

Conclusion

AWS empowers organizations to innovate faster by providing scalable, secure, and cost-effective solutions. Whether you’re deploying a simple static website or a complex AI-powered application, AWS has the tools to support your goals. By leveraging its services and following best practices, you can build resilient and future-ready applications.

Follow for more and happy learning :)

Linux Basic Commands III

22 November 2024 at 11:17

Process Management Commands:

ps - It Display running processes.
-aux: - It Show all processes.
top - It Monitor system processes in real-time.It displays a dynamic view of system processes and their resource usage.
kill - It helps to Terminate a process.
** - 9*: Forcefully kill a process.
**kill PID
* -terminates the process with the specified process ID.
pkill - Terminate processes based on their name.
pkill **- terminates all processes with the specified name.
**pgrep
- It helps to List processes based on their name.
grep - It used to search for specific patterns or regular expressions in text files or streams and display matching lines.
-i: Ignore case distinctions while searching.
-v: Invert the match, displaying non-matching lines.
-r or -R: Recursively search directories for matching patterns.
-l: Print only the names of files containing matches.
-n: Display line numbers alongside matching lines.
-w: Match whole words only, rather than partial matches.
-c: Count the number of matching lines instead of displaying them.
-e: Specify multiple patterns to search for.
-A: Display lines after the matching line.
-B: Display lines before the matching line.
-C: Display lines both before and after the matching line.

Linux Basic Commands II

21 November 2024 at 14:45

File Permission Commands:

Chmod - Change file permissions.

  • u: User/owner permissions.
  • g: Group permissions.
  • o: Other permissions.
  • +: Add permissions.
  • –: Remove permissions.
  • =: Set permissions explicitly.

Chown - Change file ownership.

Chgrp - Change group ownership.

File Compression and Archiving Commands:
**
**Tar
- Create or extract archive files.

  • -c: Create a new archive.
  • -x: Extract files from an archive.
  • -f: Specify the archive file name.
  • -v: Verbose mode.
  • -z: Compress the archive with gzip.
  • -j: Compress the archive with bzip2.

Gzip - for Compress files

  • -d: Decompress files.

Zip - to Create compressed zip archives.

  • -r: Recursively include directories.

Introduction to AWS

By: Ragul.M
20 November 2024 at 16:13

Hi folks , welcome to my blog. Here we are going to see about "Introduction to AWS".

Amazon Web Services (AWS) is the world’s leading cloud computing platform, offering a wide range of services to help businesses scale and innovate. Whether you're building an application, hosting a website, or storing data, AWS provides reliable and cost-effective solutions for individuals and organizations of all sizes.

What is AWS?
AWS is a comprehensive cloud computing platform provided by Amazon. It offers on-demand resources such as compute power, storage, networking, and databases on a pay-as-you-go basis. This eliminates the need for businesses to invest in and maintain physical servers.

Core Benefits of AWS

  1. Scalability: AWS allows you to scale your resources up or down based on your needs.
  2. Cost-Effective: With its pay-as-you-go pricing, you only pay for what you use.
  3. Global Availability: AWS has data centers worldwide, ensuring low latency and high availability.
  4. Security: AWS follows a shared responsibility model, offering top-notch security features like encryption and access control.
  5. Flexibility: Supports multiple programming languages, operating systems, and architectures.

Key AWS Services
Here are some of the most widely used AWS services:

  1. Compute:
    • Amazon EC2: Virtual servers to run your applications.
    • AWS Lambda: Serverless computing to run code without managing servers.
  2. Storage:
    • Amazon S3: Object storage for data backup and distribution.
    • Amazon EBS: Block storage for EC2 instances.
  3. Database:
    • Amazon RDS: Managed relational databases like MySQL, PostgreSQL, and Oracle.
    • Amazon DynamoDB: NoSQL database for high-performance applications.
  4. Networking:
    • Amazon VPC: Create isolated networks in the cloud.
    • Amazon Route 53: Domain name system (DNS) and traffic management.
  5. AI/ML:
    • Amazon SageMaker: Build, train, and deploy machine learning models.
  6. DevOps Tools:
    • AWS CodePipeline: Automates the release process.
    • Amazon EKS: Managed Kubernetes service.

Conclusion
AWS has revolutionized the way businesses leverage technology by providing scalable, secure, and flexible cloud solutions. Whether you're a developer, an enterprise, or an enthusiast, understanding AWS basics is the first step toward mastering the cloud. Start your AWS journey today and unlock endless possibilities!

Follow for more and happy learning :)

Basic Linux Commands

15 November 2024 at 15:08
  1. pwd — When you first open the terminal, you are in the home directory of your user. To know which directory you are in, you can use the “pwd” command. It gives us the absolute path, which means the path that starts from the root. The root is the base of the Linux file system and is denoted by a forward slash( / ). The user directory is usually something like “/home/username”.

Image description

  1. ls — Use the “ls” command to know what files are in the directory you are in. You can see all the hidden files by using the command “ls -a”.

Image description

  1. cd — Use the “cd” command to go to a directory. “cd” expects directory name or path of new directory as input.

Image description

  1. mkdir & rmdir — Use the mkdir command when you need to create a folder or a directory.Use rmdir to delete a directory. But rmdir can only be used to delete an empty directory. To delete a directory containing files, use rm.

Image description

  1. rm – Use the rm command to delete a file. Use “rm -r” to recursively delete all files within a specific directory.

Image description

  1. touch — The touch command is used to create an empty file. For example, “touch new.txt”.

Image description

  1. cp — Use the cp command to copy files through the command line.

Image description

  1. mv — Use the mv command to move files through the command line. We can also use the mv command to rename a file.

Image description

9.cat — Use the cat command to display the contents of a file. It is usually used to easily view programs.

Image description

10.vi - You can create a new file or modify a file using this editor.

Image description

Basic Linux Commands

By: Ragul.M
15 November 2024 at 14:25

Hi folks , welcome to my blog. Here we are going to see some basic and important commands of linux.

One of the most distinctive features of Linux is its command-line interface (CLI). Knowing a few basic commands can unlock many possibilities in Linux.
Essential Commands
Here are some fundamental commands to get you started:
ls - Lists files and directories in the current directory.

ls

cd - Changes to a different directory.

cd /home/user/Documents

pwd - Prints the current working directory.

pwd

cp - Copies files or directories.

cp file1.txt /home/user/backup/

mv - Moves or renames files or directories.

mv file1.txt file2.txt

rm - Removes files or directories.

rm file1.txt

mkdir - Creates a new directory.

mkdir new_folder

touch - Creates a new empty file.

touch newfile.txt

cat - Displays the contents of a file.

cat file1.txt

nano or vim - Opens a file in the text editor.

nano file1.txt

chmod - Changes file permissions.

chmod 755 file1.txt

ps - Displays active processes.

ps

kill - Terminates a process.

kill [PID]

Each command is powerful on its own, and combining them enables you to manage your files and system effectively.We can see more about some basics and interesting things about linux in further upcoming blogs which I will be posting.

Follow for more and happy learning :)

Linux basics for beginners

By: Ragul.M
14 November 2024 at 16:04

Introduction:
Linux is one of the most powerful and widely-used operating systems in the world, found everywhere from mobile devices to high-powered servers. Known for its stability, security, and open-source nature, Linux is an essential skill for anyone interested in IT, programming, or system administration.
In this blog , we are going to see What is linux and Why choose linux.

1) What is linux
Linux is an open-source operating system that was first introduced by Linus Torvalds in 1991. Built on a Unix-based foundation, Linux is community-driven, meaning anyone can view, modify, and contribute to its code. This collaborative approach has led to the creation of various Linux distributions, or "distros," each tailored to different types of users and use cases. Some of the most popular Linux distributions are:

  • Ubuntu: Known for its user-friendly interface, great for beginners.
  • Fedora: A cutting-edge distro with the latest software versions, popular with developers.
  • CentOS: Stable and widely used in enterprise environments. Each distribution may look and function slightly differently, but they all share the same core Linux features.

2) Why choose linux
Linux is favored for many reasons, including its:

  1. Stability: Linux is well-known for running smoothly without crashing, even in demanding environments.
  2. Security: Its open-source nature allows the community to detect and fix vulnerabilities quickly, making it highly secure.
  3. Customizability: Users have complete control to modify and customize their system.
  4. Performance: Linux is efficient, allowing it to run on a wide range of devices, from servers to small IoT devices.

Conclusion
Learning Linux basics is the first step to becoming proficient in an operating system that powers much of the digital world. We can see more about some basics and interesting things about linux in further upcoming blogs which I will be posting.

Follow for more and happy learning :)

An Introduction to Tokenizers in Natural Language Processing

25 September 2024 at 00:00

Tokenizers

_Co-authored by Tamil Arasan, Selvakumar Murugan and Malaikannan Sankarasubbu

In Natural Language Processing (NLP), one of the foundational steps is transforming human language into a format that computational models can understand. This is where tokenizers come into play. Tokenizers are specialized tools that break down text into smaller units called tokens, and convert these tokens into numerical data that models can process.

Imagine you have the sentence:

Artificial intelligence is revolutionizing technology.

To a human, this sentence is clear and meaningful. But we do not understand the whole sentence in one shot(okay may be you did, but I am sure if I gave you a paragraph or a even better an essay, you will not be able to understand them in one shot), but we make sense of parts of it like words and then phrases and understand the whole sentence as a composition of meanings from its parts. It is just how things work, regardless whether we are trying to make a machine mimic our language understanding or not. This has nothing to do with the reason ML models or even computers in general work with numbers. It is purely how language works and there is no going around it.

ML models like everything else we run on computers can only work with numbers, and we need to transform the text into number or series of numbers (since we have more than one word). We have a lot of freedom when it comes to how we transform the text into numbers, and as always with freedom comes complexity. But basically, tokenization as a whole is a two step process. Finding all the words and assigning a unique number - an ID to each token.

There are so many ways we can segment a sentence/paragraph into pieces like phrases, words, sub-words or even individual characters. Understanding why particular tokenization scheme is better requires a grasp of how embeddings work. If you're familiar with NLP, you'd ask "Why? Tokenization comes before the Embedding, right?" Yes, you're right, but NLP is paradoxical like that. Don't worry we will cover that as we go.

Background

Before we venture any further, lets understand the difference between Neural networks and our typical computer programs. We all know by now that for traditional computer programs, we write/translate the rules into code by hand whereas, NNs learn the rules(mapping across input and output) from data by the process called training. You see unlike in normal programming style, where we have a plethora of data-structures that can help with storing information in any shape or form we want, along with algorithms that jump up and down, back and forth in a set of instructions we call code, Neural Networks do not allow us to have all sorts of control flow we'd like. In Neural Networks, there is only one direction the "program" can run, left to right.

Unlike in traditional programs where the we can feed a program with input in complicated ways, in Neural Networks, there are only fixed number of ways, we can feed and it is usually in the form of vectors (fancy name for list of numbers) and the vectors are of fixed size (or dimension more precisely). In most DNNs, input and output sizes are fixed regardless of the problem it is trying to solve. For example, CNNs the input (usually image) size and number of channels is fixed. In RNNs, the embedding dimensions, input vocabulary size, number of output labels (classification problem e.g: sentiment classification) and or output vocabulary size (text generation problems e.g: QA, translation) are all fixed. In Transformer networks even the sentence length is fixed. This is not a bad thing, constraints like these enable the network to compress and capture the necessary information.

Also note that there are only few tools to test "equality" or "relevance" or "correctness" for things inside the network because only things that dwell inside the network are vectors. Cosine similarity and attention scores are popular. You can think of vectors as variables that keep track of state inside neural network program. But unlike in traditional programs where you can declare variables as you'd like and print them for troubleshooting, in networks the vector-variables are only meaningful only at the boundaries of the layers(not entirely true) within the networks.

Lets take a look at the simplest example to understand why just pulling a vector from anywhere in the network will not be of any value for us. In the following code, three functions perform the identical calculation despite their code is slightly different. The unnecessarily intentionally named variables temp and growth_factor need not be created as exemplified by the first function, which directly embodies the compound interest calculation formula, $A = P(1+\frac{R}{100})^{T}$. When compared to temp, the variable growth_factor hold a more meaningful interpretation - represents how much the money will grow due to compounding interest over time. For more complicated formulae and functions, we might create intermediate variables so that the code goes easy on the eye, but they have no significance to the operation of the function.

def compound_interest_1(P,R,T):
    A = P * (math.pow((1 + (R/100)),T))
    CI = A - P
    return CI

def compound_interest_2(P,R,T):
    temp = (1 + (R/100))
    A = P * (math.pow(temp, T))
    CI = A - P
    return CI

def compound_interest_3(P,R,T):
    growth_factor = (math.pow((1 + (R/100)),T))
    A = P * growth_factor
    CI = A - P
    return CI

Another example to illustrate from operations perspective. Clock arithmetic. Lets assign numbers 0 through 7 to weekdays starting from Sunday to Saturday.

Table 1

Sun Mon Tue Wed Thu Fri Sat
0 1 2 3 4 5 6

John Conway suggests, a mnemonic device for thinking of the days of the week as Noneday, Oneday, Twosday, Treblesday, Foursday, Fiveday, and Six-a-day.

So if you want to know what day it is 137 days from today if today is say, Thursday (i.e. 4). We can do $(4+137) mod 7 => 1$ i.e Monday. As you can see adding numbers(days) in clock arithmetic results in a meaningful output. You can days together to get another day. Okay lets ask the question can we multiply two days together? Is it is in anyway meaningful to multiply days? Just because we can multiply any number mathematically, is it useful to do so in our clock arithmetic?

All of this digression is to emphasize that the embedding is deemed to capture the meaning of words, vector from the last layers is deemed to capture the meaning of a sentence lets say. But when you take a vector (just because you can) within the layers for instance, it does not refer to any meaningful unit such as words or phrases and sentence as we understand it.

A little bit of history

If you're old enough, you might remember that before transformers became standard paradigm in NLP, we had another one EEAP (Embed, Encode, Attend, Predict). I am grossly oversimplifying here, but you can think of it as follows,

Embedding

Captures the meaning of words A matrix of size $N \times D$, where

  • $N$ is the size of the vocabulary, i.e unique number of words in the language
  • $D$ is the dimension of embedding, vector corresponding to each word.

Lookup the word-vector (embedding) for each word

Encoding
Find the meaning of a sentence, by using the meaning captured in embeddings of the constituent words with help of RNNs like LSTM, GRU or transformers like BERT, GPT that take the embeddings and produce vector(s) for whole the sequence.
Prediction
Depending upon the task at hand, either assigns a label to the input sentence, or generate another sentence word by word.
Attention
Helps with Prediction by focusing on what is important right now by drawing a probability distribution (normalized attention scores) over the all words. Words with high score are deemed important.

As you can see above, $N$ is the vocabulary size, i.e unique number of words in the language. And handful of years ago, language usually meant the corpus at hand (in order of few thousands of sentences) and datasets like CNN/DailyMail were considered huge. There were clever tricks like anonymizing named entities to force the ML models to focus on language specific features like grammar instead of open world words like names of Places, Presidents, Corporations and Countries, etc. Good times they were! Point is, it is possible that the corpus you have in your possession might not have all the words of the language. As we have seen, the size of the Embedding must be fixed before training the network. By good fortune if you stumble upon a new dataset and hence new words, adding them to your model was not easy, because Embedding needs to extend to accommodate this new (OOV) words and that requires retraining of the whole network. OOV means Out Of the current model's Vocabulary. And this is why simply segmenting the text on empty spaces will not work.

With that background, lets dive in.

Tokenization

Tokenization is the process of segmenting the text into individual pieces (usually words) so that ML model can digest them. It is the very first step in any NLP system and influences everything that follows. For understanding impact of tokenization, we need to understand how embeddings and sentence length influence the model. We will call sentence length as sequence length from here on, because sentence is understood to be sequence of words, and we will experiment with sequence of different things not just words, which we will call tokens.

Tokens can be anything

  • Words - "telephone" "booth" "is" "nearby" "the" "post" "office"
  • Multiword Expressions (MWEs) - "telephone booth" "is" "nearby" "the" "post office"
  • Sub-words - "tele" "#phone" "booth" "is" "near " "#by" "the" "post" "office"
  • Characters - "t" "e" "l" "e" "p" ... "c" "e"

We know segmenting the text based on empty spaces will not work, because the vocabulary will keep growing. What about punctuations? Surely they will help with words don't, won't, aren't, o'clock, Wendy's, co-operation{.verbatim} etc, same reasoning applies here too. Moreover segmenting at punctuations will create different problems, e.g: I.S.R.O > I, S, R, O{.verbatim} which is not ideal.

Objectives of Tokenization

The primary objectives of tokenization are:

Handling OOV
Tokenizers should be able to segment the text into pieces so that any word in the language whether it is in the dataset or not, any word we might conjure in foreseeable future, whether it is a technical/domain specific terminology that scientists might utter to sound intelligent or commonly used by everyone in day to day life. An ideal tokenizer should be able to deal with all and any of them.
Efficiency
Reducing the size (length) of the input text to make computation feasible and faster.
Meaningful Representation
Capturing the semantic essence of the text so that the model can learn effectively. Which we will discuss a bit later.

Simple Tokenization Methods

Go through the code below, and see if you can make any inferences on the table produced. It reads the book The Republic and counts the tokens on character, word and sentence levels and also indicated the number of unique tokens in the whole book.

Code

``` {.python results=”output raw” exports=”both”} from collections import Counter from nltk.tokenize import sent_tokenize with open(‘plato.txt’) as f: text = f.read()

words = text.split() sentences = sent_tokenize(text)

char_counter = Counter() word_counter = Counter() sent_counter = Counter()

char_counter.update(text) word_counter.update(words) sent_counter.update(sentences)

print(‘#+name: Vocabulary Size’) print(‘|Type|Vocabulary Size|Sequence Length|’) print(f’|Unique Characters|{len(char_counter)}|{len(text)}’) print(f’|Unique Words|{len(word_counter)}|{len(words)}’) print(f’|Unique Sentences|{len(sent_counter)}|{len(sentences)}’)


**Table 2**

| Type              | Vocabulary Size | Sequence Length |
| ----------------- | --------------- | --------------- |
| Unique Characters | 115             | 1,213,712       |
| Unique Words      | 20,710          | 219,318         |
| Unique Sentences  | 7,777           | 8,714           |



## Study

Character-Level Tokenization

:   In this most elementary method, text is broken down into individual
    characters.

    *\"data\"* \> `"d" "a" "t" "a"`{.verbatim}

Word-Level Tokenization

:   This is the simplest and most used (before sub-word methods became
    popular) method of tokenization, where text is split into individual
    words based on spaces and punctuation. Still useful in some
    applications and as a pedagogical launch pad into other tokenization
    techniques.

    *\"Machine learning models require data.\"* \>
    `"Machine", "learning", "models", "require", "data", "."`{.verbatim}

Sentence-Level Tokenization

:   This approach segments text into sentences, which is useful for
    tasks like machine translation or text summarization. Sentence
    tokenization is not as popular as we\'d like it to be.

    *\"Tokenizers convert text. They are essential in NLP.\"* \>
    `"Tokenizers convert text.", "They are essential in NLP."`{.verbatim}

n-gram Tokenization

:   Instead of using sentences as a tokens, what if you could use
    phrases of fixed length. The following shows the n-grams for n=2,
    i.e 2-gram or bigram. Yes the `n`{.verbatim} in the n-grams stands
    for how many words are chosen. n-grams can also be built from
    characters instead of words, though not as useful as word level
    n-grams.

    *\"Data science is fun\"* \>
    `"Data science", "science is", "is fun"`{.verbatim}.





**Table 3**

| Tokenization | Advantages                             | Disadvantages                                        |
| ------------ | -------------------------------------- | ---------------------------------------------------- |
| Character    | Minimal vocabulary size                | Very long token sequences                            |
|              | Handles any possible input             | Require huge amount of compute                       |
| Word         | Easy to implement and understand       | Large vocabulary size                                |
|              | Preserves meaning of words             | Cannot cover the whole language                      |
| Sentence     | Preserves the context within sentences | Less granular; may miss important word-level details |
|              | Sentence-level semantics               | Sentence boundary detection is challenging           |

As you can see from the table, the vocabulary size and sequence length
have inverse correlation. The Neural networks requires that the tokens
should be present in many places and many times. That is how the
networks understand words. Remember when you don\'t know the meaning of
a word, you ask someone to use it in sentences? Same thing here, the
more sentences the token is present, the better the network can
understand it. But in case of sentence tokenization, you can see there
are as many tokens in its vocabulary as in the tokenized corpus. It is
safe to say that each token is occuring only once and that is not a
healthy diet for a network. This problem occurs in word-level
tokenization too but it is subtle, the out-of-vocabulary(OoV) problem.
To deal with OOV we need to stay between character level and word-level
tokens, enter \>\>\> sub-words \<\<\<.

# Advanced Tokenization Methods

Subword tokenization is an advanced tokenization technique that breaks
text into smaller units, smaller than words. It helps in handling rare
or unseen words by decomposing them into known subword units. Our hope
is that, the sub-words decomposed from text, can be used to compose new
unseen words and so act as the tokens for the unseen words. Common
algorithms include Byte Pair Encoding (BPE), WordPiece, SentencePiece.

*\"unhappiness\"* \> `"un", "happi", "ness"`{.verbatim}

BPE is originally a technique for compression of data. Repurposed to
compress text corpus by merging frequently occurring pairs of characters
or subwords. Think of it like what and how little number of unique
tokens you need to recreate the whole book when you are free to arrange
those tokens in a line as many time as you want.

Algorithm

:   1.  *Initialization*: Start with a list of characters (initial
        vocabulary) from the text(whole corpus).
    2.  *Frequency Counting*: Count all pair occurrences of consecutive
        characters/subwords.
    3.  *Pair Merging*: Find the most frequent pair and merge it into a
        single new subword.
    4.  *Update Text*: Replace all occurrences of the pair in the text
        with the new subword.
    5.  *Repeat*: Continue the process until reaching the desired
        vocabulary size or merging no longer provides significant
        compression.

Advantages

:   -   Reduces the vocabulary size significantly.
    -   Handles rare and complex words effectively.
    -   Balances between word-level and character-level tokenization.

Disadvantages

:   -   Tokens may not be meaningful standalone units.
    -   Slightly more complex to implement.

## Trained Tokenizers

WordPiece and SentencePiece tokenization methods are extensions of BPE
where the vocabulary is not merely created by assuming merging most
frequent pair. These variants evaluate whether the given merges were
useful or not by measuring how much each merge maximizes the likelihood
of the corpus. In simple words, lets take two vocabularies, before and
after the merges, and train two language models and the model trained on
vocabulary after the merges have lower perplexity(think loss) then we
assume that the merges were useful. And we need to repeat this every
time we make a merge. Not practical, and hence there some mathematical
tricks we use to make this more practical that we will discuss in a
future post.

The iterative merging process is the training of tokenizer and this
training is different training of actual models. There are python
libraries for training your own tokenizer, but when you\'re planning to
use a pretrained language model, it is better to stick with the
pretrained tokenizer associated with that model. In the following
section we see how to train a simple BPE tokenizer, SentencePiece
tokenizer and how to use BERT tokenizer that comes with huggingface\'s
`transformers`{.verbatim} library.

## Tokenization Techniques Used in Popular Language Models

### Byte Pair Encoding (BPE) in GPT Models

GPT models, such as GPT-2 and GPT-3, utilize Byte Pair Encoding (BPE)
for tokenization.

``` {.python results="output code" exports="both"}
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer =  Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
                     vocab_size=30000)
files = ["plato.txt"]

tokenizer.train(files, trainer)
tokenizer.model.save('.', 'bpe_tokenizer')

output = tokenizer.encode("Tokenization is essential first step for any NLP model.")
print("Tokens:", output.tokens)
print("Token IDs:", output.ids)
print("Length: ", len(output.ids))
Tokens: ['T', 'oken', 'ization', 'is', 'essential', 'first', 'step', 'for', 'any', 'N', 'L', 'P', 'model', '.']
Token IDs: [50, 6436, 2897, 127, 3532, 399, 1697, 184, 256, 44, 42, 46, 3017, 15]
Length:  14

SentencePiece in T5

T5 models use a Unigram Language Model for tokenization, implemented via the SentencePiece library. This approach treats tokenization as a probabilistic model over all possible tokenizations.

import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=plato.txt --model_prefix=unigram_tokenizer --vocab_size=3000 --model_type=unigram')

``` {.python results=”output code” exports=”both”} import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load(“unigram_tokenizer.model”) text = “Tokenization is essential first step for any NLP model.” pieces = sp.EncodeAsPieces(text) ids = sp.EncodeAsIds(text) print(“Pieces:”, pieces) print(“Piece IDs:”, ids) print(“Length: “, len(ids))


``` python
Pieces: ['▁To', 'k', 'en', 'iz', 'ation', '▁is', '▁essential', '▁first', '▁step', '▁for', '▁any', '▁', 'N', 'L', 'P', '▁model', '.']
Piece IDs: [436, 191, 128, 931, 141, 11, 1945, 123, 962, 39, 65, 17, 499, 1054, 1441, 1925, 8]
Length:  17

WordPiece Tokenization in BERT

``` {.python results=”output code”} from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) text = “Tokenization is essential first step for any NLP model.” encoded_input = tokenizer(text, return_tensors=’pt’)

print(“Tokens:”, tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’][0])) print(“Token IDs:”, encoded_input[‘input_ids’][0].tolist()) print(“Length: “, len(encoded_input[‘input_ids’][0].tolist())) ```

Summary of Tokenization Methods

Table 4

Method Length Tokens
BPE 14 [‘T’, ‘oken’, ‘ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘N’, ‘L’, ‘P’, ‘model’, ‘.’]
SentencePiece 17 [‘▁To’, ‘k’, ‘en’, ‘iz’, ‘ation’, ‘▁is’, ‘▁essential’, ‘▁first’, ‘▁step’, ‘▁for’, ‘▁any’, ‘▁’, ‘N’, ‘L’, ‘P’, ‘▁model’, ‘.’]
WordPiece (BERT) 12 [‘token’, ‘##ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘nl’, ‘##p’, ‘model’, ‘.’]

Different tokenization methods give different results for the same input sentence. As we add more data to the tokenizer training, the differences between WordPiece and SentencePiece might decrease, but they will not vanish, because of the difference in their training process.

Table 5

Model Tokenization Method Library Key Features
GPT Byte Pair Encoding tokenizers Balances vocabulary size and granularity
BERT WordPiece transformers Efficient vocabulary, handles morphology
T5 Unigram Language Model sentencepiece Probabilistic, flexible across languages

Tokenization and Non English Languages

Tokenizing text is complex, especially when dealing with diverse languages and scripts. Various challenges can impact the effectiveness of tokenization.

Tokenization Issues with Complex Languages: With a focus on Tamil

Tokenizing text in languages like Tamil presents unique challenges due to their linguistic and script characteristics. Understanding these challenges is essential for developing effective NLP applications that handle Tamil text accurately.

Challenges in Tokenizing Tamil Language

  1. 1. Agglutinative Morphology

    Tamil is an agglutinative language, meaning it forms words by concatenating morphemes (roots, suffixes, prefixes) to convey grammatical relationships and meanings. A single word may express what would be a full sentence in English.

    Impact on Tokenization
    • Words can be very lengthy and contain many morphemes.
      • போகமுடியாதவர்களுக்காவேயேதான்
  2. 2. Punarchi and Phonology

    Tamil specific rules on how two words can be combined and resulting word may not be phonologically identical to its parts. The phonological transformations can cause problems with TTS/STT systems too.

    Impact on Tokenization
    • Surface forms of words may change when combined, making boundary detection challenging.
      • மரம் + வேர் > மரவேர்
      • தமிழ் + இனிது > தமிழினிது
  3. 3. Complex Script and Orthography

    Tamil alphabet representation in Unicode is suboptimal for everything except for standardized storage format. Even simple operations that are intuitive for native Tamil speaker, are harder to implement because of this. Techniques like BPE applied on Tamil text will break words at completely inappropriate points like cutting an uyirmei letter into consonant and diacritic resulting in meaningless output.

    தமிழ் > த ம ி ழ, ்

Strategies for Effective Tokenization of Tamil Text

  1. Language-Specific Tokenizers

    Train Tamil specific subword tokenizers with initial seed tokens prepared by better preprocessing techniques to avoid [problem-3]{.spurious-link target=”*3. Complex Script and Orthography”} type cases. Use morphological analyzers to decompose words into root and affixes, aiding in understanding and processing complex word forms.

Choosing the Right Tokenization Method

Challenges in Tokenization

  • Ambiguity: Words can have multiple meanings, and tokenizers cannot capture context. Example: The word "lead" can be a verb or a noun.
  • Handling Special Characters and Emojis: Modern text often includes emojis, URLs, and hashtags, which require specialized handling.
  • Multilingual Texts: Tokenizing text that includes multiple languages or scripts adds complexity, necessitating adaptable tokenization strategies.

Best Practices for Effective Tokenization

  • Understand Your Data: Analyze the text data to choose the most suitable tokenization method.
  • Consider the Task Requirements: Different NLP tasks may benefit from different tokenization granularities.
  • Use Pre-trained Tokenizers When Possible: Leveraging existing tokenizers associated with pre-trained models can save time and improve performance.
  • Normalize Text Before Tokenization: Cleaning and standardizing text

Vectordb

31 August 2024 at 00:00

Vector Databases 101

_Co-authored by Angu S KrishnaKumar, Kamal raj Kanagarajan and Malaikannan Sankarasubbu

Introduction

In the world of Large Language Models (LLMs), vector databases play a pivotal role in Retrieval Augmented Generation (RAG) applications.** These specialized databases are designed to store and retrieve high-dimensional vectors, which represent complex data structures like text, images, and audio. By leveraging vector databases, LLMs can access vast amounts of information and generate more informative and accurate responses. Retrieval Augmented Generation (RAG) is a technique that combines the power of large language models (LLMs) with external knowledge bases to generate more informative and accurate responses. By retrieving relevant information from a knowledge base and incorporating it into the LLM’s generation process, RAG can produce more comprehensive and contextually appropriate outputs.

How RAG Works:

  • User Query: A user submits a query or prompt to the RAG system.
  • Information Retrieval: The system retrieves relevant information from a knowledge base based on the query. VectorDBs play a key role in this. Embeddings aka vectors are stored in VectorDB and retrieval is done using similarity measures.
  • Language Model Generation: The retrieved information is fed into a language model, which generates a response based on the query and the retrieved context.

In this blog series, we will delve into the intricacies of vector databases, exploring their underlying principles, key features, and real-world applications. We will also discuss the advantages they offer over traditional databases and how they are transforming the way we store, manage, and retrieve data.

What is a Vector?

A vector is a sequence of numbers that forms a group. For example

  • (3) is a one dimensional vector.
  • (2,8) is a two dimensional vector.
  • (12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only visualize 3 dimensions, anything more than that you can just say it not visualize. Below is an example of 4 dimension vector representation of the word king

King Vector

What is a Vector Database?

A Vector Database (VectorDB) is a specialized database system designed to store, manage, and efficiently query high-dimensional vector data. Unlike traditional relational databases that work with structured data in tables, VectorDBs are optimized for handling vector embeddings – numerical representations of data in multi-dimensional space.

In a VectorDB:

  1. Each item (like a document, image, or concept) is represented as a vector – a list of numbers that describe the item’s features or characteristics.
  2. These vectors are stored in a way that allows for fast similarity searches and comparisons.
  3. The database is optimized for operations like finding the nearest neighbors to a given vector, which is crucial for many AI and machine learning applications.

VectorDBs are particularly useful in scenarios where you need to find similarities or relationships between large amounts of complex data, such as in recommendation systems, image recognition, or natural language processing tasks.

Key Concepts

  1. Vector Embeddings

    • Vector embeddings are numerical representations of data in a multi-dimensional space.
    • They capture semantic meaning and relationships between different pieces of information.
    • In natural language processing, word embeddings are a common type of vector embedding. Each word is represented by a vector of real numbers, where words with similar meanings are closer in the vector space.
    • For detail concepts of embedding please refer to earlier blog Embeddings

Let’s look at an example of Word Vector output generated by Word2Vec

from gensim.models import Word2Vec

# Example corpus (a list of sentences, where each sentence is a list of words)
sentences = [
    ["machine", "learning", "is", "fascinating"],
    ["gensim", "is", "a", "useful", "library", "for", "word", "embeddings"],
    ["vector", "representations", "are", "important", "for", "NLP", "tasks"]
]

# Train a Word2Vec model with 300-dimensional vectors
model = Word2Vec(sentences, vector_size=300, window=5, min_count=1, workers=4)

# Get the 300-dimensional vector for a specific word
word_vector = model.wv['machine']

# Print the vector
print(f"Vector for 'machine': {word_vector}")

Sample Output for 300 dimension vector


Vector for 'machine': [ 2.41737941e-03 -1.42750892e-03 -4.85344668e-03  3.12493594e-03, 4.84531874e-03 -1.00165956e-03  3.41092921e-03 -3.41384278e-03, 4.22888929e-03  1.44586214e-03 -1.35438916e-03 -3.27448458e-03
  4.70721726e-03 -4.50850562e-03  2.64214014e-03 -3.29884756e-03, -3.13906092e-03  1.09677911e-03 -4.94637461e-03  3.32896863e-03,2.03538216e-03 -1.52456785e-03  2.28793684e-03 -1.43519988e-03, 4.34566711e-03 -1.94705374e-03  1.93231280e-03  4.34081139e-03
  ...
  3.40303702e-03  1.58637420e-03 -3.31261402e-03  2.01543484e-03,4.39879852e-03  2.54576413e-03 -3.30528596e-03  3.01509819e-03,2.15555660e-03  1.64605413e-03  3.02376228e-03 -2.62048110e-03
  3.80181967e-03 -3.14147812e-03  2.23554621e-03  2.68812295e-03,1.80951719e-03  1.74256027e-03 -2.47024545e-03  4.06702763e-03,2.30203426e-03 -4.75471295e-03 -3.66776927e-03  2.06539119e-03]

  1. High Dimensional Space
  • Vector databases typically work with vectors that have hundreds or thousands of dimensions. This high dimensionality allows for rich and nuanced representations of data.
  • For example:
    • A word might be represented by 300 dimensions
    • An image could be represented by 1000 dimensions
    • A user’s preferences might be captured in 500 dimensions

Why do you need a Vector Database when there is RDBMS like PostGreSQL or NoSQL DB like Elastic Search or MongoDB?

RDBMS

RDBMS are designed to store and manage structured data in a tabular format. They are based on the relational model, which defines data as a collection of tables, where each table represents a relation.

Key components of RDBMS:

  • Tables: A collection of rows and columns, where each row represents a record and each column represents an attribute.
  • Rows: Also known as records, they represent instances of an entity.
  • Columns: Also known as attributes, they define the properties of an entity.
  • Primary key: A unique identifier for each row in a table.
  • Foreign key: A column in one table that references the primary key of another table, establishing a relationship between the two tables.
  • Normalization: A process of organizing data into tables to minimize redundancy and improve data integrity.

Why RDBMS don’t apply to storing vectors:

  1. Data Representation:
    • RDBMS store data in a tabular format, where each row represents an instance of an entity and each column represents an attribute.
    • Vectors are represented as a sequence of numbers, which doesn’t fit well into the tabular structure of RDBMS.
  2. Query Patterns:
    • RDBMS are optimized for queries based on joining tables and filtering data based on specific conditions.
    • Vector databases are optimized for similarity search, which involves finding vectors that are closest to a given query vector. This type of query doesn’t align well with the traditional join-based queries of RDBMS.
  3. Data Relationships:
    • RDBMS define relationships between entities using foreign keys and primary keys.
    • In vector databases, relationships are implicitly defined by the proximity of vectors in the vector space. There’s no explicit need for foreign keys or primary keys.
  4. Performance Considerations:
    • RDBMS are often optimized for join operations and range queries.
    • Vector databases are optimized for similarity search, which requires efficient indexing and partitioning techniques.

Let’s also look at a table for a comparison of features

Feature VectorDB RDBMS
Dimensional Efficiency Designed to handle high-dimensional data efficiently Performance degrades rapidly as dimensions increase
Similarity Search Implement specialized algorithms for fast approximate nearest neighbor (ANN) searches Lack native support for ANN algorithms, making similarity searches slow and computationally expensive
Indexing for Vector Spaces Use index structures optimized for vector data (e.g., HNSW, IVF) Rely on B-trees and hash indexes, which become ineffective in high-dimensional spaces
Vector Operations Provide built-in, optimized support for vector operations Require complex, often inefficient SQL queries to perform vector computations
Scalability for Vector Data Designed to distribute vector data and parallelize similarity searches across multiple nodes efficiently While scalable for traditional data, they’re not optimized for distributing and querying vector data at scale
Real-time Processing Optimized for fast insertions and queries of vector data, supporting real-time applications May struggle with the speed requirements of real-time vector processing, especially at scale
Storage Efficiency Use compact, specialized formats for storing dense vector data Less efficient when storing high-dimensional vectors, often requiring more space and slower retrieval
Machine Learning Integration Seamlessly integrate with ML workflows, supporting operations common in AI applications Require additional processing and transformations to work effectively with ML pipelines
Approximate Query Support Often support approximate queries, trading off some accuracy for significant speed improvements Primarily designed for exact queries, lacking native support for approximate vector searches

In a nutshell, RDBMS are well-suited for storing and managing structured data, but they are not optimized for storing and querying vectors. Vector databases, on the other hand, are specifically designed for handling vectors and performing similarity search operations.

NoSQL Databases

NoSQL databases are designed to handle large datasets and unstructured or semi-structured data that don’t fit well into the relational model. They offer flexibility in data structures, scalability, and high performance.

Common types of NoSQL databases include:

  • Key-value stores: Store data as key-value pairs.
  • Document stores: Store data as documents, often in JSON or BSON format.
  • Wide-column stores: Store data in wide columns, where each column can have multiple values.
  • Graph databases: Store data as nodes and relationships, representing connected data.

Key characteristics of NoSQL databases:

  • Flexibility: NoSQL databases offer flexibility in data structures, allowing for dynamic schema changes and accommodating evolving data requirements.
  • Scalability: Many NoSQL databases are designed to scale horizontally, allowing for better performance and scalability as data volumes grow.
  • High performance: NoSQL databases often provide high performance, especially for certain types of workloads.
  • Eventual consistency: Some NoSQL databases prioritize availability and performance over strong consistency, offering eventual consistency guarantees.

Why NoSQL Databases Might Not Be Ideal for Storing and Retrieving Vectors

While NoSQL databases offer many advantages, they might not be the best choice for storing and retrieving vectors due to the following reasons:

  1. Data Representation: NoSQL databases, while flexible, might not be specifically optimized for storing and querying high-dimensional vectors. The data structures used in NoSQL databases might not be the most efficient for vector-based operations.
  2. Query Patterns: NoSQL databases are often designed for different query patterns than vector-based operations. While they can handle complex queries, they might not be as efficient for similarity search, which is a core operation for vector databases.
  3. Performance Considerations:
    • Indexing: NoSQL databases often use different indexing techniques than RDBMS. While they can be efficient for certain types of queries, they might not be as optimized for vector-based similarity search.
    • Memory requirements: For vector-based operations, especially in large-scale applications, the memory requirements can be significant. NoSQL databases like Elasticsearch, which are often used for full-text search and analytics, might require substantial memory resources to handle large vector datasets efficiently.

Elasticsearch as an Example:

Elasticsearch is a popular NoSQL database often used for full-text search and analytics. While it can be used to store and retrieve vectors, there are some considerations:

  • Memory requirements: Storing and indexing large vector datasets in Elasticsearch can be memory-intensive, especially for high-dimensional vectors.
  • Query performance: The performance of vector-based queries in Elasticsearch can depend on factors like the number of dimensions, the size of the dataset, and the indexing strategy used.
  • Specialized plugins: Elasticsearch offers plugins like the knn plugin that can be used to optimize vector-based similarity search. However, these plugins might have additional performance and memory implications.

In a nutshell, while NoSQL databases offer many advantages, their suitability for storing and retrieving vectors depends on specific use cases and requirements. For applications that heavily rely on vector-based similarity search and require high performance, specialized vector databases might be a more appropriate choice.

A Deeper Dive into Similarity Search in Vector Databases

Similarity search is a fundamental operation in vector databases, involving finding the closest matches to a given query vector from a large dataset of vectors. This is crucial for applications like recommendation systems, image search, and natural language processing.

Similarity measures, algorithms, and data structures are crucial for efficient similarity search. Similarity measures (e.g., cosine, Euclidean) quantify the closeness between vectors. Algorithms (e.g., brute force, LSH, HNSW) determine how vectors are compared and retrieved. Data structures (e.g., inverted indexes, hierarchical graphs) optimize storage and retrieval. The choice of these components depends on factors like dataset size, dimensionality, and desired accuracy. By selecting appropriate measures, algorithms, and data structures, you can achieve efficient and accurate similarity search in various applications. Let’s look in details about the different similarity measures and algorithms/datastructures in the below section.

Understanding Similarity Measures

  • Cosine Similarity: Measures the angle between two vectors. It’s suitable when the magnitude of the vectors doesn’t matter (e.g., document similarity based on word counts).

import numpy as np

def cosine_similarity(v1, v2):
    """Calculates the cosine similarity between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The cosine similarity between the two vectors.
    """

    dot_product = np.dot(v1, v2)
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)

    return dot_product / (norm_v1 * norm_v2)

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
similarity = cosine_similarity(vector1, vector2)
print(similarity)

  • Euclidean Distance: Measures the straight-line distance between two points in Euclidean space. It’s suitable when the magnitude of the vectors is important (e.g., image similarity based on pixel values).

import numpy as np

def euclidean_distance(v1, v2):
    """Calculates the Euclidean distance between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The Euclidean distance between the two vectors.
    """

    return np.linalg.norm(v1 - v2)

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = euclidean_distance(vector1, vector2)
print(distance)

  • Hamming Distance: Measures the number of positions where two binary vectors differ. It’s useful for comparing binary data.
import numpy as np

def hamming_distance(v1, v2):
    """Calculates the Hamming distance between two binary vectors.

    Args:
        v1: The first binary vector.
        v2: The second binary vector.

    Returns:
        The Hamming distance between the two vectors.
    """

    return np.sum(v1 != v2)

# Example usage
vector1 = np.array([0, 1, 1, 0])
vector2 = np.array([1, 1, 0, 1])
distance = hamming_distance(vector1, vector2)
print(distance)
  • Manhattan Distance: Also known as L1 distance, it measures the sum of absolute differences between corresponding elements of two vectors.
import numpy as np

def manhattan_distance(v1, v2):
    """Calculates the Manhattan distance between two vectors.

    Args:
        v1: The first vector.
        v2: The second vector.

    Returns:
        The Manhattan distance between the two vectors.
    """

    return np.sum(np.abs(v1 - v2))

# Example usage
vector1 = np.array([1, 2, 3])
vector2 = np.array([4, 5, 6])
distance = manhattan_distance(vector1, vector2)
print(distance)

Algorithms and Data Structures

Brute Force: is a straightforward but computationally expensive algorithm for finding the nearest neighbors in a dataset. It involves comparing the query vector with every other vector in the dataset to find the closest matches.

How Brute Force Works

  1. Iterate through the dataset: For each vector in the dataset, calculate its distance to the query vector.
  2. Maintain a list of closest neighbors: Keep track of the closest vectors found so far.
  3. Update the list: If the distance between the current vector and the query vector is smaller than the distance to the farthest neighbor in the list, replace the farthest neighbor with the current vector.
  4. Repeat: Continue this process until all vectors in the dataset have been compared.

Advantages and Disadvantages

  • Advantages:
    • Simple to implement.
    • Guaranteed to find the exact nearest neighbors.
  • Disadvantages:
    • Extremely slow for large datasets.
    • Inefficient for high-dimensional data.

Python Code Example

import numpy as np

def brute_force_search(query_vector, vectors, k=10):
    """Performs brute force search for the nearest neighbors.

    Args:
        query_vector: The query vector.
        vectors: The dataset of vectors.
        k: The number of nearest neighbors to find.

    Returns:
        A list of indices of the nearest neighbors.
    """

    distances = np.linalg.norm(vectors - query_vector, axis=1)
    nearest_neighbors = np.argsort(distances)[:k]
    return nearest_neighbors

# Example usage
query_vector = np.random.rand(128)
vectors = np.random.rand(1000, 128)
nearest_neighbors = brute_force_search(query_vector, vectors, k=10)

Brute Force is generally not suitable for large datasets or high-dimensional data due to its computational complexity. For these scenarios, more efficient algorithms like LSH, HNSW, or IVF-Flat are typically used. However, it can be useful for small datasets or as a baseline for comparison with other algorithms.

Locality Sensitive Hashing (LSH): is a technique used to efficiently find similar items in large datasets. It works by partitioning the vector space into buckets and hashing similar vectors into the same bucket. This makes it possible to quickly find approximate nearest neighbors without having to compare every vector in the dataset.

How LSH Works

  1. Hash Function Selection: Choose a hash function that is sensitive to the distance between vectors. This means that similar vectors are more likely to be hashed into the same bucket.
  2. Hash Table Creation: Create multiple hash tables, each using a different hash function.
  3. Vector Hashing: For each vector, hash it into each hash table.
  4. Query Processing: When a query vector is given, hash it into each hash table.
  5. Candidate Selection: Retrieve all vectors that are in the same buckets as the query vector.
  6. Similarity Calculation: Calculate the actual similarity between the query vector and the candidate vectors.

LSH Families

  • Random Projection: Projects vectors onto random hyperplanes.
  • MinHash: Used for comparing sets of items.
  • SimHash: Used for comparing documents based on their shingles.

LSH Advantages and Disadvantages

  • Advantages:
    • Efficient for large datasets.
    • Can be used for approximate nearest neighbor search.
    • Can be parallelized.
  • Disadvantages:
    • Can introduce false positives or negatives.
    • Accuracy can be affected by the choice of hash functions and the number of hash tables.

Python Code Example using Annoy

from annoy import Annoy

# Create an Annoy index with LSH
annoy_index = Annoy(128, metric='angular', n_trees=10)

# Add vectors to the index
for i in range(1000):
    vector = np.random.rand(128)
    annoy_index.add_item(i, vector)

# Build the index
annoy_index.build()

# Search for nearest neighbors
query_vector = np.random.rand(128)
nns = annoy_index.get_nns_by_vector(query_vector, 10)

Note: The n_trees parameter in Annoy determines the number of hash tables used. A larger number of trees generally improves accuracy but can increase memory usage.

By understanding the fundamentals of LSH and carefully selecting the appropriate parameters, you can effectively use it for similarity search in your applications.

Hierarchical Navigable Small World (HNSW): is a highly efficient algorithm for approximate nearest neighbor search in high-dimensional spaces. It constructs a hierarchical graph structure that allows for fast and accurate retrieval of similar items.

How HNSW Works

  1. Initialization: The algorithm starts by creating a single layer with all data points.
  2. Layer Creation: New layers are added iteratively. Each new point is connected to a subset of existing points based on their distance.
  3. Hierarchical Structure: The layers form a hierarchical structure, with higher layers having fewer connections and lower layers having more connections.
  4. Search: To find the nearest neighbors of a query point, the search starts from the top layer and gradually moves down the hierarchy, following the connections to find the most promising candidates.

Advantages of HNSW

  • High Accuracy: HNSW often achieves high accuracy, even for high-dimensional data.
  • Efficiency: It is very efficient for large datasets and can handle dynamic updates.
  • Flexibility: The algorithm can be adapted to different distance metrics and data distributions.

Python Code Example using NMSLIB

from nmslib import NMSLIB

# Create an HNSW index
nmslib_index = NMSLIB.init(method='hnsw', space='cos')

# Add vectors to the index
nmslib_index.addDataPointBatch(vectors)

# Create the index
nmslib_index.createIndex()

# Search for nearest neighbors
query_vector = np.random.rand(128)
knn = nmslib_index.knnQuery(query_vector, k=10)

Note: The space parameter in NMSLIB specifies the distance metric used (e.g., cos for cosine similarity). You can also customize other parameters like the number of layers and the number of connections per layer to optimize performance for your specific application.

HNSW is a powerful algorithm for approximate nearest neighbor search, offering a good balance between accuracy and efficiency. It’s particularly well-suited for high-dimensional data and can be used in various applications, such as recommendation systems, image search, and natural language processing.

IVF-Flat: is a hybrid indexing technique that combines Inverted File (IVF) and Flat Hierarchical Indexing (Flat) to efficiently perform approximate nearest neighbor search (ANN) in high-dimensional vector spaces. It’s particularly effective for large datasets and high-dimensional vectors.

How IVF-Flat Works

  1. Quantization: The dataset is divided into n quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
  2. Inverted File: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
  3. Flat Index: For each quantized cell, a flat index (e.g., a linear scan or a tree-based structure) is built to store the vectors assigned to that cell.
  4. Query Processing: When a query vector is given, it’s first quantized to find the corresponding cell. Then, the flat index for that cell is searched for the nearest neighbors.
  5. Refinement: The top candidates from the flat index can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of IVF-Flat

  • Efficiency: IVF-Flat can be significantly faster than brute-force search for large datasets.
  • Accuracy: It can achieve good accuracy, especially when combined with refinement techniques.
  • Scalability: It can handle large datasets and high-dimensional vectors.
  • Flexibility: The number of quantized cells and the type of flat index can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create an IVF-Flat index
index = faiss.IndexIVFFlat(faiss.IndexFlatL2(dim), nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

  • dim is the dimensionality of the vectors.
  • nlist is the number of quantized cells.
  • nprobe is the number of cells to query during search.

IVF-Flat is a powerful technique for approximate nearest neighbor search in vector databases, offering a good balance between efficiency and accuracy. By carefully tuning the parameters, you can optimize its performance for your specific application.

ScanNN: is a scalable and efficient approximate nearest neighbor search algorithm designed for large-scale datasets. It combines inverted indexes with quantization techniques to achieve high performance.

How ScanNN Works

  1. Quantization: The dataset is divided into quantized subspaces (quantization cells). Each vector is assigned to a cell based on its similarity to a representative point (centroid) of the cell.
  2. Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
  3. Scan: During query processing, the query vector is quantized to find the corresponding cell. Then, the vectors in that cell are scanned to find the nearest neighbors.
  4. Refinement: The top candidates from the scan can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of ScanNN

  • Scalability: ScanNN can handle large datasets and high-dimensional vectors efficiently.
  • Efficiency: It uses inverted indexes to reduce the search space, making it faster than brute-force search.
  • Accuracy: ScanNN can achieve good accuracy, especially when combined with refinement techniques.
  • Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create a ScanNN index
index = faiss.IndexScanNN(faiss.IndexFlatL2(dim), nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

  • dim is the dimensionality of the vectors.
  • nlist is the number of quantized cells.
  • nprobe is the number of cells to query during search.

ScanNN is a powerful algorithm for approximate nearest neighbor search in large-scale applications. It offers a good balance between efficiency and accuracy, making it a popular choice for various tasks, such as recommendation systems, image search, and natural language processing.

Disk-ANN: is a scalable approximate nearest neighbor search algorithm designed for very large datasets that don’t fit entirely in memory. It combines inverted files with on-disk storage to efficiently handle large-scale vector search.

How Disk-ANN Works

  1. Quantization: The dataset is divided into quantized subspaces (quantization cells), similar to IVF-Flat.
  2. Inverted Index: An inverted index is created, where each quantized cell is associated with a list of vectors belonging to that cell.
  3. On-Disk Storage: The inverted index and the vectors themselves are stored on disk, allowing for efficient handling of large datasets.
  4. Query Processing: When a query vector is given, it’s quantized to find the corresponding cell. The inverted index is used to retrieve the vectors in that cell from disk.
  5. Refinement: The retrieved vectors can be further refined using exact nearest neighbor search or other techniques to improve accuracy.

Advantages of Disk-ANN

  • Scalability: Disk-ANN can handle extremely large datasets that don’t fit in memory.
  • Efficiency: It uses inverted indexes and on-disk storage to optimize performance for large-scale search.
  • Accuracy: Disk-ANN can achieve good accuracy, especially when combined with refinement techniques.
  • Flexibility: The number of quantized cells and the refinement strategy can be adjusted to balance accuracy and efficiency.

Python Code Example using Faiss

import faiss

# Create a Disk-ANN index
index = faiss.IndexDiskANN(faiss.IndexFlatL2(dim), filename, nlist, nprobe)

# Add vectors to the index
index.add(vectors)

# Search for nearest neighbors
query_vector = np.random.rand(dim)
distances, indices = index.search(query_vector, k)

In this example:

  • filename is the path to the disk file where the index will be stored.
  • Other parameters are the same as in IVF-Flat.

Disk-ANN is a powerful algorithm for approximate nearest neighbor search in very large datasets. It provides a scalable and efficient solution for handling massive amounts of data while maintaining good accuracy.

Vector Database Comparison: Features, Use Cases, and Selection Guide

Just like in RDBMS or NOSQL world there are lot of choices for different databases, Vector Databases also have quite a bit choices, choosing the right one for your application matters quite a bit. Below table compares key features, use-cases and a selection guide

VectorDB Key Features Best For When to Choose
Pinecone Fully managed service, Real-time updates, Hybrid search (vector + metadata), Serverless Production-ready applications, Rapid development, Scalable solutions When you need a fully managed solution, For applications requiring real-time updates, When combining vector search with metadata filtering
Milvus Open-source, Scalable to billions of vectors, Supports multiple index types, Hybrid search capabilities Large-scale vector search, On-premises deployments, Customizable solutions When you need an open-source solution, for very large-scale vector search applications, When you require fine-grained control over indexing
Qdrant Open-source, Rust-based for high performance, Supports filtering with payload, On-prem and cloud options High-performance vector search, Applications with complex filtering needs When performance is critical, for applications requiring advanced filtering, When you need both cloud and on-prem options
Weaviate Open-source, GraphQL API, Multi-modal data support, AI-first database Semantic search applications, Multi-modal data storage and retrieval When working with multiple data types (text, images, etc.), If you prefer GraphQL for querying, for AI-centric applications
Faiss (Facebook AI Similarity Search) Open-source, Highly efficient for dense vectors, GPU support Research and experimentation, Large-scale similarity search When you need low-level control, for integration into custom systems, When GPU acceleration is beneficial
Elasticsearch with vector search Full-text search + vector capabilities, Mature ecosystem and extensive analytics features Applications combining traditional search and vector search when you need rich text analytics, When you’re already using Elasticsearch, For hybrid search applications (text + vector), When you need advanced analytics alongside vector search
pgvector (PostgreSQL extension) Vector similarity search in PostgreSQL, Integrates with existing PostgreSQL databases Adding vector capabilities to existing PostgreSQL systems, Small to medium-scale applications When you’re already heavily invested in PostgreSQL, for projects that don’t require specialized vector DB features, When simplicity and familiarity are priorities
Vespa Open-source, Combines full-text search, vector search, and structured data, Real-time indexing and serving Complex search and recommendation systems, Applications requiring structured, text, and vector data For large-scale, multi-modal search applications, When you need a unified platform for different data types, For real-time, high-volume applications
AWS OpenSearch Fully managed AWS service, Combines traditional full-text search capabilities with vector-based similarity search. When you need to search for both text-based content and vectors. When you need to perform real-time searches and analytics on large datasets. When you want to leverage the broader AWS ecosystem for your application. For applications that require processing billions of vectors.

Conclusion

For my previous startup that I (Malaikannan Sankarasubbu) founded Datalog dot ai doing a low code virtual assistant platform, we heavily leveraged FAISS to do Intent similarity, from that point to now there are quite a few options for Vector Databases

Vector databases have emerged as a powerful tool for handling unstructured and semi-structured data, offering efficient similarity search capabilities and supporting a wide range of applications. By understanding the fundamentals of vector databases, including similarity measures, algorithms, and data structures, you can select the right approach for your specific needs.

In future blog posts, we will delve deeper into performance considerations, including indexing techniques, hardware optimization, and best practices for scaling vector databases. We will also explore real-world use cases and discuss the challenges and opportunities that lie ahead in the field of vector databases.

Demystifying IP Addresses and Netmasks: The Complete Overview

24 August 2024 at 13:14

In this blog, we will learn about IP addresses and netmasks.

IP

The Internet Protocol (IP) is a unique identifier for your device, similar to how a mobile number uniquely identifies your phone.

IP addresses are typically represented as four Octets for IPv4, with each octet being One byte/Octets in size, and eight octets for IPv6, with each octet being two bytes/Octets in size.

Examples:

  • IPv4: 192.168.43.64
  • IPv6: 2001:db8:3333:4444:5555:6666:7777:8888

For the purposes of this discussion, we will focus on IPv4.

Do we really require four Octets structure with dots between them?

The answer is NO

The only requirement for an IPv4 address is that it must be 4 bytes in size. However, it does not have to be written as four octets or even with dots separating them.

Let’s test this by fetching Google’s IP address using the nslookup command.

Convert this to binary number using bc calculator in Bash shell.

And you can see it’s working.

This is because the octet structure and the dots between them are only for human readability. Computers do not interpret dots; they just need an IP address that is 4 bytes in size, and that’s it.

The range for IPv4 addresses is from 0.0.0.0 to 255.255.255.255.

Types of IP Addresses

IP addresses are classified into two main types: Public IPs and Private IPs.

Private IP addresses are used for communication between local devices without connecting to the Internet. They are free to use and secure to use.

You can find your private IP address by using the ifconfig command


The private IP address ranges are as follows:

10.0.0.0 to 10.255.255.255
172.16.0.0 to 172.31.255.255
192.168.0.0 to 192.168.255.255

Public IP addresses are Internet-facing addresses provided by an Internet Service Provider (ISP). These addresses are used to access the internet and are not free.

By default

Private IP to Private IP communication is possible.
Public IP to Public IP communication is possible.

However:

Public IP to Private IP communication is not possible.
Private IP to Public IP communication is not possible.

Nevertheless, these types of communication can occur through Network Address Translation (NAT), which is typically used by your home router. This is why you can access the Internet even with a private IP address.

Netmasks
Netmasks are used to define the range of IP addresses within a network.

Which means,

You can see 24 Ones and 8 Zeros.

Here, we have converted 255 to binary using division method.

255 ÷ 2 = 127 remainder 1

127 ÷ 2 = 63 remainder 1

63 ÷ 2 = 31 remainder 1

31 ÷ 2 = 15 remainder 1

15 ÷ 2 = 7 remainder 1

7 ÷ 2 = 3 remainder 1

3 ÷ 2 = 1 remainder 1

1 ÷ 2 = 0 remainder 1

So, binary value of 255 is 11111111

By using this, we can able to find the number of IP addresses and its range.

Since we have 8 zeros, so

Number of IPs = 2 ^8 which equals to 256 IPs. SO, the usable IP range is 10.4.3.1 – 10.4.3.254 and the broadcast IP is 10.4.3.255.

And we can also write this as 255.255.255.0/24 . Here 24 denotes CIDR (Classless Inter-Domain Routing).

Thats it.

Kindly let me know in comments if you are any queries in these topics.

Chunking

5 August 2024 at 00:00

Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline

As the field of Natural Language Processing (NLP) continues to evolve, the combination of retrieval-based and generative models has emerged as a powerful approach for enhancing various NLP applications. One of the key techniques that significantly improves the efficiency and effectiveness of Retrieval-Augmented Generation (RAG) is chunking. In this blog, we will explore what chunking is, why it is important in RAG, the different ways to implement chunking, including content-aware and recursive chunking, how to evaluate the performance of chunking, chunking alternatives, and how it can be applied to optimize NLP systems.

Chunking

What is Retrieval-Augmented Generation (RAG)?

Before diving into chunking, let’s briefly understand RAG. Retrieval-Augmented Generation is a framework that combines the strengths of retrieval-based models and generative models. It involves retrieving relevant information from a large corpus based on a query and using this retrieved information as context for a generative model to produce accurate and contextually relevant responses or content.

What is Chunking?

Chunking is the process of breaking down large text documents or datasets into smaller, manageable pieces, or “chunks.” These chunks can then be individually processed, indexed, and retrieved, making the overall system more efficient and effective. Chunking helps in dealing with large volumes of text by dividing them into smaller, coherent units that are easier to handle.

Why Do We Need Chunking?

Chunking is essential in RAG for several reasons:

Efficiency

  • Computational cost: Processing smaller chunks of text requires less computational power compared to handling entire documents.
  • Storage: Chunking allows for more efficient storage and indexing of information.

Accuracy

  • Relevance: By breaking down documents into smaller units, it’s easier to identify and retrieve the most relevant information for a given query.
  • Context preservation: Careful chunking can help maintain the original context of the text within each chunk.

Speed

  • Retrieval time: Smaller chunks can be retrieved and processed faster, leading to quicker response times.
  • Model processing: Language models can process smaller inputs more efficiently.

Limitations of Large Language Models

  • Context window: LLMs have limitations on the amount of text they can process at once. Chunking helps to overcome this limitation.

In essence, chunking optimizes the RAG process by making it more efficient, accurate, and responsive.

Different Ways to Implement Chunking

There are various methods to implement chunking, depending on the specific requirements and structure of the text data. Here are some common approaches:

  1. Fixed-Length Chunking: Divide the text into chunks of fixed length, typically based on a predetermined number of words or characters.

    def chunk_text_fixed_length(text, chunk_size=200, by='words'):
        if by == 'words':
            words = text.split()
            return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
        elif by == 'characters':
            return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
        else:
            raise ValueError("Parameter 'by' must be either 'words' or 'characters'.")
    
    text = "The process is more important than the results. And if you take care of the process, you will get the results."
    word_chunks = chunk_text_fixed_length(text, chunk_size=5, by='words')  
    character_chunks = chunk_text_fixed_length(text, chunk_size=5, by='characters')  
       
    
    print(word_chunks)
    ['The process is more important', 'than the results. And if', 'you take care of the', 'process, you will get the', 'results.']
    
    print(character_chunks)
    ['The p', 'roces', 's is ', 'more ', 'impor', 'tant ', 'than ', 'the r', 'esult', 's. An', 'd if ', 'you t', 'ake c', 'are o', 'f the', ' proc', 'ess, ', 'you w', 'ill g', 'et th', 'e res', 'ults.']
    
  2. Sentence-Based Chunking: Split the text into chunks based on complete sentences. This method ensures that each chunk contains coherent and complete thoughts.

    import nltk
    nltk.download('punkt')
       
    def chunk_text_sentences(text, max_sentences=3):
        sentences = nltk.sent_tokenize(text)
        return [' '.join(sentences[i:i + max_sentences]) for i in range(0, len(sentences), max_sentences)]
    
    text = """Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language. NLP techniques are used to apply algorithms to identify and extract the natural language rules such that 
    the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world."""
       
       
    
    sentence_chunks = chunk_text_sentences(text, max_sentences=2)
       
       
    for i, chunk in enumerate(sentence_chunks, 1):
        print(f"Chunk {i}:\n{chunk}\n")
    
    Chunk 1:
    Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence. It deals with the interaction between computers and humans through natural language.
       
    Chunk 2:
    NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand. Text mining and text classification are common applications of NLP.
       
    Chunk 3:
    It's a powerful tool in the modern data-driven world.
    
  3. Paragraph-Based Chunking: Divide the text into chunks based on paragraphs. This approach is useful when the text is naturally structured into paragraphs that represent distinct sections or topics.

    def chunk_text_paragraphs(text):
        paragraphs = text.split('\n\n')
        return [paragraph for paragraph in paragraphs if paragraph.strip()]
    
    paragraph_chunks = chunk_text_paragraphs(text)
       
       
    for i, chunk in enumerate(paragraph_chunks, 1):
        print(f"Paragraph {i}:\n{chunk}\n")
    
    Paragraph 1:
    Natural Language Processing (NLP) is a fascinating field of Artificial Intelligence.
       
    Paragraph 2:
    It deals with the interaction between computers and humans through natural language.
       
    Paragraph 3:
    NLP techniques are used to apply algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand.
       
    Paragraph 4:
    Text mining and text classification are common applications of NLP. It's a powerful tool in the modern data-driven world.
    
  4. Thematic or Semantic Chunking: Use NLP techniques to identify and group related sentences or paragraphs into chunks based on their thematic or semantic content. This can be done using topic modeling or clustering algorithms.

    import nltk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.cluster import KMeans
       
    nltk.download('punkt')
       
    def chunk_text_thematic(text, n_clusters=5):
        sentences = nltk.sent_tokenize(text)
        vectorizer = TfidfVectorizer(stop_words='english')
        X = vectorizer.fit_transform(sentences)
        kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(X)
        clusters = kmeans.predict(X)
           
        chunks = [[] for _ in range(n_clusters)]
        for i, sentence in enumerate(sentences):
            chunks[clusters[i]].append(sentence)
           
        return [' '.join(chunk) for chunk in chunks]
       
       
       
    thematic_chunks = chunk_text_thematic(text, n_clusters=3)
       
       
    for i, chunk in enumerate(thematic_chunks, 1):
        print(f"Chunk {i}:\n{chunk}\n")
       
    
  5. Sliding Window Chunking: Use a sliding window approach to create overlapping chunks. This method ensures that important information near the boundaries of chunks is not missed.
     def chunk_text_sliding_window(text, chunk_size=200, overlap=50, unit='word'):
     """Chunks text using a sliding window.
    
     Args:
         text: The input text.
         chunk_size: The desired size of each chunk.
         overlap: The overlap between consecutive chunks.
         unit: The chunking unit ('word', 'char', or 'token').
    
     Returns:
         A list of text chunks.
     """
    
     if unit == 'word':
         data = text.split()
     elif unit == 'char':
         data = text
     else:
         # Implement tokenization for other units
         pass
    
     chunks = []
     for i in range(0, len(data), chunk_size - overlap):
         if unit == 'word':
         chunk = ' '.join(data[i:i+chunk_size])
         else:
         chunk = data[i:i+chunk_size]
         chunks.append(chunk)
    
     return chunks
    
    
  6. Content-Aware Chunking: This advanced method involves using more sophisticated NLP techniques to chunk the text based on its content and structure. Content-aware chunking can take into account factors such as topic continuity, coherence, and discourse markers. It aims to create chunks that are not only manageable but also meaningful and contextually rich.

    Example of Content-Aware Chunking using Sentence Transformers:

    from sentence_transformers import SentenceTransformer, util
    
    def content_aware_chunking(text, max_chunk_size=200):
        model = SentenceTransformer('all-MiniLM-L6-v2')
        sentences = nltk.sent_tokenize(text)
        embeddings = model.encode(sentences, convert_to_tensor=True)
        clusters = util.community_detection(embeddings, min_community_size=1)
           
        chunks = []
        for cluster in clusters:
            chunk = ' '.join([sentences[i] for i in cluster])
            if len(chunk.split()) <= max_chunk_size:
                chunks.append(chunk)
            else:
                sub_chunks = chunk_text_fixed_length(chunk, max_chunk_size)
                chunks.extend(sub_chunks)
           
        return chunks
    
  7. Recursive Chunking: Recursive chunking involves repeatedly breaking down chunks into smaller sub-chunks until each chunk meets a desired size or level of detail. This method ensures that very large texts are reduced to manageable and meaningful units at each level of recursion, making it easier to process and retrieve information.

    Example of Recursive Chunking: ```python def recursive_chunk(text, max_chunk_size): “"”Recursively chunks text into smaller chunks.

    Args: text: The input text. max_chunk_size: The maximum desired chunk size.

    Returns: A list of text chunks. “””

    if len(text) <= max_chunk_size: return [text]

    # Choose a splitting point based on paragraphs, sentences, or other criteria # For example: paragraphs = text.split(‘\n\n’) if len(paragraphs) > 1: chunks = [] for paragraph in paragraphs: chunks.extend(recursive_chunk(paragraph, max_chunk_size)) return chunks else: # Handle single paragraph chunking, e.g., by sentence splitting # …

# …


8. **Agentic Chunking**: Agent chunking is a sophisticated technique that involves using an LLM to dynamically determine chunk boundaries based on the content and context of the text. Below is an example of a prompt example for Agentic Chunking 

**Example Prompt**:

``` Prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|> 
## You are an agentic chunker. You will be provided with a content. 
Decompose the content into clear and simple propositions, ensuring they are interpretable out of context. 
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible. 
2. For any named entity that is accompanied by additional descriptive informaiton separate this information into its own distinct proposition.
3. Decontextualize proposition by adding necessary modifier to nouns or entire sentence and replacing pronouns (e.g. it, he, she, they, this, that) with the full name of the entities they refer to.
4. Present the results as list of strings, formatted in JSON 
<|eot_id|><|start_header_id|>user<|end_header_id|>
Here is the content : {content}
strictly follow the instructions provided and output in the desired format only.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Chunk Size and Overlapping in Chunking

Determining the appropriate chunk size and whether to use overlapping chunks are critical decisions in the chunking process. These factors significantly impact the efficiency and effectiveness of the retrieval and generation stages in RAG systems.

Chunk Size
  1. Choosing Chunk Size: The ideal chunk size depends on the specific application and the nature of the text. Smaller chunks can provide more precise context but may miss broader information, while larger chunks capture more context but may introduce noise or irrelevant information.
    • Small Chunks: Typically 100-200 words. Suitable for fine-grained retrieval where specific details are crucial.
    • Medium Chunks: Typically 200-500 words. Balance between detail and context, suitable for most applications.
    • Large Chunks: Typically 500-1000 words. Useful for capturing broader context but may be less precise.
  2. Impact of Chunk Size: The chunk size affects the retrieval accuracy and computational efficiency. Smaller chunks generally lead to higher retrieval precision but may require more chunks to cover the same amount of text, increasing computational overhead. Larger chunks reduce the number of chunks but may lower retrieval precision.
Overlapping Chunks
  1. Purpose of Overlapping: Overlapping chunks ensure that important information near the boundaries of chunks is not missed. This approach is particularly useful when the text has high semantic continuity, and critical information may span across chunk boundaries.

  2. Degree of Overlap: The overlap size should be carefully chosen to balance redundancy and completeness. Common overlap sizes range from 10% to 50% of the chunk size.
    • Small Overlap: 10-20% of the chunk size. Minimizes redundancy but may still miss some boundary information.
    • Medium Overlap: 20-30% of the chunk size. Good balance between coverage and redundancy.
    • Large Overlap: 30-50% of the chunk size. Ensures comprehensive coverage but increases redundancy and computational load.
  3. Example of Overlapping Chunking:
    def chunk_text_sliding_window(text, chunk_size=200, overlap=50):
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size - overlap):
            chunk = words[i:i + chunk_size]
            chunks.append(' '.join(chunk))
        return chunks
    

Evaluating the Performance of Chunking

Evaluating the performance of chunking is crucial to ensure that the chosen method effectively enhances the retrieval and generation processes. Here are some key metrics and approaches for evaluating chunking performance:

Retrieval Metrics
  1. Precision@K: Measures the proportion of relevant chunks among the top K retrieved chunks.
    def precision_at_k(retrieved_chunks, relevant_chunks, k):
        return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / k
    
  2. Recall@K: Measures the proportion of relevant chunks retrieved among the top K chunks.
    def recall_at_k(retrieved_chunks, relevant_chunks, k):
        return len(set(retrieved_chunks[:k]) & set(relevant_chunks)) / len(relevant_chunks)
    
  3. F1 Score: Harmonic mean of Precision@K and Recall@K, providing a balance between precision and recall.
    def f1_score_at_k(precision, recall):
        if precision + recall == 0:
            return 0
        return 2 * (precision * recall) / (precision + recall)
    
  4. MAP : Mean Average Precision (MAP) is primarily used in information retrieval and object detection tasks to evaluate the ranking of retrieved items
     import numpy as np
    
     def calculate_ap(y_true, y_score):
     """Calculates average precision for a single query.
    
     Args:
         y_true: Ground truth labels (0 or 1).
         y_score: Predicted scores.
    
     Returns:
         Average precision.
     """
    
     # Sort y_score and corresponding y_true in descending order
     y_score, y_true = zip(*sorted(zip(y_score, y_true), key=lambda x: x[0], reverse=True))
    
     correct_hits = 0
     sum_precision = 0
     for i, y in enumerate(y_true):
         if y == 1:
         correct_hits += 1
         precision = correct_hits / (i + 1)
         sum_precision += precision
     return sum_precision / sum(y_true)
    
     def calculate_map(y_true, y_score):
     """Calculates mean average precision.
    
     Args:
         y_true: Ground truth labels (list of lists).
         y_score: Predicted scores (list of lists).
    
     Returns:
         Mean average precision.
     """
    
     aps = []
     for i in range(len(y_true)):
         ap = calculate_ap(y_true[i], y_score[i])
         aps.append(ap)
     return np.mean(aps)
    
    
    
  5. NDCG: NDCG is a metric used to evaluate the quality of a ranking of items. It measures how well the most relevant items are ranked at the top of the list. In the context of chunking, we can potentially apply NDCG by ranking chunks based on a relevance score and evaluating how well the most relevant chunks are placed at the beginning of the list.
import numpy as np

def calculate_dcg(rel):
  """Calculates Discounted Cumulative Gain (DCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    DCG value.
  """

  return np.sum(rel / np.log2(np.arange(len(rel)) + 2))

def calculate_idcg(rel):
  """Calculates Ideal Discounted Cumulative Gain (IDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    IDCG value.
  """

  rel = np.sort(rel)[::-1]
  return calculate_dcg(rel)

def calculate_ndcg(rel):
  """Calculates Normalized Discounted Cumulative Gain (NDCG).

  Args:
    rel: Relevance scores of items.

  Returns:
    NDCG value.
  """

  dcg = calculate_dcg(rel)
  idcg = calculate_idcg(rel)
  return dcg / idcg

# Example usage
relevance_scores = [3, 2, 1, 0]
ndcg_score = calculate_ndcg(relevance_scores)
print(ndcg_score)


Generation Metrics
  1. BLEU Score: Measures the overlap between the generated text and reference text, considering n-grams.
    from nltk.translate.bleu_score import sentence_bleu
    
    def bleu_score(reference, generated):
        return sentence_bleu([reference.split()], generated.split())
    
  2. ROUGE Score: Measures the overlap of n-grams, longest common subsequence (LCS), and skip-bigram between the generated text and reference text.
    from rouge import Rouge
    
    rouge = Rouge()
    
    def rouge_score(reference, generated):
        scores = rouge.get_scores(generated, reference)
        return scores[0]['rouge-l']['f']
    
  3. Human Evaluation: Involves subjective evaluation by human judges to assess the relevance, coherence, and overall quality of the generated responses. Human evaluation can provide insights that automated metrics might miss.

Chunking Alternatives

While chunking is an effective method for improving the efficiency and effectiveness of RAG systems, there are alternative techniques that can also be considered:

  1. Hierarchical Indexing: Instead of chunking the text, hierarchical indexing organizes the data into a tree structure where each node represents a topic or subtopic. This allows for efficient retrieval by navigating through the tree based on the query’s context. ```python class HierarchicalIndex: def init(self): self.tree = {}

    def add_document(self, doc_id, topics):
        current_level = self.tree
        for topic in topics:
            if topic not in current_level:
                current_level[topic] = {}
            current_level = current_level[topic]
        current_level['doc_id'] = doc_id
    
    def retrieve(self, query_topics):
        current_level = self.tree
        for topic in query_topics:
            if topic in current_level:
                current_level = current_level[topic]
            else:
                return []
        return current_level.get('doc_id', [])
    
  2. Summarization: Instead of retrieving chunks, the system generates summaries of documents or sections that are relevant to the query. This can be done using extractive or abstractive summarization techniques.
    from transformers import BartTokenizer, BartForConditionalGeneration
    
    def generate_summary(text):
        tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
        model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    
        inputs = tokenizer([text], max_length=1024, return_tensors='pt', truncation=True)
        summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=150, early_stopping=True)
        return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
  3. Dense Passage Retrieval (DPR): DPR uses dense vector representations for both questions and passages, allowing for efficient similarity search using vector databases like FAISS.
    from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    
    question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
    context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
    
    def encode_texts(texts, tokenizer, encoder):
        inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
        return encoder(**inputs).pooler_output
    
    question_embeddings = encode_texts(["What is chunking?"], question_tokenizer, question_encoder)
    context_embeddings = encode_texts(["Chunking is a process...", "Another context..."], context_tokenizer, context_encoder)
    
    similarities = cosine_similarity(question_embeddings, context_embeddings)
    
  4. Graph-Based Representations: Instead of breaking the text into chunks, graph-based representations model the relationships between different parts of the text. Nodes represent entities, concepts, or chunks of text, and edges represent the relationships between them. This approach allows for more flexible and context-aware retrieval.
   import networkx as nx

   def build_graph(texts):
       graph = nx.Graph()
       for i, text in enumerate(texts):
           graph.add_node(i, text=text)
           # Add edges based on some similarity metric
           for j in range(i + 1, len(texts)):
               similarity = compute_similarity(text, texts[j])
               if similarity > threshold:
                   graph.add_edge(i, j, weight=similarity)
       return graph

   def retrieve_from_graph(graph, query):
       query_node = len(graph.nodes)
       graph.add_node(query_node, text=query)
       for i in range(query_node):
           similarity = compute_similarity(query, graph.nodes[i]['text'])
           if similarity > threshold:
               graph.add_edge(query_node, i, weight=similarity)
       # Retrieve nodes with highest similarity
       neighbors = sorted(graph[query_node], key=lambda x: graph[query_node][x]['weight'], reverse=True)
       return [graph.nodes[n]['text'] for n in neighbors[:k]]

Graph-based representations can capture complex relationships and provide a more holistic view of the text, making them a powerful alternative to chunking.

Conclusion

Chunking plays a pivotal role in enhancing the efficiency and effectiveness of Retrieval-Augmented Generation systems. By breaking down large texts into manageable chunks, we can improve retrieval speed, contextual relevance, scalability, and the overall quality of generated responses. Evaluating the performance of chunking methods involves considering retrieval and generation metrics, as well as efficiency and cost metrics. As NLP continues to advance, techniques like chunking will remain essential for optimizing the performance of RAG and other language processing systems. Additionally, exploring alternatives such as hierarchical indexing, passage retrieval, summarization, dense passage retrieval, and graph-based representations can further enhance the capabilities of RAG systems.

Embark on your journey to harness the power of chunking in RAG and unlock new possibilities in the world of Natural Language Processing!

If you found this blog post helpful, please consider citing it in your work:

@misc{malaikannan2024chunking, author = {Sankarasubbu, Malaikannan}, title = {Breaking Down Data: The Science and Art of Chunking in Text Processing & RAG Pipeline}, year = {2024}, url = {https://malaikannan.github.io/2024/08/05/Chunking/}, note = {Accessed: 2024-08-12} }

Embeddings

31 July 2024 at 00:00

Computers are meant to crunch numbers; it goes back to the original design of these machines. Representing text as numbers is the holy grail of Natural Language Processing (NLP), but how do we do that? Over the years, various techniques have been developed to achieve this. Early methods like n-grams (like bigrams and trigrams) and TF-IDF were able to convert words into numbers. Not just one number, a collection of them. Each word is represented by the collection of numbers. The collection of numbers is called vector and it had a size that is fixed called the dimension of the vector. Though they were useful, they had their limitations. The most important of the limitations is that the vectors for each words stands alone, i.e we could not do any mathematical operations like addition or subtraction between the vectors(actually we could but the resulting vector will not represent any word). That is where embeddings come in. Embedding is also a vector, and so each word get a corresponding vector but we can now do King - Man + Woman that will give us a vector which is close to the vector corresponding to Queen. Why is this useful? That is what we are going to explore in this article.

What are Embeddings?

Embeddings are numerical representations of text data where words or phrases from the vocabulary are mapped to vectors of real numbers. This mapping is crucial because it allows us to quantify and manipulate textual data in a way that machines can understand and process.

We understand what a word is, lets see what a vector is. A vector is a sequence of numbers that forms a group. For example

  • (3) is a one dimensional vector.
  • (2,8) is a two dimensional vector.
  • (12,6,7,4) is a four dimensional vector.

A vector can be represented as by plotting on a graph. Lets take a 2D example

2D Plot

We can only 3 dimensions, anything more than that you can just say it not visualize.

Below is an example of 4 dimension vector representation of the word king

King Vector

One of the seminal papers that have come out from Google is Word2vec. Lets see how Word2Vec works to get a conceptual understanding of how embedding works

How Word2Vec works

For a input text it looks at each word and the context of words around it. It trains on the text, and recognizes the order of each word, and the structure of the sentences. At the end of training each word is represented by a vector of N (mostly in 100 to 300 range) dimension.

Word2Vec

When we train word2vec algorithm in the example discussed above “SanFrancisco is a beautiful California city. LosAngeles is a lovely California metropolis”

Lets assume that it outputs 2 dimension vectors for each words, since we can’t visualize anything more than 3 dimension.

  • SanFrancisco (6,6)
  • beautiful (-13,-4)
  • California (10,8)
  • city (2,10)
  • LosAngeles (6.5,5)
  • lovely(-12,-7)
  • metropolis(2.5,8)

Below is a 2D Plot of vectors

2DPlot

You can see in the image that Word2vec algorithm inferred from the input text. SanFrancisco and LosAngeles are grouped together. Beautiful and lovely are grouped together. City and metropolis are grouped together. Beauty about this is, Word2vec deduced this purely from data, without being explicitly taught english or geography.

You will see more embedding approaches in the below sections

Key Characteristics of Embeddings:
  1. Dimensionality: Embeddings are vectors of fixed size. Common sizes range from 50 to 300 dimensions, though they can be larger depending on the complexity of the task.
  2. Continuous Space: Unlike traditional one-hot encoding, embeddings are dense and reside in a continuous vector space, making them more efficient and informative.
  3. Semantic Proximity: Words with similar meanings tend to have vectors that are close to each other in the embedding space.

The Evolution of Embeddings

Embeddings have evolved significantly over the years. Here are some key milestones:

  1. Word2Vec (2013): Developed by Mikolov et al. at Google, Word2Vec was one of the first algorithms to create word embeddings. It uses two architectures—Continuous Bag of Words (CBOW) and Skip-gram—to learn word associations.

  2. GloVe (2014): Developed by the Stanford NLP Group, GloVe (Global Vectors for Word Representation) improves upon Word2Vec by incorporating global statistical information of the corpus.

  3. FastText (2016): Developed by Facebook’s AI Research (FAIR) lab, FastText extends Word2Vec by considering subword information, which helps in handling out-of-vocabulary words and capturing morphological details.

  4. ELMo (2018): Developed by the Allen Institute for AI, ELMo (Embeddings from Language Models) generates context-sensitive embeddings, meaning the representation of a word changes based on its context in a sentence.

  5. BERT (2018): Developed by Google, BERT (Bidirectional Encoder Representations from Transformers) revolutionized embeddings by using transformers to understand the context of a word bidirectionally. This model significantly improved performance on various NLP tasks.

From Word Embeddings to Sentence Embeddings

While word embeddings provide a way to represent individual words, they do not capture the meaning of entire sentences or documents. This limitation led to the development of sentence embeddings, which are designed to represent longer text sequences.

Word Embeddings

Word embeddings, such as those created by Word2Vec, GloVe, and FastText, map individual words to vectors. These embeddings capture semantic similarities between words based on their context within a large corpus of text. For example, the words “king” and “queen” might be close together in the embedding space because they often appear in similar contexts.

Sentence Embeddings

Sentence embeddings extend the concept of word embeddings to entire sentences or even paragraphs. These embeddings aim to capture the meaning of a whole sentence, taking into account the context and relationships between words within the sentence. There are several methods to create sentence embeddings:

  1. Averaging Word Embeddings: One of the simplest methods is to average the word embeddings of all words in a sentence. While this method is straightforward, it often fails to capture the nuances and syntactic structures of sentences.

  2. Doc2Vec: Developed by Mikolov and Le, Doc2Vec extends Word2Vec to larger text segments by considering the paragraph as an additional feature during training. This method generates embeddings for sentences or documents that capture more context compared to averaging word embeddings.

  3. Recurrent Neural Networks (RNNs): RNNs, particularly Long Short-Term Memory (LSTM) networks, can be used to generate sentence embeddings by processing the sequence of words in a sentence. The hidden state of the RNN after processing the entire sentence can serve as the sentence embedding.

  4. Transformers (BERT, GPT, etc.): Modern approaches like BERT and GPT use transformer architectures to generate context-aware embeddings for sentences. These models can process a sentence bidirectionally, capturing dependencies and relationships between words more effectively than previous methods.

Example: BERT Sentence Embeddings

BERT (Bidirectional Encoder Representations from Transformers) has set a new standard for generating high-quality sentence embeddings. By processing a sentence in both directions, BERT captures the full context of each word in relation to the entire sentence. The embeddings generated by BERT can be fine-tuned for various NLP tasks, such as sentiment analysis, question answering, and text classification.

To create a sentence embedding with BERT, you can use the hidden states of the transformer model. Typically, the hidden state corresponding to the [CLS] token (which stands for “classification”) is used as the sentence embedding.

How to Generate Embeddings

Generating embeddings involves training a model on a large corpus of text data. Here’s a step-by-step guide to generating word and sentence embeddings:

Generating Word Embeddings with Word2Vec
  1. Data Preparation: Collect and preprocess a large text corpus. This involves tokenizing the text, removing stop words, and handling punctuation.

  2. Training the Model: Use the Word2Vec algorithm to train the model. You can choose between the CBOW or Skip-gram architecture. Libraries like Gensim in Python provide easy-to-use implementations of Word2Vec.
    from gensim.models import Word2Vec
    
    # Example sentences
    sentences = [["I", "love", "machine", "learning"], ["Word2Vec", "is", "great"]]
    
    # Train Word2Vec model
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    
  3. Using the Embeddings: Once the model is trained, you can use it to get the embedding for any word in the vocabulary.
    word_embedding = model.wv['machine']
    
Generating Sentence Embeddings with BERT
  1. Install Transformers Library: Use the Hugging Face Transformers library to easily work with BERT.
    pip install transformers
    
  2. Load Pretrained BERT Model: Load a pretrained BERT model and tokenizer.
    from transformers import BertTokenizer, BertModel
    import torch
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    
  3. Tokenize Input Text: Tokenize your input text and convert it to input IDs and attention masks.
    sentence = "BERT is amazing for sentence embeddings."
    inputs = tokenizer(sentence, return_tensors='pt')
    
  4. Generate Embeddings: Pass the inputs through the BERT model to get the embeddings.
    with torch.no_grad():
        outputs = model(**inputs)
    
    # The [CLS] token embedding
    sentence_embedding = outputs.last_hidden_state[0][0]
    
  5. Using the Embeddings: The sentence_embedding can now be used for various NLP tasks.

Data Needed for Training Embeddings

The quality of embeddings heavily depends on the data used for training. Here are key considerations regarding the data needed:

  1. Size of the Corpus: A large corpus is generally required to capture the diverse contexts in which words can appear. For example, training Word2Vec or BERT models typically requires billions of words. The larger the corpus, the better the embeddings can capture semantic nuances.

  2. Diversity of the Corpus: The corpus should cover a wide range of topics and genres to ensure that the embeddings are generalizable. This means including text from various domains such as news articles, books, social media, academic papers, and more.

  3. Preprocessing: Proper preprocessing of the corpus is essential. This includes:
    • Tokenization: Splitting text into words or subwords.
    • Lowercasing: Converting all text to lowercase to reduce the vocabulary size.
    • Removing Punctuation and Stop Words: Cleaning the text by removing unnecessary punctuation and common stop words that do not contribute to the meaning.
    • Handling Special Characters: Dealing with special characters, numbers, and other non-alphabetic tokens appropriately.
  4. Domain-Specific Data: For specialized applications, it is beneficial to include domain-specific data. For instance, medical embeddings should be trained on medical literature to capture the specialized vocabulary and context of the field.

  5. Balanced Dataset: Ensuring that the dataset is balanced and not biased towards a particular topic or genre helps in creating more neutral and representative embeddings.

  6. Data Augmentation: In cases where data is limited, data augmentation techniques such as back-translation, paraphrasing, and synthetic data generation can be used to enhance the corpus.

Applications of Sentence Embeddings

Sentence embeddings have a wide range of applications in NLP:

  1. Text Classification: Sentence embeddings are used to represent sentences for classification tasks, such as identifying the topic of a sentence or determining the sentiment expressed in a review.
  2. Semantic Search: By comparing sentence embeddings, search engines can retrieve documents that are semantically similar to a query, even if the exact keywords are not matched.
  3. Summarization

: Sentence embeddings help in generating summaries by identifying the most important sentences in a document based on their semantic content.

  1. Translation: Sentence embeddings improve machine translation systems by providing a richer representation of the source sentence, leading to more accurate translations.

Embedding Dimension Reduction Methods

High-dimensional embeddings can be computationally expensive and may contain redundant information. Dimension reduction techniques help in simplifying these embeddings while preserving their essential characteristics. Here are some common methods:

  1. Principal Component Analysis (PCA): PCA is a linear method that reduces the dimensionality of data by transforming it into a new coordinate system where the greatest variances by any projection of the data come to lie on the first coordinates (principal components).
    from sklearn.decomposition import PCA
    
    # Assuming 'embeddings' is a numpy array of shape (n_samples, n_features)
    pca = PCA(n_components=50)
    reduced_embeddings = pca.fit_transform(embeddings)
    
  2. t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a nonlinear technique primarily used for visualizing high-dimensional data by reducing it to two or three dimensions.
    from sklearn.manifold import TSNE
    
    tsne = TSNE(n_components=2)
    reduced_embeddings = tsne.fit_transform(embeddings)
    
  3. Uniform Manifold Approximation and Projection (UMAP): UMAP is another nonlinear technique that is faster and often more effective than t-SNE for dimension reduction, especially for larger datasets.
    import umap
    
    reducer = umap.UMAP(n_components=2)
    reduced_embeddings = reducer.fit_transform(embeddings)
    
  4. Autoencoders: Autoencoders are a type of neural network used to learn efficient codings of input data. An autoencoder consists of an encoder and a decoder. The encoder compresses the input into a lower-dimensional latent space, and the decoder reconstructs the input from this latent space.
    from tensorflow.keras.layers import Input, Dense
    from tensorflow.keras.models import Model
    
    # Define encoder
    input_dim = embeddings.shape[1]
    encoding_dim = 50  # Size of the reduced dimension
    input_layer = Input(shape=(input_dim,))
    encoded = Dense(encoding_dim, activation='relu')(input_layer)
    
    # Define decoder
    decoded = Dense(input_dim, activation='sigmoid')(encoded)
    
    # Build the autoencoder model
    autoencoder = Model(input_layer, decoded)
    encoder = Model(input_layer, encoded)
    
    # Compile and train the autoencoder
    autoencoder.compile(optimizer='adam', loss='mean_squared_error')
    autoencoder.fit(embeddings, embeddings, epochs=50, batch_size=256, shuffle=True)
    
    # Get the reduced embeddings
    reduced_embeddings = encoder.predict(embeddings)
    
  5. Random Projection: Random projection is a simple and computationally efficient technique to reduce dimensionality. It is based on the Johnson-Lindenstrauss lemma, which states that high-dimensional data can be embedded into a lower-dimensional space with minimal distortion.
    from sklearn.random_projection import SparseRandomProjection
    
    transformer = SparseRandomProjection(n_components=50)
    reduced_embeddings = transformer.fit_transform(embeddings)
    

Evaluating Embeddings

Evaluating embeddings is crucial to ensure that they capture meaningful relationships and semantics. Here are some common methods to evaluate embeddings:

  1. Intrinsic Evaluation: These methods evaluate the quality of embeddings based on predefined linguistic tasks or properties without involving downstream tasks.

    • Word Similarity: Measure the cosine similarity between word pairs and compare with human-annotated similarity scores. Popular datasets include WordSim-353 and SimLex-999.
      from scipy.spatial.distance import cosine
      
      similarity = 1 - cosine(embedding1, embedding2)
      
    • Analogy Tasks: Evaluate embeddings based on their ability to solve word analogy tasks, such as “king - man + woman = queen.” Datasets like Google Analogy dataset are commonly used.
      def analogy(model, word1, word2, word3):
          vec = model[word1] - model[word2] + model[word3]
          return model.most_similar([vec])[0][0]
      
  2. Extrinsic Evaluation: These methods evaluate embeddings based on their performance on downstream NLP tasks.

    • Text Classification: Use embeddings as features for text classification tasks and measure performance using metrics like accuracy, precision, recall, and F1 score.
      from sklearn.linear_model import LogisticRegression
      from sklearn.metrics import accuracy_score
      
      model = LogisticRegression()
      model.fit(train_embeddings, train_labels)
      predictions = model.predict(test_embeddings)
      accuracy = accuracy_score(test_labels, predictions)
      
    • Named Entity Recognition (NER): Evaluate embeddings by their performance on NER tasks, measuring precision, recall, and F1 score.
      # Example using spaCy for NER
      import spacy
      from spacy.tokens import DocBin
      
      nlp = spacy.load("en_core_web_sm")
      nlp.entity.add_label("ORG")
      
      train_docs = [nlp(text) for text in train_texts]
      train_db = DocBin(docs=train_docs)
      
    • Machine Translation: Assess the quality of embeddings by their impact on machine translation tasks, using BLEU or METEOR scores.
  3. Clustering and Visualization: Visualizing embeddings using t-SNE or UMAP can provide qualitative insights into the structure and quality of embeddings.

    import matplotlib.pyplot as plt
    
    tsne = TSNE(n_components=2)
    reduced_embeddings = tsne.fit_transform(embeddings)
    
    plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1])
    for i, word in enumerate(words):
        plt.annotate(word, xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]))
    plt.show()
    

Similarity vs. Retrieval Embeddings

Embeddings can be tailored for different purposes, such as similarity or retrieval tasks. Understanding the distinction between these two types of embeddings is crucial for optimizing their use in various applications.

Similarity Embeddings

Similarity embeddings are designed to capture the semantic similarity between different pieces of text. The primary goal is to ensure that semantically similar texts have similar embeddings.

Use Cases:

  • Semantic Search: Finding documents or sentences that are semantically similar to a query.
  • Recommendation Systems: Recommending items (e.g., articles, products) that are similar to a given item.
  • Paraphrase Detection: Identifying sentences or phrases that convey the same meaning.

Evaluation:

  • Cosine Similarity: Measure the cosine similarity between embeddings to evaluate their closeness.
    from sklearn.metrics.pairwise import cosine_similarity
    
    similarity = cosine_similarity([embedding1], [embedding2])
    
  • Clustering: Grouping similar items together using clustering algorithms like K-means.
    from sklearn.cluster import KMeans
    
    kmeans = KMeans(n_clusters=5)
    clusters = kmeans.fit_predict(embeddings)
    
Retrieval Embeddings

Retrieval embeddings are optimized for information retrieval tasks, where the goal is to retrieve the most relevant documents from a large corpus based on a query.

Use Cases:

  • Search Engines: Retrieving relevant web pages or documents based on user queries.
  • Question Answering Systems: Finding relevant passages or documents that contain the answer to a user’s question.
  • Document Retrieval: Retrieving documents that are most relevant to a given query.

Evaluation:

  • Precision and Recall: Measure the accuracy of retrieved documents using precision, recall, and F1 score.
    from sklearn.metrics import precision_score, recall_score, f1_score
    
    precision = precision_score(true_labels, predicted_labels, average='weighted')
    recall = recall_score(true_labels, predicted_labels, average='weighted')
    f1 = f1_score(true_labels, predicted_labels, average='weighted')
    
  • Mean Reciprocal Rank (MRR): Evaluate the rank of the first relevant document.
    def mean_reciprocal_rank(rs):
        """Score is reciprocal of the rank of the first relevant item
        First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
        Example from information retrieval with binary relevance:
        >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
        >>> mean_reciprocal_rank(rs)
        0.61111111111111105
        """
        rs = (np.asarray(r).nonzero()[0] for r in rs)
        return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
    

Symmetric vs. Asymmetric Embeddings

Symmetric and asymmetric embeddings are designed to handle different types of relationships in data, and understanding their differences can help in choosing the right approach for specific tasks.

Symmetric Embeddings

Symmetric embeddings are used when the relationship between two items is mutual. The similarity between two items is expected to be the same regardless of the order in which they are compared.

Use Cases:

  • Similarity Search: Comparing the similarity between two items, such as text or images, where the similarity score should be the same in both directions.
  • Collaborative Filtering: Recommending items

based on mutual user-item interactions, where the relationship is bidirectional.

Evaluation:

  • Cosine Similarity: Symmetric embeddings often use cosine similarity to measure the closeness of vectors.
    similarity = cosine_similarity([embedding1], [embedding2])
    
Asymmetric Embeddings

Asymmetric embeddings are used when the relationship between two items is directional. The similarity or relevance of one item to another may not be the same when the order is reversed.

Use Cases:

  • Information Retrieval: Retrieving relevant documents for a query, where the relevance of a document to a query is not necessarily the same as the relevance of the query to the document.
  • Knowledge Graph Embeddings: Representing entities and relationships in a knowledge graph, where the relationship is directional (e.g., parent-child, teacher-student).

Evaluation:

  • Rank-Based Metrics: Asymmetric embeddings often use rank-based metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to evaluate performance.
    def mean_reciprocal_rank(rs):
        rs = (np.asarray(r).nonzero()[0] for r in rs)
        return np.mean([1. / (r[0] + 1) if r.size else 0. for r in rs])
    

The Future of Embeddings

The field of embeddings is rapidly evolving. Researchers are exploring new ways to create more efficient and accurate representations, such as using unsupervised learning and combining embeddings with other techniques like graph networks. The ongoing advancements in this area promise to further enhance the capabilities of NLP systems.

Conclusion

Embeddings have revolutionized the field of NLP, providing a robust and efficient way to represent and process textual data. From word embeddings to sentence embeddings, these techniques have enabled significant advancements in how machines understand and interact with human language. With the help of dimension reduction methods, evaluation techniques, and tailored similarity and retrieval embeddings, embeddings can be optimized for a wide range of NLP tasks. Understanding the differences between symmetric and asymmetric embeddings further allows for more specialized applications. As we continue to develop more sophisticated models and techniques, embeddings will undoubtedly play a crucial role in advancing our understanding and interaction with human language.

Learned index

19 October 2018 at 00:00

Jeff Dean and Co came up with a Seminal Paper on whether Indexes can be learned using Neural Networks. I gave a talk in Saama Tech Talk Series on this topic. Guess I am moving more towards talking rather than writing nowadays.

❌
❌