Normal view

There are new articles available, click to refresh the page.
Before yesterdaySaama

🚀 How I Adopted the Lean Startup Mindset to Drive Innovation in My Team

By: angu10
11 January 2025 at 18:23

How I Adopted a Lean Startup Mindset in My Team’s Product Development 🚀

Developing innovative products in a world of uncertainty requires a mindset shift. At my team, we’ve adopted the Lean Startup mindset to ensure that every product we build is validated by real user needs and designed for scalability. Here’s how we integrated this approach into our team:

1. Value Hypothesis: Testing What Matters Most

We start by hypothesizing the value our product delivers. Since customers may not always articulate their needs, we focus on educating them about the problem and demonstrating how our solution fits into their lives. Through early user engagement and feedback, we validate whether the product solves a real problem.

2. Growth Hypothesis: Building for Scalability

Once we validate the product's value, we focus on testing its technical scalability. We run controlled experiments with system architecture, performance optimization, and infrastructure design to ensure our solution can handle growing user demands. Each iteration helps us identify potential bottlenecks, improve system reliability, and establish robust engineering practices that support future growth.

3. Minimum Viable Product (MVP): Launching to Learn

Instead of waiting to perfect our product, we launch an MVP to get it in front of users quickly. The goal is to learn, not to impress. By observing how users interact with the MVP, we gain valuable insights to prioritize features, fix pain points, and improve the user experience.

Fostering a Lean Mindset

Adopting the Lean Startup framework has been transformative for our team. It’s taught us to embrace experimentation, view failures as learning opportunities, and focus on delivering value to our users.

If you’re building a product and want to innovate smarter, consider adopting the Lean Startup mindset.

Building a Secure Web Application with AWS VPC, RDS, and a Simple Registration Page

31 December 2024 at 09:41

Here, we will see how to set up a Virtual Private Cloud (VPC) with two subnets: a public subnet to host a web application and a private subnet to host a secure RDS (Relational Database Service) instance. We’ll also build a simple registration page hosted in the public subnet, which will log user input into the RDS instance.

By the end of this tutorial, you will have a functional web application where user data from a registration form is captured and stored securely in a private RDS instance.

  1. VPC Setup: We will create a VPC with two subnets:
  • Public Subnet: Hosts a simple HTML-based registration page with an EC2 instance.
  • Private Subnet: Hosts an RDS instance (e.g., MySQL or PostgreSQL) to store registration data.
  1. Web Application: A simple registration page on the public subnet will allow users to input their data (e.g., name, email, and password). When submitted, this data will be logged into the RDS database in the private subnet.

  2. Security:

    • The EC2 instance will be in the public subnet, accessible from the internet.
    • The RDS instance will reside in the private subnet, isolated from direct public access for security purposes.
  3. Routing: We will set up appropriate route tables and security groups to ensure the EC2 instance in the public subnet can communicate with the RDS instance in the private subnet, but the RDS instance will not be accessible from the internet.

Step 1: Create a VPC with Public and Private Subnets

  1. Create the VPC:

    • Open the VPC Console in the AWS Management Console.
    • Click Create VPC and enter the details:
      • CIDR Block: 10.0.0.0/16 (this is the range of IP addresses your VPC will use).
      • Name: Eg:MyVPC.
  2. Create Subnets:

    • Public Subnet:
      • CIDR Block: 10.0.1.0/24
      • Name: PublicSubnet
      • Availability Zone: Choose an available zone.
    • Private Subnet:
      • CIDR Block: 10.0.2.0/24
      • Name: PrivateSubnet
      • Availability Zone: Choose a different zone.
  3. Create an Internet Gateway (IGW):

    • In the VPC Console, create an Internet Gateway and attach it to your VPC.
  4. Update the Route Table for Public Subnet:

    • Create or modify the route table for the public subnet to include a route to the Internet Gateway (0.0.0.0/0 → IGW).
  5. Update the Route Table for Private Subnet:

    • Create or modify the route table for the private subnet to route traffic to the NAT Gateway (for outbound internet access, if needed).

Step 2: Launch EC2 Instance in Public Subnet for Webpage Hosting

  1. Launch EC2 Instance:

    • Go to the EC2 Console, and launch a new EC2 instance using an Ubuntu or Amazon Linux AMI.
    • Select the Public Subnet and assign a public IP to the instance.
    • Attach a Security Group that allows inbound traffic on HTTP (port 80).
  2. Install Apache Web Server:

    • SSH into your EC2 instance and install Apache:
     sudo apt update
     sudo apt install apache2
    
  3. Create the Registration Page:

    • In /var/www/html, create an HTML file for the registration form (e.g., index.html):
     <html>
       <body>
         <h1>Registration Form</h1>
         <form action="/register" method="post">
           Name: <input type="text" name="name"><br>
           Email: <input type="email" name="email"><br>
           Password: <input type="password" name="password"><br>
           <input type="submit" value="Register">
         </form>
       </body>
     </html>
    
  4. Configure Apache:

  • Edit the Apache config files to ensure the server is serving the HTML page and can handle POST requests. You can use PHP or Python (Flask, Django) for handling backend processing.

Step 3: Launch RDS Instance in Private Subnet

  1. Create the RDS Instance:

    • In the RDS Console, create a new MySQL or PostgreSQL database instance.
    • Ensure the database is not publicly accessible (so it stays secure in the private subnet).
    • Choose the Private Subnet for deployment.
  2. Security Groups:

    • Create a Security Group for the RDS instance that allows inbound traffic on port 3306 (for MySQL) or 5432 (for PostgreSQL) from the public subnet EC2 instance.

Step 4: Connect the EC2 Web Server to RDS

  1. Install MySQL Client on EC2:

    • SSH into your EC2 instance and install the MySQL client:
     sudo apt-get install mysql-client
    
  2. Test Database Connectivity:

    • Test the connection to the RDS instance from the EC2 instance using the database endpoint:
     mysql -h <RDS-endpoint> -u <username> -p
    
  3. Create the Database and Table:

    • Once connected, create a database and table to store the registration data:
     CREATE DATABASE registration_db;
     USE registration_db;
     CREATE TABLE users (
       id INT AUTO_INCREMENT PRIMARY KEY,
       name VARCHAR(100),
       email VARCHAR(100),
       password VARCHAR(100)
     );
    

Step 5: Handle Form Submissions and Store Data in RDS

  1. Backend Processing:

    • You can use PHP, Python (Flask/Django), or Node.js to handle the form submission.
    • Example using PHP:
      • Install PHP and MySQL:
       sudo apt install php libapache2-mod-php php-mysql
    
 - Create a PHP script to handle the form submission (`register.php`):
   ```php
   <?php
   if ($_SERVER["REQUEST_METHOD"] == "POST") {
       $name = $_POST['name'];
       $email = $_POST['email'];
       $password = $_POST['password'];
       // Connect to RDS MySQL database
       $conn = new mysqli("<RDS-endpoint>", "<username>", "<password>", "registration_db");
       if ($conn->connect_error) {
           die("Connection failed: " . $conn->connect_error);
       }
       // Insert user data into database
       $sql = "INSERT INTO users (name, email, password) VALUES ('$name', '$email', '$password')";
       if ($conn->query($sql) === TRUE) {
           echo "New record created successfully";
       } else {
           echo "Error: " . $sql . "<br>" . $conn->error;
       }
       $conn->close();
   }
   ?>
   ```
 - Place this script in the public_html directory and configure Apache to serve the form.




Step 6: Test the Registration Form

  1. Access the Webpage:

    • Open a browser and go to the public IP address of the EC2 instance (e.g., http://<EC2-Public-IP>).
  2. Submit the Registration Form:

    • Enter a name, email, and password, then submit the form.
  • Check the RDS database to ensure the data has been correctly inserted.

MY OUTPUT:

Image description

Image description

By following these steps, we have successfully built a secure and scalable web application on AWS. The EC2 instance in the public subnet hosts the registration page, and the private subnet securely stores user data in an RDS instance. We have ensured security by isolating the RDS instance from public access, using VPC subnets, and configuring appropriate security groups.

Building a Highly Available and Secure Web Application Architecture with VPCs, Load Balancers, and Private Subnets

31 December 2024 at 09:29

Overview

1. Single VPC with Public and Private Subnets

In this architecture, we will use a single VPC that consists of both public and private subnets. Each subnet serves different purposes:

Public Subnet:

  • Hosts the website served by EC2 instances.
  • The EC2 instances are managed by an Auto Scaling Group (ASG) to ensure high availability and scalability.
  • A Load Balancer (ALB) distributes incoming traffic across the EC2 instances.

Private Subnet:

  • Hosts an RDS database, which securely stores the data submitted via the website.
  • The EC2 instances in the public subnet interact with the RDS instance in the private subnet via a private IP.
  • The private subnet has a VPC Endpoint to access S3 securely without traversing the public internet.

2. Route 53 Integration for Custom Domain Name

Using AWS Route 53, you can create a DNS record to point to the Load Balancer's DNS name, which allows users to access the website via a custom domain name. This step ensures that your application is accessible from a friendly, branded URL.

3. Secure S3 Access via VPC Endpoint

To securely interact with Amazon S3 from the EC2 instances in the private subnet, we will use an S3 VPC Endpoint. This VPC endpoint ensures that all traffic between the EC2 instances and S3 happens entirely within the AWS network, avoiding the public internet and enhancing security.

4. VPC Peering for Inter-VPC Communication

In some cases, you may want to establish communication between two VPCs for resource sharing or integration. VPC Peering or Transit Gateways are used to connect different VPCs, ensuring resources in one VPC can communicate with resources in another VPC securely.

Step 1: Set Up the VPC and Subnets

  1. Create a VPC:

    • Use the AWS VPC Wizard or AWS Management Console to create a VPC with a CIDR block (e.g., 10.0.0.0/16).
  2. Create Subnets:

  • Public Subnet: Assign a CIDR block like 10.0.1.0/24 to the public subnet. This subnet will host your web servers and load balancer.
  • Private Subnet: Assign a CIDR block like 10.0.2.0/24 to the private subnet, where your RDS instances will reside.
  1. Internet Gateway:
  • Attach an Internet Gateway to the VPC and route traffic from the public subnet to the internet.
  1. Route Table for Public Subnet:
  • Ensure that the public subnet has a route to the Internet Gateway so that traffic can flow in and out.
  1. Route Table for Private Subnet:
  • The private subnet should not have direct internet access. Instead, use a NAT Gateway in the public subnet for outbound internet access from the private subnet, if required.

Step 2: Set Up the Load Balancer (ALB)

  1. Create an Application Load Balancer (ALB):

    • Navigate to the EC2 console, select Load Balancers, and create an Application Load Balancer (ALB).
    • Choose the public subnet to deploy the ALB and configure listeners on port 80 (HTTP) or 443 (HTTPS).
    • Assign security groups to the ALB to allow traffic on these ports.
  2. Create Target Groups:

    • Create target groups for the ALB that point to your EC2 instances or Auto Scaling Group.
  3. Add EC2 Instances to the Target Group:

    • Add EC2 instances from the public subnet to the target group for load balancing.
  4. Configure Auto Scaling Group (ASG):

    • Create an Auto Scaling Group (ASG) with a launch configuration to automatically scale EC2 instances based on traffic load.

Step 3: Set Up Amazon RDS in the Private Subnet

  1. Launch an RDS Instance:

    • In the AWS RDS Console, launch a RDS database instance (e.g., MySQL, PostgreSQL) within the private subnet.
    • Ensure the RDS instance is not publicly accessible, keeping it secure within the VPC.
  2. Connect EC2 to RDS:

    • Ensure that your EC2 instances in the public subnet can connect to the RDS instance in the private subnet using private IPs.

Step 4: Set Up the S3 VPC Endpoint for Secure S3 Access

  1. Create a VPC Endpoint for S3:

    • In the VPC Console, navigate to Endpoints and create a Gateway VPC Endpoint for S3.
    • Select the private subnet and configure the route table to ensure traffic to S3 goes through the VPC endpoint.
  2. Configure Security Group and IAM Role:

    • Ensure your EC2 instances have the necessary IAM roles to access the S3 bucket.
    • Attach security groups to allow outbound traffic to the S3 VPC endpoint.

Step 5: Set Up Route 53 for Custom Domain

  1. Create a Hosted Zone:

    • In the Route 53 Console, create a hosted zone for your domain (e.g., example.com).
  2. Create Record Set for the Load Balancer:

    • Create an A Record or CNAME Record pointing to the DNS name of the ALB (e.g., mywebsite-1234567.elb.amazonaws.com).

Step 6: Set Up VPC Peering (Optional)

  1. Create VPC Peering:
    • If you need to connect two VPCs (e.g., for inter-VPC communication), create a VPC Peering Connection.
  • Update the route tables in both VPCs to ensure traffic can flow between the peered VPCs.
  1. Configure Routes:
    • In both VPCs, add routes to the route tables that allow traffic to flow between the VPCs via the peering connection.

With the use of public and private subnets, Auto Scaling Groups, Application Load Balancers, and VPC Endpoints, We can build a resilient infrastructure. Integrating Route 53 for custom domain management and VPC Peering for inter-VPC communication completes the solution for a fully managed, secure web application architecture on AWS.

Managing EKS Clusters Using AWS Lambda: A Step-by-Step Approach

By: Ragul.M
20 December 2024 at 12:20

Efficiently managing Amazon Elastic Kubernetes Service (EKS) clusters is critical for maintaining cost-effectiveness and performance. Automating the process of starting and stopping EKS clusters using AWS Lambda ensures optimal utilization and reduces manual intervention. Below is a structured approach to achieve this.

1. Define the Requirements

  • Identify the clusters that need automated start/stop operations.
  • Determine the dependencies among clusters, if any, to ensure smooth transitions.
  • Establish the scaling logic, such as leveraging tags to specify operational states (e.g., auto-start, auto-stop).

2. Prepare the Environment

  • AWS CLI Configuration: Ensure the AWS CLI is set up with appropriate credentials and access.
  • IAM Role for Lambda:
    • Create a role with permissions to manage EKS clusters (eks:DescribeCluster, eks:UpdateNodegroupConfig, etc.).
    • Include logging permissions for CloudWatch Logs to monitor the Lambda function execution.

3. Tag EKS Clusters

  • Use resource tagging to identify clusters for automation.
  • Example tags:
    • auto-start=true: Indicates clusters that should be started by the Lambda function.
    • dependency=<cluster-name>: Specifies any inter-cluster dependencies.

4. Design the Lambda Function

  • Trigger Setup:
    • Use CloudWatch Events or schedule triggers (e.g., daily or weekly) to invoke the function.
  • Environment Variables: Configure the function with environment variables for managing cluster names and dependency details.
  • Scaling Configuration: Ensure the function dynamically retrieves scaling logic via tags to handle operational states.

5. Define the Workflow

  • Fetch Cluster Information: Use AWS APIs to retrieve cluster details, including their tags and states.
  • Check Dependencies:
    • Identify dependent clusters and validate their status before initiating operations on others.
  • Start/Stop Clusters:
    • Update node group configurations or use cluster-level start/stop APIs where supported.
  • Implement Logging and Alerts: Capture the execution details and errors in CloudWatch Logs.

(If you want my code , just comment "ease-py-code" on my blog , will share you 🫶 )

6. Test and Validate

  • Dry Runs: Perform simulations to ensure the function executes as expected without making actual changes.
  • Dependency Scenarios: Test different scenarios involving dependencies to validate the logic.
  • Error Handling: Verify retries and exception handling for potential API failures.

7. Deploy and Monitor

  • Deploy the Function: Once validated, deploy the Lambda function in the desired region.
  • Set Up Monitoring:
    • Use CloudWatch Metrics to monitor function executions and errors.
    • Configure alarms for failure scenarios to take corrective actions.

By automating the start and stop operations for EKS clusters, organizations can significantly enhance resource management and optimize costs. This approach provides scalability and ensures that inter-cluster dependencies are handled efficiently.

Follow for more and happy learning :)

Automating RDS Snapshot Management for Daily Testing

18 December 2024 at 06:07

Creating a snapshot ensures you have a backup of the current RDS state. This snapshot can be used to restore the RDS instance later. 

Steps to Create a Snapshot via AWS Management Console: 

  1. Navigate to the RDS Dashboard
  2. Select the RDS instance you want to back up. 
  3. Click Actions > Take Snapshot
  4. Provide a name for the snapshot (e.g., rds-snapshot-test-date). 
  5. Click Take Snapshot

Automating Snapshot Creation with AWS CLI:

 

aws rds create-db-snapshot \
    --db-snapshot-identifier rds-snapshot-test-date \
    --db-instance-identifier your-rds-instance-id

Step 2: Use the RDS Instance for Testing 
Once the snapshot is created, continue using the RDS instance for your testing activities for the day. Ensure you document any changes made during testing, as these will not persist after restoring the instance from the snapshot. 

Step 3: Rename and Delete the RDS Instance 
At the end of the day, rename the existing RDS instance and delete it to avoid unnecessary costs. 

Steps to Rename the RDS Instance via AWS Management Console: 

  1. Navigate to the RDS Dashboard
  2. Select the RDS instance. 
  3. Click Actions > Modify
  4. Update the DB Instance Identifier (e.g., rds-instance-test-old). 
  5. Save the changes and wait for the instance to update. 

Steps to Delete the RDS Instance: 

  1. Select the renamed instance. 
  2. Click Actions > Delete
  3. Optionally, skip creating a final snapshot if you already have one. 
  4. Confirm the deletion. 

Automating Rename and Delete via AWS CLI:

 

# Rename the RDS instance
aws rds modify-db-instance \
    --db-instance-identifier your-rds-instance-id \
    --new-db-instance-identifier rds-instance-test-old

# Delete the RDS instance
aws rds delete-db-instance \
    --db-instance-identifier rds-instance-test-old \
    --skip-final-snapshot

Step 4: Restore the RDS Instance from the Snapshot 
Before starting the next day’s testing, restore the RDS instance from the snapshot created earlier. 

Steps to Restore an RDS Instance via AWS Management Console: 

  1. Navigate to the Snapshots section in the RDS Dashboard
  2. Select the snapshot you want to restore. 
  3. Click Actions > Restore Snapshot
  4. Provide a new identifier for the RDS instance (e.g., rds-instance-test). 
  5. Configure additional settings if needed and click Restore DB Instance

Automating Restore via AWS CLI:

 

aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier rds-instance-test \
    --db-snapshot-identifier rds-snapshot-test-date

Optional: Automate the Process with a Script 
To streamline these steps, you can use a script combining AWS CLI commands. Below is an example script:

 

#!/bin/bash

# Variables
RDS_INSTANCE_ID="your-rds-instance-id"
SNAPSHOT_ID="rds-snapshot-$(date +%F)"
RESTORED_RDS_INSTANCE_ID="rds-instance-test"

# Step 1: Create a Snapshot
echo "Creating snapshot..."
aws rds create-db-snapshot \
    --db-snapshot-identifier $SNAPSHOT_ID \
    --db-instance-identifier $RDS_INSTANCE_ID

# Step 2: Rename and Delete RDS Instance
echo "Renaming and deleting RDS instance..."
aws rds modify-db-instance \
    --db-instance-identifier $RDS_INSTANCE_ID \
    --new-db-instance-identifier "${RDS_INSTANCE_ID}-old"

aws rds delete-db-instance \
    --db-instance-identifier "${RDS_INSTANCE_ID}-old" \
    --skip-final-snapshot

# Step 3: Restore RDS from Snapshot
echo "Restoring RDS instance from snapshot..."
aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier $RESTORED_RDS_INSTANCE_ID \
    --db-snapshot-identifier $SNAPSHOT_ID

How to Create a Lambda Function to Export IAM Users to S3 as a CSV File

By: Ragul.M
16 December 2024 at 15:36

Managing AWS resources efficiently often requires automation. One common task is exporting a list of IAM users into a CSV file for auditing or reporting purposes. AWS Lambda is an excellent tool to achieve this, combined with the power of S3 for storage. Here's a step-by-step guide:

Step 1: Understand the Requirements
Before starting, ensure you have the following:

  • IAM permissions to list users (iam:ListUsers) and access S3 (s3:PutObject).
  • An existing S3 bucket to store the generated CSV file.
  • A basic understanding of AWS Lambda and its environment.

Step 2: Create an S3 Bucket

  1. Log in to the AWS Management Console.
  2. Navigate to S3 and create a new bucket or use an existing one.
  3. Note the bucket name for use in the Lambda function.

Step 3: Set Up a Lambda Function

  1. Go to the Lambda service in the AWS Console.
  2. Click on Create Function and choose the option to create a function from scratch.
  3. Configure the runtime environment (e.g., Python or Node.js).
  4. Assign an appropriate IAM role to the Lambda function with permissions for IAM and S3 operations. (If you want my code , just comment "ease-py-code" on my blog , will share you 🫶 )

Step 4: Implement Logic for IAM and S3

  • The Lambda function will:
    • Retrieve a list of IAM users using the AWS SDK.
    • Format the list into a CSV structure.
    • Upload the file to the specified S3 bucket.

Step 5: Test the Function

  1. Use the AWS Lambda testing tools to trigger the function.
  2. Verify that the CSV file is successfully uploaded to the S3 bucket.

Step 7: Monitor and Review

  • Check the S3 bucket for the uploaded CSV files.
  • Review the Lambda logs in CloudWatch to ensure the function runs successfully.

By following these steps, you can automate the task of exporting IAM user information into a CSV file and store it securely in S3, making it easier to track and manage your AWS users.

Follow for more and happy learning :)

Automating AWS Cost Management Reports with Lambda

By: Ragul.M
11 December 2024 at 16:08

Monitoring AWS costs is essential for keeping budgets in check. In this guide, we’ll walk through creating an AWS Lambda function to retrieve cost details and send them to email (via SES) and Slack.
Prerequisites
1.AWS Account with IAM permissions for Lambda, SES, and Cost Explorer.
2.Slack Webhook URL to send messages.
3.Configured SES Email for notifications.
4.S3 Bucket for storing cost reports as CSV files.

Step 1: Enable Cost Explorer

  • Go to AWS Billing Dashboard > Cost Explorer.
  • Enable Cost Explorer to access detailed cost data.

Step 2: Create an S3 Bucket

  • Create an S3 bucket (e.g., aws-cost-reports) to store cost reports.
  • Ensure the bucket has appropriate read/write permissions for Lambda.

Step 3: Write the Lambda Code
1.Create a Lambda Function

  • Go to AWS Lambda > Create Function.
  • Select Python Runtime (e.g., Python 3.9).
    1. Add Dependencies
  • Use a Lambda layer or package libraries like boto3 and slack_sdk. 3.Write your python code and execute them. (If you want my code , just comment "ease-py-code" on my blog , will share you 🫶 )

Step 4: Add S3 Permissions
Update the Lambda execution role to allow s3:PutObject, ses:SendEmail, and ce:GetCostAndUsage.

Step 5: Test the Lambda
1.Trigger Lambda manually using a test event.

  1. Verify the cost report is:
    • Uploaded to the S3 bucket.
    • Emailed via SES.
    • Notified in Slack.

Conclusion
With this setup, AWS cost reports are automatically delivered to your inbox and Slack, keeping you updated on spending trends. Fine-tune this solution by customizing the report frequency or grouping costs by other dimensions.

Follow for more and happy learning :)

Exploring Kubernetes: A Step Ahead of Basics

By: Ragul.M
10 December 2024 at 05:35

Kubernetes is a powerful platform that simplifies the management of containerized applications. If you’re familiar with the fundamentals, it’s time to take a step further and explore intermediate concepts that enhance your ability to manage and optimize Kubernetes clusters.

  1. Understanding Deployments
    A Deployment ensures your application runs reliably by managing scaling, updates, and rollbacks.

  2. Using ConfigMaps and Secrets
    Kubernetes separates application configuration and sensitive data from the application code using ConfigMaps and Secrets.

    ConfigMaps

    Store non-sensitive configurations, such as environment variables or application settings.

kubectl create configmap app-config --from-literal=ENV=production 

3. Liveness and Readiness Probes

Probes ensure your application is healthy and ready to handle traffic.

Liveness Probe
Checks if your application is running. If it fails, Kubernetes restarts the pod.

Readiness Probe
Checks if your application is ready to accept traffic. If it fails, Kubernetes stops routing requests to the pod.

4.Resource Requests and Limits
To ensure efficient resource utilization, define requests (minimum resources a pod needs) and limits (maximum resources a pod can use).

5.Horizontal Pod Autoscaling (HPA)
Scale your application dynamically based on CPU or memory usage.
Example:

kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10 

This ensures your application scales automatically when resource usage increases or decreases.

6.Network Policies
Control how pods communicate with each other and external resources using Network Policies.

Conclusion
Kubernetes has revolutionized the way we manage containerized applications. By automating tasks like deployment, scaling, and maintenance, it allows developers and organizations to focus on innovation. Whether you're a beginner or a seasoned developer, mastering Kubernetes is a skill that will enhance your ability to build and manage modern applications.

By mastering these slightly advanced Kubernetes concepts, you’ll improve your cluster management, application reliability, and resource utilization. With this knowledge, you’re well-prepared to dive into more advanced topics like Helm, monitoring with Prometheus, and service meshes like Istio.

Follow for more and Happy learning :)

**Dynamic Scaling with AWS Auto Scaling Groups via Console**

9 December 2024 at 06:00

To configure an Auto Scaling Group (ASG) using the AWS Management Console. Auto Scaling Groups are an essential feature of AWS, allowing you to dynamically scale your EC2 instances based on workload demand. Here, we'll have a clear understanding of creating an ASG, configuring scaling policies, and testing the setup.

Introduction to Auto Scaling Groups

An Auto Scaling Group (ASG) ensures your application has the right number of EC2 instances running at all times. You can define scaling policies based on CloudWatch metrics, such as CPU utilization, to automatically add or remove instances. This provides cost-efficiency and ensures consistent performance.Auto Scaling Groups dynamically adjust EC2 instances based on workload.

Steps to Create an Auto Scaling Group Using the AWS Console

Step 1: Create a Launch Template

  1. Log in to the AWS Management Console and navigate to the EC2 Dashboard.
  2. Create a Launch Template:
    • Go to Launch Templates and click Create Launch Template.
    • Provide a Name and Description.
    • Specify the AMI ID (Amazon Machine Image) for the operating system. For example, use an Ubuntu AMI.
    • Select the Instance Type (e.g., t2.micro).
    • Add your Key Pair for SSH access.
    • Configure Network Settings (use the default VPC and a Subnet).
    • Leave other settings as default and save the Launch Template.
    • Launch Templates simplify EC2 instance configurations for ASG.

Step 2: Create an Auto Scaling Group

  1. Navigate to Auto Scaling Groups under the EC2 Dashboard.
  2. Click "Create Auto Scaling Group".
  3. Select Launch Template: Choose the Launch Template created in Step 1.
  4. Configure Group Size and Scaling Policies:
    • Specify the Minimum size (e.g., 1), Maximum size (e.g., 3), and Desired Capacity (e.g., 1).
    • Set scaling policies to increase or decrease capacity automatically.
  5. Choose Subnets:
    • Select the Subnets from your VPC where the EC2 instances will run.
    • Ensure these Subnets are public if instances need internet access.
  6. Health Checks:
    • Use EC2 health checks to automatically replace unhealthy instances.
    • Set a Health Check Grace Period (e.g., 300 seconds).
  7. Review and Create:
    • Review the settings and click Create Auto Scaling Group.
  8. Dynamic Scaling Policies allow automated scaling based on CloudWatch metrics like CPU utilization.

Step 3: Set Up Scaling Policies

  1. In the ASG configuration, choose Dynamic Scaling Policies.
  2. Add a policy to scale out:
    • Set the policy to add 1 instance when CPU utilization exceeds 70%.
  3. Add a policy to scale in:

    - Set the policy to remove 1 instance when CPU utilization falls below 30%.

    Stress Testing the Auto Scaling Group

    To test the Auto Scaling Group, you can simulate high CPU usage on one of the instances. This will trigger the scaling policy and add more instances.Stress testing helps verify that scaling policies are working as expected.

  4. Connect to an Instance:
    Use your private key to SSH into the instance.

   ssh -i "your-key.pem" ubuntu@<Instance-IP>
  1. Install Stress Tool: Update the system and install the stress tool.
   sudo apt update 
   sudo apt install stress 
  1. Run Stress Test: Simulate high CPU utilization to trigger the scale-out policy.
   stress --cpu 8 --timeout 600 
  1. Monitor Scaling:
    • Go to the Auto Scaling Groups dashboard in the AWS Console.
    • Check the Activity tab to observe if new instances are being launched.

My Output

Image description

Image description

Configuring Auto Scaling Groups using the AWS Management Console is a straightforward process that enables dynamic scaling of EC2 instances. By following these steps, we can ensure your application is resilient, cost-efficient, and capable of handling varying workloads.

Accessing Multiple Instances via Load Balancer in AWS

9 December 2024 at 05:49

When deploying scalable applications, distributing traffic efficiently across multiple instances is crucial for performance, fault tolerance, and reliability. AWS provides Elastic Load Balancing (ELB) to simplify this process. Here,we’ll explore the concept of load balancers, target groups, security groups, and subnets, along with a step-by-step process to setting up an Application Load Balancer (ALB) to access multiple instances.

Load Balancer:

A Load Balancer is a service that distributes incoming application traffic across multiple targets (e.g., EC2 instances) in one or more availability zones. It improves the availability and fault tolerance of your application by ensuring no single instance is overwhelmed by traffic.
AWS supports three types of load balancers:

  1. Application Load Balancer (ALB): Works at Layer 7 (HTTP/HTTPS) and is ideal for web applications.
  2. Network Load Balancer (NLB): Operates at Layer 4 (TCP/UDP) for ultra-low latency.
  3. Gateway Load Balancer (GWLB): Works at Layer 3 (IP) for distributing traffic to virtual appliances.

1. Target Groups

  • Target Groups are collections of targets (e.g., EC2 instances, IPs) that receive traffic from a load balancer.
  • You can define health checks for targets to ensure traffic is routed only to healthy instances. It can Organize and monitor targets (EC2 instances).

2. Security Groups

  • Security Groups act as virtual firewalls for your instances and load balancers.
  • For the load balancer, inbound rules allow traffic on ports like 80 (HTTP) or 443 (HTTPS).
  • For the instances, inbound rules allow traffic only from the load balancer's IP or security group.
  • It Protect resources by restricting traffic based on rules.

3. Subnets

  • Subnets are segments of a VPC that isolate resources.
  • Load balancers require at least two public subnets in different availability zones for high availability.
  • EC2 instances are usually deployed in private subnets, accessible only through the load balancer.
  • It isolate resources; public subnets for load balancers and private subnets for instances.

Steps to Set Up a Load Balancer for Multiple Instances

Step 1: Launch EC2 Instances

  1. Create Two or More EC2 Instances:
    • Use the AWS Management Console to launch multiple EC2 instances in a VPC.
    • Place them in private subnets across two different availability zones.
  2. Configure Security Groups for Instances:
    • Allow traffic only from the load balancer's security group on port 80 (HTTP) or 443 (HTTPS).

Step 2: Create a Target Group

  1. Navigate to Target Groups in the EC2 section of the console.
  2. Click Create Target Group and choose Instances as the target type.
  3. Provide the following configurations:
    • Protocol: HTTP or HTTPS
    • VPC: Select the same VPC as the EC2 instances.
    • Health Check Settings: Configure health checks (e.g., Path: / and Port: 80).
  4. Register the EC2 instances as targets in this group.

Step 3: Set Up a Load Balancer
Application Load Balancer Configuration:

  1. Go to the Load Balancers section of the EC2 console.
  2. Click Create Load Balancer and choose Application Load Balancer.
  3. Configure the following:
    • Name: Provide a unique name for the load balancer.
    • Scheme: Select Internet-facing for public access.
    • Listeners: Use port 80 or 443 (for HTTPS).
    • Availability Zones: Select public subnets from at least two availability zones.

Step 4: Attach Target Group to the Load Balancer

  1. In the Listener and Rules section, forward traffic to the target group created earlier.
  2. Save and create the load balancer.

Step 5: Update Security Groups

  1. For the Load Balancer:
    • Allow inbound traffic on port 80 or 443 (if HTTPS).
    • Allow inbound traffic from all IPs (or restrict by source).
  2. For EC2 Instances:
    • Allow inbound traffic from the load balancer's security group.

Step 6: Test the Setup

  1. Get the DNS name of the load balancer from the AWS console.
  2. Access the DNS name in your browser to verify traffic is being distributed to your instances.

Step:7 Scaling with Auto Scaling Groups
Attach an Auto Scaling Group (ASG) to the target group for dynamic scaling based on traffic demand.

To access multiple EC2 instances via a load balancer in AWS, you first deploy your EC2 instances within a Virtual Private Cloud (VPC), ensuring they are in the same target network. Install and configure your desired application (e.g., a web server like Apache) on these instances. Then, create an Application Load Balancer (ALB) or Network Load Balancer (NLB) to distribute traffic. Associate the load balancer with a Target Group that includes your EC2 instances and their ports. Next, configure the load balancer's listener to route incoming traffic (e.g., HTTP or HTTPS) to the Target Group. To make the setup accessible via a domain name, map your load balancer's DNS to a custom domain using Route 53. This ensures users can access your application by visiting the domain, with the load balancer evenly distributing traffic among the EC2 instances for high availability and scalability.

My output:

Image description

Image description

Understanding Kubernetes Basics: A Beginner’s Guide

By: Ragul.M
29 November 2024 at 17:12

In today’s tech-driven world, Kubernetes has emerged as one of the most powerful tools for container orchestration. Whether you’re managing a few containers or thousands of them, Kubernetes simplifies the process, ensuring high availability, scalability, and efficient resource utilization. This blog will guide you through the basics of Kubernetes, helping you understand its core components and functionality.

What is Kubernetes?
Kubernetes, often abbreviated as K8s, is an open-source platform developed by Google that automates the deployment, scaling, and management of containerized applications. It was later donated to the Cloud Native Computing Foundation (CNCF).
With Kubernetes, developers can focus on building applications, while Kubernetes takes care of managing their deployment and runtime.

Key Features of Kubernetes

  1. Automated Deployment and Scaling Kubernetes automates the deployment of containers and can scale them up or down based on demand.
  2. Self-Healing If a container fails, Kubernetes replaces it automatically, ensuring minimal downtime.
  3. Load Balancing Distributes traffic evenly across containers, optimizing performance and preventing overload.
  4. Rollbacks and Updates Kubernetes manages seamless updates and rollbacks for your applications without disrupting service.
  5. Resource Management Optimizes hardware utilization by efficiently scheduling containers across the cluster.

Core Components of Kubernetes
To understand Kubernetes, let’s break it down into its core components:

  1. Cluster A Kubernetes cluster consists of:
  2. Master Node: The control plane managing the entire cluster.
  3. Worker Nodes: Machines where containers run.
  4. Pods :The smallest deployable unit in Kubernetes. A pod can contain one or more containers that share resources like storage and networking.
  5. Nodes : Physical or virtual machines that run the pods. Managed by the Kubelet, a process ensuring pods are running as expected.
  6. Services : Allow communication between pods and other resources, both inside and outside the cluster. Examples include ClusterIP, NodePort, and LoadBalancer services.
  7. ConfigMaps and Secrets : ConfigMaps: Store configuration data for your applications. Secrets: Store sensitive data like passwords and tokens securely.
  8. Namespaces Virtual clusters within a Kubernetes cluster, used for organizing and isolating resources.

Conclusion
Kubernetes has revolutionized the way we manage containerized applications. By automating tasks like deployment, scaling, and maintenance, it allows developers and organizations to focus on innovation. Whether you're a beginner or a seasoned developer, mastering Kubernetes is a skill that will enhance your ability to build and manage modern applications.

Follow for more and Happy learning :)

Deep Dive into AWS

By: Ragul.M
26 November 2024 at 13:50

Hi folks , welcome to my blog. Here we are going to see about some interesting deep topics in AWS.

What is AWS?

AWS is a subsidiary of Amazon that offers on-demand cloud computing services. These services eliminate the need for physical infrastructure, allowing businesses to rent computing power, storage, and other resources as needed. AWS operates on a pay-as-you-go model, which means you only pay for what you use.

Deep Dive: Serverless Architecture

One of AWS’s most revolutionary offerings is serverless computing. Traditional servers are replaced with fully managed services, allowing developers to focus solely on writing code.

Key Components of Serverless Architecture:

  • AWS Lambda: Automatically scales based on the number of requests. Ideal for microservices and event-driven workflows.
  • API Gateway: Connects client applications with backend services using APIs.
  • DynamoDB: High-performance NoSQL database for low-latency reads and writes.
  • EventBridge: Orchestrates serverless workflows using event-driven triggers. Example Use Case: Build a RESTful API without managing servers. Combine Lambda for compute, DynamoDB for storage, and API Gateway for routing.

Advanced Concepts in AWS

1. Elasticity and Auto-Scaling

AWS Auto Scaling monitors your application and adjusts capacity automatically to maintain performance. For example, if traffic spikes, AWS can add more EC2 instances or scale down when traffic reduces.

2. Hybrid Cloud and Outposts

Hybrid cloud models integrate on-premises infrastructure with AWS. AWS Outposts allow you to run AWS services on your own hardware, enabling low-latency solutions for specific industries.

3. High Availability and Disaster Recovery

AWS provides tools like:

  • Route 53 for DNS failover.
  • Cross-Region Replication for S3.
  • AWS Backup for centralized backup management across multiple services.

4. Monitoring and Logging

  • CloudWatch: Collect and monitor logs, metrics, and events.
  • CloudTrail: Records all API calls for auditing purposes.
  • AWS Config: Tracks changes to your resources for compliance.

Conclusion

AWS empowers organizations to innovate faster by providing scalable, secure, and cost-effective solutions. Whether you’re deploying a simple static website or a complex AI-powered application, AWS has the tools to support your goals. By leveraging its services and following best practices, you can build resilient and future-ready applications.

Follow for more and happy learning :)

Linux Basic Commands III

22 November 2024 at 11:17

Process Management Commands:

ps - It Display running processes.
-aux: - It Show all processes.
top - It Monitor system processes in real-time.It displays a dynamic view of system processes and their resource usage.
kill - It helps to Terminate a process.
** - 9*: Forcefully kill a process.
**kill PID
* -terminates the process with the specified process ID.
pkill - Terminate processes based on their name.
pkill **- terminates all processes with the specified name.
**pgrep
- It helps to List processes based on their name.
grep - It used to search for specific patterns or regular expressions in text files or streams and display matching lines.
-i: Ignore case distinctions while searching.
-v: Invert the match, displaying non-matching lines.
-r or -R: Recursively search directories for matching patterns.
-l: Print only the names of files containing matches.
-n: Display line numbers alongside matching lines.
-w: Match whole words only, rather than partial matches.
-c: Count the number of matching lines instead of displaying them.
-e: Specify multiple patterns to search for.
-A: Display lines after the matching line.
-B: Display lines before the matching line.
-C: Display lines both before and after the matching line.

Linux Basic Commands II

21 November 2024 at 14:45

File Permission Commands:

Chmod - Change file permissions.

  • u: User/owner permissions.
  • g: Group permissions.
  • o: Other permissions.
  • +: Add permissions.
  • –: Remove permissions.
  • =: Set permissions explicitly.

Chown - Change file ownership.

Chgrp - Change group ownership.

File Compression and Archiving Commands:
**
**Tar
- Create or extract archive files.

  • -c: Create a new archive.
  • -x: Extract files from an archive.
  • -f: Specify the archive file name.
  • -v: Verbose mode.
  • -z: Compress the archive with gzip.
  • -j: Compress the archive with bzip2.

Gzip - for Compress files

  • -d: Decompress files.

Zip - to Create compressed zip archives.

  • -r: Recursively include directories.

Introduction to AWS

By: Ragul.M
20 November 2024 at 16:13

Hi folks , welcome to my blog. Here we are going to see about "Introduction to AWS".

Amazon Web Services (AWS) is the world’s leading cloud computing platform, offering a wide range of services to help businesses scale and innovate. Whether you're building an application, hosting a website, or storing data, AWS provides reliable and cost-effective solutions for individuals and organizations of all sizes.

What is AWS?
AWS is a comprehensive cloud computing platform provided by Amazon. It offers on-demand resources such as compute power, storage, networking, and databases on a pay-as-you-go basis. This eliminates the need for businesses to invest in and maintain physical servers.

Core Benefits of AWS

  1. Scalability: AWS allows you to scale your resources up or down based on your needs.
  2. Cost-Effective: With its pay-as-you-go pricing, you only pay for what you use.
  3. Global Availability: AWS has data centers worldwide, ensuring low latency and high availability.
  4. Security: AWS follows a shared responsibility model, offering top-notch security features like encryption and access control.
  5. Flexibility: Supports multiple programming languages, operating systems, and architectures.

Key AWS Services
Here are some of the most widely used AWS services:

  1. Compute:
    • Amazon EC2: Virtual servers to run your applications.
    • AWS Lambda: Serverless computing to run code without managing servers.
  2. Storage:
    • Amazon S3: Object storage for data backup and distribution.
    • Amazon EBS: Block storage for EC2 instances.
  3. Database:
    • Amazon RDS: Managed relational databases like MySQL, PostgreSQL, and Oracle.
    • Amazon DynamoDB: NoSQL database for high-performance applications.
  4. Networking:
    • Amazon VPC: Create isolated networks in the cloud.
    • Amazon Route 53: Domain name system (DNS) and traffic management.
  5. AI/ML:
    • Amazon SageMaker: Build, train, and deploy machine learning models.
  6. DevOps Tools:
    • AWS CodePipeline: Automates the release process.
    • Amazon EKS: Managed Kubernetes service.

Conclusion
AWS has revolutionized the way businesses leverage technology by providing scalable, secure, and flexible cloud solutions. Whether you're a developer, an enterprise, or an enthusiast, understanding AWS basics is the first step toward mastering the cloud. Start your AWS journey today and unlock endless possibilities!

Follow for more and happy learning :)

Basic Linux Commands

15 November 2024 at 15:08
  1. pwd — When you first open the terminal, you are in the home directory of your user. To know which directory you are in, you can use the “pwd” command. It gives us the absolute path, which means the path that starts from the root. The root is the base of the Linux file system and is denoted by a forward slash( / ). The user directory is usually something like “/home/username”.

Image description

  1. ls — Use the “ls” command to know what files are in the directory you are in. You can see all the hidden files by using the command “ls -a”.

Image description

  1. cd — Use the “cd” command to go to a directory. “cd” expects directory name or path of new directory as input.

Image description

  1. mkdir & rmdir — Use the mkdir command when you need to create a folder or a directory.Use rmdir to delete a directory. But rmdir can only be used to delete an empty directory. To delete a directory containing files, use rm.

Image description

  1. rm – Use the rm command to delete a file. Use “rm -r” to recursively delete all files within a specific directory.

Image description

  1. touch — The touch command is used to create an empty file. For example, “touch new.txt”.

Image description

  1. cp — Use the cp command to copy files through the command line.

Image description

  1. mv — Use the mv command to move files through the command line. We can also use the mv command to rename a file.

Image description

9.cat — Use the cat command to display the contents of a file. It is usually used to easily view programs.

Image description

10.vi - You can create a new file or modify a file using this editor.

Image description

Basic Linux Commands

By: Ragul.M
15 November 2024 at 14:25

Hi folks , welcome to my blog. Here we are going to see some basic and important commands of linux.

One of the most distinctive features of Linux is its command-line interface (CLI). Knowing a few basic commands can unlock many possibilities in Linux.
Essential Commands
Here are some fundamental commands to get you started:
ls - Lists files and directories in the current directory.

ls

cd - Changes to a different directory.

cd /home/user/Documents

pwd - Prints the current working directory.

pwd

cp - Copies files or directories.

cp file1.txt /home/user/backup/

mv - Moves or renames files or directories.

mv file1.txt file2.txt

rm - Removes files or directories.

rm file1.txt

mkdir - Creates a new directory.

mkdir new_folder

touch - Creates a new empty file.

touch newfile.txt

cat - Displays the contents of a file.

cat file1.txt

nano or vim - Opens a file in the text editor.

nano file1.txt

chmod - Changes file permissions.

chmod 755 file1.txt

ps - Displays active processes.

ps

kill - Terminates a process.

kill [PID]

Each command is powerful on its own, and combining them enables you to manage your files and system effectively.We can see more about some basics and interesting things about linux in further upcoming blogs which I will be posting.

Follow for more and happy learning :)

Linux basics for beginners

By: Ragul.M
14 November 2024 at 16:04

Introduction:
Linux is one of the most powerful and widely-used operating systems in the world, found everywhere from mobile devices to high-powered servers. Known for its stability, security, and open-source nature, Linux is an essential skill for anyone interested in IT, programming, or system administration.
In this blog , we are going to see What is linux and Why choose linux.

1) What is linux
Linux is an open-source operating system that was first introduced by Linus Torvalds in 1991. Built on a Unix-based foundation, Linux is community-driven, meaning anyone can view, modify, and contribute to its code. This collaborative approach has led to the creation of various Linux distributions, or "distros," each tailored to different types of users and use cases. Some of the most popular Linux distributions are:

  • Ubuntu: Known for its user-friendly interface, great for beginners.
  • Fedora: A cutting-edge distro with the latest software versions, popular with developers.
  • CentOS: Stable and widely used in enterprise environments. Each distribution may look and function slightly differently, but they all share the same core Linux features.

2) Why choose linux
Linux is favored for many reasons, including its:

  1. Stability: Linux is well-known for running smoothly without crashing, even in demanding environments.
  2. Security: Its open-source nature allows the community to detect and fix vulnerabilities quickly, making it highly secure.
  3. Customizability: Users have complete control to modify and customize their system.
  4. Performance: Linux is efficient, allowing it to run on a wide range of devices, from servers to small IoT devices.

Conclusion
Learning Linux basics is the first step to becoming proficient in an operating system that powers much of the digital world. We can see more about some basics and interesting things about linux in further upcoming blogs which I will be posting.

Follow for more and happy learning :)

An Introduction to Tokenizers in Natural Language Processing

25 September 2024 at 00:00

Tokenizers

_Co-authored by Tamil Arasan, Selvakumar Murugan and Malaikannan Sankarasubbu

In Natural Language Processing (NLP), one of the foundational steps is transforming human language into a format that computational models can understand. This is where tokenizers come into play. Tokenizers are specialized tools that break down text into smaller units called tokens, and convert these tokens into numerical data that models can process.

Imagine you have the sentence:

Artificial intelligence is revolutionizing technology.

To a human, this sentence is clear and meaningful. But we do not understand the whole sentence in one shot(okay may be you did, but I am sure if I gave you a paragraph or a even better an essay, you will not be able to understand them in one shot), but we make sense of parts of it like words and then phrases and understand the whole sentence as a composition of meanings from its parts. It is just how things work, regardless whether we are trying to make a machine mimic our language understanding or not. This has nothing to do with the reason ML models or even computers in general work with numbers. It is purely how language works and there is no going around it.

ML models like everything else we run on computers can only work with numbers, and we need to transform the text into number or series of numbers (since we have more than one word). We have a lot of freedom when it comes to how we transform the text into numbers, and as always with freedom comes complexity. But basically, tokenization as a whole is a two step process. Finding all the words and assigning a unique number - an ID to each token.

There are so many ways we can segment a sentence/paragraph into pieces like phrases, words, sub-words or even individual characters. Understanding why particular tokenization scheme is better requires a grasp of how embeddings work. If you're familiar with NLP, you'd ask "Why? Tokenization comes before the Embedding, right?" Yes, you're right, but NLP is paradoxical like that. Don't worry we will cover that as we go.

Background

Before we venture any further, lets understand the difference between Neural networks and our typical computer programs. We all know by now that for traditional computer programs, we write/translate the rules into code by hand whereas, NNs learn the rules(mapping across input and output) from data by the process called training. You see unlike in normal programming style, where we have a plethora of data-structures that can help with storing information in any shape or form we want, along with algorithms that jump up and down, back and forth in a set of instructions we call code, Neural Networks do not allow us to have all sorts of control flow we'd like. In Neural Networks, there is only one direction the "program" can run, left to right.

Unlike in traditional programs where the we can feed a program with input in complicated ways, in Neural Networks, there are only fixed number of ways, we can feed and it is usually in the form of vectors (fancy name for list of numbers) and the vectors are of fixed size (or dimension more precisely). In most DNNs, input and output sizes are fixed regardless of the problem it is trying to solve. For example, CNNs the input (usually image) size and number of channels is fixed. In RNNs, the embedding dimensions, input vocabulary size, number of output labels (classification problem e.g: sentiment classification) and or output vocabulary size (text generation problems e.g: QA, translation) are all fixed. In Transformer networks even the sentence length is fixed. This is not a bad thing, constraints like these enable the network to compress and capture the necessary information.

Also note that there are only few tools to test "equality" or "relevance" or "correctness" for things inside the network because only things that dwell inside the network are vectors. Cosine similarity and attention scores are popular. You can think of vectors as variables that keep track of state inside neural network program. But unlike in traditional programs where you can declare variables as you'd like and print them for troubleshooting, in networks the vector-variables are only meaningful only at the boundaries of the layers(not entirely true) within the networks.

Lets take a look at the simplest example to understand why just pulling a vector from anywhere in the network will not be of any value for us. In the following code, three functions perform the identical calculation despite their code is slightly different. The unnecessarily intentionally named variables temp and growth_factor need not be created as exemplified by the first function, which directly embodies the compound interest calculation formula, $A = P(1+\frac{R}{100})^{T}$. When compared to temp, the variable growth_factor hold a more meaningful interpretation - represents how much the money will grow due to compounding interest over time. For more complicated formulae and functions, we might create intermediate variables so that the code goes easy on the eye, but they have no significance to the operation of the function.

def compound_interest_1(P,R,T):
    A = P * (math.pow((1 + (R/100)),T))
    CI = A - P
    return CI

def compound_interest_2(P,R,T):
    temp = (1 + (R/100))
    A = P * (math.pow(temp, T))
    CI = A - P
    return CI

def compound_interest_3(P,R,T):
    growth_factor = (math.pow((1 + (R/100)),T))
    A = P * growth_factor
    CI = A - P
    return CI

Another example to illustrate from operations perspective. Clock arithmetic. Lets assign numbers 0 through 7 to weekdays starting from Sunday to Saturday.

Table 1

Sun Mon Tue Wed Thu Fri Sat
0 1 2 3 4 5 6

John Conway suggests, a mnemonic device for thinking of the days of the week as Noneday, Oneday, Twosday, Treblesday, Foursday, Fiveday, and Six-a-day.

So if you want to know what day it is 137 days from today if today is say, Thursday (i.e. 4). We can do $(4+137) mod 7 => 1$ i.e Monday. As you can see adding numbers(days) in clock arithmetic results in a meaningful output. You can days together to get another day. Okay lets ask the question can we multiply two days together? Is it is in anyway meaningful to multiply days? Just because we can multiply any number mathematically, is it useful to do so in our clock arithmetic?

All of this digression is to emphasize that the embedding is deemed to capture the meaning of words, vector from the last layers is deemed to capture the meaning of a sentence lets say. But when you take a vector (just because you can) within the layers for instance, it does not refer to any meaningful unit such as words or phrases and sentence as we understand it.

A little bit of history

If you're old enough, you might remember that before transformers became standard paradigm in NLP, we had another one EEAP (Embed, Encode, Attend, Predict). I am grossly oversimplifying here, but you can think of it as follows,

Embedding

Captures the meaning of words A matrix of size $N \times D$, where

  • $N$ is the size of the vocabulary, i.e unique number of words in the language
  • $D$ is the dimension of embedding, vector corresponding to each word.

Lookup the word-vector (embedding) for each word

Encoding
Find the meaning of a sentence, by using the meaning captured in embeddings of the constituent words with help of RNNs like LSTM, GRU or transformers like BERT, GPT that take the embeddings and produce vector(s) for whole the sequence.
Prediction
Depending upon the task at hand, either assigns a label to the input sentence, or generate another sentence word by word.
Attention
Helps with Prediction by focusing on what is important right now by drawing a probability distribution (normalized attention scores) over the all words. Words with high score are deemed important.

As you can see above, $N$ is the vocabulary size, i.e unique number of words in the language. And handful of years ago, language usually meant the corpus at hand (in order of few thousands of sentences) and datasets like CNN/DailyMail were considered huge. There were clever tricks like anonymizing named entities to force the ML models to focus on language specific features like grammar instead of open world words like names of Places, Presidents, Corporations and Countries, etc. Good times they were! Point is, it is possible that the corpus you have in your possession might not have all the words of the language. As we have seen, the size of the Embedding must be fixed before training the network. By good fortune if you stumble upon a new dataset and hence new words, adding them to your model was not easy, because Embedding needs to extend to accommodate this new (OOV) words and that requires retraining of the whole network. OOV means Out Of the current model's Vocabulary. And this is why simply segmenting the text on empty spaces will not work.

With that background, lets dive in.

Tokenization

Tokenization is the process of segmenting the text into individual pieces (usually words) so that ML model can digest them. It is the very first step in any NLP system and influences everything that follows. For understanding impact of tokenization, we need to understand how embeddings and sentence length influence the model. We will call sentence length as sequence length from here on, because sentence is understood to be sequence of words, and we will experiment with sequence of different things not just words, which we will call tokens.

Tokens can be anything

  • Words - "telephone" "booth" "is" "nearby" "the" "post" "office"
  • Multiword Expressions (MWEs) - "telephone booth" "is" "nearby" "the" "post office"
  • Sub-words - "tele" "#phone" "booth" "is" "near " "#by" "the" "post" "office"
  • Characters - "t" "e" "l" "e" "p" ... "c" "e"

We know segmenting the text based on empty spaces will not work, because the vocabulary will keep growing. What about punctuations? Surely they will help with words don't, won't, aren't, o'clock, Wendy's, co-operation{.verbatim} etc, same reasoning applies here too. Moreover segmenting at punctuations will create different problems, e.g: I.S.R.O > I, S, R, O{.verbatim} which is not ideal.

Objectives of Tokenization

The primary objectives of tokenization are:

Handling OOV
Tokenizers should be able to segment the text into pieces so that any word in the language whether it is in the dataset or not, any word we might conjure in foreseeable future, whether it is a technical/domain specific terminology that scientists might utter to sound intelligent or commonly used by everyone in day to day life. An ideal tokenizer should be able to deal with all and any of them.
Efficiency
Reducing the size (length) of the input text to make computation feasible and faster.
Meaningful Representation
Capturing the semantic essence of the text so that the model can learn effectively. Which we will discuss a bit later.

Simple Tokenization Methods

Go through the code below, and see if you can make any inferences on the table produced. It reads the book The Republic and counts the tokens on character, word and sentence levels and also indicated the number of unique tokens in the whole book.

Code

``` {.python results=”output raw” exports=”both”} from collections import Counter from nltk.tokenize import sent_tokenize with open(‘plato.txt’) as f: text = f.read()

words = text.split() sentences = sent_tokenize(text)

char_counter = Counter() word_counter = Counter() sent_counter = Counter()

char_counter.update(text) word_counter.update(words) sent_counter.update(sentences)

print(‘#+name: Vocabulary Size’) print(‘|Type|Vocabulary Size|Sequence Length|’) print(f’|Unique Characters|{len(char_counter)}|{len(text)}’) print(f’|Unique Words|{len(word_counter)}|{len(words)}’) print(f’|Unique Sentences|{len(sent_counter)}|{len(sentences)}’)


**Table 2**

| Type              | Vocabulary Size | Sequence Length |
| ----------------- | --------------- | --------------- |
| Unique Characters | 115             | 1,213,712       |
| Unique Words      | 20,710          | 219,318         |
| Unique Sentences  | 7,777           | 8,714           |



## Study

Character-Level Tokenization

:   In this most elementary method, text is broken down into individual
    characters.

    *\"data\"* \> `"d" "a" "t" "a"`{.verbatim}

Word-Level Tokenization

:   This is the simplest and most used (before sub-word methods became
    popular) method of tokenization, where text is split into individual
    words based on spaces and punctuation. Still useful in some
    applications and as a pedagogical launch pad into other tokenization
    techniques.

    *\"Machine learning models require data.\"* \>
    `"Machine", "learning", "models", "require", "data", "."`{.verbatim}

Sentence-Level Tokenization

:   This approach segments text into sentences, which is useful for
    tasks like machine translation or text summarization. Sentence
    tokenization is not as popular as we\'d like it to be.

    *\"Tokenizers convert text. They are essential in NLP.\"* \>
    `"Tokenizers convert text.", "They are essential in NLP."`{.verbatim}

n-gram Tokenization

:   Instead of using sentences as a tokens, what if you could use
    phrases of fixed length. The following shows the n-grams for n=2,
    i.e 2-gram or bigram. Yes the `n`{.verbatim} in the n-grams stands
    for how many words are chosen. n-grams can also be built from
    characters instead of words, though not as useful as word level
    n-grams.

    *\"Data science is fun\"* \>
    `"Data science", "science is", "is fun"`{.verbatim}.





**Table 3**

| Tokenization | Advantages                             | Disadvantages                                        |
| ------------ | -------------------------------------- | ---------------------------------------------------- |
| Character    | Minimal vocabulary size                | Very long token sequences                            |
|              | Handles any possible input             | Require huge amount of compute                       |
| Word         | Easy to implement and understand       | Large vocabulary size                                |
|              | Preserves meaning of words             | Cannot cover the whole language                      |
| Sentence     | Preserves the context within sentences | Less granular; may miss important word-level details |
|              | Sentence-level semantics               | Sentence boundary detection is challenging           |

As you can see from the table, the vocabulary size and sequence length
have inverse correlation. The Neural networks requires that the tokens
should be present in many places and many times. That is how the
networks understand words. Remember when you don\'t know the meaning of
a word, you ask someone to use it in sentences? Same thing here, the
more sentences the token is present, the better the network can
understand it. But in case of sentence tokenization, you can see there
are as many tokens in its vocabulary as in the tokenized corpus. It is
safe to say that each token is occuring only once and that is not a
healthy diet for a network. This problem occurs in word-level
tokenization too but it is subtle, the out-of-vocabulary(OoV) problem.
To deal with OOV we need to stay between character level and word-level
tokens, enter \>\>\> sub-words \<\<\<.

# Advanced Tokenization Methods

Subword tokenization is an advanced tokenization technique that breaks
text into smaller units, smaller than words. It helps in handling rare
or unseen words by decomposing them into known subword units. Our hope
is that, the sub-words decomposed from text, can be used to compose new
unseen words and so act as the tokens for the unseen words. Common
algorithms include Byte Pair Encoding (BPE), WordPiece, SentencePiece.

*\"unhappiness\"* \> `"un", "happi", "ness"`{.verbatim}

BPE is originally a technique for compression of data. Repurposed to
compress text corpus by merging frequently occurring pairs of characters
or subwords. Think of it like what and how little number of unique
tokens you need to recreate the whole book when you are free to arrange
those tokens in a line as many time as you want.

Algorithm

:   1.  *Initialization*: Start with a list of characters (initial
        vocabulary) from the text(whole corpus).
    2.  *Frequency Counting*: Count all pair occurrences of consecutive
        characters/subwords.
    3.  *Pair Merging*: Find the most frequent pair and merge it into a
        single new subword.
    4.  *Update Text*: Replace all occurrences of the pair in the text
        with the new subword.
    5.  *Repeat*: Continue the process until reaching the desired
        vocabulary size or merging no longer provides significant
        compression.

Advantages

:   -   Reduces the vocabulary size significantly.
    -   Handles rare and complex words effectively.
    -   Balances between word-level and character-level tokenization.

Disadvantages

:   -   Tokens may not be meaningful standalone units.
    -   Slightly more complex to implement.

## Trained Tokenizers

WordPiece and SentencePiece tokenization methods are extensions of BPE
where the vocabulary is not merely created by assuming merging most
frequent pair. These variants evaluate whether the given merges were
useful or not by measuring how much each merge maximizes the likelihood
of the corpus. In simple words, lets take two vocabularies, before and
after the merges, and train two language models and the model trained on
vocabulary after the merges have lower perplexity(think loss) then we
assume that the merges were useful. And we need to repeat this every
time we make a merge. Not practical, and hence there some mathematical
tricks we use to make this more practical that we will discuss in a
future post.

The iterative merging process is the training of tokenizer and this
training is different training of actual models. There are python
libraries for training your own tokenizer, but when you\'re planning to
use a pretrained language model, it is better to stick with the
pretrained tokenizer associated with that model. In the following
section we see how to train a simple BPE tokenizer, SentencePiece
tokenizer and how to use BERT tokenizer that comes with huggingface\'s
`transformers`{.verbatim} library.

## Tokenization Techniques Used in Popular Language Models

### Byte Pair Encoding (BPE) in GPT Models

GPT models, such as GPT-2 and GPT-3, utilize Byte Pair Encoding (BPE)
for tokenization.

``` {.python results="output code" exports="both"}
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer =  Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
                     vocab_size=30000)
files = ["plato.txt"]

tokenizer.train(files, trainer)
tokenizer.model.save('.', 'bpe_tokenizer')

output = tokenizer.encode("Tokenization is essential first step for any NLP model.")
print("Tokens:", output.tokens)
print("Token IDs:", output.ids)
print("Length: ", len(output.ids))
Tokens: ['T', 'oken', 'ization', 'is', 'essential', 'first', 'step', 'for', 'any', 'N', 'L', 'P', 'model', '.']
Token IDs: [50, 6436, 2897, 127, 3532, 399, 1697, 184, 256, 44, 42, 46, 3017, 15]
Length:  14

SentencePiece in T5

T5 models use a Unigram Language Model for tokenization, implemented via the SentencePiece library. This approach treats tokenization as a probabilistic model over all possible tokenizations.

import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=plato.txt --model_prefix=unigram_tokenizer --vocab_size=3000 --model_type=unigram')

``` {.python results=”output code” exports=”both”} import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load(“unigram_tokenizer.model”) text = “Tokenization is essential first step for any NLP model.” pieces = sp.EncodeAsPieces(text) ids = sp.EncodeAsIds(text) print(“Pieces:”, pieces) print(“Piece IDs:”, ids) print(“Length: “, len(ids))


``` python
Pieces: ['▁To', 'k', 'en', 'iz', 'ation', '▁is', '▁essential', '▁first', '▁step', '▁for', '▁any', '▁', 'N', 'L', 'P', '▁model', '.']
Piece IDs: [436, 191, 128, 931, 141, 11, 1945, 123, 962, 39, 65, 17, 499, 1054, 1441, 1925, 8]
Length:  17

WordPiece Tokenization in BERT

``` {.python results=”output code”} from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) text = “Tokenization is essential first step for any NLP model.” encoded_input = tokenizer(text, return_tensors=’pt’)

print(“Tokens:”, tokenizer.convert_ids_to_tokens(encoded_input[‘input_ids’][0])) print(“Token IDs:”, encoded_input[‘input_ids’][0].tolist()) print(“Length: “, len(encoded_input[‘input_ids’][0].tolist())) ```

Summary of Tokenization Methods

Table 4

Method Length Tokens
BPE 14 [‘T’, ‘oken’, ‘ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘N’, ‘L’, ‘P’, ‘model’, ‘.’]
SentencePiece 17 [‘▁To’, ‘k’, ‘en’, ‘iz’, ‘ation’, ‘▁is’, ‘▁essential’, ‘▁first’, ‘▁step’, ‘▁for’, ‘▁any’, ‘▁’, ‘N’, ‘L’, ‘P’, ‘▁model’, ‘.’]
WordPiece (BERT) 12 [‘token’, ‘##ization’, ‘is’, ‘essential’, ‘first’, ‘step’, ‘for’, ‘any’, ‘nl’, ‘##p’, ‘model’, ‘.’]

Different tokenization methods give different results for the same input sentence. As we add more data to the tokenizer training, the differences between WordPiece and SentencePiece might decrease, but they will not vanish, because of the difference in their training process.

Table 5

Model Tokenization Method Library Key Features
GPT Byte Pair Encoding tokenizers Balances vocabulary size and granularity
BERT WordPiece transformers Efficient vocabulary, handles morphology
T5 Unigram Language Model sentencepiece Probabilistic, flexible across languages

Tokenization and Non English Languages

Tokenizing text is complex, especially when dealing with diverse languages and scripts. Various challenges can impact the effectiveness of tokenization.

Tokenization Issues with Complex Languages: With a focus on Tamil

Tokenizing text in languages like Tamil presents unique challenges due to their linguistic and script characteristics. Understanding these challenges is essential for developing effective NLP applications that handle Tamil text accurately.

Challenges in Tokenizing Tamil Language

  1. 1. Agglutinative Morphology

    Tamil is an agglutinative language, meaning it forms words by concatenating morphemes (roots, suffixes, prefixes) to convey grammatical relationships and meanings. A single word may express what would be a full sentence in English.

    Impact on Tokenization
    • Words can be very lengthy and contain many morphemes.
      • போகமுடியாதவர்களுக்காவேயேதான்
  2. 2. Punarchi and Phonology

    Tamil specific rules on how two words can be combined and resulting word may not be phonologically identical to its parts. The phonological transformations can cause problems with TTS/STT systems too.

    Impact on Tokenization
    • Surface forms of words may change when combined, making boundary detection challenging.
      • மரம் + வேர் > மரவேர்
      • தமிழ் + இனிது > தமிழினிது
  3. 3. Complex Script and Orthography

    Tamil alphabet representation in Unicode is suboptimal for everything except for standardized storage format. Even simple operations that are intuitive for native Tamil speaker, are harder to implement because of this. Techniques like BPE applied on Tamil text will break words at completely inappropriate points like cutting an uyirmei letter into consonant and diacritic resulting in meaningless output.

    தமிழ் > த ம ி ழ, ்

Strategies for Effective Tokenization of Tamil Text

  1. Language-Specific Tokenizers

    Train Tamil specific subword tokenizers with initial seed tokens prepared by better preprocessing techniques to avoid [problem-3]{.spurious-link target=”*3. Complex Script and Orthography”} type cases. Use morphological analyzers to decompose words into root and affixes, aiding in understanding and processing complex word forms.

Choosing the Right Tokenization Method

Challenges in Tokenization

  • Ambiguity: Words can have multiple meanings, and tokenizers cannot capture context. Example: The word "lead" can be a verb or a noun.
  • Handling Special Characters and Emojis: Modern text often includes emojis, URLs, and hashtags, which require specialized handling.
  • Multilingual Texts: Tokenizing text that includes multiple languages or scripts adds complexity, necessitating adaptable tokenization strategies.

Best Practices for Effective Tokenization

  • Understand Your Data: Analyze the text data to choose the most suitable tokenization method.
  • Consider the Task Requirements: Different NLP tasks may benefit from different tokenization granularities.
  • Use Pre-trained Tokenizers When Possible: Leveraging existing tokenizers associated with pre-trained models can save time and improve performance.
  • Normalize Text Before Tokenization: Cleaning and standardizing text
❌
❌