❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Adding a custom filter to Airflow logs

By: ashish
10 February 2023 at 09:28

Recently I tried to add a custom logger for our Airflow deployment in kubernetes. Since we allow our logs to be shown to our customers we cannot have our secrets and other kubernetes config be shown to the users.

On pod failure we get a dump of the entire pod config and that is what we want to remove today

Random pod configs that also expose our secrets at random intervals

Note: Our logs are sent to S3 and remote logging is enabled for our version of airflow 2.5.0, so whatever filter we write we need to add it to the S3 handler as well. If you are not using that then there is no need for you to add it to your system.

Logging filters

In short logging filters take a log message and based on the conditions we have supplied it will tell weather or not this should be logged

def filter(self, record):
    if "word" in record.getMessage():
        return False
    return True

The above function ensure that any log record that has a word β€œword” in it is omitted by the loggers.

Now let us write a filter class that omits any log that comes from taskinstance.py,standard_task_runner.py

import logging
class CustomFilter(logging.Filter):
    # filter out all log records that have taskinstance.py in them
    def __init__(self):
        super().__init__()

    def filter(self, record):
        files_to_filter = ["taskinstance.py","standard_task_runner.py"]
        return record.filename not in files_to_filter

The things available to you to use in record are –

  1. args
  2. levelname
  3. levelno
  4. pathname
  5. exc_info
  6. exc_text
  7. funcName
  8. created
  9. logThreads
  10. processName
  11. Message

So you can filter out the logs using any of the above record properties

Let us add the above CustomFilter class to Airflow

from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
from copy import deepcopy
LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)

LOGGING_CONFIG["filters"]["custom_filter"] = {
          "()": CustomFilter,
      }
if "filters" in LOGGING_CONFIG["handlers"]["task"]:
      LOGGING_CONFIG["handlers"]["task"]["filters"].append("custom_filter")
  else:
      LOGGING_CONFIG["handlers"]["task"]["filters"] = ["custom_filter"]
#since I only want it for tasks i am editing only this. But if you need to add it to other aspects of airflow feel free to do so.
airflow.logging_config.dictConfig(LOGGING_CONFIG)

I had some issues with deploying this. Mainly the issue was ModuleNotFoundError: No module named β€˜config. The simplest way I could fix it was to just replace the actual local settings file with my custom one while creating the docker image

So my docker file has the following lines –

FROM apache/airflow:2.5.0-python3.10
COPY ./config/airflow_local_settings.py /home/airflow/.local/lib/python3.10/site-packages/airflow/config_templates

And you can see the changes I made to the airflow_local_settings.py file here – https://gist.github.com/PandaWhoCodes/52ab5ffb93d881ee90113f4eb0e23b5d/revisionshttps://gist.github.com/PandaWhoCodes/52ab5ffb93d881ee90113f4eb0e23b5d/revisions

Clubbing multithreaded logs together in Python

By: ashish
4 January 2023 at 09:35

At my current company we rely heavily on Airflow for job orchestration and scheduling. From time to time we need to show these airflow task logs to our customers. The problem arises when the same task has multiple threads, each thread throws up its logs whenever it can which causes the logs to go into a frenzy. If there are more than 20-30 threads then the logs look like –

With that established, in this blog post we will look into how we can club the logs in each thread together and maybe print them in a ordered fashion.

Thread safe logging

Lets first create a sample code for thread safe logging

The current output for this will be

python main.py 
2023-01-04 08:28:45 Hello from thread 0
2023-01-04 08:28:45 Hello from thread 1
2023-01-04 08:28:45 Hello from thread 2
2023-01-04 08:28:45 Hello from thread 3
2023-01-04 08:28:45 Hello from thread 4
2023-01-04 08:28:45 Hello from thread 5
2023-01-04 08:28:45 Hello from thread 6
2023-01-04 08:28:45 Hello from thread 7
2023-01-04 08:28:45 Hello from thread 8
2023-01-04 08:28:45 Hello from thread 9
2023-01-04 08:28:48 Hello from thread 0
2023-01-04 08:28:48 Hello from thread 1
2023-01-04 08:28:48 Hello from thread 4
2023-01-04 08:28:48 Hello from thread 2
2023-01-04 08:28:48 Hello from thread 3
2023-01-04 08:28:48 Hello from thread 5
2023-01-04 08:28:48 Hello from thread 6
2023-01-04 08:28:48 Hello from thread 7
2023-01-04 08:28:48 Hello from thread 9
2023-01-04 08:28:48 Hello from thread 8

As you can see the order for the threads is not maintained.

To go around this issue, we will first have to capture all the logs and then print them out sequentially. There are three ways to go about this

  1. Write each thread log into a seperate file
  2. Write each thread log into a seperate stream
  3. Write each thread log into the same I/O stream with thread number formatted in front of the log so that we can filter it later using thread logs

Let’s write our logs into separate stream and then print them out later.

Moving logs from console to our stream

Now we have captured all the logs to a string buffer.

Next step would be to group these logs by threads. We can do that by placing them in different files or using thread name as key for a dictionary instead of a io stream.

Note: based on your use case you can continue using string buffers. My original idea was to use different buffers ( file or string ) and then at the end aggregate these. While writing this up through of another data structure that would fit better for my use case which is given below.

Adding logs to a grouped queue

The output for the above will be

2023-01-04 09:21:02,136 - __main__ - INFO - Hello from thread 0
2023-01-04 09:21:05,139 - __main__ - INFO - Hello from thread 0
2023-01-04 09:21:02,140 - __main__ - INFO - Hello from thread 1
2023-01-04 09:21:05,143 - __main__ - INFO - Hello from thread 1
2023-01-04 09:21:02,141 - __main__ - INFO - Hello from thread 2
2023-01-04 09:21:05,143 - __main__ - INFO - Hello from thread 2
2023-01-04 09:21:02,141 - __main__ - INFO - Hello from thread 3
2023-01-04 09:21:05,144 - __main__ - INFO - Hello from thread 3
2023-01-04 09:21:02,142 - __main__ - INFO - Hello from thread 4
2023-01-04 09:21:05,145 - __main__ - INFO - Hello from thread 4
2023-01-04 09:21:02,142 - __main__ - INFO - Hello from thread 5
2023-01-04 09:21:05,145 - __main__ - INFO - Hello from thread 5
2023-01-04 09:21:02,142 - __main__ - INFO - Hello from thread 6
2023-01-04 09:21:05,145 - __main__ - INFO - Hello from thread 6
2023-01-04 09:21:02,142 - __main__ - INFO - Hello from thread 7
2023-01-04 09:21:05,147 - __main__ - INFO - Hello from thread 7
2023-01-04 09:21:02,143 - __main__ - INFO - Hello from thread 8
2023-01-04 09:21:05,146 - __main__ - INFO - Hello from thread 8
2023-01-04 09:21:02,143 - __main__ - INFO - Hello from thread 9
2023-01-04 09:21:05,146 - __main__ - INFO - Hello from thread 9

As you can see they are coming up nicely in groups of their threads.

Now another issue will arise here for the main thread. We want the main thread logs to come before the threads execute and after the threads execute in order.

Apart from this, once the main thread is printed or any thread is printed and you want to add more logs to it, we must ensure that we flush it. For now I am only adding the flush for mainThread.

So the final output comes as we wanted it to be

2023-01-04 09:27:25,036 - __main__ - INFO - Message 0
2023-01-04 09:27:25,036 - __main__ - INFO - Message 1
2023-01-04 09:27:25,036 - __main__ - INFO - Message 2
2023-01-04 09:27:25,036 - __main__ - INFO - Message 3
2023-01-04 09:27:25,036 - __main__ - INFO - Message 4
2023-01-04 09:27:25,036 - __main__ - INFO - Message 5
2023-01-04 09:27:25,036 - __main__ - INFO - Message 6
2023-01-04 09:27:25,036 - __main__ - INFO - Message 7
2023-01-04 09:27:25,036 - __main__ - INFO - Message 8
2023-01-04 09:27:25,036 - __main__ - INFO - Message 9
2023-01-04 09:27:25,036 - __main__ - INFO - Hello from thread 0
2023-01-04 09:27:28,038 - __main__ - INFO - Hello from thread 0
2023-01-04 09:27:25,037 - __main__ - INFO - Hello from thread 1
2023-01-04 09:27:28,039 - __main__ - INFO - Hello from thread 1
2023-01-04 09:27:25,037 - __main__ - INFO - Hello from thread 2
2023-01-04 09:27:28,039 - __main__ - INFO - Hello from thread 2
2023-01-04 09:27:25,037 - __main__ - INFO - Hello from thread 3
2023-01-04 09:27:28,039 - __main__ - INFO - Hello from thread 3
2023-01-04 09:27:25,037 - __main__ - INFO - Hello from thread 4
2023-01-04 09:27:28,039 - __main__ - INFO - Hello from thread 4
2023-01-04 09:27:25,038 - __main__ - INFO - Hello from thread 5
2023-01-04 09:27:28,039 - __main__ - INFO - Hello from thread 5
2023-01-04 09:27:25,038 - __main__ - INFO - Hello from thread 6
2023-01-04 09:27:28,039 - __main__ - INFO - Hello from thread 6
2023-01-04 09:27:25,038 - __main__ - INFO - Hello from thread 7
2023-01-04 09:27:28,041 - __main__ - INFO - Hello from thread 7
2023-01-04 09:27:25,038 - __main__ - INFO - Hello from thread 8
2023-01-04 09:27:28,039 - __main__ - INFO - Hello from thread 8
2023-01-04 09:27:25,038 - __main__ - INFO - Hello from thread 9
2023-01-04 09:27:28,041 - __main__ - INFO - Hello from thread 9
2023-01-04 09:27:28,041 - __main__ - INFO - Message 0
2023-01-04 09:27:28,041 - __main__ - INFO - Message 1
2023-01-04 09:27:28,041 - __main__ - INFO - Message 2
2023-01-04 09:27:28,041 - __main__ - INFO - Message 3
2023-01-04 09:27:28,041 - __main__ - INFO - Message 4
2023-01-04 09:27:28,041 - __main__ - INFO - Message 5
2023-01-04 09:27:28,041 - __main__ - INFO - Message 6
2023-01-04 09:27:28,041 - __main__ - INFO - Message 7
2023-01-04 09:27:28,041 - __main__ - INFO - Message 8
2023-01-04 09:27:28,041 - __main__ - INFO - Message 9

Ingesting large files to postgres through S3

By: ashish
13 April 2022 at 20:16

One of the tasks I recently came accross my job was ingest large files but with the following

  1. Do some processing ( like generate hash for each row )
  2. Insert it into S3 for audit purposes
  3. Insert into postgres

Note:

Keep in mind your postgres database needs to support this and a s3 bucket policy needs to exist in order to allow the data to be copied over.

The setup I am using is a RDS database with S3 in the same region and proper policies and IAM roles already created.

Read more on that here – AWS documentation

For the purpose of this post I will be using dummy data from – eforexcel(1 million records)

The most straight forward way to do this would be to just do a df.to_sql like this

df = pd.read_csv("records.csv")
df.to_sql(
    name="test_table",
    con=connection_detail,
    schema="schema",
    if_exists="replace",
)

Something like this would take more than an hour! Lets do it in less than 5 minutes.

Now ofcourse there are several ways to make this faster – using copy expert, psycogpg driver etc(maybe a sepearate blog post on these), but that’s not the use case I have been tasked with. Since we need to upload the file s3 in the end for audit purposes I will ingest the data from S3 to DB.

Generate table metadata

Before we can assign an s3 operator to ingest the data we need to create the table into which this data will be inserted. We have two ways that I can think of

  1. Each column in the file will be created in the DB with a highest threshold value like varchar(2000)
  2. Each column is created with the data length as max length in each row

I will be going with option 2 here.

This entire process took around 210 seconds instead of more than an hour like the last run.

Let’s go over the code one by one

Read the csv

  1. We can pass the data directly to pandas or stream it into buffered memory something like this
with open("records.csv") as f:
    csv_rdr = csv.reader(f, delimiter=",")
    header = next(csv_rdr)
    with gzip.GzipFile(fileobj=mem_file, mode="wb", compresslevel=6) as gz:
        buff = io.StringIO()
        writer = csv.writer(buff)
        writer.writerows([header])
        for row in csv_rdr:
            writer.writerows([row])
        gz.write(buff.getvalue().encode("utf-8", "replace"))
    mem_file.seek(0)
    s3.put_object(Bucket="mybucket", Key="folder/file.gz", Body=mem_file)

2. Since the file is less than 50 MB i’ll go ahead and load it directly.

Create the table

Get the max lengths of each column and use that to generate the table. We use pandas to_sql() function for this and pass the dtypes.

Copy data from s3 gzipped file to postgres

Finally we use –

aws_s3.table_import_from_s3

to copy over the file to the postgres table.

Generating Signature Version 4 URL’s using boto3

By: ashish
12 April 2022 at 21:16

If your application allows your users to download files directly from s3, you are bound to get this error sometime in the future whenever you scale to other regions – The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.

The issue has been raised on various forums and github eg – https://github.com/rstudio/pins/issues/233 , https://stackoverflow.com/questions/57591989/amazon-web-services-s3-the-authorization-mechanism-you-have-provided-is-not-s. None of these solutions worked for me.

Here is what did

Replace the Region and Airflow bucket and you’re good to go.

Using Google Sheets as a makeshift Database [Depriciated]

By: ashish
9 March 2020 at 19:51

Do you want need a quick solution without going into the hassle of setting up a Database? If your answer to any of those questions was a yes, then you’ve come to the right place. This post will show you how you can use Google sheets as your database.

For the purposes of this blogpost I will be usiing this Google sheet.

As you can see, we will be collecting the following data from the user – Name, Email and Age.

Create the API

  • Go to the google sheet you want to use.
  • Create column headers in the first column
  • Click on tools> script editor
  • Copy the following code to the editor

    • Click on run>run function> setup.
    • Now publish your script to get the request URL with the following settings.

Now let us test this URL in a webpage.

See the Pen
Simple register form
by Thomas Ashish Cherian (@pandawhocodes)
on CodePen.

You can enter your details here to see your details being updated in the Google Sheet above( refresh to see changes) .

Making local changes render on codesandbox embeds

By: ashish
29 January 2020 at 15:49

Today I came accross codesandbox which is an online editor catered for web applications. It is an opensource effort with all bugs being tracked and handled by the community. The feature I liked the most about codesandbox was its embed feature, that lets you embed code alongside it’s rendered version.

But for some reason the local changes made for my static html embed’s did not show up in the rendered half of the screen. try editing the code below and you’ll see what I mean.


This is the template given by codesandbox for static HTML pages. Even if you press the reload button after you make the changes to the code, the page does not reflect the changes.

To fix this issue

      1. Create a new sandbox
      2. Select vanilla parcel
      3. Delete unecessary files.
      4. The embed code should work now.

Alternatively you can just use the template I have created for starting new codesandbx projects.

Building a [Smarter] bot with Dialogflow

By: ashish
7 September 2019 at 18:54

This post is a followup for Building your first [smart] bot in 10 minutes. Do give it a read before reading this post. In this post we will be building a chatbot that will handle hotel booking. There will be minimal [python] code and it will be hosted on a serverless instance of aws.

Things you will learn in this post –

  1. Fine tuning your dialogflow intents
  2. Using entities
  3. Using contexts

> Creating Intents

You can think of intents as intention. β€œWhat is your name?β€œ, The intention of this question is to get the name of the person the question was asked to. Now, I can use different sentences to have the same intention –

  1. Can you tell me your name?
  2. What do people call you ?

All the questions have the same intention – to get the persons name.
To create intents, click on the intents tab and add training phrases.

To make hotel bookings I will need the following parameters

  1. Number of people
  2. A date
  3. Number of nights for the stay

For now let the two training phrases be –

>Entities

Entities are information that you want to extract from the user inputs. In the above image you can spot three entities that we are taking from the user input.

  • nights
  • date
  • people

You can change the parameter name by double clicking it.Β 

Required Parameters

To book a hotel room you need all the three entities mentioned above. Scroll down to action and parameters and select the required checkbox for all three parameters.

Click on the prompts that will be displayed if the user fails to give one of the three entities.

Everything seems to be working well. But we still need to ask the user and confirm the given information. To send back a response, scroll down to the response tab and use $parameter_nameΒ where ever required.

>Using Context

Context provide you with background information of what’s being talked about.

Ashish doens’t like slow internet. He also not too fond of buggy interfaces.

The second sentence will not make much sense without the first one. Who is β€œhe” ? InΒ dialogflowΒ you can pass variables from one intent to another using context.

To do that, click on the context for your first intent and add a identifier for that intent. Now go to the next intent and put that indentifier as your input context. You can later use the parameters from the first intent in second one using

#identifier.parameter_name

You can download all my intents using this link and upload it to your dialogflow dashboard.

>Testing it

Β 

Β 

Β 

Β 

Β 

Β 

Hosting a static website on Heroku – The easy way

By: ashish
26 July 2019 at 18:06

Image credits : milesweb.com

Do you want free hosting for your demo website? Do you want to deploy your webapp for free? Then look no further. You can do that and more with Heroku.Β  Read more to deploy your first static website on Heroku.

Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. Heroku has a genorous free plan allowing students/hobbyist to deploy their apps on the cloud for free.

In this post you are going to learn the following –

  • Basics of git
  • Adding files necessary to host static HTML on Heroku
  • Pushing code to your git repo
  • Creating a new app on Heroku and connecting it with your GitHub account

What you’ll need –

  1. A GitHub account ( if you are a student, apply for a student pack here)
  2. A Heroku account ( if you are a student , avail your benefits here)

1. Git

What is Git?

Git is a version control tool. Think of it like a backup of your code, but instead of backing up all your code, it just backs up the changes. Watch the following video to learn more about git and github.

For the purpose of this blog post, I am going to use a sample website from microsoft sample html project. Considering you are already registered on GitHub, go to the page and click on fork. This will copy all the code to your account so that you can make changes to it.

Once that is done, you need to download and install git on your machine. After installing git, head over to your cmd prompt and copy/paste the following

git clone https://github.com/< username >/project-html-website

Make sure you replace the < username > with your own github username, and that you have already forked the project.

Once you run the above command, the project will be copied to your local directory. Go ahead and check it out.

Click on the image to learn basic git commands

2. Adding files necessary to host static HTML on Heroku

  • Add a file called composer.json in the directory. If you want to learn more about why we are adding a composer.json file to the directory – click here.Β 
  • Inside the composer.json file add the following line –
  • { }
  • Add another file called index.php. If you want to know more about index files – click here.
  • Inside index.php add the following line –
  • <?php include_once("index.html"); ?>
  • You can replace index.html with the name of the html file you want to server as your home page.

Your folder structure should now look like this

Let us now push these local changes to our github repo. To do that change directory into the folder using – cd directoryΒ and perform the following commands.

3. Pushing code to your git repo

If everything goes well you should have the same output as I have in the above screen shot. Let’s go over the commands one by one

  • cd project-html-website
  • is to change the directory of the terminal into the project folder
  • Once we are in the project folder, we add/propose changed to git by
  • git add *
  • git commit -m "commit message"
  • And finally the
  • git push
  • To push all the local changes to the github server.

4. Creating a new app on Heroku

  • Go to your Heroku dashboard and create a new app.
  • Give the app a suitable name and click on Create app.
  • Once that is done, you will be redirected to the deploy page for your new app.
  • Click on GitHub and connect your app to your GitHub repo.
  • Once you are connected, scroll down and click on the deploy brach button.
  • Once it is deployed, you can see the website with its own unique Heroku url.
  • View my app here

Hope this post helped you. If you want more help, feel free to ping meΒ @Ashish_che

A simple guide to building REST API’s in GO

By: ashish
18 June 2019 at 11:30

In this post we will build Β simple REST API’s using the Go programming language. We will also be using the MUX Router. I will also explain some of the fundamentals of the language for beginners.

If you want to learn Go visit awesome-go-in-education. A curated list of resources about Go in Education. If you want to do the same but in Python read A simple guide to creating REST API’s with flask. I will be using Goland from jetbrains as my IDE.

Before we get started, a few jargon.

REST: a RESTful API uses HTTP requests to GET, PUT, POST and DELETE data.

RESTful API designing: guidelines is a must read before you continue. It talks about terminologies, endpoints, versioning status codes and so much more.

Test your environment

Let us first test the environment to check if everything is working fine. For that we will be using a simple β€œHello World” program.

Running β€œHello World” program

Once that is done, let us import necessary packages.

Performing imports

Let us look at the imports used one by one.

  1. encoding/json – since our API’s communications will be handled in JSON format
  2. log – will log errors
  3. net/http – We will use this package to create the API’s and communicate using HTTP protocols.
  4. mux – Β A powerful URL router and dispatcher for golang . A router is used to define which function will run when a particular endpoint(URL) is called.

Writing the main funciton

Do note Β In Go, := is for declaration + assignment, whereas = is for assignment only.For example, var foo int = 10 is the same as foo := 10.

  1. First we create a new variable for our multiplexer.
  2. Then we use HandleFunc to define which function will handle which API endpoint.
  3. With http.ListenAndServe we define the port that your program must listen to continuously.We wrap that around log.Fatal so that all exeptions are logged.

To run your code type the following in your console
go run main.go

If you face an error telling you that mux is not installed then run
go get -u github.com/gorilla/mux in your console.

Post Requests

Photo by Andrik Langfield on Unsplash

Let us now post some data to the server.

Note: Click here to know more about json in Go.

  1. Adding a new function and a function handler.

2. Β Creating structs that will hold our json data.

3. Writing our add function.

Putting it all together

Testing it using postman

Hope this post helped you. If you want more help, feel free to ping me @Ashish_che

❌
❌