A Simple Code Generator Using a Cool Python Feature

For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a Python/Spark application generator and at least four variants of an Airflow DAG generator.

Different variants were needed as the requirements and the complexity of the output evolved over a period of time. Using this experience, I will show how you can get started on your journey of writing a code generator using a cool feature of Python.

For the purpose of this article, I will use a Python program that generates a basic Python/Spark application to get and display 10 rows of the specified table. The application to be generated is as below

import os
import sys
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("generator").getOrCreate()
try:
    df = spark_session.sql("select * from user_names")
    df.show(10, False)
except Exception as e:
    print("Got an error {er}".format(er=str(e)))
spark_session.stop()        

Version 1

Given this task, most of us will write an application as below, using print statements

import os
import sys

print("import os")
print("import sys")
print("import pyspark")
print("from pyspark import SparkContext")
print("from pyspark.sql import SQLContext")
print("from pyspark.sql import SparkSession")
print("")
print("spark_session = SparkSession.builder.appName(\"generator\").getOrCreate()")
print("try:")
print("    df = spark_session.sql(\"select * from user_names\")")
print("    df.show(10, False)")
print("except Exception as e:")
print("    print(\"Got an error {er}\".format(er=str(e)))")
print("")        

But this approach is cumbersome and not maintainable.

Version 2

What if we want to allow the user to provide the name of the application and the name of the table dynamically?

Let us accept the name of the application and the name of the table as command line arguments when the generator is executed. Our code generator has to be modified as below

import os
import sys

app_name = sys.argv[1]
table_name = sys.argv[2]

print("import os")
print("import sys")
print("import pyspark")
print("from pyspark import SparkContext")
print("from pyspark.sql import SQLContext")
print("from pyspark.sql import SparkSession")
print("")
print("spark_session = SparkSession.builder.appName(\"" + app_name + "\").getOrCreate()")
print("try:")
print("    df = spark_session.sql(\"select * from " + table_name + "\")")
print("    df.show(10, False)")
print("except Exception as e:")
print("    print(\"Got an error {er}\".format(er=str(e)))")
print("")        

Even with this approach, though we have managed to parameterize the name of the application and the name of the table, the approach itself is not without problems. The biggest problem is that this approach is not flexible.

Version 3

In version 2, can you make out which part of the code is the code generator and which part of the code is the generated code? It is quite difficult to separate out the two. Imagine what the code will look like if we have to generate a very large and complex program. As you can imagine, the code generator will not be easy to maintain.Let us simplify the code generator. Python allows us to define blocks of text inside triple double quotes or triple single quotes. The text can not only span multiple rows, but can also contain variable place-holders. What are variable place-holders? These are elements that are substituted by the actual value at the time the block of text is evaluated. And when is a block of text evaluated? When it is used in a print statement.

How does our code generator look like?

import os
import sys

template_application = """ # note the triple quotes that indicate start of block
import os
import sys
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName("{app_name}").getOrCreate()
try:
    df = spark_session.sql("select * from {table_name}")
    df.show(10, False)
except Exception as e:
    print("Got an error {{er}}".format(er=str(e)))
spark_session.stop()
""" # note the triple quote that indicate end of block

app_name = sys.argv[1]
table_name = sys.argv[2]
print(template_application.format(app_name=app_name, table_name=table_name))        

We are defining all our code in the variable named 'template_application'. The variable also contains variable place-holders for application name (app_name) and table name (table_name). We have to take care to provide values for these variables. We do that in the print statement, where we provide the actual values using the format keyword.

Important Note: You will note that we have enclosed the 'er' variable inside double curly brackets. This is because we want the variable to remain a variable in the generated code. By using double curly brackets, Python will remove one set of curly brackets during evaluation of the format statement, but will retain the second set. The second set then appears as a variable in the generated code.

Happy coding!!!

#python #template #code_generation

要查看或添加评论,请登录

Bipin Patwardhan的更多文章

  • Change management is crucial (Databricks version)

    Change management is crucial (Databricks version)

    My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…

  • Friday fun - Impersonation (in a good way)

    Friday fun - Impersonation (in a good way)

    All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…

  • Any design is a trade-off

    Any design is a trade-off

    Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

    1 条评论
  • Quick Tip: The headache caused by import statements in Python

    Quick Tip: The headache caused by import statements in Python

    When developing applications, there has to be a method to the madness. Just because a programming environment allows…

  • Databricks: Enabling safety in utility jobs

    Databricks: Enabling safety in utility jobs

    I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…

  • Recap of my articles from 2024

    Recap of my articles from 2024

    As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…

  • Handling dates

    Handling dates

    Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

  • pfff -- why are you spending time to save 16sec execution time

    pfff -- why are you spending time to save 16sec execution time

    In my current project, we are implementing a data processing and reporting application using Databricks. All the code…

    2 条评论
  • Quick Tip - Add a column to a table (Databricks)

    Quick Tip - Add a column to a table (Databricks)

    As the saying goes, change is the only constant, even in the data space. As we design tables for our data engineering…

  • Friday Fun - Reduce time of execution and face execution failure

    Friday Fun - Reduce time of execution and face execution failure

    In my project that has been executing since Dec 2023, things have been going good. We do have the occasional hiccup…

社区洞察