A cool Python feature that is invaluable for a code generator
Bipin Patwardhan
Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9
For my most recent project, I wrote a couple of code generators - three variants of a Python/Spark application generator and at least four variants of an Airflow DAG generator. Different variants were needed as the requirements and the complexity of the output evolved over a period of time. Using this experience, I will show how you can get started on your journey of writing a code generator using a cool feature of Python.
For the purpose of this article, I will use a Python program that generates a basic Python/Spark application to get and display 10 rows of the specified table. The application to be generated is as below
import os
import sys
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("generator").getOrCreate()
try:
??? df = spark_session.sql("select * from user_names")
??? df.show(10, False)
except Exception as e:
??? print("Got an error {er}".format(er=str(e)))
spark_session.stop()
Version 1
The simplest method for generating this application is to make use of print statements as below
import os
import sys
print("import os")
print("import sys")
print("import pyspark")
print("from pyspark import SparkContext")
print("from pyspark.sql import SQLContext")
print("from pyspark.sql import SparkSession")
print("")
print("spark_session = SparkSession.builder.appName(\"generator\").getOrCreate()")
print("try:")
print("??? df = spark_session.sql(\"select * from user_names\")")
print("??? df.show(10, False)")
print("except Exception as e:")
print("??? print(\"Got an error {er}\".format(er=str(e)))")
print("")
Version 2
What if we want to allow the user to provide the name of the application and the name of the table, so that these can be incorporated in the application? Let us accept the name of the application and the name of the table as command line arguments when the generator is executed. Our code generator has to be modified as below
领英推荐
import os
import sys
app_name = sys.argv[1]
table_name = sys.argv[2]
print("import os")
print("import sys")
print("import pyspark")
print("from pyspark import SparkContext")
print("from pyspark.sql import SQLContext")
print("from pyspark.sql import SparkSession")
print("")
print("spark_session = SparkSession.builder.appName(\"" + app_name + "\").getOrCreate()")
print("try:")
print("??? df = spark_session.sql(\"select * from " + table_name + "\")")
print("??? df.show(10, False)")
print("except Exception as e:")
print("??? print(\"Got an error {er}\".format(er=str(e)))")
print("")
Can you make out which part of the code is the code generator and which part of the code is the generated code? It is quite difficult to separate out the two. Imagine what the code will look like if we have to generate a very large and complex program. As you can imagine, the code generator will not be easy to maintain.
Version 3
Let us simplify the code generator. Python allows us to define blocks of text inside triple double quotes or triple single quotes. The text can not only span multiple rows, but can also contain variable place-holders. What are variable place-holders? These are elements that are substituted by the actual value at the time the block of text is evaluated. And when is a block of text evaluated? When it is used in a print statement. How does our code generator look like?
import os
import sys
template_application = """ # note the triple quotes that indicate start of block
import os
import sys
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("{app_name}").getOrCreate()
try:
??? df = spark_session.sql("select * from {table_name}")
??? df.show(10, False)
except Exception as e:
??? print("Got an error {{er}}".format(er=str(e)))
spark_session.stop()
""" # note the triple quote that indicate end of block
app_name = sys.argv[1]
table_name = sys.argv[2]
print(template_application.format(app_name=app_name, table_name=table_name))
We are defining all our code in the variable named 'template_application'. The variable also contains variable place-holders for application name (app_name) and table name (table_name). We have to take care to provide values for these variables. We do that in the print statement, where we provide the actual values using the format keyword.
Important Note:
You will note that we have enclosed the 'er' variable inside double curly brackets. This is because we want the variable to remain a variable in the generated code. By using double curly brackets, Python will remove one set of curly brackets during evaluation of the format statement, but will retain the second set. The second set then appears as a variable in the generated code.
Happy coding!!!