Reading JSON config - Spark vs Python

One of the most common tasks that you will have to deal with in the present data engineering context using PySpark, is to read a JSON file. And by that I do not mean data in JSON format. I am referring to application configuration stored in a JSON.

In PySpark, to read a JSON file, we have the following code

try:
    config_js = spark.read.option("multiline", "true").json(global_config).collect()[0]
    log_path = config_js["log_path"]
except Exception as ex:
    print(str(ex))        

Pretty simple, isn't it?

The problem with this approach is when you have to deal with nested JSONs. While that problem can be resolved by flattening the JSON, another problem with this approach is when we need to handle the scenario of missing keys. If we ask for a key and it is not present in the data, it will throw an exception. For proper execution, we need to use another try / except block as given below.

try:
    config_js = spark.read.option("multiline", "true").json(global_config).collect()[0]
    try:
        log_path = config_js["log_path"]
    except Exception ex1:
        log_path = "/log"
except Exception as ex:
    print(str(ex))        

In comparison, the Python approach is pretty simple - even in PySpark and I prefer this approach. In the Python approach, we have to read the config file as a plain text file, concatenate all lines, and then load the data as a JSON object. an example is below.

try:
    text_data = ""
    df = spark.read.format("text").option("multiline", "true").load(file_path).collect()
    for line in df:
        row_data = line["value"]
        text_data += " " + row_data
    json_data = json.loads(text_data)
except Exception as ex:
    json_data = {}        

The Python approach has three advantages over the PySpark approach. The first one is that when referring to missing keys, we can use the get() method (instead of the [] mechanism) and avoid having to write nested try / except blocks. The second advantage is also for missing keys. When we use the get() method and the key is not found in the data, it returns None by default. If we do not wish to get a None value, we can provide a default value.

Sample code below

log_path = json_data.get("log_path", "/log")        

Thus, not only do we avoid usage of an explicit try / except block, we can also provide default values for missing keys.

The third advantage of the Python approach is that we do not need to flatten the data. When JSON data is flattened, it adds many rows to the data and one JSON record expands into many records. In our project, we had a situation where a JSON file containing 18 JSON records exploded into nearly 260K records - the JSON nesting was so deep. It is important to note that we were dealing with data and not configuration. If we had a configuration file that was so deeply nested, it would be a troublesome configuration. But using the Python approach, we would still have 18 records. In the Python approach, the mechanism of accessing nested elements remains same - use the get() method. If the object is another object, apply another get() operation to fetch the nested key.

#json #pyspark #python #configuration

要查看或添加评论,请登录

社区洞察

其他会员也浏览了