Using the power of the Databricks (GenAI) Assistant
Bipin Patwardhan
Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9
Recently, for a customer, we are ingesting JSON data from an API. While we have been doing this activity (for a different source), the mechanism is different. Earlier, we used two approaches. In the first approach, we connected to the REST API of the source and downloaded the JSON. This JSON was loaded using the 'spark.read.load' command, which gave us the complete JSON in memory. We flattened the JSON and stored only the required columns in a table. In the second approach, we connected to the source, flattened it in memory and stored the data in flattened format in the table. We did not generate an intermediate JSON file. So one hop saved. In both approaches, data is not available in the original structure. In the first approach, while we still have the JSON as a file, in the second approach, the JSON is never visible.
This time, we were told by the customer that we need to store the original JSON received from the source. As we were receiving an array in the JSON, we decided to extract individual objects from the JSON and store each as one row using string format.
On to the next step. Loading this data. As the column is string, when Spark reads it, it is treated as string. To convert the content from text to JSON, we need to use the 'from_json' method. Simple? Yes and No. Simple because it is one method. Not simple because we need to provide a schema.
And that is the crux of the story - explicit specification of the schema.
As the input is JSON, the input can change. Our earlier approaches handled this approach because we were dealing with JSON and we extracted only the required columns after flattening the structure. Having to specify the schema explicitly, has an impact on code flexibility and its adaptability. If the schema changes, we need to make a code change. We wanted to avoid this situation. Hence we decided to specify the schema using metadata (again a JSON). Then we decided to write a generic function that reads the metadata specification and creates the schema structure, which is given to the 'from_json' function.
As we were working in the Databricks environment, and it has introduced GenAI capabilities, we decided to summon the assistant. To start the activity of metadata based schema specification, we created a schema for the input file we were dealing with. Then we asked the assistant to generate a metadata specification - a Domain Specific Language (DSL) if we want to call it that. The assistant generated the JSON specification. We then asked it to generate code that would read the specification and create the structure as expected by Spark.
领英推荐
Once again the Assistant performed the activity and generated Python code.
We executed the Python code and boom! The code refused to work. We examined the code and found that the generated code had one missing element. When encountering a structure, the assistant did not generate an array. Once we made the change, we were able to generate the structure expected by Spark, read the JSON data, flatten it and extract the required columns.
What can we learn from this incident? The GenAI features of the Databricks Assistant are helpful and can help save effort required to generate code. But, we have to provide proper instructions. If the instructions are not precise, the code generated may not suffice our needs and we will end up spending a lot of time debugging the generated code. One more point. The code generated by GenAI assistants need examination.
Please do no assume that thy generate perfect code as well as code that matches your expectations.
#spark #json #genai #databricks #assistant #databricks_assistant #schema