登录查看更多内容

Using the power of the Databricks (GenAI) Assistant

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

发布日期: 2024年7月1日

Recently, for a customer, we are ingesting JSON data from an API. While we have been doing this activity (for a different source), the mechanism is different. Earlier, we used two approaches. In the first approach, we connected to the REST API of the source and downloaded the JSON. This JSON was loaded using the 'spark.read.load' command, which gave us the complete JSON in memory. We flattened the JSON and stored only the required columns in a table. In the second approach, we connected to the source, flattened it in memory and stored the data in flattened format in the table. We did not generate an intermediate JSON file. So one hop saved. In both approaches, data is not available in the original structure. In the first approach, while we still have the JSON as a file, in the second approach, the JSON is never visible.

This time, we were told by the customer that we need to store the original JSON received from the source. As we were receiving an array in the JSON, we decided to extract individual objects from the JSON and store each as one row using string format.

On to the next step. Loading this data. As the column is string, when Spark reads it, it is treated as string. To convert the content from text to JSON, we need to use the 'from_json' method. Simple? Yes and No. Simple because it is one method. Not simple because we need to provide a schema.

And that is the crux of the story - explicit specification of the schema.

As the input is JSON, the input can change. Our earlier approaches handled this approach because we were dealing with JSON and we extracted only the required columns after flattening the structure. Having to specify the schema explicitly, has an impact on code flexibility and its adaptability. If the schema changes, we need to make a code change. We wanted to avoid this situation. Hence we decided to specify the schema using metadata (again a JSON). Then we decided to write a generic function that reads the metadata specification and creates the schema structure, which is given to the 'from_json' function.

As we were working in the Databricks environment, and it has introduced GenAI capabilities, we decided to summon the assistant. To start the activity of metadata based schema specification, we created a schema for the input file we were dealing with. Then we asked the assistant to generate a metadata specification - a Domain Specific Language (DSL) if we want to call it that. The assistant generated the JSON specification. We then asked it to generate code that would read the specification and create the structure as expected by Spark.

领英推荐

Azure Databricks, august updates

Martin Jonsing 3 年前

Experimenting with Databricks Volumes

Tony Siciliani 1 年前

Azure Data Factory integration with Databricks…

Ruslan Zubtsov 10 个月前

Once again the Assistant performed the activity and generated Python code.

We executed the Python code and boom! The code refused to work. We examined the code and found that the generated code had one missing element. When encountering a structure, the assistant did not generate an array. Once we made the change, we were able to generate the structure expected by Spark, read the JSON data, flatten it and extract the required columns.

What can we learn from this incident? The GenAI features of the Databricks Assistant are helpful and can help save effort required to generate code. But, we have to provide proper instructions. If the instructions are not precise, the code generated may not suffice our needs and we will end up spending a lot of time debugging the generated code. One more point. The code generated by GenAI assistants need examination.

Please do no assume that thy generate perfect code as well as code that matches your expectations.

#spark #json #genai #databricks #assistant #databricks_assistant #schema

Using the power of the Databricks (GenAI) Assistant

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Databricks : Intelligence 2.0, Delivered!

Optimize performance with Delta Lake

Unlock Databricks' Full Potential: Learn Basics, Mitigate Costs, and Know Limitations Today!

Building an End-to-End Airflow Data Pipeline with BigQuery, dbt & Soda

Integrating big data into a web application; a strategy for using PySpark and AWS

Think it's too early to set up a data lake? Think again!

The 'SpodSkak' Stack?: A real-time analytics architecture that supports?AI/ML and can scale.

Top 10 Benefits of Using Databricks

Why It’s a Great Time to Join the Data Science Team at Sportsbet

What is Delta table in Databricks? Difference between an Upsert and Append mode on fact or factless fact tables !

领英推荐

Reducing run time by 10min only?

2024年9月23日

Points to Ponder - Even with GenAI at your disposal, you need to know how to use its power

2024年9月17日

Databricks: Conditional execution in a job using if-else

2024年9月12日

Friday Fun - a productive Thursday

2024年8月30日

Databricks - Access file metadata when loading multiple files from a directory

2024年8月19日

Friday Fun -- show content of large file (page by page)

2024年8月9日

Flatten an XML in pyspark environment

2024年8月6日

GenAI and the notion that 'it just works'

2024年7月23日

Points to Ponder - If DALL-E generates an image, who owns the copyright?

2024年7月5日

Versions, Settings and Data types - Databricks

2024年6月21日

社区洞察

其他会员也浏览了

Databricks : Intelligence 2.0, Delivered!

Optimize performance with Delta Lake

Unlock Databricks' Full Potential: Learn Basics, Mitigate Costs, and Know Limitations Today!

Building an End-to-End Airflow Data Pipeline with BigQuery, dbt & Soda

Integrating big data into a web application; a strategy for using PySpark and AWS

Think it's too early to set up a data lake? Think again!

The 'SpodSkak' Stack?: A real-time analytics architecture that supports?AI/ML and can scale.

Top 10 Benefits of Using Databricks

Why It’s a Great Time to Join the Data Science Team at Sportsbet

What is Delta table in Databricks? Difference between an Upsert and Append mode on fact or factless fact tables !