Flatten an XML in pyspark environment

Most of us are familiar with reading JSON files in a PySpark environment, followed by extraction of the required fields. All solutions mention use of explode functionality followed by extraction. Depending on choice, the explode can be selective (for specific nodes) or for all the nodes in the data set.

If the JSON is complex, the explode command creates a lot of records, which is a topic for another post.

Recently we faced a similar situation - we had to load data from an XML. We had to connect to a SOAP (yes it is still around) interface and fetch data. Our payload was XML and the response was XML (naturally, as it is SOAP). We wanted to extract data from the XML and store it in a (Databricks) table.

As we had prior experience with flattening for a JSON, we thought we could do something similar with the XML. But we could not find enough examples. Most examples were in Python (using pandas).

After much thought and trial, I hit upon a solution. I decided to look at the 'xmltodict' Python library. This library, when given an XML, converts it to a Python dictionary. As a Python dictionary and a JSON are interchangeable, we could, if conversion was successful, get the data converted from XML to JSON, which could then be flattened.

We loaded the XML and used 'xmltodict' parse it. It did the job and generated and equivalent JSON. After applying the step of removing namespace elements from the keys using replace function in Python, we had a standard structure JSON.

And we were in business.

#parsing #conversion #flattening #flatten #json #xml #pyspark #python #ingestion #data_load #databricks

Siji Kallarakal

Senior Solution Architect - Big Data Cloud, at Capgemini | AWS-2x, OCI-7x, Azure-3x, GenAI Certified

7 个月

Very informative !! Thanks for sharing

回复

要查看或添加评论,请登录

Bipin Patwardhan的更多文章

  • Parallel execution in Spark

    Parallel execution in Spark

    On reading the title, I am sure the first reaction will be 'What am I talking about'. As we all know, Spark is a…

    1 条评论
  • Writing code to generate code - Python + SQL version

    Writing code to generate code - Python + SQL version

    In my current project, we had to build multiple metric tables. The base table had 50 columns and we had to add around…

  • Change management is crucial (Databricks version)

    Change management is crucial (Databricks version)

    My last project was a data platform implemented using Databricks. As is standard in a data project, we were ingesting…

  • Friday fun - Impersonation (in a good way)

    Friday fun - Impersonation (in a good way)

    All of us know that impersonation - the assumption of another person's identity, be it for good or bad - is not a good…

  • Any design is a trade-off

    Any design is a trade-off

    Irrespective of any area in the world (software or otherwise), every design is a trade off. A design cannot be the 'one…

    1 条评论
  • Quick Tip: The headache caused by import statements in Python

    Quick Tip: The headache caused by import statements in Python

    When developing applications, there has to be a method to the madness. Just because a programming environment allows…

  • Databricks: Enabling safety in utility jobs

    Databricks: Enabling safety in utility jobs

    I am working on a project where we are using Databricks on the WAS platform. It is a standard data engineering project…

  • A Simple Code Generator Using a Cool Python Feature

    A Simple Code Generator Using a Cool Python Feature

    For a project that I executed about three years ago, I wrote a couple of code generators - three variants of a…

  • Recap of my articles from 2024

    Recap of my articles from 2024

    As we are nearing the end of 2024, I take this opportunity to post a recap of the year - in terms of the articles I…

  • Handling dates

    Handling dates

    Handling dates is tough in real life. Date handling is probably tougher in the data engineering world.

社区洞察

其他会员也浏览了