Handling Nested Schema in Apache Spark
Sachin D N ????
Data Consultant @ Lumen Technologies | Data Engineer | Big Data Engineer | AWS | Azure | Apache Spark | Databricks | Delta Lake | Agile | PySpark | Hadoop | Python | SQL | Hive | Data Lake | Data Warehousing | ADF
Apache Spark provides powerful tools for working with complex, nested data structures. In this blog, we'll explore two different approaches to handling nested schemas in PySpark.
let's consider a JSON dataset of customers, where each customer has an ID, a name (consisting of a first name and a last name), and a city. Here's an example of what the data might look like:
[
??{"customer_id":?1,?"fullname":?{"firstname":?"John",?"lastname":?"Doe"},?"city":?"Bangalore"},
??{"customer_id":?2,?"fullname":?{"firstname":?"Jane",?"lastname":?"Doe"},?"city":?"Mysore"},
??{"customer_id":?3,?"fullname":?{"firstname":?"Bob",?"lastname":?"Smith"},?"city":?"Chennai"}
]
We can load this data into a DataFrame and apply the nested schema using either of the approaches described in the blog post. Here's how we can do it:
Approach 1: Using DDL Strings
ddlSchema?=?"customer_id?long,?fullname?struct<firstname:string,lastname:string>,?city?string"
df?=?spark.read .format("json").schema(ddlSchema).load("/path/to/data.json")
Approach 2: Using StructType and StructField
from?pyspark.sql.types?import?StructType,?StructField,?LongType,?StringType
customer_schema?=?StructType([
????StructField("customer_id",?LongType()),
????StructField("fullname",?StructType([
????????StructField("firstname",?StringType()),
????????StructField("lastname",?StringType())
????])),
????StructField("city",?StringType())
])
df?=?spark.read .format("json").schema(customer_schema).load("/path/to/data.json")
领英推荐
In both cases, "/path/to/data.json" should be replaced with the actual path to your JSON file. The resulting DataFrame will have a nested schema that matches the structure of the data.
Approach 1: Using DDL Strings
The first approach involves defining the schema using a DDL (Data Definition Language) string. This is a string that specifies the structure of the data in a format similar to the one used in SQL. Here's an example:
ddlSchema = "customer_id long, fullname struct<firstname:string,lastname:string>, city string"
df = spark.read .format("json").schema(ddlSchema).load("/path/to/data")
In this code, ddlSchema defines a schema with three fields: "customer_id", "fullname", and "city". The "fullname" field is a struct that contains two subfields: "firstname" and "lastname". The schema(ddlSchema) method applies this schema to the data.
Approach 2: Using StructType and StructField
The second approach involves defining the schema using StructType and StructField objects. This provides more flexibility and allows you to define more complex schemas. Here's an example:
from pyspark.sql.types import StructType, StructField, LongType, StringType
customer_schema = StructType([
StructField("customer_id", LongType()),
StructField("fullname", StructType([
StructField("firstname", StringType()),
StructField("lastname", StringType())
])),
StructField("city", StringType())
])
df = spark.read .format("json").schema(customer_schema).load("/path/to/data")
```
In this code, customer_schema defines the same schema as before, but using StructType and StructField objects. The schema(customer_schema) method applies this schema to the data.
Both of these approaches allow you to work with nested data in Spark. The best one to use depends on your specific needs and the complexity of your data.