ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Spark Tidbits - Lesson 8

John Miner

Data Architect at Insight

å‘å¸ƒæ—¥æœŸ: 2024å¹´7æœˆ11æ—¥

File formats supported by the Apache Spark engine can be classified as either strong or weak in nature. A format like Apache Paquet is considered strong since column names and data types are already set. On the other hand, a format like Comma Separated Values is weak since the file has to be read completely to infer data types and column names.

Today, we are going to talk about how to supply a schema to the spark.read method to reduce the processing time for these weak files.

To start, we need a simple dataset to work with. If you have not experimented with Machine Learning, you should look at the Kaggle web-site. It has competitions for budding Data Scientists to learn. The Titanic dataset is perfect for a binary classification. Did passenger cabin location and gender play a role in surviving the disaster?

Please download the sample files from the Kaggle website: train - actual passengers survival, test - passengers with unknown survival, and score - classify passenger as survived or deceased. The accuracy of the model is how many correct classifications in the scoring file?

Please notice I created a new lake house called lh_titanic and uploaded the three files to the storage section. The first code example reads in the training file. Because a schema is not provided, the Spark Engine has to infer the data types.

#
# 1 - infer csv file schema
#

# read in file, infer schema
df1 = spark.read.format("csv").option("header", "true").option("delimiter", ",").load("Files/train.csv")

# show top 10 records
display(df1.limit(10))

There have been several data type misclassifications. Many of the numbers are shown as strings. How can we fix this issue?

Microsoft Fabric supports Scala, Python, SQL and R as languages. Regardless of flavor, all code has to be converted to Java Byte code. The second code example uses the data types from the Spark SQL library to define an list. This list can be simple or nested. This schema variable tells the engine what are the column names and column data types. See the documentation for details.

#
# 2 - read using schema (structures)
#

# load libs
from pyspark.sql.types import *
from pyspark.sql.functions import *

# define object
objSchema = StructType([
    StructField("PassengerId", IntegerType()),
    StructField("Survived", IntegerType()),
    StructField("Pclass", IntegerType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),  
    StructField("Age", IntegerType()),  
    StructField("SibSp", StringType()),
    StructField("Parch", StringType()),      
    StructField("Ticket", StringType()),      
    StructField("Fare", FloatType()),      
    StructField("Cabin", StringType()),
    StructField("Embarked", StringType())
    ])


# read in file, infer schema
df2 = spark.read.format("csv").schema(objSchema).option("header", "true").option("delimiter", ",").load("Files/train.csv")

# show top 10 records
display(df2.limit(10))

Please note that many of the type issues have been fixed! Can you find the two columns that can be changed to integer data types?

é¢†è‹±æŽ¨è

Exploring Data Operations with PySpark, Pandas, DuckDB, Polars, and DataFusion in a Python Notebook

Exploring Data Operations with PySpark, Pandasâ€¦

Alex Merced 5 ä¸ªæœˆå‰

SQL and Python - Combining the 2 Forces for Advanced Data Analysis

SQL and Python - Combining the 2 Forces for Advancedâ€¦

Muhammad Ishtiaq Khan 8 ä¸ªæœˆå‰

BigData Analytics with PySpark

Ram Narasimhan 3 å¹´å‰

Like many things in Apache Spark, there are multiple ways to accomplish the same task. Instead of using a python list, we can define a string that represents the columns using a SQL data definition syntax. Again, the schema method on the read operation can be used to define the file.

#
# 3 - read using schema (SQL DDL)
#

# load libs
from pyspark.sql.types import *
from pyspark.sql.functions import *

# SQL DDL as string
strSchema = " PassengerId INTEGER, Survived INTEGER, Pclass INTEGER, Name STRING, Sex STRING, Age INTEGER, SibSp STRING, Parch STRING, Ticket STRING, Fare FLOAT, Cabin STRING, Embarked STRING "

# read in file, infer schema
df3 = spark.read.format("csv").schema(strSchema).option("header", "true").option("delimiter", ",").load("Files/train.csv")

# show top 10 records
display(df3.limit(10))

The same results are shown using this new coding pattern.

If you have not figured it out yet, the SibSp and Parch columns can be stored as integers instead of strings.

The last step is to save the dataframe as a table. One way is to use the saveAsTable method. This is what we call a managed table. Another way is to save the dataframe as a delta file. This allows for both managed and unmanaged tables. Since the path to the file uses the Tables shortcut, the new table is considered managed.

#
#  4 - delta files show up as tables
#

df3.write.format("delta").save("Tables/test_data")

The very last step is to validate our data. This can be done by previewing the file. We can see that the delta table has 891 rows.

To recap, Spark supports reading both weak and strong file formats. With the weak format, the engine has to spend time reading the whole file to infer data types. With small files, the impact of this operation is negligible. For large files, this can be quite time consuming.

Spark allows the developer to speed up the reading of weak formatted files by supplying a schema. Today, we learnt how to define the schema using predefined data types to create a list variable and/or using a SQL data definition language string. Both methods accomplish the same result.

Last but not least, Microsoft Fabric prefers managed tables. This type of table is supported by the SQL Analytic End Point. Right now, unmanaged tables do not show up for the end users to query the data. Please see my article on SQL Server Central for all the details.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

John Minerçš„æ›´å¤šæ–‡ç«

Why use Tally Tables in the Fabric Warehouse?

2025å¹´2æœˆ26æ—¥

Why use Tally Tables in the Fabric Warehouse?

Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by mostâ€¦
Streaming Data with Azure Databricks

2025å¹´2æœˆ25æ—¥

Streaming Data with Azure Databricks

Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or aâ€¦

1 æ¡è¯„è®º
Upcoming Fabric Webinars from Insight

2025å¹´2æœˆ19æ—¥

Upcoming Fabric Webinars from Insight

Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help youâ€¦
How to develop solutions with Fabric Data Warehouse?

2025å¹´2æœˆ18æ—¥

How to develop solutions with Fabric Data Warehouse?

Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. Theâ€¦
Understanding file formats within the Fabric Lakehouse

2025å¹´2æœˆ10æ—¥

Understanding file formats within the Fabric Lakehouse

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentationâ€¦

3 æ¡è¯„è®º
Engineering a Lakehouse with Azure Databricks with Spark Dataframes

2025å¹´2æœˆ3æ—¥

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.
Create an Azure Databricks SQL Warehouse

2025å¹´1æœˆ21æ—¥

Create an Azure Databricks SQL Warehouse

Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not allâ€¦

2 æ¡è¯„è®º
How to Load a Fabric Warehouse?

2025å¹´1æœˆ9æ—¥

How to Load a Fabric Warehouse?

Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and everyâ€¦
My Year End Wrap Up for 2024

2024å¹´12æœˆ26æ—¥

My Year End Wrap Up for 2024

Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

1 æ¡è¯„è®º
Virtualizing GCP data with Fabric Shortcuts

2024å¹´12æœˆ16æ—¥

Virtualizing GCP data with Fabric Shortcuts

New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines toâ€¦

See all articles

Spark Tidbits - Lesson 8

John Miner

Data Architect at Insight

é¢†è‹±æŽ¨è

John Minerçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

Simplifying Apache Spark usage with Optimus

PySpark

My PySpark Job Is Taking Foreverâ€¦ Now What? ?

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Spark Tidbits - Lesson 9

My na?ve Data analytics tutorial: Python vs. SQLite

é¢†è‹±æŽ¨è

John Minerçš„æ›´å¤šæ–‡ç«

Why use Tally Tables in the Fabric Warehouse?

Streaming Data with Azure Databricks

Upcoming Fabric Webinars from Insight

How to develop solutions with Fabric Data Warehouse?

Understanding file formats within the Fabric Lakehouse

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Create an Azure Databricks SQL Warehouse

How to Load a Fabric Warehouse?

My Year End Wrap Up for 2024

Virtualizing GCP data with Fabric Shortcuts

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

Simplifying Apache Spark usage with Optimus

PySpark

My PySpark Job Is Taking Foreverâ€¦ Now What? ?

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Spark Tidbits - Lesson 9

My na?ve Data analytics tutorial: Python vs. SQLite

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†