Spark Tidbits - Lesson 8

Spark Tidbits - Lesson 8

File formats supported by the Apache Spark engine can be classified as either strong or weak in nature. A format like Apache Paquet is considered strong since column names and data types are already set. On the other hand, a format like Comma Separated Values is weak since the file has to be read completely to infer data types and column names.


Today, we are going to talk about how to supply a schema to the spark.read method to reduce the processing time for these weak files.


To start, we need a simple dataset to work with. If you have not experimented with Machine Learning, you should look at the Kaggle web-site. It has competitions for budding Data Scientists to learn. The Titanic dataset is perfect for a binary classification. Did passenger cabin location and gender play a role in surviving the disaster?


Please download the sample files from the Kaggle website: train - actual passengers survival, test - passengers with unknown survival, and score - classify passenger as survived or deceased. The accuracy of the model is how many correct classifications in the scoring file?


Please notice I created a new lake house called lh_titanic and uploaded the three files to the storage section. The first code example reads in the training file. Because a schema is not provided, the Spark Engine has to infer the data types.

#
# 1 - infer csv file schema
#

# read in file, infer schema
df1 = spark.read.format("csv").option("header", "true").option("delimiter", ",").load("Files/train.csv")

# show top 10 records
display(df1.limit(10))
        


There have been several data type misclassifications. Many of the numbers are shown as strings. How can we fix this issue?


Microsoft Fabric supports Scala, Python, SQL and R as languages. Regardless of flavor, all code has to be converted to Java Byte code. The second code example uses the data types from the Spark SQL library to define an list. This list can be simple or nested. This schema variable tells the engine what are the column names and column data types. See the documentation for details.

#
# 2 - read using schema (structures)
#

# load libs
from pyspark.sql.types import *
from pyspark.sql.functions import *

# define object
objSchema = StructType([
    StructField("PassengerId", IntegerType()),
    StructField("Survived", IntegerType()),
    StructField("Pclass", IntegerType()),
    StructField("Name", StringType()),
    StructField("Sex", StringType()),  
    StructField("Age", IntegerType()),  
    StructField("SibSp", StringType()),
    StructField("Parch", StringType()),      
    StructField("Ticket", StringType()),      
    StructField("Fare", FloatType()),      
    StructField("Cabin", StringType()),
    StructField("Embarked", StringType())
    ])


# read in file, infer schema
df2 = spark.read.format("csv").schema(objSchema).option("header", "true").option("delimiter", ",").load("Files/train.csv")

# show top 10 records
display(df2.limit(10))        

Please note that many of the type issues have been fixed! Can you find the two columns that can be changed to integer data types?

Like many things in Apache Spark, there are multiple ways to accomplish the same task. Instead of using a python list, we can define a string that represents the columns using a SQL data definition syntax. Again, the schema method on the read operation can be used to define the file.

#
# 3 - read using schema (SQL DDL)
#

# load libs
from pyspark.sql.types import *
from pyspark.sql.functions import *

# SQL DDL as string
strSchema = " PassengerId INTEGER, Survived INTEGER, Pclass INTEGER, Name STRING, Sex STRING, Age INTEGER, SibSp STRING, Parch STRING, Ticket STRING, Fare FLOAT, Cabin STRING, Embarked STRING "

# read in file, infer schema
df3 = spark.read.format("csv").schema(strSchema).option("header", "true").option("delimiter", ",").load("Files/train.csv")

# show top 10 records
display(df3.limit(10))        

The same results are shown using this new coding pattern.

If you have not figured it out yet, the SibSp and Parch columns can be stored as integers instead of strings.

The last step is to save the dataframe as a table. One way is to use the saveAsTable method. This is what we call a managed table. Another way is to save the dataframe as a delta file. This allows for both managed and unmanaged tables. Since the path to the file uses the Tables shortcut, the new table is considered managed.

#
#  4 - delta files show up as tables
#

df3.write.format("delta").save("Tables/test_data")        

The very last step is to validate our data. This can be done by previewing the file. We can see that the delta table has 891 rows.


To recap, Spark supports reading both weak and strong file formats. With the weak format, the engine has to spend time reading the whole file to infer data types. With small files, the impact of this operation is negligible. For large files, this can be quite time consuming.


Spark allows the developer to speed up the reading of weak formatted files by supplying a schema. Today, we learnt how to define the schema using predefined data types to create a list variable and/or using a SQL data definition language string. Both methods accomplish the same result.


Last but not least, Microsoft Fabric prefers managed tables. This type of table is supported by the SQL Analytic End Point. Right now, unmanaged tables do not show up for the end users to query the data. Please see my article on SQL Server Central for all the details.




要查看或添加评论,请登录

John Miner的更多文章

  • Why use Tally Tables in the Fabric Warehouse?

    Why use Tally Tables in the Fabric Warehouse?

    Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…

  • Streaming Data with Azure Databricks

    Streaming Data with Azure Databricks

    Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

    1 条评论
  • Upcoming Fabric Webinars from Insight

    Upcoming Fabric Webinars from Insight

    Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…

  • How to develop solutions with Fabric Data Warehouse?

    How to develop solutions with Fabric Data Warehouse?

    Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…

  • Understanding file formats within the Fabric Lakehouse

    Understanding file formats within the Fabric Lakehouse

    I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation…

    3 条评论
  • Engineering a Lakehouse with Azure Databricks with Spark Dataframes

    Engineering a Lakehouse with Azure Databricks with Spark Dataframes

    Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.

  • Create an Azure Databricks SQL Warehouse

    Create an Azure Databricks SQL Warehouse

    Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

    2 条评论
  • How to Load a Fabric Warehouse?

    How to Load a Fabric Warehouse?

    Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…

  • My Year End Wrap Up for 2024

    My Year End Wrap Up for 2024

    Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

    1 条评论
  • Virtualizing GCP data with Fabric Shortcuts

    Virtualizing GCP data with Fabric Shortcuts

    New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines to…

社区洞察

其他会员也浏览了