Understanding file formats within the Fabric Lakehouse
I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation materials on my Git Hub Repo.
Microsoft Fabric has OneLake Storage at the center of all services. Storage is based upon existing Azure Data Lake Storage and can be accessed with tools that you are familiar with. Many different file formats have been used over time. Understanding the pros and cons of each file type is important.
I will be using several different datasets during the talk: zip files, stock data, earthquake data, NASA website data, and Fisher Iris dataset.
We will start our exploration with the CSV format. However, there are many other formats that you might encounter in real life. Web services typically use a JSON document as the input and/or output to REST API calls. The Apache foundation projects came up with three different file formats: AVRO shines at data deserialization for RPC calls, ORC is suited for Hadoop processing, and PARQUET is optimized for Spark processing.
There are edge cases in which a file is in a special format. One can always use the TEXT format to parse out the data. All of the above formats do not support the ACID properties of a database.
领英推荐
That is why Databricks developed the DELTA file format which was opened source in 2019. This format is the foundation of most files in OneLake. The Fabric Lakehouse is an implementation of Apache Spark.
One can read and write all of these file formats using Spark dataframes. Additionally, one can create either managed (INTERNAL) or unmanaged (EXTERNAL) tables in the hive catalog. Only managed tables are accessible by the SQL endpoint at this time and should be used most cases. One cool feature is the short hand notation in Spark SQL to read up a file format given a directory. This can be used as input to the create table as (CTAS) statement to create managed tables.
As a bonus round if time permits, we will talk about how to read from a PostgreSQL database and write to delta tables using the pubs database.
At the end of this talk, the developer will have a full understanding of all the file formats than can be managed by Fabric.
?? Mark your calendars and RSVP today! ??
?? Sr. Consultant & Data Architect | ?? Intl. Speaker & Lecturer | ?? 7x Former MVP | ?? Lead: Data Driven Community & GlobalHebrew | ?? Data Platforms Expert.
3 周Thanks for sharing, John Miner! ?? Can't wait for this session! ?? ?? Save the date and join the Data Driven Community! ??