Understanding file formats within the Fabric Lakehouse

Understanding file formats within the Fabric Lakehouse

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation materials on my Git Hub Repo.


Microsoft Fabric has OneLake Storage at the center of all services. Storage is based upon existing Azure Data Lake Storage and can be accessed with tools that you are familiar with. Many different file formats have been used over time. Understanding the pros and cons of each file type is important.


I will be using several different datasets during the talk: zip files, stock data, earthquake data, NASA website data, and Fisher Iris dataset.


We will start our exploration with the CSV format. However, there are many other formats that you might encounter in real life. Web services typically use a JSON document as the input and/or output to REST API calls. The Apache foundation projects came up with three different file formats: AVRO shines at data deserialization for RPC calls, ORC is suited for Hadoop processing, and PARQUET is optimized for Spark processing.


There are edge cases in which a file is in a special format. One can always use the TEXT format to parse out the data. All of the above formats do not support the ACID properties of a database.


That is why Databricks developed the DELTA file format which was opened source in 2019. This format is the foundation of most files in OneLake. The Fabric Lakehouse is an implementation of Apache Spark.


One can read and write all of these file formats using Spark dataframes. Additionally, one can create either managed (INTERNAL) or unmanaged (EXTERNAL) tables in the hive catalog. Only managed tables are accessible by the SQL endpoint at this time and should be used most cases. One cool feature is the short hand notation in Spark SQL to read up a file format given a directory. This can be used as input to the create table as (CTAS) statement to create managed tables.


As a bonus round if time permits, we will talk about how to read from a PostgreSQL database and write to delta tables using the pubs database.


At the end of this talk, the developer will have a full understanding of all the file formats than can be managed by Fabric.

?? Mark your calendars and RSVP today! ??

回复
Ronen Ariely

?? Sr. Consultant & Data Architect | ?? Intl. Speaker & Lecturer | ?? 7x Former MVP | ?? Lead: Data Driven Community & GlobalHebrew | ?? Data Platforms Expert.

3 周

Thanks for sharing, John Miner! ?? Can't wait for this session! ?? ?? Save the date and join the Data Driven Community! ??

回复

要查看或添加评论,请登录

John Miner的更多文章

  • Why use Tally Tables in the Fabric Warehouse?

    Why use Tally Tables in the Fabric Warehouse?

    Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…

  • Streaming Data with Azure Databricks

    Streaming Data with Azure Databricks

    Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

    1 条评论
  • Upcoming Fabric Webinars from Insight

    Upcoming Fabric Webinars from Insight

    Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…

  • How to develop solutions with Fabric Data Warehouse?

    How to develop solutions with Fabric Data Warehouse?

    Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…

  • Engineering a Lakehouse with Azure Databricks with Spark Dataframes

    Engineering a Lakehouse with Azure Databricks with Spark Dataframes

    Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.

  • Create an Azure Databricks SQL Warehouse

    Create an Azure Databricks SQL Warehouse

    Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

    2 条评论
  • How to Load a Fabric Warehouse?

    How to Load a Fabric Warehouse?

    Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…

  • My Year End Wrap Up for 2024

    My Year End Wrap Up for 2024

    Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

    1 条评论
  • Virtualizing GCP data with Fabric Shortcuts

    Virtualizing GCP data with Fabric Shortcuts

    New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines to…

  • Spark Tidbits - Lesson 12

    Spark Tidbits - Lesson 12

    Designer a Power BI report that will be used for the whole wide company requires planning, testing and deployment. Many…

社区洞察

其他会员也浏览了