登录查看更多内容

Understanding file formats within the Fabric Lakehouse

John Miner

Data Architect at Insight

发布日期: 2025年2月10日

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation materials on my Git Hub Repo.

Microsoft Fabric has OneLake Storage at the center of all services. Storage is based upon existing Azure Data Lake Storage and can be accessed with tools that you are familiar with. Many different file formats have been used over time. Understanding the pros and cons of each file type is important.

I will be using several different datasets during the talk: zip files, stock data, earthquake data, NASA website data, and Fisher Iris dataset.

We will start our exploration with the CSV format. However, there are many other formats that you might encounter in real life. Web services typically use a JSON document as the input and/or output to REST API calls. The Apache foundation projects came up with three different file formats: AVRO shines at data deserialization for RPC calls, ORC is suited for Hadoop processing, and PARQUET is optimized for Spark processing.

There are edge cases in which a file is in a special format. One can always use the TEXT format to parse out the data. All of the above formats do not support the ACID properties of a database.

领英推荐

Using Airbyte with Tabular

Tabular (now part of Databricks) 1 年前

May 2023 - Iceberg Community News

Tabular (now part of Databricks) 1 年前

Building an Open, Multi-Engine Data Lakehouse with S3…

Serhii Sokolenko ???? 2 个月前

That is why Databricks developed the DELTA file format which was opened source in 2019. This format is the foundation of most files in OneLake. The Fabric Lakehouse is an implementation of Apache Spark.

One can read and write all of these file formats using Spark dataframes. Additionally, one can create either managed (INTERNAL) or unmanaged (EXTERNAL) tables in the hive catalog. Only managed tables are accessible by the SQL endpoint at this time and should be used most cases. One cool feature is the short hand notation in Spark SQL to read up a file format given a directory. This can be used as input to the create table as (CTAS) statement to create managed tables.

As a bonus round if time permits, we will talk about how to read from a PostgreSQL database and write to delta tables using the pubs database.

At the end of this talk, the developer will have a full understanding of all the file formats than can be managed by Fabric.

CloudDataDriven

3 周

?? Mark your calendars and RSVP today! ??

Ronen Ariely

?? Sr. Consultant & Data Architect | ?? Intl. Speaker & Lecturer | ?? 7x Former MVP | ?? Lead: Data Driven Community & GlobalHebrew | ?? Data Platforms Expert.

3 周

Thanks for sharing, John Miner! ?? Can't wait for this session! ?? ?? Save the date and join the Data Driven Community! ??

查看更多评论

要查看或添加评论，请登录

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

2025年2月26日

Why use Tally Tables in the Fabric Warehouse?

Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…
Streaming Data with Azure Databricks

2025年2月25日

Streaming Data with Azure Databricks

Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

1 条评论
Upcoming Fabric Webinars from Insight

2025年2月19日

Upcoming Fabric Webinars from Insight

Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…
How to develop solutions with Fabric Data Warehouse?

2025年2月18日

How to develop solutions with Fabric Data Warehouse?

Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…
Engineering a Lakehouse with Azure Databricks with Spark Dataframes

2025年2月3日

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.
Create an Azure Databricks SQL Warehouse

2025年1月21日

Create an Azure Databricks SQL Warehouse

Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

2 条评论
How to Load a Fabric Warehouse?

2025年1月9日

How to Load a Fabric Warehouse?

Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…
My Year End Wrap Up for 2024

2024年12月26日

My Year End Wrap Up for 2024

Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

1 条评论
Virtualizing GCP data with Fabric Shortcuts

2024年12月16日

Virtualizing GCP data with Fabric Shortcuts

New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines to…
Spark Tidbits - Lesson 12

2024年12月9日

Spark Tidbits - Lesson 12

Designer a Power BI report that will be used for the whole wide company requires planning, testing and deployment. Many…

See all articles

Understanding file formats within the Fabric Lakehouse

John Miner

Data Architect at Insight

领英推荐

John Miner的更多文章

社区洞察

其他会员也浏览了

Exploring Apache Spark: The Ultimate Guide to Big Data Mastery ??

Spark Performance Tuning: Spill

A Beginner’s Take on Spark Query and Storage Optimizations

Deep Dive into Persist in Apache Spark

Databricks Photon and its relation to Apache Spark

2024/2025 Data, Infrastructure, Security and AI

Apache Spark 101: Window Functions

Spark Tidbits - Lesson 11

Using the alexmerced/datanotebook Docker Image

What the Heck is Puppygraph?

领英推荐

John Miner的更多文章

Why use Tally Tables in the Fabric Warehouse?

Streaming Data with Azure Databricks

Upcoming Fabric Webinars from Insight

How to develop solutions with Fabric Data Warehouse?

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Create an Azure Databricks SQL Warehouse

How to Load a Fabric Warehouse?

My Year End Wrap Up for 2024

Virtualizing GCP data with Fabric Shortcuts

Spark Tidbits - Lesson 12

社区洞察

其他会员也浏览了

Exploring Apache Spark: The Ultimate Guide to Big Data Mastery ??

Spark Performance Tuning: Spill

A Beginner’s Take on Spark Query and Storage Optimizations

Deep Dive into Persist in Apache Spark

Databricks Photon and its relation to Apache Spark

2024/2025 Data, Infrastructure, Security and AI

Apache Spark 101: Window Functions

Spark Tidbits - Lesson 11

Using the alexmerced/datanotebook Docker Image

What the Heck is Puppygraph?