ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Spark Tidbits - Lesson 10

John Miner

Data Architect at Insight

å‘å¸ƒæ—¥æœŸ: 2024å¹´7æœˆ19æ—¥

Apache Spark was design to execute java byte programs in a cluster using general purpose computing. Most cloud implementations use LINUX virtual machines as the nodes in the cluster. Therefore, it is not surprising that mounting storage and working with folders and/or files is supported by the utility libraries. In Azure Databricks use the dbutils library and for Microsoft products (synapse + fabric) use the mssparkutils library.

Today, we are going to work with Microsoft fabric. Lets list out any existing mounts using the code below.

#
#  1 - existing mounts
#

mssparkutils.fs.mounts()

The default mount gives us access to the lakehouse files and tables. There is also a notebook working directory mount.

Right now, I have the weather data loaded in the Files directory. See image below that uses the local path with the shell command to list out folders and files.

We can create custom mounts to both other lakehouse and Azure Data Lake Storage. The code below uses the workspace and lakehouse ids to mount the Adventure Works lakehouse.

#
#  3 - mount lakehouse in same workspace
#

mssparkutils.fs.mount( 
 "abfss://a668a328-9f67-4678-93f2-10d5afdfe3ad@onelake.dfs.fabric.microsoft.com/0416e287-2a33-4093-8fa6-5f46d7e660d5", 
 "/advwrks"
)

Please note that mounts only exist during a session. See the local path for details shown below.

Use the getMountPath method of the file system (fs) class to retrieve the fully qualified path when required. The code below shows only the /raw/saleslt directory exists under Files folder in the Adventure Works lakehouse. We want to create a new directory called weather and copy the files from the current lakehouse over to the new storage folder.

The above code was part of step 4 in my learning notebook. Step 5 creates the new folder and executes the file copy.

#
#  5 - create dir + copy files between lakehouses
#

# get file list
files = mssparkutils.fs.ls(f"file://{mssparkutils.fs.getMountPath('/default')}/Files")
print(files)

# create new directory
mssparkutils.fs.mkdirs(f"file://{mssparkutils.fs.getMountPath('advwrks')}/Files/raw/weather")

# copy over files
for file in files:
    print(file.path)
    src = file.path
    dst = f"file://{mssparkutils.fs.getMountPath('advwrks')}/Files/raw/weather/"
    mssparkutils.fs.cp(src, dst)

# show the resulting files
mssparkutils.fs.ls(f"file://{mssparkutils.fs.getMountPath('advwrks')}/Files/raw/weather")

The image below shows the three files in the source location.

For some reason, the library created a Cyclic Redundancy Check file for each csv or text file. The image below shows six files in the weather directory.

Microsoft Fabric introduced the concept of shortcuts that are created at the lakehouse object level. However, other Apache Spark implementations uses the mount method. Microsoft Fabric can use either an Account Key or Shared Access Signature.

é¢†è‹±æŽ¨è

Coding Challenge #8 - Build a Redis Server

John Crickett 1 å¹´å‰

Apache Spark: Features, Use Cases, Advantages & Limitations

Apache Spark: Features, Use Cases, Advantages &â€¦

Anush K. 2 å¹´å‰

Apache Spark : The Shuffle

Akhil Pathirippilly mana 2 å¹´å‰

Due to its limited scope, I stored a SAS key value pair into an existing key vault named kvs4tips2021.

The credentials class has a getSecret method. The code below attempts to read in the secret and mount the ADLS storage.

#
#  6 - mount adls storage, copy files to lakehouse
#

token = mssparkutils.credentials.getSecret("https://kvs4tips2021.vault.azure.net/", "sec-sas-key-4-adls")
mssparkutils.fs.mount(  
    "abfss://bronze@sa4adls2030.dfs.core.windows.net",  
    "/adls",  
    {"sasToken": token}
)

By default, the notebook runs under the current session user. In my case, the john@craftydba.com user does not have access to the Azure Key Vault. Please see the http error code 403 below.

Once access has been granted to the user, the Python code works just fine. The image below shows the new mount point called adls.

If we take a look at the storage folder, we can see there are five delta tables stored as sub folders.

The following code uses the copy file command with the recurse option.


# copy stocks directory
src = f"file://{mssparkutils.fs.getMountPath('adls')}/stocks"
dst = f"file://{mssparkutils.fs.getMountPath('/default')}/Files"
mssparkutils.fs.cp(src, dst, recurse=True)

Using the lake house explorer, on can see the five folders have been copied to a sub directory called stocks with our sample lakehouse named lh_mounts_n_files.

In a nutshell, there is plenty of PySpark code out there using libraries to mount storage and manage folders/files. Databricks, Fabric and Synapse all have the ability to read secrets from a key vault. If you have to use credentials, using a key vault is the preferred way to centralize and control information.

Mounting storage in Databricks results in a mount point under the /mnt directory. However, in Fabric and Synapse we have the spark session id as part of the path. Therefore, use the get mount path method to convert the mount name to a fully qualified path.

Next time, I will be talking about how to read and write data to a PostgreSQL database from Microsoft Fabric.

VMS

8 ä¸ªæœˆ

John Miner Thanks for Sharing! ??

èµž

å›žå¤

1 æ¬¡å›žåº”

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

John Minerçš„æ›´å¤šæ–‡ç«

Why use Tally Tables in the Fabric Warehouse?

2025å¹´2æœˆ26æ—¥

Why use Tally Tables in the Fabric Warehouse?

Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by mostâ€¦
Streaming Data with Azure Databricks

2025å¹´2æœˆ25æ—¥

Streaming Data with Azure Databricks

Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or aâ€¦

1 æ¡è¯„è®º
Upcoming Fabric Webinars from Insight

2025å¹´2æœˆ19æ—¥

Upcoming Fabric Webinars from Insight

Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help youâ€¦
How to develop solutions with Fabric Data Warehouse?

2025å¹´2æœˆ18æ—¥

How to develop solutions with Fabric Data Warehouse?

Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. Theâ€¦
Understanding file formats within the Fabric Lakehouse

2025å¹´2æœˆ10æ—¥

Understanding file formats within the Fabric Lakehouse

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentationâ€¦

3 æ¡è¯„è®º
Engineering a Lakehouse with Azure Databricks with Spark Dataframes

2025å¹´2æœˆ3æ—¥

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.
Create an Azure Databricks SQL Warehouse

2025å¹´1æœˆ21æ—¥

Create an Azure Databricks SQL Warehouse

Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not allâ€¦

2 æ¡è¯„è®º
How to Load a Fabric Warehouse?

2025å¹´1æœˆ9æ—¥

How to Load a Fabric Warehouse?

Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and everyâ€¦
My Year End Wrap Up for 2024

2024å¹´12æœˆ26æ—¥

My Year End Wrap Up for 2024

Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

1 æ¡è¯„è®º
Virtualizing GCP data with Fabric Shortcuts

2024å¹´12æœˆ16æ—¥

Virtualizing GCP data with Fabric Shortcuts

New Technology Before the invention of shortcuts in Microsoft Fabric, big data engineers had to create pipelines toâ€¦

See all articles

Spark Tidbits - Lesson 10

John Miner

Data Architect at Insight

é¢†è‹±æŽ¨è

John Minerçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Apache Spark - Memory Allocation

What is Apache Spark?

Getting Started with Apache Airflow

Practical Apache Spark in 10 minutes. Part 7 â€” GraphX and Neo4j

Preparation Strategy and Resources for Databricks Certified Developer for Apache Spark 3.0 (Python)

Python POST requests three ways with Oracle REST Data Services (ORDS)

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

Apache Spark Cluster Managers â€“ YARN, Mesos & Standalone

Using apache Sedona with AWS Glue (Scala)

Apache Spark

é¢†è‹±æŽ¨è

John Minerçš„æ›´å¤šæ–‡ç«

Why use Tally Tables in the Fabric Warehouse?

Streaming Data with Azure Databricks

Upcoming Fabric Webinars from Insight

How to develop solutions with Fabric Data Warehouse?

Understanding file formats within the Fabric Lakehouse

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Create an Azure Databricks SQL Warehouse

How to Load a Fabric Warehouse?

My Year End Wrap Up for 2024

Virtualizing GCP data with Fabric Shortcuts

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Apache Spark - Memory Allocation

What is Apache Spark?

Getting Started with Apache Airflow

Practical Apache Spark in 10 minutes. Part 7 â€” GraphX and Neo4j

Preparation Strategy and Resources for Databricks Certified Developer for Apache Spark 3.0 (Python)

Python POST requests three ways with Oracle REST Data Services (ORDS)

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

Apache Spark Cluster Managers â€“ YARN, Mesos & Standalone

Using apache Sedona with AWS Glue (Scala)

Apache Spark

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†