Mount ADLS on Databricks
Naeem Akhtar
BigData Service Core System & Integration Engineering @UBS|AZ-104 |Azure Administrator | Databricks administration, Unity Catalog, Data Factory, Azure and DBK CLI, AWS Certified, Terraform, Automation, Azure DevOps
Azure Data Lake Storage Gen2 (ADLS Gen2) and Databricks stand as robust pillars in the Microsoft Azure ecosystem. The synergy achieved by mounting ADLS Gen2 on Databricks unlocks a plethora of opportunities for streamlined data processing and analytics. This article guides you through the step-by-step process of integrating ADLS Gen2 with Databricks, delving into the benefits and considerations that come with this powerful collaboration.
Let's Mount ADLS Gen2 on Databricks:
Step 1: Set Up Infrastructure.
Create and databricks workspace and cluster attach a notebook with cluster if infra is not ready else create a new notebook and attach it with existing cluster from databricks workspace.
Step 2: Configure Storage Key and Endpoint
Retrieve the storage account key, SAS for authentication and endpoint for your ADLS Gen2 account. Configure these details in the Databricks workspace, which will be used to establish a connection.
Step 3: Mount ADLS Gen2 on Databricks
Use the Databricks notebook to run commands that mount the ADLS Gen2 storage. Specify the storage account, container, mount point, and the access key.
The provided code snippet utilizes Databricks' dbutils.fs.mount function to mount Azure Data Lake Storage Gen2 (ADLS Gen2) onto a specified mount point within the Databricks environment. Let's break down the components of this command:
# Example mount command
dbutils.fs.mount( source = "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/", mount_point = "/mnt/<mount_point>", extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
1. Azure Key Vault-backed Secret Scope: Used as above.
2. Databricks-backed Secret Scope: will explain in later post.
By executing this code, you establish a connection between Databricks and ADLS Gen2, enabling seamless data access and processing within the Databricks environment. This mount operation essentially links the specified ADLS Gen2 container to the specified mount point, allowing users to interact with the data as if it were part of the local file system.
NOTE: Given that ADLS Gen2 resides on the public network, there is no requirement for a separate network establishment.
But if ADLS on Selected network in that case we need to make sure network connectivity must be enabled from databricks to ADLS Gen2 either through Private Endpoint or via Service End Point.
Will cover Databricks to ADLS Gen2 via Selected network in my next Post.
领英推荐
Pros of Mounting ADLS Gen2 on Databricks
1. Unified Data Management:
2. Scalability and Performance:
3. Unified Analytics:
4. Consistency Across Environments:
Cons of Mounting ADLS Gen2 on Databricks
1. Security:
How to Over cum Security issue:
spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token
service_credential = dbutils.secrets.get(scope="<secret-scope>",key="<service-credential-key>")
spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")