Mount ADLS on Databricks
https://medium.com/@nakhtar.etc/mount-adls-on-databricks-fe3b54da07be

Mount ADLS on Databricks

Azure Data Lake Storage Gen2 (ADLS Gen2) and Databricks stand as robust pillars in the Microsoft Azure ecosystem. The synergy achieved by mounting ADLS Gen2 on Databricks unlocks a plethora of opportunities for streamlined data processing and analytics. This article guides you through the step-by-step process of integrating ADLS Gen2 with Databricks, delving into the benefits and considerations that come with this powerful collaboration.

Let's Mount ADLS Gen2 on Databricks:

Step 1: Set Up Infrastructure.

Create and databricks workspace and cluster attach a notebook with cluster if infra is not ready else create a new notebook and attach it with existing cluster from databricks workspace.

Step 2: Configure Storage Key and Endpoint

Retrieve the storage account key, SAS for authentication and endpoint for your ADLS Gen2 account. Configure these details in the Databricks workspace, which will be used to establish a connection.

Step 3: Mount ADLS Gen2 on Databricks

Use the Databricks notebook to run commands that mount the ADLS Gen2 storage. Specify the storage account, container, mount point, and the access key.

The provided code snippet utilizes Databricks' dbutils.fs.mount function to mount Azure Data Lake Storage Gen2 (ADLS Gen2) onto a specified mount point within the Databricks environment. Let's break down the components of this command:

# Example mount command 
dbutils.fs.mount( source = "abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/", mount_point = "/mnt/<mount_point>", extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})        

  • source: Specifies the source location to be mounted. In this case, it uses the Azure Blob File System (abfss) protocol to connect to the specified ADLS Gen2 container. Replace <container_name> and <storage_account_name> with the actual names of your ADLS Gen2 container and storage account, respectively.
  • mount_point: Defines the mount point within the Databricks environment where the ADLS Gen2 will be accessible. Replace <mount_point> with the desired path for mounting.
  • extra_configs: Allows you to provide additional configurations. In this example, it includes a configuration key (denoted as <conf-key>) with a corresponding value retrieved from Databricks secrets. This is useful for securely handling sensitive information such as keys or credentials.dbutils.secrets.get: Fetches the secret associated with the specified scope (<scope-name>) and key (<key-name>). This is a secure way to manage and access sensitive information within Databricks.in the above snippet we used azure key vault service to protect out credentials in it.
  • Databricks secrets Scope: there are 2 type of databricks secrets scope.

1. Azure Key Vault-backed Secret Scope: Used as above.

2. Databricks-backed Secret Scope: will explain in later post.

By executing this code, you establish a connection between Databricks and ADLS Gen2, enabling seamless data access and processing within the Databricks environment. This mount operation essentially links the specified ADLS Gen2 container to the specified mount point, allowing users to interact with the data as if it were part of the local file system.

NOTE: Given that ADLS Gen2 resides on the public network, there is no requirement for a separate network establishment.

But if ADLS on Selected network in that case we need to make sure network connectivity must be enabled from databricks to ADLS Gen2 either through Private Endpoint or via Service End Point.

Will cover Databricks to ADLS Gen2 via Selected network in my next Post.

Pros of Mounting ADLS Gen2 on Databricks

1. Unified Data Management:

  • Efficient Data Access: Mounting ADLS Gen2 allows Databricks to directly access data in its native storage, avoiding unnecessary data transfers.

2. Scalability and Performance:

  • Parallel Processing: Databricks can leverage the distributed architecture of ADLS Gen2 for parallel processing, enhancing scalability and performance.

3. Unified Analytics:

  • Seamless Integration: Analysts and data scientists can seamlessly integrate ADLS Gen2 data into their analytics workflows without the need for extensive data movement.

4. Consistency Across Environments:

  • Development to Production: Mounting facilitates consistency from development to production, streamlining the deployment of data pipelines.

Cons of Mounting ADLS Gen2 on Databricks

1. Security:

  • As mounting of storage at workspace level and couldn't control the access and permissions on mount point.
  • If we use access key to mount that has full access on ADLS and all user who are accessing it will get full access on ADLS.
  • Not advisable in case of shared storage.

How to Over cum Security issue:

  • Use Spark Configuration of databricks cluster to integrate with ADLS2.
  • Use Service Principal to authenticate with ADLS2. with limited access:

spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token        

  • Another option to use notebook authentication for ADLS2:

service_credential = dbutils.secrets.get(scope="<secret-scope>",key="<service-credential-key>")

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")        


要查看或添加评论,请登录

Naeem Akhtar的更多文章

  • Create Databricks clusters?

    Create Databricks clusters?

    Azure Databricks, a cloud-based big data analytics platform, offers multiple methods for creating clusters to suit…

  • Databricks CLI

    Databricks CLI

    Databricks Command-Line Interface (CLI) is a command-line tool provided by Databricks to interact with Databricks…

  • Git and GitHub

    Git and GitHub

    An Introduction to Git and GitHub Git is an open source distributed version control system that helps developers manage…

社区洞察

其他会员也浏览了