Step-by-step guide to connecting Mage Pro SQL blocks with Databricks

Step-by-step guide to connecting Mage Pro SQL blocks with Databricks

Link to original article written by Mage DevRel, Cole Freeman : https://www.mage.ai/blog/step-by-step-guide-to-connecting-mage-pro-sql-blocks-with-databricks

TLDR

The guide provides a step-by-step tutorial for integrating Mage Pro SQL blocks with Databricks to create automated data pipelines. It walks through configuring Databricks credentials in Mage Pro, creating a pipeline, adding a data loader block to fetch API data (using golf rankings as an example), securely storing API keys, creating SQL blocks to transfer data to Databricks, and finally verifying the data through queries in Databricks. The integration aims to streamline data workflows by eliminating manual processes and creating a seamless connection between data sources and Databricks for analysis.

Table of contents

  • Introduction
  • Step 1: Setting up your Databricks configuration
  • Step 2: Navigating to the pipelines page
  • Step 3: Creating a new pipeline
  • Step 4: Adding your first data loader block
  • Step 5: Configuring the API key and base URL
  • Step 6: Creating a SQL block for Databricks connection
  • Step 7: Querying data in Databricks
  • Conclusion: Successful integration with Mage Pro and Databricks

Introduction

Ever feel like your data systems are playing an endless game of "telephone," where nothing quite syncs up the way it should? It’s frustrating, isn’t it? Don’t worry—I’ve got your back. In this guide, we’ll tackle that chaos by combining the magic of Mage Pro’s SQL blocks with Databricks. By the time we’re done, you’ll have automated pipelines so smooth, they’ll practically run themselves. Say goodbye to clunky or even manual workflows and hello to a system that well, just works.

Managing and integrating data from various systems can be a complex task for any organization. By combining the capabilities of Mage Pro and Databricks, you can create a data pipeline that automates the flow of information from your source systems into a central data platform, ready for analysis. Let’s get started integrating Mage Pro with Databricks.

Step 1: Setting up your Databricks configuration

To kick things off, we need to set up our Databricks configuration. This is an essential first step to ensure that everything runs smoothly as we integrate Mage Pro with Databricks.

Start by navigating to the left-hand menu in your Mage Pro environment. Click on the editor page. Once your files are loaded, scroll down until you find the io_config.yaml file. Open it up, and add the integration for Databricks. All the needed keys are listed below. Add your own individual Databricks values.

You'll need to input the following key details:

  • Databricks access token: This is your personal access token for security.
  • Database name: Specify the database you are working in.
  • Host: Enter the host URL for your Databricks instance.
  • HTTP path: This is the path to the Databricks HTTP endpoint.
  • Schema: Define the schema you plan to use.

For security reasons, it's best practice storing sensitive information like your access token in secrets. Make sure to do this for your own setup to keep your data secure.


Step 2: Navigating to the pipelines page

Once your configuration is set up, it’s time to navigate to the pipelines page. Head back to the left-hand navigation menu and click on the “Pipelines” button. This will take you to the pipelines overview where you can create and manage your data pipelines.

Here, you’ll see any existing pipelines and have the option to create new ones. Click the green “New pipeline” button to create a new pipeline.


Step 3: Creating a new pipeline

Now, let’s create a new pipeline! You can start from scratch, and that’s what we’re going to do today. Give your pipeline a name that reflects its purpose, and then go ahead and give it a description.

Don’t forget to tag your pipeline! Tags can help you organize and find your pipelines later. You might tag it with terms like demo or data integration. Once you’ve filled in the necessary details, hit the Create New Pipeline button, and voilà! You have a fresh pipeline ready for action.


Step 4: Adding your first data loader block

With your pipeline created, it’s time to add your first block. In this case, we’re going to add a data loader block. Click on the option to create a new block and select API as the type.

Name your block something descriptive, like “Fetch World Golf Rankings API Data.” This will help you remember its purpose later on. After naming it. Then hit Save and add.

Now, it’s time to add some code to this block. You can add the code below, but remember you will need a DataGolf API key. I recommend finding a free API where you can practice fetching data from different sources. Remember, if you’re working with sensitive information like an API key, you can store it securely in the secrets manager.


Step 5: Configuring the API key and base URL

Before we run the data loader block, we need to ensure that our API key and base URL are correctly configured. Navigate to the right-hand menu and look for the option to add a new secret. Here, you can input a key-value pair for your API key.

Make sure to label it appropriately, like DATAGOLF_API_KEY. Once you hit enter, your new secret will be saved, and you can reference it in your code. This keeps your API key secure while allowing your code to access it seamlessly.


Go ahead and click the run button. This will take a few seconds, depending on the amount of data you are extracting. Once it runs, you should see a sample of your data returned. From our example the data returned will include player names, IDs from the API, and additional descriptive information. It’s exciting to see it all come together!

Step 6: Creating a SQL block for Databricks connection

Now that we’ve successfully run our data loader block, it's time to create a SQL block for connecting to Databricks. To begin, navigate back to your pipeline editor and click on Blocks to add another block. Hover over the Transformer tab and then choose the generic SQL Block from the menu.


Give your SQL block a descriptive name, such as “Databricks SQL Connection.” This will help you identify its purpose in your pipeline. Once named, hit Save and add to proceed. With your SQL block created, it’s time to write the code that will define how the data is structured in Databricks. In the SQL editor, you’ll want to start with a CREATE OR REPLACE TABLE statement. This code essentially takes the data from our earlier API block, Fetch World Golf Rankings API Data, and creates a new table in Databricks called world_golf_rankings.

Make sure to adjust the table name and schema according to your preferences. Once you’ve input the SQL code, you’re ready for the next step! Now it’s time to run the SQL block! Click on the run button to execute your SQL code. This action will create or update the specified table in your Databricks instance.


Step 7: Querying data in Databricks

After the execution is complete, it’s important to verify that the data has been correctly loaded. Head over to your Databricks workspace and check the world_golf_rankings_formatted table.

To confirm the data integrity, you can run a simple query:


SELECT COUNT(*) FROM workspace.default.world_golf_rankings_formatted;

This will give you the total number of rows in the table. If all goes well, you should see the expected number of records that you fetched from the API!

With your data now securely in Databricks, you can start querying it. To get a glimpse of your data, try running a basic query:


SELECT * FROM workspace.default.world_golf_rankings_formatted LIMIT 10;

This query will fetch the first ten records from your table, allowing you to review the data structure and content. Adjust the LIMIT clause as needed to explore more records.


Conclusion: Successful integration with Mage Pro and Databricks

Congratulations! You’ve successfully integrated Mage Pro's SQL blocks with Databricks. By following these steps, you’ve created a data pipeline that pulls data from an API, processes it, and loads it into your Databricks environment. This integration not only enhances your data workflows but also empowers you to leverage the full potential of your data. Keep exploring the capabilities of Mage Pro and Databricks to unlock even more insights and efficiencies in your data management processes.

Want to learn more about Mage Pro, get a demo today.

要查看或添加评论,请登录

Mage的更多文章

社区洞察