登录查看更多内容

Generic Data Ingestion Process in Apache Spark

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

发布日期: 2020年7月27日

In this article, We will understand how we can write a Generic Ingestion Process using Spark. We will be using Databricks for it. Our goal is to create an ingestion framework which can ingest files of the following file formats from any cloud location and load into any Database table or Cloud Directory.

We will create a single notebook to accomplish this. We can extend this framework to ingest the data from ODBC/JDBC but that we will leave for next article.

Our present focus is to ingest from files and we will restrict to the below formats though we can very well extend any supported file formats.

Parquet Files
JSON Files
CSV Files

What is Databricks

Databricks is a fast, easy, and collaborative Apache Spark-based analytics service. Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes:

Fully managed Spark clusters
An interactive workspace for exploration and visualization
A platform for powering your favourite Spark-based applications

let's get started

Step1: We will create a cluster and a Notebook. ( This is petty easy on Databricks) . We will name the notebook as - Generic_Ingestion_Notebook. We will be working on pyspark so this is a python notebook.

Step2: We will create 5 parameters for our Notebook.

InputPath - This can be a path of your cloud location
InputFile - The name of your source file
TargetTable - The target table where we want to load our data
TargetPath - Target Cloud path where we want to load our data
LoadType - Table or File

Step3: Getting the Type of File and printing the File name

Step4: Extracting the extension of the file. i.e (.csv) , (.json), (.parquet)

Step5: Function to get the DataFrame based on the File Format

Remember this is very basic. We can add the complexity as required. Also ideally this should not be a part of the main generic notebook, We should be keeping all the reusable methods in the separate folder and should call that folder at the beginning of our generic ingestion process.

Step6: Calling the function and getting the Dataframe

Step7: Optionally, if we wish to print / store the schema of the processed file

Step8: Also, If we want to record the count of the processed file.

Step9: Now, We have to save this dataframe either in any table or any file systems provided we have the valid connection credentials with us.

Here, We are loading into the Snowflake Cloud Data-warehouse. ( Note : I have removed the Snowflake credentials cell after running it )

Running the Notebook

We basically can run our notebook in two ways either manual or on the schedule basis at the fixed time ( like the way we do in CRON job ).

We can create another notebook - Run_Notebook and use the magic function %run to run our Generic_Ingestion_Notebook for different parameters like the below.

Note: We can create more complex workflow via - dbutils.notebook.run but this is not in scope for today.

So let's run one of the cell and see how it looks like.

Yeah, that's it. We have loaded the csv file data into snowflake table via our generic ingestion process.

So if you followed it till end, you can see how easy it is to create a very basic generic framework for ingesting the data from heterogeneous sources and loading it to anywhere. We can extend this framework to cater more complex use cases. Please try it by yourself and let me know if you are able to recreate it. Also do let me know in case of any clarifications.

Thanks for reading it. I hope you found yourself informed. Please press the "like" button if you like it.

Gonzalo Nistal

Ingeniero de Datos Sr en QuintoAndar Group

3 年

Great article Deepak, many thanks for shsrr your knowlegde.

? Jeffrey Hasenbos

?? Founder @ Datashark Consultancy | Crafting Data Platforms, Warehouses, and Solutions for Analytics and Business Intelligence ??

3 年

Nice article! Do you have the code on Github or something?

Antriksh Chourasia

Data Engineer - Databricks, AWS Glue, AWS S3, Redshift, Hive, Pyspark, Shell Scripting, Jenkins CI/CD, GitHub.

3 年

Great article to learn from. Liked every part of it. Deepak Rajak can you please publish one for data transformation like removing duplicates from multiple columns, adding new column based on certain conditions or anything you can think of about manipulating the dataset.

1 次回应

Sourabh Joshi

Lead Data Engineer | Wantrepreneur | Blogger

3 年

Looks fine! One suggestion - maybe you could add some lines to validate the data before loading it onto the target?

2 次回应

Rakesh Sahoo

4 年

adding the getArgument() while creating notebook is not there in databricks free community version???

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Generic Data Ingestion Process in Apache Spark

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

What is Databricks

let's get started

Running the Notebook

更多精彩文章

社区洞察

其他会员也浏览了

Apache Spark 101: DataFrame Write API Operation

Databricks Delta Lake — A Friendly Intro

Automated Real-Time Data Streaming Pipeline using Apache Nifi, AWS, Snowpipe, Stream & Task

Apache Spark VS DATABRICKS

Building a Recommendation Engine in less than 20 lines of code - On Azure Databricks

The Power of Azure Databricks

Synapse vs Databricks Notebooks

Databricks Photon and its relation to Apache Spark

Learn how to read Hudi data with AWS Glue Ray using Daft (No Spark)

Establish a connection between Azure DataLake Storage Gen 2 and Azure Databricks (python)

What is Databricks

let's get started

Running the Notebook

Multi Tasks Job in Databricks

2022年1月12日

Deploying Databricks on Azure

2022年1月10日

Databricks SQL - The new Cloud Data Ware(Lake)house

2021年11月10日

Create Tables in Databricks & Query it from AWS Athena

2021年11月8日

AWS Glue Data Catalog as the Metastore for Databricks

2021年11月1日

Deploying Databricks on AWS

2021年10月31日

Danny's Diner Case Study using Pyspark on Databricks

2021年10月6日

Azure Cloud Data Engineering

2021年6月8日

Deploying Databricks on Google Cloud Platform

2021年4月13日

CI / CD in Azure Databricks using Azure DevOps

2021年4月9日

社区洞察

其他会员也浏览了

Apache Spark 101: DataFrame Write API Operation

Databricks Delta Lake — A Friendly Intro

Automated Real-Time Data Streaming Pipeline using Apache Nifi, AWS, Snowpipe, Stream & Task

Apache Spark VS DATABRICKS

Building a Recommendation Engine in less than 20 lines of code - On Azure Databricks

The Power of Azure Databricks

Synapse vs Databricks Notebooks

Databricks Photon and its relation to Apache Spark

Learn how to read Hudi data with AWS Glue Ray using Daft (No Spark)

Establish a connection between Azure DataLake Storage Gen 2 and Azure Databricks (python)