登录查看更多内容

Cloud agnostic analytics using Databricks

Mehul Thakkar

发布日期: 2022年8月22日

Databricks is an enterprise software company that provides cloud agnostic data engineering tools for processing and transforming huge volumes of data to build machine learning models. Databricks is built on top of distributed Cloud computing environments like?Azure, AWS, or Google Cloud?that facilitate running applications on CPUs or GPUs based on analysis requirements.

In a simple term, databricks is a single platform for data, analytics and AI. Databricks is an original creator of Apache Spark, Delta lake and MLflow.

On Google Cloud, it is available in marketplace.

On AWS also it is available in marketplace as shown below.

In this article I will use Azure databricks but on other cloud providers, the interface is the same.

Once you sign into Azure portal, select databricks in the search as follows.

After selecting Azure databricks service, first let's create Azure databricks workspace.

Step 2: Set networking parameters. Here I am allowing public IP but in DEV, UAT and PROD systems you have to use private IP only.

Step 3 Review all the settings and select create button at the bottom of the page.

Step 4: Wait for workspace to get created.

Once workspace is ready, you will get the following screen.

Once Azure databricks workspace gets created, select the workspace and launch the workspace.

Now let's create databricks cluster.

Step 1: Inside Azure databricks workspace, select Create->Cluster.

Step 2: Configure cluster. Ideally we should create multi node cluster but I am using Azure trial version so I will be creating single node cluster. You can use "quotas" in Azure to check the quota per region and you can request to increase quota in the region you are interested in.

Now select "Create Cluster" button at the bottom of the wizard.

After cluster gets successfully created, we will get the following screen.

Now let's create a notebook to analyze the data.

Step 1: Select Create -> Notebook button.

Step 2: Give name to the notebook and select the programming language you are interested in. By default it will select Python.

Step 3: Now check the existing data within databricks cluster.

Step 4: Now check all the datasets available inside dbfs:/databricks-datasets/ folder.

For this article I will pick airline dataset.

Step 3: Let's check all the files of airline dataset.

领英推荐

My Laptop is Faster than Your Cloud : Announcing…

Tomasz Tunguz 2 年前

Data Analytics Services: AWS, Azure, GCP

Dr. Rabi Prasad Padhy 1 年前

Mixture of Experts Explained

Louis-Fran?ois Bouchard 11 个月前

Since airline dataset has multiple files, let's check first two files to see the type of the file and it's contents. Since we don't know if the file type is csv or parquet, we can just check the head of the file.

Step 4: Check the head of the part-00000 file.

As we can see it is csv file with header information.

Step 5: Let's check the head of the part-00001 file.

The part-00001 file doesn't have header information so we have to extract schema from part-00000 file.

Step 6: Load part-00000 file first using pyspark.

As we can see that spark inferred ArrTime, DepTime, DepDelay, ArrDelay fields which supposed to be integer as string. Most likely these columns have some invalid data.

Step 7: Use the schema extracted from part-00000 file to read the entire airline dataset.

As we can see that DepTime, ArrTime fields have "NA" value.

Step 8: Let's check number of records. Note: avoid checking count in production.

Step 9: Remove null rows.

Since "NA" is not None or null value, spark will not consider it as null values so we have to do some data cleansing before analysis.

Step 10: Replace all "NA" by "None"

Step 10.1: Option 1

As we can see that "NA" has been replaced by "null".

Step 10.2: Option 2

Step 10.3: Option 3

As we can see that the datatype has changed to integer. Now let's check the data.

As shown above, the NA value has changed to "null" value.

Step 11: Now let's try to remove null values once again.

Step 12: Check record count after removing null values.

After removing null values, the record count is less than the record count in step 8.

Step 13: Let's check the number of unique carrier in the dataset.

Step 14: Let's find out the airline which delayed the most.

Step 15: Let's plot a graph of the most departure delayed flights.

Step 16: Let's find out on an average the carrier which arrived late.

All the above steps can be executed as a workflow by just selecting workflow tab in the UI and providing path of the notebook we have created and the cluster on which the workflow should run. We can even monitor spark job running behind the scene.

Disclaimer:

The statements, views, or opinions expressed in my LinkedIn profile and related articles represent my own views and not those of my current or previous employers or LinkedIn.
Any comments from those responding to my postings belong to, and only to, the responder posting the comment. I am not responsible or liable for such comments.
Although I hope that you find my articles helpful, and perhaps educational. If you decide to rely on them for any purpose whatsoever, I will not be held liable, and you do so at your own risk.
"Names, Logos and other proprietary information quoted in this article are property of respective companies/owners and are mentioned here for reference purpose only."
Only public domain data has been used in the examples. If you have any objection with the data then please let me know with the proof of your ownership and I will be happy to update my article.

References:

https://docs.databricks.com/getting-started/introduction/index.html

要查看或添加评论，请登录

Mehul Thakkar的更多文章

Confluent Kafka on AWS and comparison to AWS MSK

2022年8月24日

Confluent Kafka on AWS and comparison to AWS MSK

As per Wikipedia, Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system…
gRPC and temporal.io (workflow as code)

2021年12月15日

gRPC and temporal.io (workflow as code)

Some of the most complex applications on the Internet today, such as search engines, social media networks, and cloud…
Data lake on AWS using native services and cloud agnostic tools like Attunity Replicate, Striim, Streamsets, Collibra,Confluent Kafka, Tableau

2021年8月20日

Data lake on AWS using native services and cloud agnostic tools like Attunity Replicate, Striim, Streamsets, Collibra,Confluent Kafka, Tableau

In this article I will provide some background and production approaches while implementing data lake on AWS. The…
Weather data ( in netCDF format ) analytics on AWS using Python IRIS module, AWS Glue, Athena and QuickSight

2019年12月3日

Weather data ( in netCDF format ) analytics on AWS using Python IRIS module, AWS Glue, Athena and QuickSight

Network Common Data Form (NetCDF) is a standard format for environmental data, such as meteorological data. NetCDF can…
AWS S3 Select using boto3 and pyspark

2019年10月23日

AWS S3 Select using boto3 and pyspark

AWS S3 service is an object store where we create data lake to store data from various sources. By selecting S3 as data…
Increased Error Rates in AWS Ohio ( us-east-2) region.

2018年10月6日

Increased Error Rates in AWS Ohio ( us-east-2) region.

There were increased error rates today in AWS Ohio ( us-east-2) region. If your production was down in Ohio region in…

See all articles

Cloud agnostic analytics using Databricks

Mehul Thakkar

领英推荐

Mehul Thakkar的更多文章

社区洞察

其他会员也浏览了

Databricks Solutions on AWS, Azure and GCP

AWS Lambda Use Cases

Transforming Data Science with Cloud Computing: Innovations and Applications

?? Master CloudstoragesApi for CVAT Data Annotation Workflows ???

AWS re:Invent 2024: A Recap of Major Announcements

Top Announcements from AWS re:Invent 2024: Driving Digital Transformation Forward

Databricks Serverless Compute

If you're not using DragonFlyDB for AI, you're nuts!

AWS re:Invent 2022

Day 2 at AWS re:Invent 2024: Top 3 Game-Changing Breakthroughs in Databases, Generative AI, and AI Models

领英推荐

Mehul Thakkar的更多文章

Confluent Kafka on AWS and comparison to AWS MSK

gRPC and temporal.io (workflow as code)

Data lake on AWS using native services and cloud agnostic tools like Attunity Replicate, Striim, Streamsets, Collibra,Confluent Kafka, Tableau

Weather data ( in netCDF format ) analytics on AWS using Python IRIS module, AWS Glue, Athena and QuickSight

AWS S3 Select using boto3 and pyspark

Increased Error Rates in AWS Ohio ( us-east-2) region.

社区洞察

其他会员也浏览了

Databricks Solutions on AWS, Azure and GCP

AWS Lambda Use Cases

Transforming Data Science with Cloud Computing: Innovations and Applications

?? Master CloudstoragesApi for CVAT Data Annotation Workflows ???

AWS re:Invent 2024: A Recap of Major Announcements

Top Announcements from AWS re:Invent 2024: Driving Digital Transformation Forward

Databricks Serverless Compute

If you're not using DragonFlyDB for AI, you're nuts!

AWS re:Invent 2022

Day 2 at AWS re:Invent 2024: Top 3 Game-Changing Breakthroughs in Databases, Generative AI, and AI Models