Cloud agnostic analytics using Databricks
Azure Databricks

Cloud agnostic analytics using Databricks

Databricks is an enterprise software company that provides cloud agnostic data engineering tools for processing and transforming huge volumes of data to build machine learning models. Databricks is built on top of distributed Cloud computing environments like?Azure, AWS, or Google Cloud?that facilitate running applications on CPUs or GPUs based on analysis requirements.

In a simple term, databricks is a single platform for data, analytics and AI. Databricks is an original creator of Apache Spark, Delta lake and MLflow.

On Google Cloud, it is available in marketplace.

No alt text provided for this image

On AWS also it is available in marketplace as shown below.

No alt text provided for this image


In this article I will use Azure databricks but on other cloud providers, the interface is the same.

Once you sign into Azure portal, select databricks in the search as follows.

No alt text provided for this image


After selecting Azure databricks service, first let's create Azure databricks workspace.

No alt text provided for this image

Step 2: Set networking parameters. Here I am allowing public IP but in DEV, UAT and PROD systems you have to use private IP only.

No alt text provided for this image


Step 3 Review all the settings and select create button at the bottom of the page.

No alt text provided for this image

Step 4: Wait for workspace to get created.

No alt text provided for this image


Once workspace is ready, you will get the following screen.

No alt text provided for this image

Once Azure databricks workspace gets created, select the workspace and launch the workspace.

No alt text provided for this image


Now let's create databricks cluster.

Step 1: Inside Azure databricks workspace, select Create->Cluster.

No alt text provided for this image


Step 2: Configure cluster. Ideally we should create multi node cluster but I am using Azure trial version so I will be creating single node cluster. You can use "quotas" in Azure to check the quota per region and you can request to increase quota in the region you are interested in.

No alt text provided for this image

Now select "Create Cluster" button at the bottom of the wizard.

No alt text provided for this image


After cluster gets successfully created, we will get the following screen.

No alt text provided for this image


Now let's create a notebook to analyze the data.

Step 1: Select Create -> Notebook button.

No alt text provided for this image

Step 2: Give name to the notebook and select the programming language you are interested in. By default it will select Python.

No alt text provided for this image

Step 3: Now check the existing data within databricks cluster.

No alt text provided for this image


Step 4: Now check all the datasets available inside dbfs:/databricks-datasets/ folder.

No alt text provided for this image


For this article I will pick airline dataset.

Step 3: Let's check all the files of airline dataset.

No alt text provided for this image


Since airline dataset has multiple files, let's check first two files to see the type of the file and it's contents. Since we don't know if the file type is csv or parquet, we can just check the head of the file.

Step 4: Check the head of the part-00000 file.

No alt text provided for this image

As we can see it is csv file with header information.

Step 5: Let's check the head of the part-00001 file.

No alt text provided for this image

The part-00001 file doesn't have header information so we have to extract schema from part-00000 file.

Step 6: Load part-00000 file first using pyspark.

No alt text provided for this image

As we can see that spark inferred ArrTime, DepTime, DepDelay, ArrDelay fields which supposed to be integer as string. Most likely these columns have some invalid data.

Step 7: Use the schema extracted from part-00000 file to read the entire airline dataset.

No alt text provided for this image

As we can see that DepTime, ArrTime fields have "NA" value.

Step 8: Let's check number of records. Note: avoid checking count in production.

No alt text provided for this image


Step 9: Remove null rows.

No alt text provided for this image


Since "NA" is not None or null value, spark will not consider it as null values so we have to do some data cleansing before analysis.

Step 10: Replace all "NA" by "None"

Step 10.1: Option 1

No alt text provided for this image

As we can see that "NA" has been replaced by "null".

Step 10.2: Option 2

No alt text provided for this image


Step 10.3: Option 3

No alt text provided for this image

As we can see that the datatype has changed to integer. Now let's check the data.

No alt text provided for this image

As shown above, the NA value has changed to "null" value.

Step 11: Now let's try to remove null values once again.

No alt text provided for this image

Step 12: Check record count after removing null values.

No alt text provided for this image

After removing null values, the record count is less than the record count in step 8.

Step 13: Let's check the number of unique carrier in the dataset.

No alt text provided for this image

Step 14: Let's find out the airline which delayed the most.

No alt text provided for this image

Step 15: Let's plot a graph of the most departure delayed flights.

No alt text provided for this image

Step 16: Let's find out on an average the carrier which arrived late.


No alt text provided for this image


All the above steps can be executed as a workflow by just selecting workflow tab in the UI and providing path of the notebook we have created and the cluster on which the workflow should run. We can even monitor spark job running behind the scene.

Disclaimer:

  1. The statements, views, or opinions expressed in my LinkedIn profile and related articles represent my own views and not those of my current or previous employers or LinkedIn.
  2. Any comments from those responding to my postings belong to, and only to, the responder posting the comment. I am not responsible or liable for such comments.
  3. Although I hope that you find my articles helpful, and perhaps educational. If you decide to rely on them for any purpose whatsoever, I will not be held liable, and you do so at your own risk.
  4. "Names, Logos and other proprietary information quoted in this article are property of respective companies/owners and are mentioned here for reference purpose only."
  5. Only public domain data has been used in the examples. If you have any objection with the data then please let me know with the proof of your ownership and I will be happy to update my article.

References:

https://docs.databricks.com/getting-started/introduction/index.html

要查看或添加评论,请登录

Mehul Thakkar的更多文章

社区洞察

其他会员也浏览了