Cloud agnostic analytics using Databricks
Databricks is an enterprise software company that provides cloud agnostic data engineering tools for processing and transforming huge volumes of data to build machine learning models. Databricks is built on top of distributed Cloud computing environments like?Azure, AWS, or Google Cloud?that facilitate running applications on CPUs or GPUs based on analysis requirements.
In a simple term, databricks is a single platform for data, analytics and AI. Databricks is an original creator of Apache Spark, Delta lake and MLflow.
On Google Cloud, it is available in marketplace.
On AWS also it is available in marketplace as shown below.
In this article I will use Azure databricks but on other cloud providers, the interface is the same.
Once you sign into Azure portal, select databricks in the search as follows.
After selecting Azure databricks service, first let's create Azure databricks workspace.
Step 2: Set networking parameters. Here I am allowing public IP but in DEV, UAT and PROD systems you have to use private IP only.
Step 3 Review all the settings and select create button at the bottom of the page.
Step 4: Wait for workspace to get created.
Once workspace is ready, you will get the following screen.
Once Azure databricks workspace gets created, select the workspace and launch the workspace.
Now let's create databricks cluster.
Step 1: Inside Azure databricks workspace, select Create->Cluster.
Step 2: Configure cluster. Ideally we should create multi node cluster but I am using Azure trial version so I will be creating single node cluster. You can use "quotas" in Azure to check the quota per region and you can request to increase quota in the region you are interested in.
Now select "Create Cluster" button at the bottom of the wizard.
After cluster gets successfully created, we will get the following screen.
Now let's create a notebook to analyze the data.
Step 1: Select Create -> Notebook button.
Step 2: Give name to the notebook and select the programming language you are interested in. By default it will select Python.
Step 3: Now check the existing data within databricks cluster.
Step 4: Now check all the datasets available inside dbfs:/databricks-datasets/ folder.
For this article I will pick airline dataset.
Step 3: Let's check all the files of airline dataset.
领英推荐
Since airline dataset has multiple files, let's check first two files to see the type of the file and it's contents. Since we don't know if the file type is csv or parquet, we can just check the head of the file.
Step 4: Check the head of the part-00000 file.
As we can see it is csv file with header information.
Step 5: Let's check the head of the part-00001 file.
The part-00001 file doesn't have header information so we have to extract schema from part-00000 file.
Step 6: Load part-00000 file first using pyspark.
As we can see that spark inferred ArrTime, DepTime, DepDelay, ArrDelay fields which supposed to be integer as string. Most likely these columns have some invalid data.
Step 7: Use the schema extracted from part-00000 file to read the entire airline dataset.
As we can see that DepTime, ArrTime fields have "NA" value.
Step 8: Let's check number of records. Note: avoid checking count in production.
Step 9: Remove null rows.
Since "NA" is not None or null value, spark will not consider it as null values so we have to do some data cleansing before analysis.
Step 10: Replace all "NA" by "None"
Step 10.1: Option 1
As we can see that "NA" has been replaced by "null".
Step 10.2: Option 2
Step 10.3: Option 3
As we can see that the datatype has changed to integer. Now let's check the data.
As shown above, the NA value has changed to "null" value.
Step 11: Now let's try to remove null values once again.
Step 12: Check record count after removing null values.
After removing null values, the record count is less than the record count in step 8.
Step 13: Let's check the number of unique carrier in the dataset.
Step 14: Let's find out the airline which delayed the most.
Step 15: Let's plot a graph of the most departure delayed flights.
Step 16: Let's find out on an average the carrier which arrived late.
All the above steps can be executed as a workflow by just selecting workflow tab in the UI and providing path of the notebook we have created and the cluster on which the workflow should run. We can even monitor spark job running behind the scene.
Disclaimer:
References:
https://docs.databricks.com/getting-started/introduction/index.html