登录查看更多内容

Ingesting, Parsing and Querying Semi Structured Data (JSON) into Snowflake Vs Databricks!!!

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

发布日期: 2020年5月19日

Let’s gets directly into the business. In this article we will try to go through the steps of ingesting, parsing and querying the Json data in two one of the very popular and advanced cloud platforms, Snowflake and Databricks.

Let’s start with Snowflake:

What is Snowflake ?

Snowflake is a cloud data warehouse built on top of the public cloud (AWS / Azure / GCP ) infrastructure and is a true SaaS offering. There is no hardware (virtual or physical) for you to select, install, configure, or manage. There is no software for you to install, configure, or manage. All ongoing maintenance, management, and tuning is handled by Snowflake.

lets start with our process:

Our Data:

The JSON data consists of weather information provided by OpenWeatherMap detailing the historical conditions of New York City from 2016-07-05 to 2019-06-25. It is also staged o n AWS S3 where the data represents 57.9k rows, 61 objects, and 2.5MB total size compressed.

The raw JSON in GZ files and in a text editor looks like:

Let's load this data into Snowflake.

Step1: Create a Database

create database weather;

Step2: Create a Table

Snowflake’s VARIANT data type allows Snowflake to ingest semi-structured data without having to pre-define the schema.

create table json_weather_data (v variant);

Step3: Create a Stage

External stages are storage locations outside the Snowflake environment in another cloud storage location. This could be either Amazon S3 storage or Microsoft Azure storage or GCP buckets.

create stage nyc_weather
url = 's3://snowflake-workshop-lab/weather-nyc';

List the stage

list @nyc_weather;

Step4: Copy the Json data from Stage to Table

copy into json_weather_data 
from @nyc_weather 
file_format = (type=json);

Check if it is copied correctly!!

select * from json_weather_data limit 10;

Wow!! We have the Json data loaded into our table. This is pretty cool.

Step5: Create a VIEW to format it properly ( We need to do to parse it properly)

create view json_weather_data_view as
select
  v:time::timestamp as observation_time,
  v:city.id::int as city_id,
  v:city.name::string as city_name,
  v:city.country::string as country,
  v:city.coord.lat::float as city_lat,
  v:city.coord.lon::float as city_lon,
  v:clouds.all::int as clouds,
  (v:main.temp::float)-273.15 as temp_avg,
  (v:main.temp_min::float)-273.15 as temp_min,
  (v:main.temp_max::float)-273.15 as temp_max,
  v:weather[0].main::string as weather,
  v:weather[0].description::string as weather_desc,
  v:weather[0].icon::string as weather_icon,
  v:wind.deg::float as wind_dir,
  v:wind.speed::float as wind_speed
from json_weather_data
where city_id = 5128638;

Step6: Query the Json data from View as if you are Querying a table

select * from json_weather_data_view
where date_trunc('month',observation_time) = '2018-01-01' 
limit 20;

Hurry!! We have the nicely formatted, queryable data in the Views. Thats it, we are done !!

Little bit about Views & Materialized Views

A View allows the result of a query to be accessed as if it were a table. Views can help you: present data to end users in a cleaner manner (like in this lab we will present “ugly” JSON in a columnar format), limit what end users can view in a source table for privacy/security reasons, or write more modular SQL.

There are also Materialized Views in which SQL results are stored, almost as though the results were a table. This allows faster access, but requires storage space. Materialized Views require Snowflake Enterprise Edition or higher.

Now quickly lets look, How can we do the same process in Databricks:

What is Databricks ?

Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. This Apache-Spark based platform runs a distributed system behind the scenes, meaning the workload is automatically split across various processors and scales up and down on demand. Increased efficiency results in direct time and cost savings for massive tasks.

let's follow the same process on different platform altogether !!

Step1: Creating a Schema to Parse / Read the Json

Step2: Read the Json data from S3 ( I have already mounted the S3 bucket on DBFS - Databricks File System)

Yeah!! We have created our Spark DataFrame

Step3: Some clean up Data Engineering!!

Step4: Creating a Temp Table from the DataFrame

Step5: Query the table ( Yes, in the Databricks itself !!)

So, We have seen how quickly we can load and query JSON data both in Snowflake and Databricks. Both are just amazing and I am just loving them !!

I hope this step by step explanation informed you at least at the high level that how easy to work on both the platform.

Credits: Snowflake 0 to 90 Webinar. Databricks Community Edition

Sagar Badarkhe

Ab Initio Architecture, Solutions & CoE|Cloud CoE|Data governance

4 年

I think a PoC/demo with social media data like Twitter in JSON format and parsing with Databricks should be worthwhile.

2 次回应

查看更多评论

要查看或添加评论，请登录

Deepak Rajak的更多文章

Multi Tasks Job in Databricks

2022年1月12日

Multi Tasks Job in Databricks

A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

3 条评论
Deploying Databricks on Azure

2022年1月10日

Deploying Databricks on Azure

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

9 条评论
Databricks SQL - The new Cloud Data Ware(Lake)house

2021年11月10日

Databricks SQL - The new Cloud Data Ware(Lake)house

Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

10 条评论
Create Tables in Databricks & Query it from AWS Athena

2021年11月8日

Create Tables in Databricks & Query it from AWS Athena

In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link…

2 条评论
AWS Glue Data Catalog as the Metastore for Databricks

2021年11月1日

AWS Glue Data Catalog as the Metastore for Databricks

We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in…

10 条评论
Deploying Databricks on AWS

2021年10月31日

Deploying Databricks on AWS

Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

1 条评论
Danny's Diner Case Study using Pyspark on Databricks

2021年10月6日

Danny's Diner Case Study using Pyspark on Databricks

If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

9 条评论
Azure Cloud Data Engineering

2021年6月8日

Azure Cloud Data Engineering

You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

22 条评论
Deploying Databricks on Google Cloud Platform

2021年4月13日

Deploying Databricks on Google Cloud Platform

Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

4 条评论
CI / CD in Azure Databricks using Azure DevOps

2021年4月9日

CI / CD in Azure Databricks using Azure DevOps

In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please…

19 条评论

See all articles

Ingesting, Parsing and Querying Semi Structured Data (JSON) into Snowflake Vs Databricks!!!

Deepak Rajak

Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.

Our Data:

Step1: Create a Database

Step2: Create a Table

Step3: Create a Stage

Step4: Copy the Json data from Stage to Table

Step5: Create a VIEW to format it properly ( We need to do to parse it properly)

Step6: Query the Json data from View as if you are Querying a table

Little bit about Views & Materialized Views

Now quickly lets look, How can we do the same process in Databricks:

What is Databricks ?

Step1: Creating a Schema to Parse / Read the Json

Step2: Read the Json data from S3 ( I have already mounted the S3 bucket on DBFS - Databricks File System)

Step3: Some clean up Data Engineering!!

Step4: Creating a Temp Table from the DataFrame

Step5: Query the table ( Yes, in the Databricks itself !!)

Deepak Rajak的更多文章

社区洞察

其他会员也浏览了

Redshift vs Bigquery

Snowflake or Databricks? BigQuery or Dataproc? Redshift or EMR?

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Data Loading with Snowflake's COPY INTO Command-Table

Azure Data and Power BI News (March 2023)

Snowflake VS Azure Synapse | 7 reasons why you should choose Snowflake OR Synapse on Azure

Flatten Hierarchical(Nested) Json Data in Snowflake Vs Databricks

BigLake : A Multi-Cloud Data Strategy

Unloading from AWS S3 & loading into Azure Synapse Table via Databricks

Learn How to Easily Analyze Data Using BigQuery Tables

Our Data:

Step1: Create a Database

Step2: Create a Table

Step3: Create a Stage

Step4: Copy the Json data from Stage to Table

Step5: Create a VIEW to format it properly ( We need to do to parse it properly)

Step6: Query the Json data from View as if you are Querying a table

Little bit about Views & Materialized Views

Now quickly lets look, How can we do the same process in Databricks:

What is Databricks ?

Step1: Creating a Schema to Parse / Read the Json

Step2: Read the Json data from S3 ( I have already mounted the S3 bucket on DBFS - Databricks File System)

Step3: Some clean up Data Engineering!!

Step4: Creating a Temp Table from the DataFrame

Step5: Query the table ( Yes, in the Databricks itself !!)

Deepak Rajak的更多文章

Multi Tasks Job in Databricks

Deploying Databricks on Azure

Databricks SQL - The new Cloud Data Ware(Lake)house

Create Tables in Databricks & Query it from AWS Athena

AWS Glue Data Catalog as the Metastore for Databricks

Deploying Databricks on AWS

Danny's Diner Case Study using Pyspark on Databricks

Azure Cloud Data Engineering

Deploying Databricks on Google Cloud Platform

CI / CD in Azure Databricks using Azure DevOps

社区洞察

其他会员也浏览了

Redshift vs Bigquery

Snowflake or Databricks? BigQuery or Dataproc? Redshift or EMR?

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Data Loading with Snowflake's COPY INTO Command-Table

Azure Data and Power BI News (March 2023)

Snowflake VS Azure Synapse | 7 reasons why you should choose Snowflake OR Synapse on Azure

Flatten Hierarchical(Nested) Json Data in Snowflake Vs Databricks

BigLake : A Multi-Cloud Data Strategy

Unloading from AWS S3 & loading into Azure Synapse Table via Databricks

Learn How to Easily Analyze Data Using BigQuery Tables