Ingesting, Parsing and Querying Semi Structured Data (JSON) into Snowflake Vs Databricks!!!

Ingesting, Parsing and Querying Semi Structured Data (JSON) into Snowflake Vs Databricks!!!

Let’s gets directly into the business. In this article we will try to go through the steps of ingesting, parsing and querying the Json data in two one of the very popular and advanced cloud platforms, Snowflake and Databricks.

Let’s start with Snowflake:

What is Snowflake ?

Snowflake is a cloud data warehouse built on top of the public cloud (AWS / Azure / GCP ) infrastructure and is a true SaaS offering. There is no hardware (virtual or physical) for you to select, install, configure, or manage. There is no software for you to install, configure, or manage. All ongoing maintenance, management, and tuning is handled by Snowflake.

lets start with our process:

Our Data:

The JSON data consists of weather information provided by OpenWeatherMap  detailing the historical conditions of New York City from 2016-07-05 to 2019-06-25. It is also staged o n AWS S3 where the data represents 57.9k rows, 61 objects, and 2.5MB total size compressed. 

The raw JSON in GZ files and in a text editor looks like: 

No alt text provided for this image

Let's load this data into Snowflake.

Step1: Create a Database

create database weather;

Step2: Create a Table

Snowflake’s VARIANT data type allows   Snowflake to ingest semi-structured data without having to pre-define the schema. 

create table json_weather_data (v variant);

Step3: Create a Stage

External stages are storage locations outside the Snowflake environment in another cloud storage location. This could be either Amazon S3 storage or Microsoft Azure storage or GCP buckets.

create stage nyc_weather
url = 's3://snowflake-workshop-lab/weather-nyc';

List the stage

list @nyc_weather;
No alt text provided for this image

Step4: Copy the Json data from Stage to Table

copy into json_weather_data 
from @nyc_weather 
file_format = (type=json);

Check if it is copied correctly!!

select * from json_weather_data limit 10;
No alt text provided for this image

Wow!! We have the Json data loaded into our table. This is pretty cool.

Step5: Create a VIEW to format it properly ( We need to do to parse it properly)

create view json_weather_data_view as
select
  v:time::timestamp as observation_time,
  v:city.id::int as city_id,
  v:city.name::string as city_name,
  v:city.country::string as country,
  v:city.coord.lat::float as city_lat,
  v:city.coord.lon::float as city_lon,
  v:clouds.all::int as clouds,
  (v:main.temp::float)-273.15 as temp_avg,
  (v:main.temp_min::float)-273.15 as temp_min,
  (v:main.temp_max::float)-273.15 as temp_max,
  v:weather[0].main::string as weather,
  v:weather[0].description::string as weather_desc,
  v:weather[0].icon::string as weather_icon,
  v:wind.deg::float as wind_dir,
  v:wind.speed::float as wind_speed
from json_weather_data
where city_id = 5128638;

Step6: Query the Json data from View as if you are Querying a table

select * from json_weather_data_view
where date_trunc('month',observation_time) = '2018-01-01' 
limit 20;
No alt text provided for this image

Hurry!! We have the nicely formatted, queryable data in the Views. Thats it, we are done !!

Little bit about Views & Materialized Views

A View allows the result of a query to be accessed as if it were a table. Views can help you: present data to end users in a cleaner manner (like in this lab we will present “ugly” JSON in a columnar format), limit what end users can view in a source table for privacy/security reasons, or write more modular SQL.

There are also Materialized Views in which SQL results are stored, almost as though the results were a table. This allows faster access, but requires storage space. Materialized Views require Snowflake Enterprise Edition or higher.

Now quickly lets look, How can we do the same process in Databricks:

What is Databricks ?

Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. This Apache-Spark based platform runs a distributed system behind the scenes, meaning the workload is automatically split across various processors and scales up and down on demand. Increased efficiency results in direct time and cost savings for massive tasks.

let's follow the same process on different platform altogether !!

Step1: Creating a Schema to Parse / Read the Json

No alt text provided for this image

Step2: Read the Json data from S3 ( I have already mounted the S3 bucket on DBFS - Databricks File System)

No alt text provided for this image

Yeah!! We have created our Spark DataFrame

Step3: Some clean up Data Engineering!!

No alt text provided for this image

Step4: Creating a Temp Table from the DataFrame

No alt text provided for this image

Step5: Query the table ( Yes, in the Databricks itself !!)

No alt text provided for this image

So, We have seen how quickly we can load and query JSON data both in Snowflake and Databricks. Both are just amazing and I am just loving them !!

I hope this step by step explanation informed you at least at the high level that how easy to work on both the platform.

Credits: Snowflake 0 to 90 Webinar. Databricks Community Edition

Sagar Badarkhe

Ab Initio Architecture, Solutions & CoE|Cloud CoE|Data governance

4 年

I think a PoC/demo with social media data like Twitter in JSON format and parsing with Databricks should be worthwhile.

要查看或添加评论,请登录

Deepak Rajak的更多文章

  • Multi Tasks Job in Databricks

    Multi Tasks Job in Databricks

    A job in Databricks is a non-interactive way to run an application in a Databricks cluster, for example, an ETL job or…

    3 条评论
  • Deploying Databricks on Azure

    Deploying Databricks on Azure

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    9 条评论
  • Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL - The new Cloud Data Ware(Lake)house

    Databricks SQL is a product offering from Databricks which they are pitching against the likes of Snowflake, AWS…

    10 条评论
  • Create Tables in Databricks & Query it from AWS Athena

    Create Tables in Databricks & Query it from AWS Athena

    In my last article, we have integrated AWS Glue with Databricks as external data catalog ( Metastore ). Here is a link…

    2 条评论
  • AWS Glue Data Catalog as the Metastore for Databricks

    AWS Glue Data Catalog as the Metastore for Databricks

    We can configure Databricks Runtime to use the AWS Glue Data Catalog as its metastore. This can serve as a drop-in…

    10 条评论
  • Deploying Databricks on AWS

    Deploying Databricks on AWS

    Databricks is Cloud agnostic Platform as a Service ( PaaS) offering available in all three public clouds . In this…

    1 条评论
  • Danny's Diner Case Study using Pyspark on Databricks

    Danny's Diner Case Study using Pyspark on Databricks

    If you are a Data guy - Analyst, Engineer or Scientist, you needed to explore some good end to end case study / project…

    9 条评论
  • Azure Cloud Data Engineering

    Azure Cloud Data Engineering

    You might have fed up enough by listening to people that the Cloud is the way forward, learn it, everything is going…

    22 条评论
  • Deploying Databricks on Google Cloud Platform

    Deploying Databricks on Google Cloud Platform

    Databricks now available on GCP as well ( Ofcourse already available in AWS & Azure ). In this ultra short article we…

    4 条评论
  • CI / CD in Azure Databricks using Azure DevOps

    CI / CD in Azure Databricks using Azure DevOps

    In my last article, I have integrated Azure Databricks with Azure DevOps, so before you read this one further, please…

    19 条评论

社区洞察

其他会员也浏览了