Ingesting, Parsing and Querying Semi Structured Data (JSON) into Snowflake Vs Databricks!!!
Deepak Rajak
Data Engineering /Advanced Analytics Technical Delivery Lead at Exusia, Inc.
Let’s gets directly into the business. In this article we will try to go through the steps of ingesting, parsing and querying the Json data in two one of the very popular and advanced cloud platforms, Snowflake and Databricks.
Let’s start with Snowflake:
What is Snowflake ?
Snowflake is a cloud data warehouse built on top of the public cloud (AWS / Azure / GCP ) infrastructure and is a true SaaS offering. There is no hardware (virtual or physical) for you to select, install, configure, or manage. There is no software for you to install, configure, or manage. All ongoing maintenance, management, and tuning is handled by Snowflake.
lets start with our process:
Our Data:
The JSON data consists of weather information provided by OpenWeatherMap detailing the historical conditions of New York City from 2016-07-05 to 2019-06-25. It is also staged o n AWS S3 where the data represents 57.9k rows, 61 objects, and 2.5MB total size compressed.
The raw JSON in GZ files and in a text editor looks like:
Let's load this data into Snowflake.
Step1: Create a Database
create database weather;
Step2: Create a Table
Snowflake’s VARIANT data type allows Snowflake to ingest semi-structured data without having to pre-define the schema.
create table json_weather_data (v variant);
Step3: Create a Stage
External stages are storage locations outside the Snowflake environment in another cloud storage location. This could be either Amazon S3 storage or Microsoft Azure storage or GCP buckets.
create stage nyc_weather url = 's3://snowflake-workshop-lab/weather-nyc';
List the stage
list @nyc_weather;
Step4: Copy the Json data from Stage to Table
copy into json_weather_data from @nyc_weather file_format = (type=json);
Check if it is copied correctly!!
select * from json_weather_data limit 10;
Wow!! We have the Json data loaded into our table. This is pretty cool.
Step5: Create a VIEW to format it properly ( We need to do to parse it properly)
create view json_weather_data_view as select v:time::timestamp as observation_time, v:city.id::int as city_id, v:city.name::string as city_name, v:city.country::string as country, v:city.coord.lat::float as city_lat, v:city.coord.lon::float as city_lon, v:clouds.all::int as clouds, (v:main.temp::float)-273.15 as temp_avg, (v:main.temp_min::float)-273.15 as temp_min, (v:main.temp_max::float)-273.15 as temp_max, v:weather[0].main::string as weather, v:weather[0].description::string as weather_desc, v:weather[0].icon::string as weather_icon, v:wind.deg::float as wind_dir, v:wind.speed::float as wind_speed from json_weather_data where city_id = 5128638;
Step6: Query the Json data from View as if you are Querying a table
select * from json_weather_data_view where date_trunc('month',observation_time) = '2018-01-01' limit 20;
Hurry!! We have the nicely formatted, queryable data in the Views. Thats it, we are done !!
Little bit about Views & Materialized Views
A View allows the result of a query to be accessed as if it were a table. Views can help you: present data to end users in a cleaner manner (like in this lab we will present “ugly” JSON in a columnar format), limit what end users can view in a source table for privacy/security reasons, or write more modular SQL.
There are also Materialized Views in which SQL results are stored, almost as though the results were a table. This allows faster access, but requires storage space. Materialized Views require Snowflake Enterprise Edition or higher.
Now quickly lets look, How can we do the same process in Databricks:
What is Databricks ?
Databricks is an industry-leading, cloud-based data engineering tool used for processing and transforming massive quantities of data and exploring the data through machine learning models. This Apache-Spark based platform runs a distributed system behind the scenes, meaning the workload is automatically split across various processors and scales up and down on demand. Increased efficiency results in direct time and cost savings for massive tasks.
let's follow the same process on different platform altogether !!
Step1: Creating a Schema to Parse / Read the Json
Step2: Read the Json data from S3 ( I have already mounted the S3 bucket on DBFS - Databricks File System)
Yeah!! We have created our Spark DataFrame
Step3: Some clean up Data Engineering!!
Step4: Creating a Temp Table from the DataFrame
Step5: Query the table ( Yes, in the Databricks itself !!)
So, We have seen how quickly we can load and query JSON data both in Snowflake and Databricks. Both are just amazing and I am just loving them !!
I hope this step by step explanation informed you at least at the high level that how easy to work on both the platform.
Credits: Snowflake 0 to 90 Webinar. Databricks Community Edition
Ab Initio Architecture, Solutions & CoE|Cloud CoE|Data governance
4 年I think a PoC/demo with social media data like Twitter in JSON format and parsing with Databricks should be worthwhile.