JSON in Fabric Datawarehouse
Jovan Popovic
Principal Program Manager at Microsoft, working on Microsoft Fabric Warehouse. Worked on Azure Synapse, Azure SQL Azure SQL Managed Instance, and SQL Server.
Traditionally, data warehouses operate with strongly schematized tabular data organized in star or snowflake models, providing optimal performance for analytical queries. However, modern data warehouses must handle the famous 3Vs characteristics that describe big data:
Variety is particularly challenging because it encompasses diverse data formats and structures.
JSON (JavaScript Object Notation) is the format of choice in many data systems (from plain file storage to NoSQL and relational databases) for storing data without a strongly typed schema. JSON is used to represent documents in NoSQL databases such as Azure CosmosDB, to serialize information sent by IoT devices, or to store logs and information with variable schemas in file systems. JSON’s flexible schema allows it to represent complex data structures, including nested objects and arrays, which are common in today’s diverse data sources. Additionally, properties of JSON can be easily added without schema changes, enhancing its flexibility.
Integrating JSON data into a data warehouse is necessary to enable analytics on semi-structured data.
Fabric DW is a modern data warehouse that enables you to analyze large volumes of data and ingest large batches of data produced in your source systems. Therefore, Fabric DW handles volume and velocity by design. The third V, handling data variety, is addressed with its built-in JSON support.
In this article, we will explore how you can work with JSON data in Fabric DW.
JSON in Fabric?DW
Fabric DW is built on the SQL Server/Azure SQL engine, which means it leverages the same functions available in these platforms. The JSON functionalities you use in SQL Server or Azure SQL to process JSON data are also available in Fabric DW.
You can use the following functionalities to work with JSON data in Fabric DW:
Since these are standard SQL functions and operators, you can easily combine them with other functionalities in the T-SQL language.
Ingesting JSON data into Fabric?DW
Let's start by loading JSON data into the data warehouse. First, we need to create a target table where the JSON data will be stored. Since JSON documents are represented as UTF-8 text, we can use the VARCHAR(MAX) type in Fabric DW to store the JSON documents that we will ingest.
IF schema_id('stage') IS NULL
EXEC('CREATE SCHEMA stage');
CREATE TABLE stage.charging_stations (json VARCHAR(MAX))
Note that Fabric DW uses UTF-8 collation for VARCHAR columns, ensuring that JSON data is stored without any corruption, even if it contains non-Latin characters. Use the VARCHAR(MAX) type if you plan to load large JSON documents larger than 8KB.
Now we are ready to import JSON data into the staging table. In this example, I'm using a public JSON dataset containing information about US charging stations, which is stored here: US EV Charging Locations.
After downloading this JSON file and storing it in your Azure Data Lake storage account, you can load it into the staging table using the COPY INTO command:
COPY INTO stage.charging_stations
FROM 'https://<storage>.dfs.core.windows.net/cfahlgren1/us-ev-charging-locations/charging_stations.json'
WITH (FILE_TYPE='CSV',FIELDTERMINATOR='0x00',FIELDQUOTE='0x02')
Although COPY INTO supports only CSV and Parquet file formats, you can easily tweak it to load line-delimited JSON files by overriding the terminators and quotes.
This command will load each JSON line from the source file as a separate VARCHAR(MAX) cell in the staging table. Once the data is loaded, you can easily explore the JSON documents stored in the table:
Each row contains a separate JSON document with information about a charging station. In the following sections, we will explore how to analyze and transform this data.
Analyzing JSON Data
Now that you have loaded data into your staging table, you can extract values from the JSON documents to analyze data. The following functions will be useful:
You can easily extract city and brand values from each JSON document and use them in T-SQL queries:
SELECT TOP 10
JSON_VALUE(json, '$.metadata."addr:city"') AS city,
JSON_VALUE(json, '$.metadata.brand') AS brand,
COUNT(*) AS [Number of charging stations]
FROM stage.charging_stations
WHERE JSON_VALUE(json, '$.metadata."addr:city"') IS NOT NULL
AND JSON_VALUE(json, '$.metadata.brand') IS NOT NULL
GROUP BY JSON_VALUE(json, '$.metadata.brand'),
JSON_VALUE(json, '$.metadata."addr:city"')
ORDER BY COUNT(*) DESC
The JSON_VALUE function takes a VARCHAR value formatted as JSON text and extracts a single value at the specified path. As a result, you can use the returned properties of the JSON documents in GROUP BY and ORDER BY clause to create reports:
Note that the JSON_VALUE function allows you to access nested properties and properties with special characters in keys (e.g., addr:city).
Alternatively, you can use the OPENJSON function, which takes a JSON document and returns multiple properties. The OPENJSON function enables you to parse the entire JSON document and return all JSON properties as columns in the result set.
The following query demonstrates how to return all relevant properties from each cell in the JSON documents:
SELECT properties.*
FROM stage.charging_stations
CROSS APPLY OPENJSON (json)
WITH (
lat float '$.lat',
lng float '$.lng',
[metadata.@id] varchar(64) '$.metadata."@id"',
[metadata.name] varchar(128) '$.metadata.name',
[metadata.brand] varchar(128) '$.metadata.brand',
[metadata.operator] varchar(256) '$.metadata.operator',
[metadata.amenity] varchar(64) '$.metadata.amenity',
[metadata.source] varchar(64) '$.metadata.source',
[metadata.capacity] varchar(128) '$.metadata.capacity',
[metadata.shop] varchar(8000) '$.metadata.shop',
[metadata.note] varchar(8000) '$.metadata.note',
[metadata.brand:wikidata] varchar(8000) '$.metadata."brand:wikidata"',
[metadata.brand:wikipedia] varchar(8000) '$.metadata."brand:wikipedia"',
[metadata.addr:housenumber] varchar(32) '$.metadata."addr:housenumber"',
[metadata.addr:street] varchar(8000) '$.metadata."addr:street"',
[metadata.addr:city] varchar(8000) '$.metadata."addr:city"',
[metadata.addr:country] varchar(8000) '$.metadata."addr:country"',
[metadata.operator:wikidata] varchar(8000) '$.metadata."operator:wikidata"',
[metadata.fuel:cng] varchar(32) '$.metadata."fuel:cng"',
[metadata.fuel:e10] varchar(32) '$.metadata."fuel:e10"',
[metadata.fuel:e85] varchar(32) '$.metadata."fuel:e85"',
[metadata.fuel:lpg] varchar(32) '$.metadata."fuel:lpg"',
[metadata.fuel:1_25] varchar(32) '$.metadata."fuel:1_25"',
[metadata.fuel:1_50] varchar(32) '$.metadata."fuel:1_50"',
[metadata.designation] varchar(32) '$.metadata.designation',
[metadata.fuel:biogas] varchar(32) '$.metadata."fuel:biogas"',
[metadata.fuel:diesel] varchar(32) '$.metadata."fuel:diesel"',
[metadata.fuel:biodiesel] varchar(32) '$.metadata."fuel:biodiesel"',
[metadata.fuel:octane_91] varchar(32) '$.metadata."fuel:octane_91"',
[metadata.fuel:octane_95] varchar(32) '$.metadata."fuel:octane_95"',
[metadata.fuel:octane_98] varchar(32) '$.metadata."fuel:octane_98"',
[metadata.fuel:GTL_diesel] varchar(32) '$.metadata."fuel:GTL_diesel"',
[metadata.fuel:HGV_diesel] varchar(32) '$.metadata."fuel:HGV_diesel"',
[metadata.fuel:octane_100] varchar(32) '$.metadata."fuel:octane_100"',
[metadata.fuel:electricity] varchar(32) '$.metadata."fuel:electricity"'
) AS properties
WHERE [metadata.fuel:octane_100] IS NOT NULL
As you can see, OPENJSON enables you to parse complex JSON structures with nested fields and handle various property names, including those with semicolons.
The OPENJSON function is particularly important when you need to move JSON data from the staging table to a strongly schematized table. If you create a DW table (for example gold.charging_stations) with the columns that match the properties of the JSON documents, you can combine the query above with the INSERT INTO statement to load JSON data from the staging to gold table:
INSERT INTO gold.charging_stations
SELECT properties.*
FROM stage.charging_stations
CROSS APPLY OPENJSON (json)
WITH ( lat float '$.lat', lng float '$.lng', ... ) AS properties
Converting JSON properties to a strongly typed schema and separating columns for each property can boost the performance of your queries.
Another interesting case for OPENJSON is "schematizing" json data. You can add CREATE VIEW statement that has the SELECT * FROM json_table CROSS APPLY OPENJSON query as a definition, and in your view you will expose all json properties:
This might be useful if you need to expose the json data to external tools like PowerBI to create a report on the raw data.
For the best performance it is better to transform your json data from a stage table into the real DW table with persisted columns, even if it requires additional transformation. Parsing JSON at every query cannot match the performance of the real table that permanantly holds the json properties.
Formatting Query Results as JSON Text
Instead of tabular results, you can return the results of a T-SQL query as JSON text using the JSON_OBJECT, JSON_ARRAY functions, and the FOR JSON clause, as shown in the following example:
SELECT TOP 3
JSON_OBJECT(
'city': JSON_VALUE(json, '$.metadata."addr:city"'),
'brand': JSON_VALUE(json, '$.metadata.brand')) as data,
COUNT(*) AS [Number of stations]
FROM stage.charging_stations
WHERE JSON_VALUE(json, '$.metadata."addr:city"') IS NOT NULL
AND JSON_VALUE(json, '$.metadata.brand') IS NOT NULL
GROUP BY JSON_VALUE(json, '$.metadata.brand'),
JSON_VALUE(json, '$.metadata."addr:city"')
ORDER BY COUNT(*)
FOR JSON PATH
The FOR JSON query clause formats the entire result set as a JSON string. You can also create JSON sub-objects using the JSON_OBJECT and JSON_ARRAY functions, which return JSON objects based on the values of columns.
The result of the query above might look like the JSON document in the following example:
[
{
"data": {"city":"New York", "brand":"ChargePoint" },
"Number of stations":11
},
{
"data":{"city":"Payson","brand":"Tesla, Inc."},
"Number of stations":11
},
{
"data":{"city":"San Diego","brand":"Tesla Supercharger"},
"Number of stations":8}
]
Conclusion
The JSON functionalities in Fabric DW empower you to efficiently parse loaded JSON documents, extract values for comprehensive analysis and reporting, and format query results as JSON text.
These robust make Fabric DW an indispensable tool for managing and analyzing semi-structured data within your data warehouse.
Azure DW/BI Analytics Tech. Arch.
3 个月Nice to see JSON coming to fabric Jovan Popovic Does anything changes from what we saw in synapse?
Cloud Analytics Business Lead- APJ market | Author
3 个月Very informative
Data Analytics Leader | AI Professional
3 个月Very helpful
Director of Analytics and AI | Helping businesses do more with data
3 个月Jovan Popovic Very well written and I saw what you did with the wordplay and timing of the article ?? JSON_OBJECT and JSON_ARRAY make it so much easier to work with JSON in the SQL Server family of databases. They also make it a lot simpler when you’re using JSON as an agnostic format for metadata-driven framework configurations and need to load that JSON into a set of configuration tables. The possibilities are endless ?