登录查看更多内容

JSON in Fabric Datawarehouse

Jovan Popovic

Principal Program Manager at Microsoft, working on Microsoft Fabric Warehouse. Worked on Azure Synapse, Azure SQL Azure SQL Managed Instance, and SQL Server.

发布日期: 2024年12月14日

Traditionally, data warehouses operate with strongly schematized tabular data organized in star or snowflake models, providing optimal performance for analytical queries. However, modern data warehouses must handle the famous 3Vs characteristics that describe big data:

Volume: Refers to the vast amounts of data generated by businesses and users.
Velocity: The speed at which new data is generated and needs to be processed.
Variety: The different types of data, including structured, semi-structured, and unstructured data.

Variety is particularly challenging because it encompasses diverse data formats and structures.

JSON (JavaScript Object Notation) is the format of choice in many data systems (from plain file storage to NoSQL and relational databases) for storing data without a strongly typed schema. JSON is used to represent documents in NoSQL databases such as Azure CosmosDB, to serialize information sent by IoT devices, or to store logs and information with variable schemas in file systems. JSON’s flexible schema allows it to represent complex data structures, including nested objects and arrays, which are common in today’s diverse data sources. Additionally, properties of JSON can be easily added without schema changes, enhancing its flexibility.

Integrating JSON data into a data warehouse is necessary to enable analytics on semi-structured data.

Fabric DW is a modern data warehouse that enables you to analyze large volumes of data and ingest large batches of data produced in your source systems. Therefore, Fabric DW handles volume and velocity by design. The third V, handling data variety, is addressed with its built-in JSON support.

In this article, we will explore how you can work with JSON data in Fabric DW.

JSON in Fabric?DW

Fabric DW is built on the SQL Server/Azure SQL engine, which means it leverages the same functions available in these platforms. The JSON functionalities you use in SQL Server or Azure SQL to process JSON data are also available in Fabric DW.

JSON is available in SQL since 2016. Functions marked with * are added in 2022.

You can use the following functionalities to work with JSON data in Fabric DW:

JSON_VALUE() and JSON_QUERY(): Parse JSON documents and extract scalar or object values.
OPENJSON: Parses arrays of JSON objects and returns them as a set of rows.
JSON_MODIFY(): Modifies a property of a JSON object.
ISJSON() and JSON_PATH_EXISTS(): Check if the text is properly formatted JSON or if a path exists in a JSON document.
JSON_OBJECT() and JSON_ARRAY(): Format arguments as JSON objects or arrays.
FOR JSON: A query clause that formats T-SQL query output as JSON text. Together with JSON_OBJECT() and JSON_ARRAY(), it can be used to generate very complex JSON structures.

Since these are standard SQL functions and operators, you can easily combine them with other functionalities in the T-SQL language.

Ingesting JSON data into Fabric?DW

Let's start by loading JSON data into the data warehouse. First, we need to create a target table where the JSON data will be stored. Since JSON documents are represented as UTF-8 text, we can use the VARCHAR(MAX) type in Fabric DW to store the JSON documents that we will ingest.

IF schema_id('stage') IS NULL
	EXEC('CREATE SCHEMA stage');

CREATE TABLE stage.charging_stations (json VARCHAR(MAX))

Note that Fabric DW uses UTF-8 collation for VARCHAR columns, ensuring that JSON data is stored without any corruption, even if it contains non-Latin characters. Use the VARCHAR(MAX) type if you plan to load large JSON documents larger than 8KB.

Now we are ready to import JSON data into the staging table. In this example, I'm using a public JSON dataset containing information about US charging stations, which is stored here: US EV Charging Locations.

After downloading this JSON file and storing it in your Azure Data Lake storage account, you can load it into the staging table using the COPY INTO command:

COPY INTO stage.charging_stations
FROM 'https://<storage>.dfs.core.windows.net/cfahlgren1/us-ev-charging-locations/charging_stations.json'
WITH (FILE_TYPE='CSV',FIELDTERMINATOR='0x00',FIELDQUOTE='0x02')

Although COPY INTO supports only CSV and Parquet file formats, you can easily tweak it to load line-delimited JSON files by overriding the terminators and quotes.

This command will load each JSON line from the source file as a separate VARCHAR(MAX) cell in the staging table. Once the data is loaded, you can easily explore the JSON documents stored in the table:

the ingested JSOn documents can be queried as VARCHAR columns

Each row contains a separate JSON document with information about a charging station. In the following sections, we will explore how to analyze and transform this data.

Analyzing JSON Data

Now that you have loaded data into your staging table, you can extract values from the JSON documents to analyze data. The following functions will be useful:

JSON_VALUE and JSON_QUERY: Extract scalar or object/array values from the JSON text.
OPENJSON: Converts arrays of objects into a result set.

You can easily extract city and brand values from each JSON document and use them in T-SQL queries:

SELECT TOP 10
             JSON_VALUE(json, '$.metadata."addr:city"') AS city,
             JSON_VALUE(json, '$.metadata.brand') AS brand,
             COUNT(*) AS [Number of charging stations]
FROM stage.charging_stations
WHERE JSON_VALUE(json, '$.metadata."addr:city"') IS NOT NULL
AND JSON_VALUE(json, '$.metadata.brand') IS NOT NULL
GROUP BY JSON_VALUE(json, '$.metadata.brand'),
         JSON_VALUE(json, '$.metadata."addr:city"')
ORDER BY COUNT(*) DESC

The JSON_VALUE function takes a VARCHAR value formatted as JSON text and extracts a single value at the specified path. As a result, you can use the returned properties of the JSON documents in GROUP BY and ORDER BY clause to create reports:

Note that the JSON_VALUE function allows you to access nested properties and properties with special characters in keys (e.g., addr:city).

Alternatively, you can use the OPENJSON function, which takes a JSON document and returns multiple properties. The OPENJSON function enables you to parse the entire JSON document and return all JSON properties as columns in the result set.

The following query demonstrates how to return all relevant properties from each cell in the JSON documents:

SELECT properties.* 
FROM stage.charging_stations
CROSS APPLY OPENJSON (json) 
WITH (
    lat float '$.lat', 
    lng float '$.lng', 
    [metadata.@id] varchar(64) '$.metadata."@id"', 
    [metadata.name] varchar(128) '$.metadata.name', 
    [metadata.brand] varchar(128) '$.metadata.brand',
    [metadata.operator] varchar(256) '$.metadata.operator', 
    [metadata.amenity] varchar(64) '$.metadata.amenity',
    [metadata.source] varchar(64) '$.metadata.source',
    [metadata.capacity] varchar(128) '$.metadata.capacity', 
    [metadata.shop] varchar(8000) '$.metadata.shop',
    [metadata.note] varchar(8000) '$.metadata.note',
    [metadata.brand:wikidata] varchar(8000) '$.metadata."brand:wikidata"', 
    [metadata.brand:wikipedia] varchar(8000) '$.metadata."brand:wikipedia"',
    [metadata.addr:housenumber] varchar(32) '$.metadata."addr:housenumber"',
    [metadata.addr:street] varchar(8000) '$.metadata."addr:street"',
    [metadata.addr:city] varchar(8000) '$.metadata."addr:city"', 
    [metadata.addr:country] varchar(8000) '$.metadata."addr:country"',
    [metadata.operator:wikidata] varchar(8000) '$.metadata."operator:wikidata"',
    [metadata.fuel:cng] varchar(32) '$.metadata."fuel:cng"',
    [metadata.fuel:e10] varchar(32) '$.metadata."fuel:e10"',
    [metadata.fuel:e85] varchar(32) '$.metadata."fuel:e85"',
    [metadata.fuel:lpg] varchar(32) '$.metadata."fuel:lpg"',
    [metadata.fuel:1_25] varchar(32) '$.metadata."fuel:1_25"',
    [metadata.fuel:1_50] varchar(32) '$.metadata."fuel:1_50"',
    [metadata.designation] varchar(32) '$.metadata.designation',
    [metadata.fuel:biogas] varchar(32) '$.metadata."fuel:biogas"',
    [metadata.fuel:diesel] varchar(32) '$.metadata."fuel:diesel"',
    [metadata.fuel:biodiesel] varchar(32) '$.metadata."fuel:biodiesel"',
    [metadata.fuel:octane_91] varchar(32) '$.metadata."fuel:octane_91"',
    [metadata.fuel:octane_95] varchar(32) '$.metadata."fuel:octane_95"',
    [metadata.fuel:octane_98] varchar(32) '$.metadata."fuel:octane_98"',
    [metadata.fuel:GTL_diesel] varchar(32) '$.metadata."fuel:GTL_diesel"',
    [metadata.fuel:HGV_diesel] varchar(32) '$.metadata."fuel:HGV_diesel"',
    [metadata.fuel:octane_100] varchar(32) '$.metadata."fuel:octane_100"',
    [metadata.fuel:electricity] varchar(32) '$.metadata."fuel:electricity"'
) AS properties
WHERE [metadata.fuel:octane_100] IS NOT NULL

As you can see, OPENJSON enables you to parse complex JSON structures with nested fields and handle various property names, including those with semicolons.

The OPENJSON function is particularly important when you need to move JSON data from the staging table to a strongly schematized table. If you create a DW table (for example gold.charging_stations) with the columns that match the properties of the JSON documents, you can combine the query above with the INSERT INTO statement to load JSON data from the staging to gold table:

INSERT INTO gold.charging_stations
SELECT properties.* 
FROM stage.charging_stations
CROSS APPLY OPENJSON (json) 
WITH ( lat float '$.lat', lng float '$.lng', ... ) AS properties

Converting JSON properties to a strongly typed schema and separating columns for each property can boost the performance of your queries.

Another interesting case for OPENJSON is "schematizing" json data. You can add CREATE VIEW statement that has the SELECT * FROM json_table CROSS APPLY OPENJSON query as a definition, and in your view you will expose all json properties:

A view on top of a table containing JSON data can extract all json properties

This might be useful if you need to expose the json data to external tools like PowerBI to create a report on the raw data.

For the best performance it is better to transform your json data from a stage table into the real DW table with persisted columns, even if it requires additional transformation. Parsing JSON at every query cannot match the performance of the real table that permanantly holds the json properties.

Formatting Query Results as JSON Text

Instead of tabular results, you can return the results of a T-SQL query as JSON text using the JSON_OBJECT, JSON_ARRAY functions, and the FOR JSON clause, as shown in the following example:

SELECT TOP 3
      JSON_OBJECT(
          'city': JSON_VALUE(json, '$.metadata."addr:city"'),
          'brand': JSON_VALUE(json, '$.metadata.brand')) as data,
      COUNT(*) AS [Number of stations]
FROM stage.charging_stations
WHERE JSON_VALUE(json, '$.metadata."addr:city"') IS NOT NULL
AND JSON_VALUE(json, '$.metadata.brand') IS NOT NULL
GROUP BY JSON_VALUE(json, '$.metadata.brand'),
         JSON_VALUE(json, '$.metadata."addr:city"')
ORDER BY COUNT(*)
FOR JSON PATH

The FOR JSON query clause formats the entire result set as a JSON string. You can also create JSON sub-objects using the JSON_OBJECT and JSON_ARRAY functions, which return JSON objects based on the values of columns.

The result of the query above might look like the JSON document in the following example:

[
  {
   "data": {"city":"New York", "brand":"ChargePoint" },
   "Number of stations":11
  },
  {
    "data":{"city":"Payson","brand":"Tesla, Inc."},
   "Number of stations":11
  },
  {
   "data":{"city":"San Diego","brand":"Tesla Supercharger"},
   "Number of stations":8}
]

Conclusion

The JSON functionalities in Fabric DW empower you to efficiently parse loaded JSON documents, extract values for comprehensive analysis and reporting, and format query results as JSON text.

These robust make Fabric DW an indispensable tool for managing and analyzing semi-structured data within your data warehouse.

Sanjay Rathod

Azure DW/BI Analytics Tech. Arch.

3 个月

Nice to see JSON coming to fabric Jovan Popovic Does anything changes from what we saw in synapse?

1 次回应

Debananda Ghosh

Cloud Analytics Business Lead- APJ market | Author

3 个月

Very informative

2 次回应

Mitesh Patel

Data Analytics Leader | AI Professional

3 个月

Very helpful

2 次回应

Michael Amadi

Director of Analytics and AI | Helping businesses do more with data

3 个月

Jovan Popovic Very well written and I saw what you did with the wordplay and timing of the article ?? JSON_OBJECT and JSON_ARRAY make it so much easier to work with JSON in the SQL Server family of databases. They also make it a lot simpler when you’re using JSON as an agnostic format for metadata-driven framework configurations and need to load that JSON into a set of configuration tables. The possibilities are endless ?

2 次回应

查看更多评论

要查看或添加评论，请登录

Jovan Popovic的更多文章

Spatial functions in Fabric Datawarehouse

2024年12月25日

Spatial functions in Fabric Datawarehouse

In today's data-driven world, data warehouses play a crucial role in enabling organizations to perform complex…

3 条评论
Choosing between Lakehouse and Warehouse in Microsoft Fabric

2023年8月10日

Choosing between Lakehouse and Warehouse in Microsoft Fabric

Microsoft Fabric is a cloud-based platform designed to unify the storage, management, and analysis of large-scale data…

9 条评论
Use shortcuts instead of external tables to reference external data in Fabric Warehouse

2023年7月18日

Use shortcuts instead of external tables to reference external data in Fabric Warehouse

Microsoft Fabric is a unified data and analytics platform that delivers a modern warehouse that integrates smoothly…

12 条评论
Analyze Lakehouse data with T-SQL in Microsoft Fabric

2023年7月17日

Analyze Lakehouse data with T-SQL in Microsoft Fabric

Microsoft Fabric is a unified data analytics platform that enables you to store your data in Lakehouse artifacts. A…
Querying Lakehouse data from Warehouse in Microsoft Fabric

2023年7月12日

Querying Lakehouse data from Warehouse in Microsoft Fabric

Microsoft Fabric is a cloud data platform that provides both Lakehouse and Warehouse artifacts for data storage and…

3 条评论
Troubleshooting performance on serverless Synapse SQL pool using QPI library

2021年12月21日

Troubleshooting performance on serverless Synapse SQL pool using QPI library

Query performance insights Four years ago, I created a Query performance insights library (QPI). This is a T-SQL…

9 条评论
Use T-SQL queries over your Azure Data Lake files with serverless SQL pool in Azure Synapse Analytics

2020年12月15日

Use T-SQL queries over your Azure Data Lake files with serverless SQL pool in Azure Synapse Analytics

One of the important features of Azure Synapse Analytics is the ability to analyze the files placed on Azure Data Lake…
Analyzing IoT data in Azure SQL

2018年1月23日

Analyzing IoT data in Azure SQL

IoT use cases commonly share some patterns in how they ingest, process, and store data. First, these systems need to…
Introducing Azure SQLDB Managed Instance

2017年10月14日

Introducing Azure SQLDB Managed Instance

Azure SQL Managed instance is a new service that will be publicly available in Microsoft Azure cloud in the near…

9 条评论

See all articles

JSON in Fabric?DW

Ingesting JSON data into Fabric?DW

Analyzing JSON Data

Formatting Query Results as JSON Text

Conclusion

Jovan Popovic的更多文章

Spatial functions in Fabric Datawarehouse

Choosing between Lakehouse and Warehouse in Microsoft Fabric

Use shortcuts instead of external tables to reference external data in Fabric Warehouse

Analyze Lakehouse data with T-SQL in Microsoft Fabric

Querying Lakehouse data from Warehouse in Microsoft Fabric

Troubleshooting performance on serverless Synapse SQL pool using QPI library

Use T-SQL queries over your Azure Data Lake files with serverless SQL pool in Azure Synapse Analytics

Analyzing IoT data in Azure SQL

Introducing Azure SQLDB Managed Instance