Ingesting XML file in Snowflake from Azure Blob
This article discusses inserting XML data into a Snowflake table from Azure Blob Storage. This excludes Azure Data Factory orchestration, and only includes an XML file, blob setups, snowflake object configurations, and SQL queries.
First, we must ensure our objects are in place and prepared. Let's talk about them one by one.
Azure Blob Storage
We need a blob storage and a container where our XML file will be dropped. Make sure you have the proper privileges to drop/create a file.
In Img. 1.1, you can see that I have a blob storage, and under the Data Storage (sidebar left side), you will see Containers, and inside that will be the list of your containers. Later, I'll drop an XML file under the test container.
Select your container, and upload your XML file like the image below. For this example, I uploaded my my_xml_file.xml in my test container.
Now that the file is uploaded in the blob storage, then we're done with this part.
Snowflake Setup
Let's turn our heads to Snowflake because we'll set up some things here.
File Format
File format object preemptively tells Snowflake what kind of data you’ll ingest in the staging table.
CREATE OR REPLACE FILE FORMAT XML_LOAD
TYPE = 'xml';
The query above creates a file format object named XML_LOAD. The second line is a parameter that specifies the input file format, which is "xml". This file format object will be used in the next setup.
Staging Table / Storage Integration
External stages are objects which references data sources outside Snowflake. This may be from Azure Blobs, Amazon S3s, or Google Data Cloud. From here you can view the list of objects inside a specific container.
CREATE OR REPLACE STAGE XML_EXT_STAGE
URL = 'azure://xxxxxxx.blob.core.windows.net/test/'
FILE_FORMAT = 'XML_LOAD';
The first line is the declaration for creating an external stage called XML_EXT_STAGE.
The second parameter is the URL. It specifies the link to the container. To retrieve the URL of your container, click on the ellipsis on the far right side of your container, and click Container properties.
You will then redirect to this page. In the URL section, you'll see the URL value of your container.
Returning to the query, the third line represents the file format that will be ingested in this external stage. This is where we need to use the file format we created earlier, which is XML_LOAD, and set that as a value.
However, the query above only works for public containers, which may be possible for personal projects or subscriptions.
But in a corporate setting where everything is private, we need to tweak the query above to add more parameters.
To provide the external stage a privilege to access our blob storage, we need to create another object: Storage Integration.
CREATE OR REPLACE STORAGE INTEGRATION MY_STORAGE_INTEGRATION
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'AZURE'
ENABLED = TRUE
AZURE_TENANT_ID = 'xx-0000-xxx-0000'
The first line specifies the creation of a storage integration object named MY_STORAGE_INTEGRATION.
On the second line, we specify what type of integration we will be using, and that's external stage.
On the third line, we specify what cloud provider we will be using, and that is Azure. Other possible values could be S3 for Amazon or GCS for Google Cloud Storage.
On the fourth line, we want our storage integration to be utilized, so we're enabling it.
For the fifth line, we need to provide the Azure tenant id.
To find the Azure Tenant ID:
Now that we've created our Storage Integration object, we can include it in our external stage configuration. The query will now look like this:
领英推荐
CREATE OR REPLACE STAGE XML_EXT_STAGE
URL = 'azure://xxxxxxx.blob.core.windows.net/test/'
STORAGE_INTEGRATION = MY_STORAGE_INTEGRATION
FILE_FORMAT = 'XML_LOAD';
To verify if your external stage is successfully mounted to your blob, you can execute this query:
LIST @XML_EXT_STAGE;
If your blob is not empty, it will return a table of items that looks like this.
Every row in the result will represent each item in your container.
Target Table
The next thing we need to prepare is the target table of the xml file.
CREATE OR REPLACE TABLE XML_TARGET (
DATA VARIANT,
ORIGIN_FILE VARCHAR(200),
LOAD_TIMESTAMP TIMESTAMP_LTZ(9)
);
In the query above, we are creating a table named XML_TARGET with three columns - DATA of data type variant, FILENAME of data type varchar (similar to string type), and LOAD_TIMESTAMP of data type timestamp_ltz.
One thing to note is that the XML file format can only generate one variant column, hence the DATA column. The data type variant also describes the XML appropriately since it's semi-structured data, much like a JSON, or a dictionary.
COPY INTO command
Next, we must discuss ingesting the XML file from the external stage to the target table. All we need is a command called COPY INTO.
COPY INTO XML_TARGET FROM (
SELECT
$1 AS DATA,
REGEXP_REPLACE(METADATA$FILENAME, '.+/(.+)', '\\1') AS ORIGIN_FILE,
CURRENT_TIMESTAMP() AS LOAD_TIMESTAMP
FROM @XML_EXT_STAGE
(PATTERN=>'.*.xml')
)
ON_ERROR = SKIP_FILE;
The first line specifies the copy command from a select statement on the external stage towards our target table, XML_TARGET.
The first parameter of the select statement, $1, is the XML file's numeric column position, aka the first column of the XML file (XML files only have one columns).
Typical formats like CSV or XLSX may have multiple columns, so they could be referenced as $2 or $3 for their second and third columns, respectively.
In the second column, we transformed the value of METADATA$FILENAME using regex to filter out only the filename. You can query several metadata values in your external stage (and their value can be copied into tables), and one example is the filename. METADATA$FILENAME's value is the filename plus the container address, and for this example I want to store only the name, so I pruned out the container address using regex.
The third column is just a simple CURRENT_TIMESTAMP() command to generate the current date and time.
Just below the FROM statement, is a pattern matching command that specifies what type of file from the external stage I want to copy to the target table. The query above specifies that only XML files will be copied.
The last line in the query is an optional parameter that states that it will skip the copying of erroneous files.
Once you run the query, your XML file will be on your target table.
References
File Format
Snowflake Stages
Storage Integration
Copy Into Command
Querying Metadata for Staged Files
Here's my contact details should you want to reach out.
Email: [email protected]
Phone: (+63) 917 718 1960