登录查看更多内容

Azure Data Explorer: Real-Time Analytics - Palo Alto Web Traffic Logs

Rory McManus

Owner @Data Mastery | Need help with Data?

发布日期: 2022年9月7日

Since remote working has become the norm, risk and information security teams are operating in a completely different landscape and must adapt in order to meet these new monitoring and log requirements which is critical to ensure resiliency and security of business operations.

Over the past couple of years, our client's Palo Alto Web security log storage costs on Hadoop have increased 5 fold and their ad-hoc Kafka queries are running at a snails pace preventing the information security team from responding to threats in a timely manner.

Azure Data Explorer (ADX) is a fully managed data analytics service for real-time analysis on huge volumes of data streaming from applications, websites and IoT devices. Performing near real time analytics over over petabytes of data, returning results in less than a second across billions of records.

After our client switched from Hadoop to ADX as the primary storage they observed a huge cost savings of 50% and reduced Kafka Cluster query responses from 30 minutes to a few seconds!

What Data Mastery love most about Azure Data Explorer is the simplified solution and ability it provides our client to add new data ingestion pipelines!

For a detailed explanation on Azure Data Explorer click here.

Ingesting Palo Alto Logs from Azure Storage to ADX

In this article, I will demonstrate how to create an Ingestion Pipeline to ingest and transform Palo Alto Web Traffic logs files uploaded hourly to an Azure Storage Account and which accumulate to a daily total of 200GB (when uncompressed).

The file is a compressed .gz file split into three different formats:

Space delimited values
Pipe delimited values
Space delimited Key-Value Pairs

Solution

The solution used follows the high-level steps below:

Palo Alto Log Files are uploaded/created on Azure Storage(ADLS Gen2) This action in turn triggers the ingestion process using an Event Grid-created subscriber.
The file is ingested into an ADX staging table.
An ADX user-defined Update Policy reads the newly uploaded data in the staging table and transforms the data into the destination table as required.

P.S. This Solution is as simple as it sounds!

Ingestion Pipeline

Prerequisites

Install Kusto explorer and connect to the ADX cluster. Alternatively, the Web UI can be used.
Microsoft recommends each file must be 1GB uncompressed for optimal ingestion and no larger than 4GB.
Register Event Grid with the Azure Subscription.

To create the ingestion pipeline the following steps must be completed

Create a container on Azure Storage - ADLS Gen2.
Create an ADX Staging Table.
Set a Retention Policy on the ADX Staging table.
Create an ADX Query Function to read and transform the data landing in the staging table.
Create an ADX Destination Table for the curated data.
Create ADX Update policy: The Update Policy instructs ADX to automatically append data to the target table whenever new data is inserted into the staging table, based on the transformation function created in step 3.
Create an Event Grid Ingestion Method: The chosen ingestion method is ingesting data into data explorer via Event Grid from ADLS.
Test :)

Steps

1. Create a container on Azure Storage - ADLS Gen2.

2. Create an ADX staging table with one column of data type string.

3. Set a Retention Policy on the ADX Staging table to only keep 14 days of data.

领英推荐

The Challenge in Big Data is Small Files

MinIO 6 个月前

Harnessing the Power of Azure Databricks and Microsoft…

Sanjay Kumar MBA,MS,PhD 7 个月前

Data Empowerment through Azure: A Path to Informed…

Sunil Tripathy 1 年前

4. Create an ADX Function.

The function reads and transforms the data from the staging table to the desired output. Only a subset of source columns are required in the output.

5. Create an ADX Destination Table for the curated data.

The ingestion function can be used to create the schema for the destination table using the following script:

NOTE: Ensure the DateTime and numeric columns are typed correctly as ADX stores metadata and statistics for each column. ADX will also store the maximum and minimum values of the extent of the data. This will ensure that when the user requests the data from the store, with certain conditions, it will be compared and only relevant extents are scanned and returned as results.

6. Create ADX Update Policy

The Update Policy instructs ADX to automatically append data to the target table whenever new data is inserted into the staging table, based on the transformation function created above.

7. Create an Event Grid Ingestion Method.

The chosen ingestion method is ingesting data into data explorer via Event Grid from ADLS.

Log in to the Azure Portal.
Navigate to the ADX Cluster?? Databases (Select appropriate database) ? Data connections.
Add Data Connection – see below.

Click ‘Next: Review + create >’ to the next tab Ingest Properties.

NOTE: Txt files do not have mappings. Mappings are only used for CSV, JSON, AVRO, and W3CLOGFILE files.

8. Test :)

Upload a file to the Azure storage container. If the ingestion has failed run the query below to check why.

Conclusion

If you would like a copy of my code, please drop me a message on?LinkedIn.

I hope you have found this helpful and will save your company time and money with Azure Data Explorer.

Please share your thoughts, questions, corrections and suggestions. All feedback and comments are very welcome.

Nazeer Ali Mohammed

2 年

Amazing results and thanks for sharing the successful story Rory McManus ??

1 次回应

Uri Barash

2 年

Wonderful! Way to go Rory McManus!

1 次回应

查看更多评论

要查看或添加评论，请登录

Rory McManus的更多文章

IoT Real Time Analytics - WAGO PLC with Databricks Auto Loader

2022年8月24日

IoT Real Time Analytics - WAGO PLC with Databricks Auto Loader

Modern businesses have an overwhelming amount of data available to them from a huge number of IoT devices and…

4 条评论
What is Databricks Auto Loader?

2022年8月15日

What is Databricks Auto Loader?

Databricks is a scalable big data analytics platform designed for data science and data engineering. Built on top of…

5 条评论
What is Azure Data Explorer?

2022年4月17日

What is Azure Data Explorer?

Azure Data Explorer (ADX) is a fully managed data analytics service for real-time analysis on large volumes of data…

2 条评论
Azure Data Explorer: Real-Time Analytics - Fortinet Logs

2021年11月15日

Azure Data Explorer: Real-Time Analytics - Fortinet Logs

I recently used Data Explorer with an education client to migrate an existing Kafka workload which ingests and…

13 条评论
Databricks PySpark Type 2 SCD Function for Azure Dedicated SQL Pools

2021年4月21日

Databricks PySpark Type 2 SCD Function for Azure Dedicated SQL Pools

Slowly Changing Dimensions (SCD) is a commonly used dimensional modeling technique used in data warehousing to capture…

20 条评论

See all articles

Azure Data Explorer: Real-Time Analytics - Palo Alto Web Traffic Logs

Rory McManus

Owner @Data Mastery | Need help with Data?

Ingesting Palo Alto Logs from Azure Storage to ADX

Solution

Ingestion Pipeline

Prerequisites

Steps

领英推荐

Conclusion

Rory McManus的更多文章

社区洞察

其他会员也浏览了

Unity Catalog in Azure Databricks: Why You Should Use It and How to Implement It

Unraveling the Past, Empowering the Future: Versioning and Timetravel of Data in Azure Fabric

?? Azure Secure Medallion Mesh Architecture: Innovating to Solve Real Client Challenges ??

Microsoft Azure

Azure Databricks: Collaborative Data Solutions

Microsoft Fabric - Databases: The Foundation for Scalable and Intelligent Data Management ????

DP-600 Fabric Analytics Engineer Certification: A Repository

Azure Analytics Services

"Medallion architecture in Microsoft Azure's Delta Lake"

Transforming Data and Cloud Solutions in Somalia.

Ingesting Palo Alto Logs from Azure Storage to ADX

Solution

Ingestion Pipeline

Prerequisites

Steps

领英推荐

Conclusion

Rory McManus的更多文章

IoT Real Time Analytics - WAGO PLC with Databricks Auto Loader

What is Databricks Auto Loader?

What is Azure Data Explorer?

Azure Data Explorer: Real-Time Analytics - Fortinet Logs

Databricks PySpark Type 2 SCD Function for Azure Dedicated SQL Pools

社区洞察

其他会员也浏览了

Unity Catalog in Azure Databricks: Why You Should Use It and How to Implement It

Unraveling the Past, Empowering the Future: Versioning and Timetravel of Data in Azure Fabric

?? Azure Secure Medallion Mesh Architecture: Innovating to Solve Real Client Challenges ??

Microsoft Azure

Azure Databricks: Collaborative Data Solutions

Microsoft Fabric - Databases: The Foundation for Scalable and Intelligent Data Management ????

DP-600 Fabric Analytics Engineer Certification: A Repository

Azure Analytics Services

"Medallion architecture in Microsoft Azure's Delta Lake"

Transforming Data and Cloud Solutions in Somalia.