登录查看更多内容

Parsing XML Logs With Nifi – Part 1 of 3

Chris Gambino

Lead Architect | Co-Founder at Calculated Systems

发布日期: 2016年4月3日

+ 关注

I have a plan to write a 3 part “intro” series as to how to handle your XML files. The subjects will be:

Basic XML and Feature Extraction via Text Managment, Splitting and Xpath
Interactive Text Handling with XQuery and Regex in relation to XMLs
XML schema validation and transformations

XML data is read into the flowfile contents when the file lands in nifi. As long as it is a valid XML format the 5 dedicated XML processors can be applied to it for management and feature extraction. Commonly a user will want to get this XML data into a database which will require us to do a feature extraction and convert to a new format such as JSON or AVRO.

The simplest of the XML processors is the “SplitXml” processor. This simply takes the current selection of data and breaks the children off into their own files. The depth of the split in relation to the root is configurable as shown below. An example of when this may be helpful is when you have a list of events, each of which should be treated seperatly

XPath is is a syntax language way of extracting information from an XML. It allows you to search for nodes based on hierarchy, name, or even attribute. It has limited regex integration and has framework for moderately complex queries. More complete documentation can be found here https://www.w3schools.com/xsl/xpath_syntax.asp The processor below shows the “EvaluateXPath” processor being combined with XPath language to extract node data and an attribute. It should not be confused for XQuery which I will cover in my next article.

With executing the Xpath module something very important happens, the xml attributes are now NIFI attributes. This allows us to apply routing and other intelligence that is Nifi's signature. One of the transformations I have previously worked on is how to get the XML data into an AVRO format for easy ingestion. At this time all of the AVRO processors in nifi play nicely with JSONs so the “AttributestoJSON” processor can be used to as an out of the box intermediary to get the format you need. Note that I have set the destination of the processor to “flowfile-contents” which will over-ride the existing XML contents for a JSON.

With a JSON + attributes this is a very easy flow file to work with and can be easily merged into existing workflows or written out to a file for the Hive SerDe.

Max B?reb?ck

Enterprise architecture supports and enable business to be successful

6 年

I sugest an alternative way of doing this, in a more generic way by flattening out the XML file to tables. XML files can be complex with many branches that contains repetitive objects and can in turn contain child branches that also contain multiple records. I have created a generic groovy processor that is using and XSD to identify all potential tables and the structure of this table including data types. This information is then applied on an XML file to generate one flow file per found table . you can read the whole description in my blog?https://max.bback.se/index.php/2018/06/30/xml-to-tables-csv-with-nifi-and-groovy-part-2-of-2/ And all source code is available on GitHub. /Max

2 次回应

要查看或添加评论，请登录

Chris Gambino的更多文章

NiFi and Retrieval Augmented Generation

2024年6月10日

NiFi and Retrieval Augmented Generation

Phase 1 – “Basic Knowledge” We built a real time slackbot to help answer NiFi questions. To build and host this…

1 条评论
Cloud First IoT with Syft

2019年12月18日

Cloud First IoT with Syft

Introduction Syft Technologies is a leading scientific equipment manufacturer specializing in chemical analysis. To…
A Crash Course for Amazon Natural Language Processing

2019年6月18日

A Crash Course for Amazon Natural Language Processing

Over the past few years we have seen a rise in cloud native “machine learning” models. These general use models are…
What I Learned from 2.75 Million Bike Rides

2019年6月11日

What I Learned from 2.75 Million Bike Rides

What do you think is the most popular bicycle spot is in San Francisco? I’ll give you a hint, over 129,000 people…
Moving Data to the Cloud - A Practical Guide

2019年6月5日

Moving Data to the Cloud - A Practical Guide

Moving data to the cloud is one of the cornerstones of any cloud migration. Having worked with both on-premise and…

2 条评论
Automated Data Collection with NiFi

2019年5月29日

Automated Data Collection with NiFi

Introduction Manufacturing is a field that is undergoing a complete transformation in the era of faster and more…

2 条评论
Create A Restful API for Nifi, Walmart Case Study

2016年9月6日

Create A Restful API for Nifi, Walmart Case Study

I was recently tinkering with the walmart rest-api. This is publicly available interface and can be used for a quick…
Windows Share + Nifi + HDFS – A Practical Guide

2016年4月11日

Windows Share + Nifi + HDFS – A Practical Guide

Recently I had a client ask about how would we go about connecting a windows share to Nifi to HDFS, or if it was even…

1 条评论
Integrating Nifi with Graylog

2016年3月25日

Integrating Nifi with Graylog

Graylog is gaining popularity as a log exploration tool. So this begs the question, how do you intelligently route your…

1 条评论
Building a Smarter Home with Nifi and Spark

2016年2月4日

Building a Smarter Home with Nifi and Spark

I submitted an abstract for the hadoop world summit. Check it out and vote for it here Join us as we discuss what life…

2 条评论

See all articles

Parsing XML Logs With Nifi – Part 1 of 3

Chris Gambino

Lead Architect | Co-Founder at Calculated Systems

Chris Gambino的更多文章

社区洞察

其他会员也浏览了

Overview of Structured API Execution

5 reaons why Entity Framework will be your best friend

Paranoid? Masking, anonymizing, and obfuscating PII in XML and JSON data

10 Advanced features of Laravel Eloquent

Creating a Semantic Model using Snowflake's Semantic Model Generator

Reactive Streams: Asynchronous Data Processing

Understanding IEnumerable, ICollection, IList, and IQueryable in C#

JSON Operators:-

Creating my first Intermediate Representation (IR) layer

Critical milestone for ISO graph query standard GQL

Chris Gambino的更多文章

NiFi and Retrieval Augmented Generation

Cloud First IoT with Syft

A Crash Course for Amazon Natural Language Processing

What I Learned from 2.75 Million Bike Rides

Moving Data to the Cloud - A Practical Guide

Automated Data Collection with NiFi

Create A Restful API for Nifi, Walmart Case Study

Windows Share + Nifi + HDFS – A Practical Guide

Integrating Nifi with Graylog

Building a Smarter Home with Nifi and Spark

社区洞察

其他会员也浏览了

Overview of Structured API Execution

5 reaons why Entity Framework will be your best friend

Paranoid? Masking, anonymizing, and obfuscating PII in XML and JSON data

10 Advanced features of Laravel Eloquent

Creating a Semantic Model using Snowflake's Semantic Model Generator

Reactive Streams: Asynchronous Data Processing

Understanding IEnumerable, ICollection, IList, and IQueryable in C#

JSON Operators:-

Creating my first Intermediate Representation (IR) layer

Critical milestone for ISO graph query standard GQL