登录查看更多内容

Data Hubs: MarkLogic vs. Hadoop

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

发布日期: 2015年9月27日

Fair warning. I'm going to piss off some people in this article. You have been warned :-)

I have worked on a number of data hub oriented projects over the last few years - big projects integrating thousands of relational, XML and JSON systems ad feeds. In about half of those projects, there's almost always someone pounding the Haddoop drums, talking about how Hadoop data lakes are the best ways of creating centralized data repositories.

The cost argument usually comes into play (look, it's open source, so it's free), as does the developer issue (look, it's in Java, the number one language in the world for big data projects). Yet at the end of the day, Hadoop has frankly had an abysmal record in the enterprise - to the extent that I've started calling them public works projects for java developers, because they seem to employ so-friggin' many of them.

Now, I'm not dismissing Hadoop as a technology. I've seen some pretty cool Hadoop uses, and MarkLogic actually employs Hadoop for its content processing pump (MLCP), which does a decent job of handling the threading necessary to submit large numbers of documents. In other words, what I like about Hadoop is now actually something that Hadoop is running away from - Map/Reduce. Hadoop as a processing framework is reasonably okay. Not great, but for first pass processing where transformations and interdependencies are not a big problem, it does the job well.

True data hub integration requires sophisticated transformation capabilities, semantic master key management and rich security. Hadoop has none of these things.

The problem that I have with Hadoop is when it's touted as the next major database technology, capable of handling all of the intricacies of data hubs. It is a large, complex system, concentrating principally upon the mechanics of moving data around, and requires that you have a large number of developers writing complex ETL code, which becomes especially problematic when dealing with heterogeneous data sources coming from large numbers of standard databases.

Data integration requires context - you need to know where your data has come from, need to know how keys relate to one another, and often have to integrate keys with reference data tables that are not always clearly defined and articulated. You need namespaces, because you should never be reliant upon people keeping track of where your data is coming from, and you need to have transformation tools capable of pulling information from multiple sources simultaneously, something that map/reduce by itself is not very good at because the data you need is on different threads.

Hadoop doesn't do XML very well, and to date it does RDF not at all, save for a few very limited scope projects. It's slow to query against, and indexing is usually done as a batch, not necessarily in real time. Sure, Sparq and the like have been touted as Hadoop Mark II, but perhaps not surprisingly, many companies that invested heavily in failed Hadoop projects are not exactly rushing out to embrace the next latest and greatest. If you are backing up your five year old accounting tapes to Hadoop, it's a great technology, but when you're dealing with on-demand, real-time content, Hadoop is just not competitive.

MarkLogic's bet was simple - make JavaScript a first class language and JSON a first class data format, and the developers will come. By all indications, they are.

Now, I'll admit - I'm biased. I've been working with MarkLogic server for nearly a decade, sometimes cheek to jowl with the people writing the core code for the product. Maybe that puts me in an exceptional position, and I will be the first person to admit that MarkLogic has a steep learning curve. However, having said that, I think it is also fair to say that for the purpose of putting together a consolidated integrated data hub, MarkLogic is superior to Hadoop based systems in nearly every way, especially with the release of MarkLogic 8.0 earlier this year.

There are several specific criteria for making this claim:

Security. Multitenancy is a key requirement for any number of databases, because of the need to store information from multiple clients simultaneously on systems without that information being visible to potential competitors or to hackers. MarkLogic supports compartmentalized role and permission based security that makes it possible to achieve this, and is one of the few databases, SQL or NoSQL, that has that level of security. With some work, it also becomes possible to enable field based security, especially by utilizing the semantic data store. This has become a major factor in MarkLogic's acceptance within the government. Hadoop does not have anywhere near this level of granular security control.

Variability of Formats and Languages. Within MarkLogic, it is possible to store and retrieve XML and JSON data interchangeably, utilizing a common query architecture that is format independent. JSON and XML can be queried and updated from APIs accessible written in both JavaScript and XQuery, with the JavaScript engine being the Google V8 engine. What's more, MarkLogic can be called via node.js, Java, Ruby, Python, C++ and C#, as well as through a RESTful API that's language independent, and MarkLogic supports an ODBC driver for making the database act like a relational database, at least for output.

There are Hadoop connectors for languages, but differences between Java versions, Hadoop versions and integrative APIs can also make connectivity and portability difficult.

For transformations, Hadoop utilizes Xalan, released nearly fifteen years ago and using XSLT1. MarkLogic has most of the features of XSLT 3, all of XQuery, JavaScript and SPARQL.

Transformability. ETL processes require strong transformative capability. Most Java based implementations are still utilizing the Xalan XSLT 1.0 processor that was published in 2001, despite the fact that XSLT 3.0 was published last year, and XSLT 2.0 has been around for nearly a decade. XSLT 2.0 (which MarkLogic supports, and extends) is a powerful transformational tool, and because it is simple within MarkLogiic to transform between XML and JSON, it is possible to use its power on JSON content as well.

Similarly, XQuery can transform JSON content using the Oracle designed JSONiq format, and JavaScript can transform JSON content as well - all within the "application layer" of MarkLogic. This versatility is simply unmatched in Hadoop without extensive reworking of core libraries, and even when it does exist is typically generations behind what exists in MarkLogic.

Data Scalability. Hadoop started out life as a way to build indexes from an extended corpus of web content, so the idea of scalability has always been an implicit part of its appeal. HDFS plays into that as well, as does the utilization of Hadoop as a way of storing relational legacy content - you can scale out across multiple nodes fairly efficiently, though at a cost of fairly slow total access times.

MarkLogic plays well with HDFS - it is possible to set up an HDFS partition as a storage node within MarkLogic, not only being able to access HDFS content, but also (significantly) being able to index that content within MarkLogic itself. MarkLogic has excellent indexing technology, on the NoSQL side some of the best in the industry. Such HDFS is considered a slow node, one of several tiers of data nodes that MarkLogic can use (others include Amazon S3 nodes, as well as ultrafast SSD nodes using flash memory for storage). Because of this architecture, MarkLogic also scales well, and can maintain extensive clusters of nodes that can be expanded or contracted dynamically.

Additionally, MarkLogic employs flexible replication to be able to update all or a portion of the databases dynamically between systems, typically employing XQuery or Javascript functions to determine what does or does not get replicated between databases. This in turn can be combined with data synchronization so that MarkLogic systems can remain in sync, even when one system goes off the Internet for a while. This data synchronization is especially useful for data hubs that are semantically designed. Again Hadoop doesn't have this capability.

Semantics. Most people, when thinking about data, tend to see it in terms of the atomic data that it holds - various and sundry fields for names, dates and other types of information. However, what is usually far more important for data integration is the ability to match primary and foreign keys from various databases. It turns out that this information can actually be encoded quite easily within RDF - the Resource Description Framework, using conventions such as the R2RML mappings (https://www.w3.org/TR/r2rml/). Once in this format, it is far easier to create associated conceptual mappings between different data systems.

This in turn makes it easier to create a broad scale master data management (MDM) solution, because it becomes possible to create the relationships, both direct and indirect, between different identifiers for given content types (tables, generally) using semantic URIs and triples. This form in turn can be queried by using a SPARQL query system, another W3C standard that is built into MarkLogic ... and doesn't exist in the current Hadoop ecosystem.

Similarly, one of the other big data requirements is in the effective handling of reference data management (RDM) or metadata management (MTDM) system - also known as controlled vocabularies. These are the specific terms and codes that are often used to stand for specific states - gender status, marital status, FIPS or ISO Codes for countries and states, along with a plethora of similar enumerated terms specialized to government agencies, health care, insurance, finance, research, education, human relations and so forth. Semantic systems are well suited to managing this type of information, and for relating equivalent or near equivalent terms between different data stores.

Versioning and Bitemporality. Hadoop does not have versioning support - the ability to persist data records that change over time. For many applications, versioning support is not necessary. For data hubs, versioning is an absolutely critical requirement, as data provenance and data governance both become major factors. In mission critical applications or ones where financial transaction management is important, being able to assert not only when a transaction was made but also when the system recognized that a transaction was made becomes highly important. A small number of larger scale relational databases are just now incorporating bitemporal storage, especially those that are used for financial data, research data, insurance and similar areas. MarkLogic is one of them. Hadoop is not.

Data Governance is not just a position in an org chart. It has to be built into the database at a foundational level. Data provenance matters.

Developer Skills. A few years ago, MarkLogic faced a real problem. The number of people who were skilled MarkLogic developers capable of working XQuery was very limited, far smaller than the number of Java developers. This meant that there was legitimate concern in staffing for a large project.

Starting with 8.0, there were three radical changes that were introduced. The first was that JavaScript became a native server side language, coexisting with XQuery. This was the same engine that was used to power the node.js engine, and out of the box there are additionally a number of node.js connectors, meaning that if you have JavaScript developers in your organization, you have MarkLogic developers with a surprisingly low amount of training.

The second piece MarkLogic added was to make their training classes free, not just at the basic level but at the upper end as well. They are very good - I've attended several, and even as a seasoned developer I learned quite a bit.

The third piece was to integrate all of their APIs so that they are equally accessible regardless of whether you are working with XQuery and XML, XQuery and JSON, Javascript and XML, JavaScript and JSON, SPARQL and RDF, or invoking through external tools.

As a consequence, the number of MarkLogic developers has in fact been rising pretty dramatically. It's still a little pricier for more advanced skill-sets, but even that's changing.

Total Cost of Ownership. Hadoop is free. MarkLogic starts, for certain applications, at about $18K a year. So, on the surface, this may seem to be a slam dunk for Hadoop. However, once you start looking at TCO, the equation shifts pretty dramatically. For starters, MarkLogic requires far fewer people in order to build and maintain an application. Most projects I've worked with have taken trained staff of three to four people about three months to develop from beginning to end, based upon the projects I've seen, and these are typically fairly complex projects involving millions of documents.

Licensing costs would seem to skew in Hadoop's favor. Take total cost of ownership into account, however, and a very different story emerges.

Once the core infrastructure is built, adding services to an intelligently designed hub where MarkLogic is utilized to its fullest capacity is usually quite easy. Of course, part of that is that you do need to use MarkLogic in that way - the more that you move out of MarkLogic in terms of processing, the more complexity you're adding and the greater the cost. While many developers would prefer to work with MarkLogic simply as a CRUD repository without taking advantage of most of its other features, doing so will actually end up spending much more time, energy and money than they would otherwise.

Assuming four people at $80K/year per person, this means the typical cost of putting together a MarkLogic data hub is around $100K, dropping down to around $25K after the first year.

In my experience, Hadoop development teams usually tend to run between ten and fifteen developers who end up having to build much more of the core infrastructure than they would otherwise - the core requirements for a data hub are the same regardless of what technology is used. You are paying less in licensing costs, but you need more systems - MDM systems, transformation systems, semantic systems, integration, orchestration and rules management, security, reporting and so forth. Only a small fraction of these come from Hadoop. Development time is usually closer to a year than three months, and licensing costs out of band tend to run well north of $100K by itself.

Five years from now, Hadoop should be where MarkLogic is now. Five years from now, MarkLogic will be a true AI database.

Even being conservative at ten developers at around $80K apiece (and software developers nowadays are earning six figures, so this number is still low-balling), this puts the overall development cost somewhere around $800K. Add in maintenance costs and support licensing costs, and the total cost for a Hadoop data hub implementation is pushing the boundaries of a million dollar project. Not surprisingly, staffing companies love Hadoop.

To be fair, Sparq shows potential, and I think that many of the shortcomings of the Hadoop model are typical of any comparatively new technology. I would be willing to say that within five years, the Hadoop/Sparq stack will have most of the features that MarkLogic server has today, though I also think that five years from now, MarkLogic will be showing signs of being a true AI database.

Kurt Cagle is the founder and chief ontologist for Semantical LLC. The views expressed here are his, and he took no considerations for writing this post.

Peter Cresse

Data and Network Executive. Bringing data, AI, and fiber solutions to the next level.

7 年

I think Kurt took a good approach by saying that it might make people upset (likely some people who spent massive amounts of hadoop driven implementation, but never mind that). The biggest issue in data today, the #1 issue from CIO's... can't find qualified people. So beyond the licensing cost, it is an issues of ROI as well as opportunity cost of your dear people. The new ROI is based upon the lifetime of three year learning curves on difficult data management. This kind of lays it out nicely - whatever side you are on.

1 次回应

Soumen Chatterjee

AI, Data, Privacy, Confidential AI Leader | CIPP/E, AIGP, CCSK | 7K+ followers | Multi-cloud AI Innovation | Trusted and Ethical Technologist | Sustainable Architecture|| All expressed views personal

8 年

A great detailed work, Kurt. However, I am not sure if we are comparing Apple with Orange here. Hadoop is just for a specific type of data problem and ML has its own use-cases. I haven't come across any customer who has replaced Haddop with ML or vice-versa. There should not be any comparison like that. This should be based on user scenario and specific industry problem. I have used both the technology in large scale implementation. Both of them have their own merits and demerits. Kind Regards, Soumen.

Shyam Kadari

Cloud, Data, Analytics, AI/ML and Generative AI

8 年

I work with both Hadoop, MarkLogic and like both of them. I believe that MarkLogic is the best NoSQL technology available today. This a biased article towards MarkLogic against Hadoop. The article has several errors regarding Hadoop. Comparing Hadoop and MarkLogic is like comparing apples and oranges. Bashing another technology to make your preferred technology look good never works in real world. Without getting into all the details (which will require large article) I believe MarkLogic can not replace Hadoop in Enterprise Data Lakes. Just look at the market and market share. Both Hadoop and MarkLogic can co-exist to provide optimal solution for managing and deriving insights into enterprise data.

Ajay Raina

8 年

Well written and I concur with the thought process. A data hub which ingests documents, structured data , unstructured data and triples with support to enrich with ontologies and make things more consumer ready for different purposes is key in today's world. Hadoop is a far cry from most of it.

Fred Simkin

Developing and delivering knowledge based automated decisioning solutions for the Industrial and Agricultural spaces.

8 年

Still waiting for "Knowledge Hubs".

查看更多评论

要查看或添加评论，请登录

Kurt Cagle的更多文章

Reality Check

2025年2月22日

Reality Check

Copyright 2025 Kurt Cagle / The Cagle Report What are we seeing here? Let me see if I can break it down: ?? Cloud…

14 条评论
MarkLogic Gets a Serious Upgrade

2025年2月15日

MarkLogic Gets a Serious Upgrade

Copyright 2025 Kurt Cagle / The Cagle Report Progress Software has just dropped the first v12 Early Access release of…

14 条评论
Beyond Copyright

2025年2月9日

Beyond Copyright

Copyright 2025 Kurt Cagle / The Cagle Report The question of copyright is now very much on people's minds. I do not…

5 条评论
Beware Those Seeking Efficiency

2025年2月8日

Beware Those Seeking Efficiency

Copyright 2025 Kurt Cagle / The Cagle Report As I write this, the Tech Bros are currently doing a hostile takeover of…

85 条评论
A Decentralized AI/KG Web

2025年2月1日

A Decentralized AI/KG Web

Copyright 2025 Kurt Cagle / The Cagle Report An Interesting Week This has been an interesting week. On Sunday, a…

48 条评论
Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

2025年1月26日

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

I am currently working on Deepseek (https://chat.deepseek.

41 条评论
The (Fake) Testerone Crisis

2025年1月15日

The (Fake) Testerone Crisis

Copyright 2025 Kurt Cagle/The Cagle Report "Testosterone! What the world needs now is TESTOSTERONE!!!" - Mark…

22 条评论
Why AI Agents Aren't Agents

2025年1月15日

Why AI Agents Aren't Agents

Copyright 2025 Kurt Cagle/The Cagle Report One of the big stories in 2024 was that "2025 Would Be The Year of Agentic…

22 条评论
What to Study in 2025 If You Want A Job in 2030

2025年1月12日

What to Study in 2025 If You Want A Job in 2030

Copyright 2025 Kurt Cagle/The Cagle Report This post started out as a response to someone asking me what I thought…

28 条评论
Ontologies and Knowledge Graphs

2025年1月9日

Ontologies and Knowledge Graphs

Copyright 2025 Kurt Cagle/The Cagle Report In my last post, I talked about ontologies as language toolkits, but I'm…

53 条评论

See all articles

Data Hubs: MarkLogic vs. Hadoop

Kurt Cagle

Editor In Chief @ The Cagle Report | Ontologist | Author | Iconoclast

Kurt Cagle的更多文章

社区洞察

其他会员也浏览了

Data Lake & Hadoop : How can they power your Analytics?

Hadoop vs. Snowflake: Which One is Better

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Introduction:

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Kurt Cagle的更多文章

Reality Check

MarkLogic Gets a Serious Upgrade

Beyond Copyright

Beware Those Seeking Efficiency

A Decentralized AI/KG Web

Thoughts on DeepSeek, OpenAI, and the Red Pill/Blue Pill Dilemma of Stargate

The (Fake) Testerone Crisis

Why AI Agents Aren't Agents

What to Study in 2025 If You Want A Job in 2030

Ontologies and Knowledge Graphs

社区洞察

其他会员也浏览了

Data Lake & Hadoop : How can they power your Analytics?

Hadoop vs. Snowflake: Which One is Better

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop

Hadoop vs Spark: Which Big Data Framework is the Best Fit for Your Organization?

Introduction:

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing