Hadoop Conundrum – An architect’s dilemma
There is one interesting article by Merv Adrian in Gartner’s blog , which talks about the Hadoop and solution way beyond. His assessment is more towards newer development and transition from Hadoop to latest technologies that is just beyond hadoop.
In another prominent market research paper, has predicted the expected growth of data warehouse market with 8.2% CAGR by 2025, which is currently USD 18.61 billion. The research says that:
“The increase in adoption of big data and advanced analytics combining into data warehousing drive the adoption of data warehousing in majority of the enterprises.”
The paper also says that:
“Column-oriented database systems, also known as column-stores, have an important demand in the past few years. It stores each database column distinctly thus, the attributes from the same column would be stored compressed, contiguously, and densely-packed in the disk.”
However, another research paper is expecting Hadoop market to grow by 13% CAGR from USD 12.8 billion in 2020 to USD 23.5 billion by 2025. All these different views creates a confusion whether it’s an end of the ERA for Hadoop or Do we still have better solution to build around Hadoop?
What’s the Confusion?
It has been more than a decade since Hadoop based technology properly introduced to the mass developers. Like many of you, I have also witnessed its development and growth. In this entire decade, one question that I still keep listening from customers that whether they should use Hadoop or continue to use traditional or modern databases?
Now a days, consultants & solution architects are suggesting Hadoop to the customers as a de-facto choice. I feel bit worried when I hear customers with 1 TB or 2 TB data volume are implementing Hadoop solutions. Even a customer who wants to manage their documents and we are suggesting Hadoop platform.
I have come across with numerous use cases where customer has implemented Hadoop platform in the name of “Data Lake” and gained very less out of it. Another issue is with the skill set availability in the market. Although with times, we have enough skill set available in the market but quality has become the main concern. Secondly, domain knowledge of the resources are also a concern, which is not resulting the technology into business benefits.
I remember a case, where one of the customer is facing concurrency & performance challenges while running reports and dashboards in mere 10 TB of data & 30-user concurrency. By careful look, it is found that customer has 100% data coming from structured sources while it was developed using HIVE database. Technically, there is no harm creating such architecture but we have seen that HIVE is not a solution, which could provide better concurrency to business users.
My Experience So Far
Hadoop as a technology still behind when it comes to running interactive and user facing applications. Still databases are best bet when it comes to structured data management. I know some of the customers those started Hadoop journey few years back and now moving towards data warehousing technologies, if not all but for specific use like building reporting/dash-boarding system. Few of my learnings I want to share:
1. Usage of Hadoop is primarily as a storage for high volume and variety of data and its processing. There are various ways to achieve Velocity either by using integration tools, complex event processing and streaming tools, using CDC or setting up Kafka cluster separately (without using Hadoop)
2. Use Hadoop as a processing engine, as much as possible than using it as reporting engine. For a smaller enterprise with few business users, it may be fine to use Hadoop technology like HIVE, Impala etc. but if number of users are, high and faster response is required, better to go for data warehouse solutions with columnar, in-memory etc.
3. Try to find the volume first, if it is enough to save cost for the Organization, otherwise Hadoop will cost them dearly in longer run in TCO comparison including databases, resource skill-set cost, maintenance and operational cost like data center space etc.
4. There are databases to manage petabytes of data on daily basis, but sometimes complexity to deal with complex data requires various options to explore. In such situations, Hadoop may be a good fit. Example data with both structured and unstructured contents or data coming in some specific formats like network logs aggregation, mining or oil exploration related data in SIGY or .WITSML formats etc. Common formats like XML, JSON, GeoJSON, Key-Value etc., are now a days being easily manage through multi-model databases.
5. Spark has also emerged as a great analytical platform for running and processing analytical models with combination with cloud storage, but it cannot guarantee of large dataset persistency for analytical consumption by end business users, therefore final output of the processing should be loaded back to high-end databases.
6. Try to find simple solutions for even complex problems, it is easier to propose Hadoop platform to a customer without giving deeper thoughts on solution but sometimes, simple ways and means are sufficient. For example, if customer wants key-value data to store and analyze, it is better to use non-HDFS based NoSQL than HIVE or HBase on HDFS. If key-value data is 10-20% of overall requirement of data, then better to use multi-model DB like Oracle etc. are better choice. Or, if the requirement is real-time analytics, better to use Complex Event Processing (CEP) and stream analytics solution (either standalone or On top of Kafka + Spark) than implementing complete Hadoop cluster. You can store persistent data in NoSQL or RDBMS for future use.
7. Large Hadoop clusters usually demands high maintenance. If cluster size is even more than 40-50 nodes, it requires continuous support and maintenance due to node failures and network latency issues. With Object storage becoming new data lake, it has become obvious choice to keep the data in object storage and build the transient clusters based on requirement.
8. Data management is another area where Hadoop technologies still has many challenges as compared to databases. You can manage data referential integrity, workloads, backup, replication etc. better in database, therefore, wherever you come across with such requirement, keep your database options open.
9. For small and quick solution, better to use RDBMS data processing & management methods however for complex data solutions where extensive data processing and volume is involved, you may think for Hadoop as alternate with keeping above points in mind.
10. Hadoop is not good for document management where requirements are like versioning, OCR, workflow, role-based permissions etc., as mentioned earlier it is good for storage of any type of data i.e. documents, multi-media files etc. So, if requirement is more from document management system, Hadoop is not the right fitment.
It took us over three decades to remove the data management challenges surrounding to Database development. For example, security, performance using indexing, partitioning, high availability, improved workload managements, replication, backup etc. We should give it a thought before proposing Hadoop based architecture to our customers.
Recent Trends
Recent trends are indicating that both Hadoop and Non-hadoop markets are gaining momentum. However in case of Hadoop distribution, Organization’s are taking a precautious steps by using On-prem Hadoop. With the growing demand of cloud, traditional Hadoop distributions like Cloudera, Map-R are in declining revenue trends. New offerings by AWS, Google, DataBricks, Confluent, Presto etc. have better growth coming.
However, the trends in non-Hadoop data warehousing solutions are also gaining momentum again and customers are realizing their mistake on jumping to Hadoop technology wagon. These growing technology innovations includes Snowflake, VoltDB, MemSQL, MongoDB, Oracle ADW, Oracle DB (multi-model), Google BigQuery, Amazon Redshift, Azure Synapse Analytics etc. I will talk about these technologies in further articles in future.
It is not that Hadoop will not survive or grow, in fact, there are various real world use cases where it fits to the requirement, but I feel, being consultant/architect expert it is our responsibility to make customer successful with right advice and assess our options carefully.
Disclaimer: All views, information or opinion expressed, in this article are my own and do not represent of any entity whatsoever with which I am affiliated. This article is solely to share knowledge based on my understanding. None of the authors, contributors, administrators or anyone else connected with this article in any way whatsoever, can be responsible for your use of the information contained in the article.
Architect at Wipro
4 年I agree with you, architects have to analyse the use case and go for relevant technology stack, in real world, organisation have Cloud architects, Bigdata architects, etc. If you go to Hadoop/Bigdata architects, he will solve the Business problem using Bigdata stack without considering the other options irrespective of whether it is the right thing to do. That is why role called Enterprise architects became popular in last 4-5 years to solve the problem and not just deliver the Projects.
Leading Big Data Solutions and Teams on Azure and GCP Cloud | Enterprise Data Architect| BI and Data Warehousing | Dimensional and Data Vault Modeling | Data Mesh | Data Bricks | Big Query | Delta Lake | Unity Catalog
4 年Great Article Kavindra
360DT | Google Authorized Instructor | Data Scientist | Big Data| Instructor| Data Architect | MCT
4 年Great Article and very well said that "growing technology innovations includes Snowflake, VoltDB, MemSQL, MongoDB, Oracle ADW, Oracle DB (multi-model), Google BigQuery, Amazon Redshift, Azure Synapse Analytics etc" are some of the POA Solutions that enterprises are looking for. Definitely comes with their own pro and cons.