登录查看更多内容

The vision of self service Hadoop

Sam Babad

CPO @ CloudHiro - FinOps AutoPilot | Helping customers control their cloud expenses

发布日期: 2014年9月7日

I was not a great fan of Hadoop. It was always easy to push the data into it. Not that easy to extract value from it. That is not the case anymore. Now there is a different question. With the abundance of technologies popping around the Hadoop ecosystem we find ourselves puzzled with where to go next. Although it is obvious that what we did this far was relatively naive, the multiple directions in which new technologies point us raises a different concern. Are we taking the right path?

Hadoop is on fire. The adoption is so fast and wide spread that I sometimes think that soon our PCs are going to to become part of a huge Hadoop cluster (not a bad idea in itself - dropbox on steroids?). However, what we are doing with the data in Hadoop and the way we manage the resources of the cluster is constantly evolving.

Hadoop started as an affordable yet very reliable storage for BigData that organizations started to generate and collect. Driven by the idea of bringing the program to the data instead of transporting the data to the logic, the MapReduce (MR) paradigm ruled. However, it was clear, from the beginning, that it is not an optimal approach.

Pig, Hive & HBase were a no brainer. Pig is a high-level language for expressing data analysis programs that is translated to MapReduce. Hive is a “Database” on top of Hadoop that provides an SQL-like query language called HiveQL. HBase is a non-relational database, or an enhanced Key-Value store, if you prefer. However they all have their limitations in readability, performance, completeness of operations etc…

Alternatives did not take long to appear. Some complementary like Kafka, and some as an alternative like Impala. Here is just a partial list with a one line description on each. I am sure I missed more than a couple, you are welcome to add your own.

Zookeeper - A centralized service to manage the Hadoop cluster resources.

Sqoop - SQL to Hadoop. Import from every jdbc supported DB into Hadoop.

Kafka (Flume) - Ingest high throughput data into Hadoop using an append log structure.

Solr (Lucene, Katta) - An enterprise search platform built on Lucene. Cloud version by Katta.

Nutch - A web crawler written in Java. Integrated with Solr.

Mahout - A scalable machine learning and data mining library.

Spark - An engine for running large scale data processing fast. Allows for usage or SQL, Java, Python, MLlib (machine learning) and GraphX.

Storm - A real time distributed stream processing engine for BigData streams.

Tez - An application framework to build low latency high throughput tasks in Hadoop designed to replace MR to support human interaction.

Cascading - An application development platform for Hadoop (application server).

Cassandra - An operational distributed Key Value store that allows SQL like access to data.

CouchDB - A web database that store the data in JSON format and allows you to access it via HTTP.

Impala - An MPP interactive SQL database for Hadoop by Cloudera.

Parquet - Efficient columnar file storage format for Hadoop. Makes Impala columnar (Cloudera).

Dataflow - Java MPP engine for hadoop with Knime as it’s user interface.

Oozie - Workflow tool to coordinate jobs/actions on Hadoop & HDFS.

Yarn - A resource management system that allows multiple technologies to run in parallel on Hadoop 2.

In 2014 the leading columnar database vendors like Actian (Matrix aka ParAccel) have released their version of their high performance columnar databases running on top of Hadoop. This is in addition to in-memory databases over Hadoop and relational ACID OLTP databases over Hadoop. This leaves us wondering. Which technology should we use? Are all these going to survive?

Unless one, or a small set of the tools fits your use-cases like a glove, I think in order to answer the short term question, you need a long term view. The reason PCs have changed the technology world for good was mass adoption. This was enabled by the ability of every person to autonomously operate the PC. In Fact, a PC today has the computing power and storage it took a whole operational IT department to provide to an organization just a couple of decades ago. If you come to think of it, the most successful technologies are those that are self-service by nature. From the car to the iPad, if we can drive it ourselves it will prevail.

I think self-service Hadoop will prevail. If you think back to the first killer application of the PC you will probably remember three of them. Word, Excel and your favorite game. Since I don’t see any gaming software being developed on top of hadoop (another interesting idea in itself), and I don’t see a big need for word processing over Hadoop, I think we are only left with Excel. Which brings me to Datameer. The spreadsheet over Hadoop company. Datameer provides a virtualization layer on top of Hadoop that enables self-service. You don’t need Java skills nor the knowledge of what “ls”, “du” and “rm” are. You can easily bring your data into Hadoop from multiple data sources like files, Google Analytics or your local Excel and analyze it in an “Excel” like web interface which is as intuitive as the iPhone was when it was released. This is a limitless spread sheet that shows just a sample of the data but on save generates code that runs on your entire data set. The results are then brought back to the spreadsheet. This is actual self-service over Hadoop.

Moving forward with your Hadoop, I think what you should ask yourself is: Will this tool be a self-service tool or is it a techie tool? If it is a techie tool it will always be somewhat limited.

Now, if you are not willing to pay for licenses just for labor, and since I don't know of any open source self-service solution, yet... I think Spark is the safest bet today.

Narender R Kanuganti MS(Cyber Security),MBA,CISM,PMP Available

8 年

Very lucid and crystal clear in a simple way !

Eitan Weisbeker

10 年

shmuel, great over all coverage. we are just in the middest of looking for a replacement to couchdb since it doesnt meet our use case. will give spark a look.

查看更多评论

要查看或添加评论，请登录

Sam Babad的更多文章

Azure regional pricing models - the devil is in the details.

2023年7月5日

Azure regional pricing models - the devil is in the details.

Talking to Azure customers on a daily basis, it is clear to us that the complexity of pricing models prevents customers…

2 条评论
A short summary of the announcements from Andy's keynote today - AWS re: Invent 2019

2019年12月3日

A short summary of the announcements from Andy's keynote today - AWS re: Invent 2019

1. M6g, R6g, C6g with AWS Gravito processors! Up to 64 vCPUs.

1 条评论
AWS re: Invent pre announcements

2019年12月3日

AWS re: Invent pre announcements

As re: Invent 2019 warms up here in Vegas, I thought it would be nice to write about a few of the "not so important"…
re:Invent 2017 - a quick summery of new announcements

2017年12月5日

re:Invent 2017 - a quick summery of new announcements

There were many interesting announcements @ re:Invent. Here are the major ones: Alexa for business.
Why are there so many BI tools?

2016年8月1日

Why are there so many BI tools?

This question comes up a lot lately. It's usually followed by questions like, "which one should we choose?", "which one…

12 条评论

See all articles

Sam Babad的更多文章

Azure regional pricing models - the devil is in the details.

A short summary of the announcements from Andy's keynote today - AWS re: Invent 2019

AWS re: Invent pre announcements

re:Invent 2017 - a quick summery of new announcements

Why are there so many BI tools?