The vision of self service Hadoop
I was not a great fan of Hadoop. It was always easy to push the data into it. Not that easy to extract value from it. That is not the case anymore. Now there is a different question. With the abundance of technologies popping around the Hadoop ecosystem we find ourselves puzzled with where to go next. Although it is obvious that what we did this far was relatively naive, the multiple directions in which new technologies point us raises a different concern. Are we taking the right path?
Hadoop is on fire. The adoption is so fast and wide spread that I sometimes think that soon our PCs are going to to become part of a huge Hadoop cluster (not a bad idea in itself - dropbox on steroids?). However, what we are doing with the data in Hadoop and the way we manage the resources of the cluster is constantly evolving.
Hadoop started as an affordable yet very reliable storage for BigData that organizations started to generate and collect. Driven by the idea of bringing the program to the data instead of transporting the data to the logic, the MapReduce (MR) paradigm ruled. However, it was clear, from the beginning, that it is not an optimal approach.
Pig, Hive & HBase were a no brainer. Pig is a high-level language for expressing data analysis programs that is translated to MapReduce. Hive is a “Database” on top of Hadoop that provides an SQL-like query language called HiveQL. HBase is a non-relational database, or an enhanced Key-Value store, if you prefer. However they all have their limitations in readability, performance, completeness of operations etc…
Alternatives did not take long to appear. Some complementary like Kafka, and some as an alternative like Impala. Here is just a partial list with a one line description on each. I am sure I missed more than a couple, you are welcome to add your own.
Zookeeper - A centralized service to manage the Hadoop cluster resources.
Sqoop - SQL to Hadoop. Import from every jdbc supported DB into Hadoop.
Kafka (Flume) - Ingest high throughput data into Hadoop using an append log structure.
Solr (Lucene, Katta) - An enterprise search platform built on Lucene. Cloud version by Katta.
Nutch - A web crawler written in Java. Integrated with Solr.
Mahout - A scalable machine learning and data mining library.
Spark - An engine for running large scale data processing fast. Allows for usage or SQL, Java, Python, MLlib (machine learning) and GraphX.
Storm - A real time distributed stream processing engine for BigData streams.
Tez - An application framework to build low latency high throughput tasks in Hadoop designed to replace MR to support human interaction.
Cascading - An application development platform for Hadoop (application server).
Cassandra - An operational distributed Key Value store that allows SQL like access to data.
CouchDB - A web database that store the data in JSON format and allows you to access it via HTTP.
Impala - An MPP interactive SQL database for Hadoop by Cloudera.
Parquet - Efficient columnar file storage format for Hadoop. Makes Impala columnar (Cloudera).
Dataflow - Java MPP engine for hadoop with Knime as it’s user interface.
Oozie - Workflow tool to coordinate jobs/actions on Hadoop & HDFS.
Yarn - A resource management system that allows multiple technologies to run in parallel on Hadoop 2.
In 2014 the leading columnar database vendors like Actian (Matrix aka ParAccel) have released their version of their high performance columnar databases running on top of Hadoop. This is in addition to in-memory databases over Hadoop and relational ACID OLTP databases over Hadoop. This leaves us wondering. Which technology should we use? Are all these going to survive?
Unless one, or a small set of the tools fits your use-cases like a glove, I think in order to answer the short term question, you need a long term view. The reason PCs have changed the technology world for good was mass adoption. This was enabled by the ability of every person to autonomously operate the PC. In Fact, a PC today has the computing power and storage it took a whole operational IT department to provide to an organization just a couple of decades ago. If you come to think of it, the most successful technologies are those that are self-service by nature. From the car to the iPad, if we can drive it ourselves it will prevail.
I think self-service Hadoop will prevail. If you think back to the first killer application of the PC you will probably remember three of them. Word, Excel and your favorite game. Since I don’t see any gaming software being developed on top of hadoop (another interesting idea in itself), and I don’t see a big need for word processing over Hadoop, I think we are only left with Excel. Which brings me to Datameer. The spreadsheet over Hadoop company. Datameer provides a virtualization layer on top of Hadoop that enables self-service. You don’t need Java skills nor the knowledge of what “ls”, “du” and “rm” are. You can easily bring your data into Hadoop from multiple data sources like files, Google Analytics or your local Excel and analyze it in an “Excel” like web interface which is as intuitive as the iPhone was when it was released. This is a limitless spread sheet that shows just a sample of the data but on save generates code that runs on your entire data set. The results are then brought back to the spreadsheet. This is actual self-service over Hadoop.
Moving forward with your Hadoop, I think what you should ask yourself is: Will this tool be a self-service tool or is it a techie tool? If it is a techie tool it will always be somewhat limited.
Now, if you are not willing to pay for licenses just for labor, and since I don't know of any open source self-service solution, yet... I think Spark is the safest bet today.
Very lucid and crystal clear in a simple way !
R&D Manager | Architecture | SaaS | B2B | B2C | Cloud | ML | DevOps | Data Analytics | Web & Mobile | Security |
10 年shmuel, great over all coverage. we are just in the middest of looking for a replacement to couchdb since it doesnt meet our use case. will give spark a look.