Databricks Lights a Spark Underneath Your SaaS

Databricks Lights a Spark Underneath Your SaaS

On January 13, Databricks hosted a meetup in their brand new San Francisco headquarters. On the agenda was what to expect from their roadmap in 2015. You can find the entire video here and the slide deck here.

For those who are unfamiliar, Databricks is the company behind the open source software Apache Spark, which is becoming an extremely popular Big Data application engine. You may have heard of Spark described as being the next generation of Map Reduce, but it is not tied to Hadoop. Some might say that Spark is in the fast lane to become the killer app for Big Data. Both the number of patches merged, and the number of contributors has tripled from 2013 to 2014.

At the core of Spark is a concept called Resilient Distributed Datasets (RDD's). Without RDD's, there is no Spark. An RDD is an immutable, partitioned, collection of elements that can be distributed across a cluster in a manner that is fault tolerant, scales linearly, and is mostly* in-memory. An element can be anything that is serializable. Working with collections in a language like Java is convenient when the number of elements is in the hundreds or thousands. However, when that number jumps to millions or billions, you can very quickly run into capacity issues on a single machine. The beauty of Spark is that this collection can be spread out over an entire cluster of machines (in memory) without the developer needing to think too much about it. Each RDD is immutable and remembers how it was created. If there is a failure somewhere in the cluster, the failed RDD's are automatically re-created elsewhere in the cluster. I prefer to think of it as akin to creating ETL using CREATE TABLE AS statements, except it is not limited to the resources of a single database server. Nor is it limited to just SQL statements. If you are curious as to the underpinnings of RDD's, there is no better resource than to read the 2012 white paper written by its Berkeley creators.

*If the dataset does not fit into memory, Spark can either recompute the partitions that don't fit into RAM each time they are requested or spill it to disk (added in 2014).

The big takeaway from this meetup is that Databricks is doubling-down on Schema RDD's to aid with stabilizing their APIs so that they can encourage an strong ecosystem of projects outside of Spark. A Spark Packages index has already been launched so that developers can create and submit their own packages, much like the ecosystem that already exists for Python and R. In particular, Spark needs to play catch-up to more established projects by adding machine learning algorithms. The introduction of Spark Packages will accelerate this. It was revealed that many Apache Mahout developers have already begun to turn their attention to reengineering their algorithms on Spark. The Databricks team recognizes that it will be difficult for Spark Packages to reach critical mass without stable APIs. Therefore, the immediate priority appears to be leveraging Schema RDD's in both internal and pluggable APIs so that they can graduate them from their Alpha versions.

So what exactly is a Schema RDD? These were introduced last year as part of Spark SQL, the newest component to the Spark family. A Schema RDD is "an RDD of Row objects that has an associated schema." This is essentially putting structure and types around your data. Not only does this help to better define interfaces, but it allows Spark to optimize for performance. Now data scientists can interact with RDDs just like they do with data frames in R and Python.

It wasn't clear how extensible the meta-data in a Schema RDD will be, or if it will be restricted to basics such as names and types. This could be a very powerful unifying concept for everything Spark. In data warehousing, it is not uncommon to build a bespoke solution with Informatica or Data Stage as the ETL engine, Business Objects or Cognos as the BI tool, ER Studio or Embarcadero as the data modeling tool, and a mixture of Oracle, DB2, and SQL Server databases. Each of these applications has their own catalog of meta-data to manage, and too much of the work involves keeping all of this meta-data in sync. Schema RDDs have the potential of utilizing a single set of meta-data from the point of sourcing data all the way through to dashboards.

Loading these Schema RDDs from any data source will be accomplished by a new Data Source API. Data can be sourced into a Schema RDD using a plugin and then manipulated in any supported Spark language (Java, Scala, Python, SQL, R). The Schema RDD will serve as a common interchange format so that the data can be accessed via Spark SQL, GraphX, MLLib, or Streaming; regardless of which programming language is used. There are already plugins created for Avro, Cassandra, Parquet, Hive, and JSON. Support for partitioned data sources will be included in a release later this year so that a predicate can determine which HDFS directories should be accessed.

The Spark Machine Learning Library (MLLib) will leverage Schema RDD's to ensure interoperability and a stable API. The main focus for MLLib in 2015 will be the Pipelines API. This API provides a language for describing the workflows that glue together all the data munging and machine learning necessary in today's analytics. There was even a suggestion of partial PMML support for importing and exporting models. It became apparent that MLLib has some catching up to do when a list of 14+ candidate algorithms for 2015 were projected on the screen along with 5 candidates for optimization primitives. The sooner all these APIs are stabilized, the sooner the Spark community can get to work on delivering stable packages of these; rather then relying on additions to MLLib itself. The Pipelines API will become the bridge between prototyping data scientists and production deploying data engineers.

Those were my main takeaways from the prepared remarks. There were a few interesting discoveries in the Q&A. A general availability version of Spark R is expected sometime in the first half of this year. There is a hold up right now due to an incompatible licensing issue. This project is being driven by Berkeley's AMP labs rather than Databricks. There still seems to be more questions than answers about how R is going to fit into the Spark ecosystem other than leveraging Schema RDD's as data frames.

It was admitted that there is "not much going on" for YARN support in the 2015 roadmap. Databricks leverages Apache Mesos to administer their clusters, so it is logical to assume their incentive will be to promote Mesos integration as a higher priority. There are plenty of partners in the Spark ecosystem who are more closely aligned with Hadoop, so I would not be surprised to see one or more of them assuming the mantle for this area.

All of the above is open source and freely available to be installed on your own local cluster. Databricks is in the business of selling access to their own hosted Spark platform, which has been architected to run on Amazon Web Services. You can leverage their SaaS solution to have an "instant on" environment for working with Big Data. Their "Notebooks" interface allows you to write code in a browser and run it interactively. You can scale clusters up and down on the fly, create data pipelines, and even create dashboards to be published to other users. From a business intelligence / data analysis point of view, it is very attractive to see how easy it is to analyze data and quickly generate appealing charts. All of this can be done without the need to install a single server in your data center, or the need to hire a team of people to support the inevitable failure of components.

Ali Ghodsi leads engineering and product management for Databricks, and his theme for his presentation at the 2014 Spark Summit was to pay homage to UI pioneer Alan Kay with their shared desire to "Make simple things simple, and complex things possible." This sentence is congruent with the roadmap that was previewed for 2015. Simple operations such as basic aggregations will become easy to perform with Schema RDDs. Complex things, such as engineering features as part of a data pipeline to build a gradient boosted decision tree model which is then hosted as part of a real-time data stream, will become possible. It is still early days, but the excitement around Databricks is catching fire, and as Bruce Springsteen correctly observed, "you can't start a fire without a spark."

Simon Berglund

Managing Director | Chief Revenue Officer | SaaS | Sales | Customer Success | Partner Channels | B2B | AI | GRC | ERP | ESG | HCM |

10 年

An interesting post... An application’s flexibility and data quality depend quite a bit on the underlying data model. In other words, a good data model can lead to a good application and a bad data model can lead to a bad application. Therefore we need an objective way of measuring what is good or bad about the model. After reviewing hundreds of data models, Steve Hoberman formalized the criteria he has been using into what he calls the Data Model Scorecard. Download The Data Model Scorecard at https://goo.gl/kHdqhZ

回复

要查看或添加评论,请登录

Jason Pohl的更多文章

社区洞察

其他会员也浏览了