Is the future Berkeley's Ray or Spark or may be both?
Credit: Michael Jordan - Berkeley

Is the future Berkeley's Ray or Spark or may be both?

Quite interesting! Both Ray and Spark are the production of U.C Berkeley, both are distributed frameworks and both can handle AI/ML tasks . Spark obviously has come a long way with its 2.x version, an incredible number of improvements with Tungsten project, catalyst, datasets and many other features. Ray although new in its infant phase, has a promise of bringing further improvements, not just IO speeds where spark exceeds many frameworks, but by eliminating some bottle necks that Spark might encounter in very large scale projects especially in the AI domain, like providing immediate feedback, with tightly integrated feedback loops which are extremely important in machine learning, One of the important features in Ray is the ability to go back in time and simulate the environment or the use case at least at the core of its goals, so you don't have to start your iterations from 0. In addition it promises the elimination of serialization of objects and things like that, easy of object sharing by saving objects states. Although Spark can easily integrate with other frameworks like Apache Arrow or Alluxio (previously Tachyon) and can improve tremendously the performance and eliminate the penalty of objects conversion, Ray seem to promise quite a bit in this area as well and seems to be a wonderful addition to AI opensource offerings where immediate feedback is much needed to learn quickly and act.

Courtesy of Michael Jordan (Director of AmpLab) now Rise - The improvements are not just in the way Ray works but also the architecture model:

Although support seems to be only for python at this time, am impressed by the speed of python when invoking ray framework, using 4 cores, a regular python function took me about 4 seconds, while by invoking Ray I got to sub millisecond. The speed of Ray also comes from the ability to take advantage of CPU pipelining and it can also take advantage of GPUs as well.

I would think it is worth to run few real use case with some large data sets using same conditions including language (e.g. python) for now and benchmark both Ray and Spark see where that get us!

It would be also interesting if/when Ray offers support for Scala, Java and R how that scale against Spark. I would also think it is worth to test Julia which has the ability to parallelize tasks out of the box with its support for distributed parallel execution, (also spark has this in its core engine with parallelize method) or perhaps we should think of a framework initially written in Julia instead!

Please note I am not recommending anything at this point this is a thought for food, Spark has dozens if not hundreds of real life production use cases, I just wanted to get the thinking going and possibly start some benchmarks/feedback of both and other frameworks .

要查看或添加评论,请登录

SAMIR OULDSAADI的更多文章

社区洞察

其他会员也浏览了