The best programming language for data science and machine learning
Image: Getty Images/iStockphoto)

The best programming language for data science and machine learning

Hint: There is no easy answer, and no consensus either.

Arguing about which programming language is the best one is a favorite pastime among software developers. The tricky part, of course, is defining a set of criteria for "best."

With software development being redefined to work in a data science and machine learning context, this timeless question is gaining new relevance. Let's look at some options and their pros and cons, with commentary from domain experts.

Even though, in the end, the choice is at least to some extent a subjective one, some criteria come to mind. Ease of use and syntax may be subjective, but things such as community support, available libraries, speed, and type safety are not. There are a few nuances here, though.

Execution speed and type safety

In machine learning applications, the training and operational (or inference) phases for algorithms are distinct. So, one approach taken by some people is to use one language for the training phase and then another one for the operational phase.

The reasoning here is to work during development with the language that is more familiar or easy to use, or has the best environment and library support. Then the trained algorithm is ported to run on the environment preferred by the organization for its operations.

While this is an option, especially using standards such as PMML, it may increase operational complexity. In addition, in many cases things are not clear-cut, as programming done in one language may call libraries in another one, thus diluting the argument on execution speed.

Another thing to note is type safety. Type safety in programming languages is a little like schema in databases: While not having it increases flexibility, it also increases the chances of errors.

In this thread initiated by Andriy Burkov, machine learning team leader at Gartner, Burkov argues against using dynamically typed languages such as Python for machine learning.

"You can run an experiment for several hours, or even days, just to find out that the code crashed because of an incorrect type conversion or a wrong number of attributes in a method call," says Burkov.

Java

Despite having what is arguably the largest footprint in enterprise deployment, Java is not getting much love these days. Some of this may have to do with the "coolness factor," as Java has been challenged by new programming languages, but there are also some very real concerns here.

What has greatly helped Java establish it footprint, namely the JVM, is also a reason why people are skeptical about using it for machine learning. Similarly, one famous feature of Java, which helps deal with the complexities of C++, garbage collection, may pose problems in production environments.

Java may not be getting much love these days, but it remains the one programming language with the widest deployment base in the enterprise.

When discussing trends in software development with Paco Nathan, managing partner at Derwen and data science practitioner and thought leader, the topic did come up.

Nathan notes that the trend he sees is toward real-time applications, and this is not something he believes the JVM is well-suited for, as it is an abstraction over the hardware. Adding a layer between the code and the hardware provides cross-platform portability, but also slows down execution.

Nathan also cites Ion Stoica, the initiator of Apache Spark, which is heavily used for real-time applications. Nathan mentioned that one of the rules Stoica has recently set for his research team in Berkeley is abolishing Java.

Nathan commented that he expects that to spill over from research to industry over a five-year timeframe, as is typical for directions initiated in research environments. But maybe we should not be too fast in writing off Java.

The ups and downs that have been following Java during its stewardship by Oracle may have contributed to its falling out of grace. They may also have something to do with the perceived stalemate in the evolution of the JVM.

With enterprise Java being handed off to the Eclipse foundation, however, there is a chance Java and the JVM may be revitalized. There are also initiatives, such as Gandiva, which aim to optimize Java code for specialized hardware, potentially making it a competitive option for machine learning.

In addition, that large footprint has given rise to initiatives, such as DeepLearning4J, which aim to bring to Java users access to the same libraries typically used through other languages.

Python

According to a recent survey by KDNuggets, Python is the undisputed leader in use for data science and machine learning. Some often cited reasons for this preference are the wide choice in libraries and the fact that it's considered an easy language to work with.

Python is the language of choice for most when it comes to data science and machine learning.

Ashok Reddy, GM DevOps at CA Technologies, notes that Python was the language of choice in his recently completed master's in AI and Machine Learning at Georgia Tech.

Reddy goes on to add that Python is gaining popularity in universities due to its simplicity, so graduates are more likely to know Python than Java. Beyond simplicity, he also cites the abundance of libraries as a key reason for this.

Reddy notes that, from a performance perspective, C is also a popular choice for use in AI and embedded-IoT applications, but Java is not going away. Reddy also sees a pattern in using Python for development and then other languages for deployment of machine learning algorithms.

This also applies internally at CA, as Reddy notes that, in addition to having legacy code in C and Java, the cross-platform portability that Java offers is a key priority for CA.

"Many startups use Ruby or Python initially, and when they grow up they switch to Java," says Reddy.

R

In the KDNuggets survey, R's share seems to be dropping compared to last. R, however, has been gaining enterprise adoption over the last few years.

In some ways R is not a typical programming language, as it's not a general purpose one. R's roots lies in statistics, as it has been developed specifically to deal with such needs.

Read the full article on ZDNet Big on Data

Dr Rajesh Jain MD, Diabetes

Chair, diabetesasia.org, Consultant Diabetes

6 年
回复
Manuel Gea

Entrepreneur ? CEO ? Pharma-Biotech-Digital ? Thinking out of the box ? Heuristic ? Holistic ? Trusted AI ? IA confiance ? R&D Life Sciences ? Keynote Speaker ? Board Member

6 年

AI #Alert: THE MUST-READ PRESENTATION TO UNDERSTAND: https://www.dhirubhai.net/pulse/ibms-watson-supercomputer-recommended-unsafe-cancer-manuel-gea-/ Everything you always wanted to know about Digital Health revolution “big promises” but were afraid to ask! https://www.bmsystems.net/download/BMSystems-Pesentation-club-INSEAD-Alumni-BCG-04072017-web.pdf @manuelgea #IBMWatson #AI #IA #medical #digital @sanofi @servier @ucb

回复
Michael Szul

Software Engineering Manager in Higher Education ● DevOps & Conversational Software

6 年

Python is the most versatile for the data science and machine learning space, and the most likely to allow for crossover appeal. Barring that, a LISP dialect like Clojure would give you the power of LISP (the original AI language) with access to the JVM and associated Java libraries when needed.

回复
Jochen L. Leidner

AI Professor | Scientist-Engineer | Consultant | Advisor

6 年

Python is the winner of the popularity contest, and Java is the winner of the 'what's running in production\' contest. I don't know anyone who uses R in production, so while I'm not opposed to using R for research, productization may involve extra re-implementation (which is fine if the productivity wins during research outweight that cost). BTW, the Python world is fragmented in people who only use Python packages and the people who implement these packages in C/C++ (or even Cython).

I don't get it. Java outperforms python yet Java is criticised for its JVM and Python is praised for its ease of use.

回复

要查看或添加评论,请登录

George Anadiotis的更多文章

社区洞察

其他会员也浏览了