Big Data: Python, R or Julia?
Big data projects are becoming common. Organizations seek to take advantage of all that big data has to offer. While many companies are on board with the idea of implementing a big data project, properly executing one is another matter entirely.
Many factors have to be considered, from what types of legacy systems you have at your disposal to the talent and skills within your organization in the first place. One of the most important decisions that could affect nearly every aspect of a big data project is the preferred programming language you use. There are several programming languages available for big data programming. The popular ones are Python, R, and Julia.
1. Python
Python is a general-purpose scripting language. It can do complex data processing and implementation of mathematical and algorithmic functions for machine learning. Many developers are comfortable with Python since it’s easier to learn. Python incorporates modules, exceptions, dynamic typing, very high-level dynamic data types, and classes. It has interfaces to many system calls and libraries, as well as to various Windows-based systems. We will discuss some pros and cons of Python.
The Pros:
- Free availability
- Cost saving
- Easy integration
As Python is an open source technology, it is freely available. Also, bugs can be easily detected and fixed in Python. Python makes a strong argument that it can save the enterprise money, both in the software creation and the maintenance stage. Python's clean, readable syntax makes code eminently readable, even by programmers other than those who worked on the original project. Thus, it creates less complication and helps in reducing costs. Python has easy integration with and extensibility using C and Java. Also, it has good support for objects, modules, and other reusability mechanisms. An often overlooked point in favor of adopting Python in enterprises, particularly those with significant commitments to Java, is a Python variant called Jython. Written completely in Java, Jython allows rapid development and testing of applications leveraging the Java class library in a fraction of the time of the edit-compile-test cycle of Java. Jython also enables tight integration of Python and Java code, allowing each to take advantage of the other language's capabilities.
The Cons:
- Lack of multi processor support
- Lack of prepackaged solutions
- Database access layer limitations
Python lacks the ability to support more than one processor. Python offers relatively fewer packaged solutions. It does include an extensive class library with the language’s distribution but has less number of packaged solutions. Compared to established technologies such as ODBC and JDBC, Python's database access layer seems a bit primitive and underdeveloped.
2. R
R is an extremely rich environment, especially when you get into statistics. Inference, statistical modeling and then plotting your data on a bar, pie chart and histogram is simple in R, as it’s formatted for statistical modeling using vectors and/or matrices. If you’re a data analyst who wants to see data distributions before drawing conclusions, R allows you to visualize outliers and data density. We will discuss some pros and cons of R:
The Pros:
- Robust graphical interface
- Numerous external packages
- Free availability
- R is open source and cross -platform technology. Hence, it is freely available. R is easy to extend, modify and improve with add-on packages. External packages for R are increasing almost daily, most of them based on published up-to-date books and peer-reviewed articles. R is a programming environment well suited for statistical analysis. It also deals well with spatial data. Hence, it has a robust graphical interface and an active user group list / forum.
The Cons:
- Steep learning curve
- Problems in memory management
- No backward compatibility
R has a very steep learning curve. Users find it difficult to learn and understand R. There are memory management problems (depends on your OS), especially when displaying big images at high resolution or working with huge matrices (hundreds of Mb). R is still evolving. It does not support backward-compatibility. It cannot run on earlier versions of the system.
3. Julia
Julia is a high-level, high-performance dynamic programming language for technical computing. It naturally has many, many of the mathematical and statistical libraries found in any high- performance environment. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. Julia’s Base library is largely written in Julia itself.
The Pros:
- Easy to install
- Straightforward syntax
- Includes scientific functions
- Third party libraries are often written entirely in Julia, making them easy to install. It also makes Julia easy to dive into and read / change / edit. The focus of the language is bound to scientific applications, which means that the syntax for common scientific operations can be more straightforward. Julia has many standard scientific functions as the part of the core language.
The Cons:
- Not fully stabilized
- Lesser scientific tools
- Slower
The libraries in Julia have not fully stabilized and are likely to break backwards compatibility. The set of existing scientific tools is still only a fraction of what’s available in Python. Dictionaries are hashed differently than Python dictionaries, which can make them slower in many cases.
Programme Management, ???? WFP ????
7 年R please. Any day anytime
Senior Software Engineering Manager
7 年Python does have multi processing and can share memory between processes (as long as the object has a pickle). Also, panda and numpy are fast enough for most of the applications. REPL is very useful to try and test new functions, especially when it takes long time to create some arrays, data frame etc.
Strategic HealthTech Innovations Leader
7 年I am loving Python, especially with Jupyter in Anaconda Navigator. Is there a way to load packages into the Environment so I can use R and Julia from IPython directly?
Analisi dati.
7 年when all major bigdata platforms are written in Java why we must always discard it and take something more trendy or radical chic? this my mantra.