Data Science and Machine Learning. With Java?

Data Science and Machine Learning. With Java?

This note has been compiled in part from the webinar Data Science & Machine Learning in Java, presented by Simon Ritter, Deputy CTO of Azul Systems. Any errors, however, are wholly mine.

In this blog, I outline briefly:

-      Common Applications of Data Science

-      Definitions: Machine learning, deep learning, data engineering and data science

-      Why Java for data science workflows, for both production and research.

Common Applications of Data Science

The blogosphere is full of descriptions about how data science and “AI’ is changing the world. This table references some key applications.

No alt text provided for this image

These applications outlined are largely not new, nor are "AI" algorithms like neural networks. However, increasingly commoditized, flexible and cheaper hardware with readily available algorithms and APIs have lowered barriers to data-compute intensive approaches common to data science, making the use of "AI" algorithms much more straightforward.

Key Definitions: Machine Learning, Data Science, etc

For practitioners, definitions are well understood. For those less familiar and curious, here are some quick definitions and introductions to baseline everyone.

At their heart, data science workflows transform data, from heterogenous sources of information, through models and learning, to derive information from which “useful” decisions can be expedited. Decisions may be automated (e.g. an online search or a retail credit fraud check) or inform human decisions (e.g. portfolio manager investment decisions or a complex corporate lending negotiation).

No alt text provided for this image

Some see a distinction between Data Science and Data Engineering, but both serve two sides of the same coin, as U2 put it once, “we’re one but we’re not the same.” I was recently pointed to this table, which I adjusted a tad below and I’d argue that developers/DevOps should be called out too as a distinct column.

No alt text provided for this image

In the same article, a commentator observed:

"Most cloud-native-type companies need five data engineers for each data scientist to get the data into the form and location needed for good data science," said Jason Preszler, head data scientist at Karat, a technical hiring service. "Without both roles, the data [that] companies are easily collecting is just sitting around or underutilized."

I’ve seen exceptional domain-specialists-come-data scientists also be CTO-like unicorns, bridging the gap between algorithm, implementation and business insight. I've also seen enterprise architects and CTOs, particularly those gifted with both soft skills and AI-focused STEM PhDs, drive algorithmic research, a chance perhaps to relive their university days. Their direction in turn helps specialists deploy individual tasks, such as algorithms and research, data munging and warehousing, software and application development right through to business-level reporting and, if applicable, automated activity execution.

Now let’s briefly examine some key algorithmic terminologies, important because we’ll return to them later in the article when exploring emerging Java capabilities:

Machine Learning: "The field of study that gives computers the ability to learn without being explicitly programmed” - Arthur Samuel (1959)

The field subdivides in multiple ways.

Machine Learning itself uses labeled training data to predict future values, essentially learn from example. Supervised (which trains a model on known inputs and outputs) and unsupervised learning (finds hidden patterns or intrinsic structures in input data) can both apply.

In deep learning, a computer model learns to perform classification tasks directly from images, text, signals or sound. Models are trained by using a large set of labeled data and neural network architectures that contain many layers, like below. 

No alt text provided for this image

Among various deep learning algorithms, I’ve interacted with two in my professional life

  • Convolutional neural networks which extracted image features, in my case cars in out-of-town car parks/malls from satellite images, forming the basis of an "alternative" RetailWatch car count data-set to give daily insights into retail performance.
  • I showed demos using Long-Short-Term Memory Nets to drive sentiment classification from news, tweets and earnings announcements to underpin alternative data-driven trading strategies.

Reinforcement Learning: This approach utilizes a human-like trial and error “agent-based” approach to reinforce paths that work and discard paths that don’t. Such approaches are popular in search, retail and trading strategies, as they can mimic complex human behavior. They are also applied in ADAS (Advanced-Driver Assistance Systems) applications, intersecting well with the human-machine interface on which such systems depend.

No alt text provided for this image

Why Java in Your Data Science Workflows?

All languages are beautiful, their individual beauty often lies in the eye of the beholder. Open source languages Python and R since 2010-15 have dominated upstream Data Science, prior to that the commercial language MATLAB in which many game-changing early neural nets algorithms were implemented. Views differ on how far Python and R extend into the enterprise stack. In research, R has a rich statistical library ecosystem while key libraries like Tensorflow, PyTorch and Keras are accessible from Python, facilitated by the SciPy stack and Pandas. However, other languages are coming to the fore, including Java, C++ and .NET. Gartner machine learning guru, Andriy Burkov, eloquently writes:

"Some people working in data analysis think that there's something special about Python (or R, or, Scala).

They will tell you that you have to use one of those because otherwise, you will not get the best result. It's not true. The choice of language should be made based on two factors: 1) how well your final product will integrate with the existing ecosystem and 2) the availability of production-grade data analysis libraries for each language.

Currently, almost any popular language has one or more powerful libraries for data analysis. Java is an excellent example, where the development of everything hot is happening right now because of a multitude of existing JVM languages. C++ historically has a huge choice of implemented algorithms. Even proprietary ecosystems such as .NET today contain implementations of most of the state-of-the-art algorithms and learning paradigms. So, if someone tells you that only Python is the way to go, I would be skeptical and look for someone who embraces diversity."

Great advice. Two key points primarily from the Java perspective:

i) Data science algorithms “upstream” particularly for statistics, machine learning and deep learning methodologies (neural nets), hitherto the province of Python, R and MATLAB, are more readily available across more languages. In Java, for example the following frameworks are emerging:

-      DeepLearning4J includes a Toolkit for building, training and deploying neural networks. RL4J extend with reinforcement learning targets image processing and includes Markov Decision Processes (MDP) and Deep Q Network (DQN) methods

-      ND4J: Key scientific computing libraries for JVM use, modeled on NumPy and core MATLAB, including deep learning capabilities.

-      Amazon Deep Java Library: Develop and deploy machine and deep learning models, drawing on MXNet, PyTorch and TensorFlow frameworks.

These and other capabilities make Java accessible to developer-savvy scientific programmers.

Note that commercial "upstream" environments such as SAS, KNIME and RapidMiner offer data science platforms with strong Java foundations. MATLAB too has historically integrated well with Java for application development and API connectivity, a theme in Yair Altman’s Java/MATLAB aging classic, Undocumented MATLAB. The MATLAB Production Server is one of several vehicles to deploy MATLAB algorithms into Java enterprise applications. In your Java code, you can define a Java interface to represent the deployed MATLAB function, instantiate a proxy object to communicate with the Production Server, and thus call the MATLAB-generated function.

You can also interface and deploy open source R code to Java in many ways, including via the package RServe.

In short, there are increasing capabilities to code (production-ready-ish) data science algorithms in Java and if not in Java then call other languages from Java. Python (with NumPy; SciPy; Pandas), R and MATLAB will surely remain algorithmic domain leaders given their matrix algebra, tech computing and statistical foundations, but Java and other languages are increasingly compelling.

A quick nod to C++. Remember that key “Python” libraries have strong C++ foundations including Tensorflow and PyTorch. Away from algorithms and toward data engineering, Python Pandas lead Wes McKinney, for example, has highlighted the relevance of C++ to the multi-platform Arrow Project.

ii) Data science enterprise architectures “downstream,” particularly those focusing on secure data throughput, are often Java-based and/or underpinned in platforms or languages (e.g. Scala or Clojure) using the Java Virtual Machine [JVM], such as:

  • Hadoop: Distributed storage and processing of big data using the MapReduce programming model
  • Spark: Where Hadoop tends towards batch, Spark performs batch and streaming.
  • Kafka: Messaging and Streaming
  • Cassandra: NoSQL Database
  • Neo4J: Another popular NoSQL Database
  • Elasticsearch: A search engine based on the Lucene library, providing a distributed full-text search engine with an HTTP web interface and schema-free JSON documents.

Java excels in distributed environments. Secure data handling, manipulation, transfer and connectivity are among its natural strengths, benefiting too from a coordinated strategy around security, enforced over the years by Sun Microsystems, Oracle and now the vibrant OpenJDK organization. The cross-platform approach underpinned by the JVM, i.e. develop once, deploy anywhere, facilitates enterprise development. Key projects, for example, Project Panama, enhance ease of access, and will bring compute-intensive deep learning-friendly CUDA and OpenCL-based libraries and GPU hardware within easier reach.

In conclusion, Java is prominent in enterprise architectures, but increasing in versatility in “upstream” data science-enabling algorithmic capabilities. It will operate in conjunction with Python, R, MATLAB, C++ and others and not instead of them, but possibilities are increasingly available to use Java across all aspects of data science workflows.

As an MBA Student wrote in response to Burkov's post referenced earlier, "I started learning Java last year and I am beyond excited about learning how to do advanced analytics in this language! I am currently taking Java based courses in ‘Data Structures and Algorithms’ & ‘Mathematics in Computing'"

要查看或添加评论,请登录

Steve Wilcockson的更多文章

  • Python + KX: 2 Stars Collide

    Python + KX: 2 Stars Collide

    This blog is repurposed from the original on kx.com here.

  • March Tiobe Index: Brief Comments

    March Tiobe Index: Brief Comments

    The March Tiobe Index is out. I follow Java (1, up short term), Python (3, continual rise), R (11, steady mid-term)…

    1 条评论
  • The Hysteriometer: 8 Years Later

    The Hysteriometer: 8 Years Later

    On waking up this morning, a glance at my phone notified me that my "most commented" post on Facebook from 2012 was on…

  • To Salespeople, … With Love

    To Salespeople, … With Love

    I used to work in tech sales, primarily selling software to engineers, quants, data scientists, numerical researchers…

    9 条评论
  • The Non-Contradiction of Proprietary Finance and Community, Open Source Programming

    The Non-Contradiction of Proprietary Finance and Community, Open Source Programming

    I work in financial services, typically quantitative technology applications. One of my current employers is an imagery…

    9 条评论
  • Job Was Only Meant to Last 6 Months....... Two Decades Later

    Job Was Only Meant to Last 6 Months....... Two Decades Later

    21 years and 9 months is a long time to spend at a single company these days, or in my case two conflated companies -…

    35 条评论
  • Motivational Wisdom c/o My Garden Trellis

    Motivational Wisdom c/o My Garden Trellis

    Every once in a while, a colleague or a friend comes up to me and says, "Steve, you are clearly operating above and…

    7 条评论
  • Reforming Risk Management. Again?

    Reforming Risk Management. Again?

    In financial services, the need to reform, readjust and reinvent risk management has been a continual theme before and…

    1 条评论
  • Hedge Fund Wives: Not Assets Or Liabilities

    Hedge Fund Wives: Not Assets Or Liabilities

    The Headline: First Impressions Recently shared an article emailed me in a "newsletter": Hedge Fund Manager Who Lists…

    3 条评论
  • UK Election 2015: Forgotten Finance ?

    UK Election 2015: Forgotten Finance ?

    First, for those of you not from the UK, I apologise, but I hope the title makes clear this post is UK…

    5 条评论

社区洞察

其他会员也浏览了