登录查看更多内容

Can Microsoft R server turbocharge Analytics Workloads?

Yogesh Kulkarni

Co-Founder and Chief Technology Officer at Ellicium Technology Solutions

发布日期: 2016年4月5日

R is a popular tool used by data scientists and engineers for data mining and statistical computing. It supports procedural as well as object oriented programming. Due to the contributions of the user community and the various packages, R lends excellent support for various Linear and Non-Linear Modelling, Machine Learning algorithms and Time Series Analysis. Due to these features, R presented itself as a very attractive tool for our Time Series Analytics product.

Context – We have been using RStudio Open Source R for our data processing requirements. RStudio IDE has been the favourite due to its simplicity, multi-platform (Windows, Linux) support and well…. also due to the fact that it is open source!

To put it in very simple terms, our requirement is to retrieve transaction level data from a Hadoop cluster, use the corresponding Metadata from MySQL tables, perform complex computations to process the data and load the computed data to Hadoop or MySQL as per requirements.

Challenges – We have been facing several challenges using R on Production. A few of them are –

Data is processed in batches and run times for a batch is in hours and increasing. Getting the run time of a batch to minutes (rather than hours) is getting imperative.
By default, RStudio Open Source R does not process data in parallel. We have had to build code using makeCluster and doParallel packages for parallel processing. However, this results in the code getting complex and therefore, supporting it becomes a challenge.

Why Microsoft R Server - We are not planning to move away from R and are looking for an alternate execution environment which would help us overcome the above challenges. As a part of this thought process, we came across the Microsoft R Server and claims of how it is supposed to speed up processing in R. Knowing that Microsoft had acquired Revolution Analytics, we were curious to know what Microsoft R Server would bring to the table.

We have decided to conduct our own tests with our actual data to assess if Microsoft R server can handle workloads better than RStudio Open Source R and under what conditions.

Assessment mechanism - We will be running a series of identical tests on both RStudio Open Source R as well as Microsoft R server. To make the tests comparable, the R code, input data and environment will be the same. Run times required for processing the workloads will be measured and used for comparing and reporting results.

Sandbox instance details –

Machine – Intel Core i7 Quad-core 64 bit processor, 16 GB RAM
R version – 3.0.1
MySQL – 5.6.12

Flow of the R code used for testing –

Load the required R packages
Connect to a table containing varying loads (few millions of rows) and retrieve selective data
Process the data based on selective business rules and analytics algorithms in either of the two ways -

Process data row-by-row
Hold data in a Data Frame and process it in bulk

Load the processed data into another table

Test Execution – The above R code will be used to run the tests on RStudio and Microsoft R Server for varying workloads like 10k and 100k rows on the Sandbox instance.

Results – Results will be published in my next blog. Please stay tuned in for the results!!

Mariano Silva

8 年

I look forward to seeing the results!

要查看或添加评论，请登录

Yogesh Kulkarni的更多文章

Importance of Agile Leadership in Embracing Change

2023年7月21日

Importance of Agile Leadership in Embracing Change

The Role of Agile Leadership in Embracing Change In today's fast-paced digital landscape, effective leadership is…

7 条评论
Feeling Exiled in This Corona Pandemic? Key Points to Take Away from “The Mahabharata – Exile of the Pandavas”

2020年4月27日

Feeling Exiled in This Corona Pandemic? Key Points to Take Away from “The Mahabharata – Exile of the Pandavas”

As I write this article, there is one thing which I can be 100% sure of - a majority of you who are reading this…

19 条评论
Future of Big Data in 2018!

2017年12月14日

Future of Big Data in 2018!

The year 2017 was an interesting one in the Big Data world. Though the adoption of Hadoop as the Big Data platform…

8 条评论
Thinking of Professional Advancement In Life – Head To The Himalayas!

2017年8月30日

Thinking of Professional Advancement In Life – Head To The Himalayas!

The title of my blog might sound weird to some. However, that is exactly what I mean to say.

17 条评论
Building an All-Rounder Big Data Team

2016年9月8日

Building an All-Rounder Big Data Team

Talent wins games, but teamwork and intelligence win championships." --Michael Jordan Alone we can do so little…

11 条评论
Microsoft R Server Wins Round 2 of the Bout!

2016年4月21日

Microsoft R Server Wins Round 2 of the Bout!

In my earlier blogs “Can Microsoft R server turbocharge Analytics Workloads” and “Round 1 of the Bout Between Microsoft…

2 条评论
Round 1 of the Bout - Microsoft R Server vs RStudio

2016年4月7日

Round 1 of the Bout - Microsoft R Server vs RStudio

In my earlier blog “Can Microsoft R server turbocharge Analytics Workloads”, I spoke at length about why we are looking…

2 条评论
Why I rejected 300 Hadoop candidates!

2015年11月23日

Why I rejected 300 Hadoop candidates!

Having been a part of the IT industry for the last 18 years, I have had the chance to meet, interact and assess…

70 条评论
Are you planning for the Production Deployment of your Hadoop System – Part2

2015年10月15日

Are you planning for the Production Deployment of your Hadoop System – Part2

In my last blog, I discussed about the scenarios leading to the Production deployment of Hadoop and how the concerns…

7 条评论
Are you planning for the Production Deployment of your Hadoop System?

2015年9月22日

Are you planning for the Production Deployment of your Hadoop System?

Are you planning for the Production Deployment of your Hadoop System? So, you are a part of the Hadoop Bandwagon now!…

See all articles

Can Microsoft R server turbocharge Analytics Workloads?

Yogesh Kulkarni

Co-Founder and Chief Technology Officer at Ellicium Technology Solutions

Yogesh Kulkarni的更多文章

社区洞察

其他会员也浏览了

Expedite Apache Spark Queries with Bloom Filter Indexing

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Simplifying Apache Spark usage with Optimus

A Beginner’s Take on Spark Query and Storage Optimizations

MapReduce (and its legacy)

Apache Spark 101: Window Functions

Apache Spark 3.0 for Data Scientists : Best Practices

Spark Performance Tuning: Spill

Apache Spark Optimizations - Compression

Yogesh Kulkarni的更多文章

Importance of Agile Leadership in Embracing Change

Feeling Exiled in This Corona Pandemic? Key Points to Take Away from “The Mahabharata – Exile of the Pandavas”

Future of Big Data in 2018!

Thinking of Professional Advancement In Life – Head To The Himalayas!

Building an All-Rounder Big Data Team

Microsoft R Server Wins Round 2 of the Bout!

Round 1 of the Bout - Microsoft R Server vs RStudio

Why I rejected 300 Hadoop candidates!

Are you planning for the Production Deployment of your Hadoop System – Part2

Are you planning for the Production Deployment of your Hadoop System?

社区洞察

其他会员也浏览了

Expedite Apache Spark Queries with Bloom Filter Indexing

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Simplifying Apache Spark usage with Optimus

A Beginner’s Take on Spark Query and Storage Optimizations

MapReduce (and its legacy)

Apache Spark 101: Window Functions

Apache Spark 3.0 for Data Scientists : Best Practices

Spark Performance Tuning: Spill

Apache Spark Optimizations - Compression