Can Microsoft R server turbocharge Analytics Workloads?

Can Microsoft R server turbocharge Analytics Workloads?

R is a popular tool used by data scientists and engineers for data mining and statistical computing. It supports procedural as well as object oriented programming. Due to the contributions of the user community and the various packages, R lends excellent support for various Linear and Non-Linear Modelling, Machine Learning algorithms and Time Series Analysis. Due to these features, R presented itself as a very attractive tool for our Time Series Analytics product. 

Context – We have been using RStudio Open Source R for our data processing requirements. RStudio IDE has been the favourite due to its simplicity, multi-platform (Windows, Linux) support and well…. also due to the fact that it is open source!

To put it in very simple terms, our requirement is to retrieve transaction level data from a Hadoop cluster, use the corresponding Metadata from MySQL tables, perform complex computations to process the data and load the computed data to Hadoop or MySQL as per requirements.

Challenges – We have been facing several challenges using R on Production. A few of them are –

  1. Data is processed in batches and run times for a batch is in hours and increasing. Getting the run time of a batch to minutes (rather than hours) is getting imperative.
  2. By default, RStudio Open Source R does not process data in parallel. We have had to build code using makeCluster and doParallel packages for parallel processing. However, this results in the code getting complex and therefore, supporting it becomes a challenge.

Why Microsoft R Server - We are not planning to move away from R and are looking for an alternate execution environment which would help us overcome the above challenges.  As a part of this thought process, we came across the Microsoft R Server and claims of how it is supposed to speed up processing in R. Knowing that Microsoft had acquired Revolution Analytics, we were curious to know what Microsoft R Server would bring to the table.

We have decided to conduct our own tests with our actual data to assess if Microsoft R server can handle workloads better than RStudio Open Source R and under what conditions.  

Assessment mechanism - We will be running a series of identical tests on both RStudio Open Source R as well as Microsoft R server. To make the tests comparable, the R code, input data and environment will be the same. Run times required for processing the workloads will be measured and used for comparing and reporting results.  

Sandbox instance details –

  1. Machine – Intel Core i7 Quad-core 64 bit processor, 16 GB RAM
  2. R version – 3.0.1
  3. MySQL – 5.6.12

Flow of the R code used for testing –

  • Load the required R packages
  • Connect to a table containing varying loads (few millions of rows) and retrieve selective data
  • Process the data based on selective business rules and analytics algorithms in either of the two ways -
  1. Process data row-by-row
  2. Hold data in a Data Frame and process it in bulk
  • Load the processed data into another table

Test Execution – The above R code will be used to run the tests on RStudio and Microsoft R Server for varying workloads like 10k and 100k rows on the Sandbox instance.

Results – Results will be published in my next blog. Please stay tuned in for the results!!

Mariano Silva

Analytics | Big Data | Business Intelligence | Data Engineering | Data Governance | Lean 6 Sigma Black Belt | Machine Learning

8 年

I look forward to seeing the results!

回复

要查看或添加评论,请登录

Yogesh Kulkarni的更多文章

社区洞察

其他会员也浏览了