Round 1 of the Bout - Microsoft R Server vs RStudio

Round 1 of the Bout - Microsoft R Server vs RStudio

In my earlier blog “Can Microsoft R server turbocharge Analytics Workloads, I spoke at length about why we are looking to conduct our own tests to assess if Microsoft R Server can handle workloads better than RStudio Open Source. The first round of testing has been completed and the results are slightly disappointing. Based on a few claims on the Datasheet, they are nowhere near what we were expecting! Is there a reason for it? Let’s find out…. 

To recap what we did for the Testing:

Sandbox instance details –

  1. Machine – Intel Core i7 Quad-core 64 bit processor, 16 GB RAM
  2. R version – 3.0.1
  3. MySQL – 5.6.12

Flow of the R code used for testing –

  • Load the required R packages
  • Connect to a table containing varying loads (few millions of rows) and retrieve selective data
  • Process the data based on selective business rules and analytics algorithms in either of the two ways -
  1. Process data and load it into another table using one-row-at-a-time processing
  2. Hold data in a Data Frame, process it for all rows and load it into another table in bulk

Test Execution – The above R code was run using RStudio and Microsoft R Server for 10k rows, 50k rows and 100k rows and for one-row-at-a-time processing

 Test Results -

 Observations – 

  1. i) For a lower volume of data (10k rows and 50k rows), Microsoft R server seems to run 6% faster than RStudio when the processing is one row at a time. However, this difference of 6% is really not very significant.
  2. ii) However, for higher volume of data (100k rows) and for one-row-at-a-time processing, RStudio seems to give a performance similar to that of Microsoft R Server.

 

Conclusion –

  1. For one-row-at-a-time processing, Microsoft R server seems to get little opportunity to parallelise processing and make use of disk. As a result, the performance improvement is not very significant. To put it in different words, the one-row-at-a-time processing approach will not utilise the Microsoft R Server processing capabilities.
  2. For one-row-at-a-time processing, run times scale linearly i.e. as the workload increases, the run times increase proportionately. To put it in different words, performance for higher volume of rows can be predicted to a good degree of precision.

 

Next Steps – We will now change the approach used in the R code. Instead of doing one-row-at-a-time processing, we will get the data in an R data frame and instruct R to do bulk processing. Let’s see if Microsoft R Server gets an opportunity to prove its prowess!

要查看或添加评论,请登录

Yogesh Kulkarni的更多文章

社区洞察

其他会员也浏览了