Microsoft R Server Wins Round 2 of the Bout!
Yogesh Kulkarni
Co-Founder and Chief Technology Officer at Ellicium Technology Solutions
In my earlier blogs “Can Microsoft R server turbocharge Analytics Workloads” and “Round 1 of the Bout Between Microsoft R Server”, I gave a background of the testing we are doing to see how well Microsoft R Server handles workloads as compared to RStudio. The first round of results which were based on processing data row-by-row, did not show any results in favour of Microsoft R Server or RStudio.
We did the second round of testing by using bulk processing of data and the results are AMAZING!! Microsoft R Server seems to have proven its prowess and the results are tilted heavily in its favour.
To quickly recap what we did for the Testing:
Sandbox instance details –
- Machine – Intel Core i7 Quad-core 64 bit processor, 16 GB RAM
- R version – 3.0.1
- MySQL – 5.6.12
Flow of the R code used for testing –
- Load the required R packages
- Connect to a table containing varying loads (few millions of rows) and retrieve selective data
- Process the data based on selective business rules and analytics algorithms in either of the two ways -
- Round 1 of testing - Process data and load it into another table using one-row-at-a-time processing
- Round 2 of testing - Hold data in a Data Frame, process it for all rows and load it into another table in bulk
Test Execution – The above R code was run using RStudio and Microsoft R Server for 10k rows, 50k rows and 100k rows and for processing data in bulk using R Data Frames.
Test Results -
Observations –
- For any workload, right from 10k rows to 100k rows and when processing data in bulk using R Data Frames, Microsoft R Server processes data much faster than RStudio.
- As the workload increases, Microsoft R Server performs better than RStudio by a significant margin as seen below –
- for 10k rows, Microsoft R Server is 4 times faster !
- for 50k rows, Microsoft R Server is 5 times faster !!
- for 100k rows, Microsoft R Server is 6 times faster !!!
Conclusion –
1. For bulk data processing using R Data Frames, Microsoft R Server overshadows RStudio by a significant margin. This is because it seems to be able to parallelise data processing and make good use of all the available resources.
Looking at it in another way - to make use of the Microsoft R Server capabilities, data needs to be processed in bulk where ever possible. Processing data one-row-at-a-time will not give significant performance benefits on Microsoft R Server.
2. As the volume of data being processed in bulk increases, the difference in performance becomes more and more significant. This means that the Microsoft R Server is able to handle large volumes of data much better than RStudio.
In other words – beyond a certain volume of data, RStudio might not be able to process data within acceptable time frames. It will be imperative to consider Microsoft R Server for such cases.
Summary - Microsoft R Server seems to have an ability to parallelise data processing which gives superior results. Also, it makes optimum use of the available memory and processing power which is very essential when processing big volumes of data.
Next Steps – We plan to run similar tests for typical Machine Learning algorithms e.g. Regression, Classification or Principal Component Analysis (PCA). It will be interesting to see how Microsoft R Server fares in this particular case.
Driving AI & Data Innovation | CEO @ Aptus Data Labs | Generative AI & Data Governance Advocate | Digital Transformation Leader
8 年R studio & MS R Server architecture is different .. so that comparison does not make sense for parallelisation point of view.. important to test the parallelisation capability using MS R server vs Spark/MLib (or HP Vertica distributed R rapidminer ) or processing using a particular algorithm like regression or MCA etc with the same 4-core machine and high volume transaction 5-10 millions on hadoop