The Dynamic Duo: SAS for Big Data Analytics, and R for Plotting
Monika Wahi
Epidemiology & Biostatistics Consultant a/k/a Data Scientist | Exclusive and innovative solutions for data science challenges in public health, research and education
In a previous article, I considered whether it was necessary to learn both R and Python – can you just get by being like me, and only knowing R? A commenter pointed out that Python handles big data better than R, and that is a good point. It reminded me of a point I’d like to make about “best practices” when using SAS and R in the same shop.
SAS is Awesome with Big Data Processing, but Not So Much with Plots
It reminded me back when I worked in SAS shops. SAS can easily handle bigger data than R – but one of the known issues with SAS is that it has always been terrible with plots. SAS has tried to solve this by building apps around its core engine to try to pretty-up the graphing part.
However, I think there are deeper challenges with SAS’s graphic approach. My opinion is that the SAS needs to rebuild their core engine. It follows very old-fashioned (60s era) routines which are not appropriate for today’s faster technology. Hence, it simply builds the plot very slowly.
When I wrote my book about “R for SAS Users”, I used the same dataset and did a Kaplan-Meier survival plot using PC SAS and then again using R GUI on my desktop computer. The R GUI one processed instantly and was camera-ready, and the SAS one almost didn’t even display. When it did come up, it looked workable for diagnosing a survival curve, but it was definitely not camera-ready.
Difference in Plotting in SAS vs. R
R and SAS plot differently. I recently made this YouTube video to explain how to format a dataset for plotting in R ggplot package.
Like Excel, R wants you to format a dataframe in a certain way before you plot it. For example, if you want to plot means, you have to calculate them first and put them in your plot data.
But SAS is easier to deal with in this way. SAS assumes it’s going to plot directly from the big data. So if you are plotting means, you actually just give it your native dataset with all the rows in it, and tell it to both calculate those means on the fly and plot them in one big operation. This is likely why it takes so long to do.
In Mixed SAS and R Shops, Make SAS do the Big Data and R do the Plotting
If you start a new analytics shop today, then you would want to think hard about whether to actually set up a SAS shop from scratch. I, personally, would not advise this – but that doesn’t mean SAS is not needed now. It is direly needed now, because just about all shops in government and healthcare are SAS shops right now. I mentioned this in an earlier article, where I made this point and bemoaned that the younger generation is turning its collective nose up at SAS – so who is going to do all this work?!?!?
I love this talk by Dr. Elizabeth Atkinson at the Mayo Clinic. She explains how their SAS shop needed to start incorporating R, and how they did it. This is a model for what we are going to have to do with these SAS shops in the future.
If you find yourself in charge of a SAS shop that is trying to incorporate R, then the first thing you should do is make R do all the plotting.
Just use SAS to grind through the big data and make a plotting dataset, export it, read it into R, and make the plot. It’s a safe, less-disruptive place to start with such a daunting task, isn’t it?
…and trust me, your life will immediately get way more beautiful!
Monika Wahi is LinkedIn Learning author of data science courses in both SAS and R. Got a question? Contact her at [email protected] or on LinkedIn.
Postdoctoral Research Associate - Redox Flow Batteries / Energy Science and Engineering PhD
5 年Great article, thank you for sharing