Python vs. R vs. SAS

Python vs. R vs. SAS

Lately, I have been speaking to a number of people who are convinced that SAS’ days are numbered. Open-source technologies and tools are increasingly being chosen over enterprise solutions; this was highlighted in Alex Woodie’s excellent piece for Datanami which you should check out if you get a moment: Python Eats Into R as SAS Dominance Fades.

The article highlights a very interesting pattern in that your choice of a certain tool is dependent on whether you consider yourself to be a data scientist or a predictive analytics professional. What is interesting is that those who identified themselves as being data scientists opted to use Python as their tool of choice, to the tune of 53%, in a survey conducted, and those who identified themselves as being predictive analytics professionals found themselves as being 43% in favor of using SAS and 41% in favor of R, with only 16% preferring Python in this instance.

So, I wanted to get in touch with a couple of senior contacts and hear their thoughts on this subject. This week I am delighted to share the comments of Ola Ottosson of Actionbase and Daniel Tidstr?m of Kambi, both of whom are based in Stockholm.

First up is Ola.

Ola has 15 years of experience working as a consultant in the analytical space, focusing primarily on retail, CRM, customer development, and loyalty management. He started his career at dunnhumby in London as an analyst. He has also worked at EYC, working with three out of the top-five grocery retailers in the U.S., consulting on how to use insights from customer data to optimize assortment, pricing, space management, and promotions. Since 2011 he has lived in Stockholm and is working for Actionbase, where he is the Analytical Director and is responsible for R&D.

Richard: “What is your experience working with SAS, R, and Python?”

Ola: “I started using SAS at dunnhumby and have used it on and off ever since, doing both data manipulation and statistical modeling. I have picked up R over the last 4-5 years, and now I’m using it quite often. With Python, I’m not that experienced, but have tried it out using things like Pandas, NumPy, and SciPy.”

Richard: “With respect to your clients, what is the most commonly used tool, and have you seen that change with the emergence of Python and R?”

Ola: “There has definitely been a change in the last few years, with companies putting more and more effort into analytics and specifically the types of companies that did not do more advanced analytics than just BI/reporting. SAS used to be (and still is) the choice for many large companies where perhaps a start-up is much more likely to use something like R or Python, partly because of the fact that they are open source.”

Richard: “What limitations do you think SAS, Python and R have with regard to analysis, and would you take a blended approach with respect to different tools for alternative modeling tasks?”

Ola: “They can all perform most of the analytical tasks that you could imagine. I think R stands out when it comes to visualizing and charting data. R and Python (at least in the open source versions) do of course have limited support (support relies more on online communities) which can mean they will be ruled out for certain companies which are heavily regulated. There are now alternatives like Revolution Analytics (Microsoft R), but then the advantage of being free of charge goes away.

“R has an edge over Python when it comes to statistical analysis and explorative analysis because the language was created with those aspects in mind. Uncommon and brand-new statistical methods are almost always found as R libraries before eventually showing up in the other languages. Although it is possible to work in Python interactively using, for example, Jupyter notebooks or the Python interpreter, using R in combination with RStudio is far superior in this aspect, in my experience. Python on the other hand is a general-purpose language which is far better than R for developing larger applications or applications which have low latency requirements (it is much faster than R, in general). In my experience, Python also has more algorithmic libraries that are not purely statistical, which is an advantage when statistical procedures are to be combined with other algorithms. Both R and Python have great visualization libraries, but I tend to prefer R’s ggplot and plotly libraries.

“SAS procedures are, on the other hand, often more reliable and better documented than R libraries because anyone can implement an R library.

“Blended approaches add complexity to the implementations, but if the analytic applications have specific needs, it can be advantageous. For example, you might want to perform the initial analysis in R, and then blend the R visualization libraries with Python code for production. Or you might stick to SAS libraries because of reliability and support requirements and do the visualizations and exploratory analysis in R.”

Richard: “What is your ‘weapon of choice’ and why?”

Ola: “As a consultant, in many cases you will work with whatever the client has. In cases where we do a one-off analytical project like a predictive model, I would tend to use R, as it would not add any cost to the project itself. In cases where you build a large number of complex models that need to be maintained and monitored, I think that SAS provides the best tools to do this effectively.”

Richard: “If you were a betting man, what tool do you see as being the dominant player over the next few years?”

Ola: “We see a trend moving towards open-source languages, and R was for example recently reported to pass SAS in scholarly use (https://r4stats.com/2016/06/08/r-passes-sas-in-scholarly-use-finally/). Python seems to be growing in popularity as an analytics language, and much development is going on towards bringing many of R’s features to Python. Because R is unlikely to ever be able to support the abundance of additional features of Python, my best bet would be that Python will be the dominant player in the long term, at least for companies having their major presence online. Neither of these languages will go away anytime soon, though, as they currently fulfil vital roles in the analytics sector.”

Next up is Daniel.

Daniel is Head of Analytics at Kambi Sports Solutions. He has more than 15 years of experience in IT and business development, and has worked for the last 3 years in the Big Data and Advanced Analytics area.

Richard: “What experience do you personally have with SAS, R, and Python?”

Daniel: “I started out learning R after trying a lot of different tools like KNIME, RapidMiner, etc. I had some decent progress on R, but have lately switched more or less completely to Python. The switch to Python was very much influenced by the notebook revolution starting with IPython (now Jupyter), but the ease of learning the language and the frameworks is probably the biggest factor. My experience is quite general, with a lot of data-wrangling with Pandas and NumPy, machine learning with scikit-learn, and visualizations with matplotlib and seaborn. Currently, I am trying to learn probabilistic programming with PyMC3. I have also worked extensively with PySpark over Hadoop.”

Richard: “How worried do you think SAS should be with the emergence of R and Python?”

Daniel: “I think that SAS and their peers like IBM are already worried. There is quite a lot of evidence that R and Python are gaining traction, and I believe this trend will continue. The ecosystems of R and Python are growing every day, so I assume that SAS and the other vendors will find it increasingly difficult to justify the price tag in relation to an open-source framework.

“At the same time, they have a lot of industry experience, a well-established installed base with mission-critical applications that I assume they can leverage for up-selling and cross-selling. Packaged analytics is also an area where they seem quite strong. But given that analytics is increasingly becoming a differentiator, there might well be a trend towards moving critical capabilities in-house, so they are likely to face a very rapid change in their markets.

“This is of course a lot of speculation from my side as I have little insight into the internals of these companies and can only judge based on market forces.”

Richard: “Why do you think so many people are switching to R and more recently Python with respect to analysis?”

Daniel: “In general, I would say that the tools are strong choices now, so if you want to pick up a tool for advanced analytics, R and Python are default choices today. That is also likely to be driven by the fact that open source in general is driving the development within BI and analytics. Platforms such as Hadoop are becoming much more common, and any analytical frameworks on top of these platforms will most likely have a Python API and probably also R.

“The community support for R and Python is also very strong, and the fact that many online learning resources like Coursera are so focused on R and Python indicates a bright future for these tools.”

Richard: “And how do you expect this to play out over the next 2-3 years?”

Daniel: “I am fairly convinced that the current trends will continue. Open source will continue to grow, agility will be even more important, and analytics/machine-learning will increasingly be a part of the infrastructure. All these trends are transforming the BI and analytics space, and I believe this is a good thing.

“The coming 2-3 years will see a broader adoption of Hadoop or similar platforms, a continued push towards streaming/real-time analytics, and I think that many companies will be forced to rethink their data pipelines in order to scale out their analytics initiatives. Cloud will likely become more and more of a first choice for deployment.

“I expect a maturing market and that more use of best practices will arise. For sure, an exciting time if you like change; not so much if not.”

Richard: “Do you see any new players entering the market that we should be aware of?”

Daniel: “A ‘new’ programming language to look out for is likely to be Scala, as it is gaining a lot of traction for distributed processing. Frameworks like Apache Flink and Spark have Scala as their primary language. For data science, it might be worth looking into Julia, which is much faster than both R and Python.

“Outside languages, there are so many new players entering the market, so it’s really hard to judge whom to watch.”

My name is Richard Downes, and I specialize in helping companies hire experienced Analytics, Artificial Intelligence, and Data Science professionals in Europe and the U.S and have over 15 years of Recruitment / Staffing experience. If you are considering your next career move or are a company needing to make an experienced hire within any of these areas, please feel free to get in touch and take a look at my video introduction embedded below.


I wonder if this advanced analytics could have assisted the ability of scientists to speak up on subjects such as atmospheric nuclear weapons testing, profligate use of OC's in agriculture, asbestos inclusion in building material, and (more lately) the flawed use of neonicotinoid pesticides used to 'control' escalating insect pest in farming? Would it really be able to advance conclusive evidence of the dire consequences ordinary dialogue was unable to provide, to 'the powers that be' - scientifically ignorant, and morally repugnant? It would appear unlikely.

回复
Richard Downes

Data and AI/Machine Learning Recruitment Specialist

8 年

Thanks to Ola Ottosson and Daniel Tidstr?m. Really appreciated your help putting this one together!

要查看或添加评论,请登录

Richard Downes的更多文章

社区洞察

其他会员也浏览了