Python eats away at R: Top Software for Analytics, Data Science, Machine Learning in 2018 -Trends and Analysis
Gregory Piatetsky-Shapiro
Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.
The 19th annual KDnuggets Software Poll had over 2,300 voters, somewhat less than in 2017, perhaps because only one vendor - RapidMiner - had a very active campaign to vote in KDnuggets poll. On average, a participant selected about 7 different tools used, so votes with just one tools stood out. We removed about 260 such "lone" votes (which mainly were from RapidMiner), because even if they represented legitimate users of that tool, their experience was very atypical and would skew the results. Here is my initial analysis, based on 2052 participants, after "lone" voters were removed. More detailed association analysis and anonymized data will be published in about 2 weeks.
Fig 1: KDnuggets Analytics/Data Science 2018 Software Poll: top tools in 2018, and their share in the 2016-7 polls (* for a more valid comparison, we recomputed the results of 2016, 2017 polls to exclude "lone" votes)
Here are the top 11 tools, which all had at least 20% share. Table 1: Top Analytics/Data Science/ML Software in 2018 KDnuggets Poll
Software2018% share% change2018 vs 2017 Python65.6%11% RapidMiner52.7%65% R48.5%-14% SQL39.6%1% Excel39.1%24% Anaconda33.4%37% Tensorflow29.9%32% Tableau26.4%21% scikit-learn24.4%11% Keras22.2%108%
Here 2018 % share is % of voters who used this tool, % change is the change in share vs 2017 Software Poll, with green and red highlighting changes up and down of 10% or more. The average number of tools per respondent was 7.0, slightly higher than 6.75 in 2017 Poll (also excluding just 1-tool responses). Compared to 2017 Software Poll, the one new entry is Keras. Knime dropped from top 11, perhaps because this year they did not have a campaign among their users to vote. Here are some observations.
Python eats away at R
Python already had over 50% share in 2017, and increased its share to 66%, while R share has decreased for the first time since we have done this poll, and dropped to below 50%.
RapidMiner surges
RapidMiner, which was the top Data Science platform in the past several polls, dramatically increased its share to about 50%, up from 33% in 2017. What part of this is due to user growth, and what part to vendor promotion? I asked RapidMiner what they did to encourage their users, and here is a response from Ingo Mierswa, RapidMiner founder and president.
"Like many vendors, RapidMiner promotes the KDnuggets survey to users through a number of channels, including sending a few emails to people who have used our product in the past 12 months. We've done the same promotion before, but two different things happened this year. First we received a much better response. Over 400 users personally replied to my email expressing how happy they were to help us out. But more importantly, we've seen a 300% increase in monthly active RapidMiner users over the past year, so we emailed more people than in prior years. We're humbled to have such an engaged and loyal user community."
For the record I note that RapidMiner is not a current advertiser on KDnuggets.
SQL is steady
SQL, including Spark SQL, and SQL to Hadoop tools, continues to have a share of about 40% in each of the last 3 polls. So, if you are an aspiring Data Scientist, learn SQL - it will likely be useful for a long while!
Consolidation
We note that among 56 tools with 2% or higher share in 2017, 19 (only about one third) have increased share in 2018, while 37 have dropped in share. This, along with recent acquisitions (Datawatch buying Angoss, Minitab buying Salford) suggests that consolidation of Data Science platforms is on the way.
... See the rest of the article, including our analysis of Deep Learning software, Hadoop vs Spark, other trends, and 3-years of data on all tools on KDnuggets:
https://www.kdnuggets.com/2018/05/poll-tools-analytics-data-science-machine-learning-results.html
Docker | IIT Madras
6 年True! I expect SQL to move up the list...I feel its' one of the underrated data analytics language, most of the sophisticated Machine Learning algo such as K-means, other clustering algo can easily be done in SQL. Advantage of SQL over python is it provides persistent data storage.
Co-Founder
6 年My green school dreamhttps://www.ted.com/talks/john_hardy_my_green_school_dream
Hands-on Technology Executive | Blockchain Cloud Security
6 年In the same spirit, let's try to analyze the market share of various utensils in the food market. Forks are clearly the most frequently used, followed by knives, followed by spoons and by teaspoons -- not sure which of the latter is used more frequently. What's lost from this analysis is the fact they all are different tools and each is used to a different purpose!
Hands-on Technology Executive | Blockchain Cloud Security
6 年This analysis is orders of magnitude less meaningful than the proverbial apples and oranges! It compares a general purpose scripting programming language (Python) with a domain specific programming language (R) with a general relational query language (SQL) with a general purpose spreadsheet tool (Excel) with a machine learning framework/toolchain (TensorFlow) with a graphical multi-dimensional data analysis platform (Tableau). Can you spell gobbledygook?