What Big Data, Data Science, Deep Learning software goes together?

What Big Data, Data Science, Deep Learning software goes together?

Last week, I reported the results of 2016 KDnuggets Software Poll: R, Python Duel As Top Analytics, Data Science software.

This post looks a little deeper and examines the associations between different tools, their relationship to Big Data and Deep Learning, and regional patterns. At the end of the post there is a link to anonymized dataset, so that you can do your own analysis (and let me know about the results in comments).

The question asked in KDnuggets Poll was

What software you used for Analytics, Data Mining, Data Science, Machine Learning projects in the past 12 months?



Thus, if tools are used together by the same voter it does not imply that they are used on the same project, but we can assume some affinity between those tools, since data scientists tend to use their favorite tool combinations on many projects.

First, we looked at associations between the top 10 tools.

There are many ways to measure how significant is associations between two nominal or binary features, like chi-square or T-test, but as we did in our 2015 analysis (Which Big Data, Data Mining, and Data Science Tools go together?), we used a simple measure we call "Lift", defined as

Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )

where pct(X) is the percent of users who selected X.

Fig. 1 below shows the pairwise lift between the top 10 tools. The diagonal is left blank, since we don't need a lift between a tool and itself.

Note that the chart is symmetric relative to the diagonal, since the lift measure is symmetric:
Lift (X & Y) = Lift (Y & X).

However we show both upper and lower triangles of the chart to make it easier to see patterns across rows and columns.

We note R and Python generally "play" well with other tools except RapidMiner, SQL and Excel go well with Tableau, RapidMiner is used less with other tools than average, while Hadoop and Spark used more with other tools. Not surprisingly, scikit-learn is used most with Python, but interestingly, also with Spark.

Spark and Hadoop have the highest lift=2.66, while RapidMiner and scikit-learn have the lowest lift=0.59.

Next we look at associations between top 5 commercial and top 5 free tools.

We note that R, Tableau and MATLAB "play" well with other tools, while RapidMiner users use less other tools than average.

Read the rest of the post, including a very interesting chart ranking top tools by their  R/Python bias on KDnuggets:

What Big Data, Data Science, Deep Learning software goes together?

https://www.kdnuggets.com/2016/06/big-data-science-deep-learning-software-associations.html

Mike Tatsky

CTO at FancyGrid

8 年

Your post reminds me what we are doing at FancyGrid - https://fancygrid.com

回复
Padmakumar (PK) Nambiar

Director Of Product Development @ Oracle | OCI SaaS, GenAI, DS & ML | ex-IBM

8 年

Greg, Thanks for this post...; it has some interesting observations. I'm assuming the data is based from a sample of users of these tools/languages - from across the industry ?

Eric W.

Statistical Modelling Analyst

8 年

Alberto de Santos Sierra, this is relevant to our discussion few days ago.

要查看或添加评论,请登录

Gregory Piatetsky-Shapiro的更多文章

社区洞察

其他会员也浏览了