Which Big Data, Data Mining, and Data Science Tools go together?

Which Big Data, Data Mining, and Data Science Tools go together?

We took anonymized data from the results of the  2015 KDnuggets Data Mining Software Poll,  and performed association analysis the top 20 tools. The dataset consisted of 2759 votes, each for one or more tools.   At the bottom of this post there is a link to download the anonymized dataset.

We used a version of Apriori algorithm to analyze the results.

There are many ways to measure how significant is associations between two nominal or binary features, eg chi-square or T-test, but we use a simple measure we call "Lift", defined as

Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )

where pct(X) is the percent of users who selected X.

Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)

Note that this measure is symmetric:   Lift (X & Y) = Lift (Y & X)

Fig. 1 shows the heat map for the top 10 Data Mining tools.  The lift values are displayed in their respective matrix positions and the color gradient represents the degree of association from high to low.
If lift is > 1.2 the square is reddish, if less than 0.8, bluish, else grey.

Spark and Hadoop have the highest association with a lift=3.31, followed by Spark and Python (lift=2.05). We also note strong association between Excel and SQL, and Tableau and SQL.

The lowest associations were found between SAS base and KNIME (0.51), SAS base and RapidMiner (0.52), and KNIME and RapidMiner (0.56).

A similar heat map (Fig. 2) was computed showing the various associations between Open Source and Commercial tools.

See the rest of the post on KDnuggets: https://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html

Although new but Oracle's BDD and Infocaptor work well with Hadoop as well

Mark Ortega

BY/JDA Fulfillment, Demand | Product Owner | Rapid Solutions

9 年

Thanks. A useful classification tip sheet.

回复
Mark Peterson

AI Leader | Strategic Communication

9 年

I agree with Roger, Teradata Aster and R. They compliment each other very well.

回复
Roger Fried

Senior Director, Data Management and Business Intelligence

9 年

R and Teradata Aster with the AsterR package

回复
Mariano Silva

Analytics | Big Data | Business Intelligence | Data Engineering | Data Governance | Lean 6 Sigma Black Belt | Machine Learning

9 年

Nice. We use R+SQL (1.30) and R+Smthg-like-Tableau (1.42)

回复

要查看或添加评论,请登录

Gregory Piatetsky-Shapiro的更多文章

社区洞察

其他会员也浏览了