登录查看更多内容

Which Big Data, Data Mining, and Data Science Tools go together?

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

发布日期: 2015年6月25日

We took anonymized data from the results of the 2015 KDnuggets Data Mining Software Poll, and performed association analysis the top 20 tools. The dataset consisted of 2759 votes, each for one or more tools. At the bottom of this post there is a link to download the anonymized dataset.

We used a version of Apriori algorithm to analyze the results.

There are many ways to measure how significant is associations between two nominal or binary features, eg chi-square or T-test, but we use a simple measure we call "Lift", defined as

Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )

where pct(X) is the percent of users who selected X.

Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)

Note that this measure is symmetric: Lift (X & Y) = Lift (Y & X)

Fig. 1 shows the heat map for the top 10 Data Mining tools. The lift values are displayed in their respective matrix positions and the color gradient represents the degree of association from high to low.
If lift is > 1.2 the square is reddish, if less than 0.8, bluish, else grey.

Spark and Hadoop have the highest association with a lift=3.31, followed by Spark and Python (lift=2.05). We also note strong association between Excel and SQL, and Tableau and SQL.

The lowest associations were found between SAS base and KNIME (0.51), SAS base and RapidMiner (0.52), and KNIME and RapidMiner (0.56).

A similar heat map (Fig. 2) was computed showing the various associations between Open Source and Commercial tools.

See the rest of the post on KDnuggets: https://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html

Nilesh Jethwa

9 年

Although new but Oracle's BDD and Infocaptor work well with Hadoop as well

1 次回应

Mark Ortega

BY/JDA Fulfillment, Demand | Product Owner | Rapid Solutions

9 年

Thanks. A useful classification tip sheet.

Mark Peterson

AI Leader | Strategic Communication

9 年

I agree with Roger, Teradata Aster and R. They compliment each other very well.

Roger Fried

Senior Director, Data Management and Business Intelligence

9 年

R and Teradata Aster with the AsterR package

Mariano Silva

9 年

Nice. We use R+SQL (1.30) and R+Smthg-like-Tableau (1.42)

查看更多评论

要查看或添加评论，请登录

Gregory Piatetsky-Shapiro的更多文章

KDnuggets: Personal History and Nuggets of Experience

2021年12月4日

KDnuggets: Personal History and Nuggets of Experience

Dear Readers, I have big news! After 40+ years of working full time, including 35+ years of data mining/KDD/data…

160 条评论
Which Data Science Skills are core and which are hot/emerging ones?

2019年9月17日

Which Data Science Skills are core and which are hot/emerging ones?

The latest KDnuggets Poll asked 1. Which skills / knowledge areas do you currently have (at the level you can use in…

30 条评论
Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

2019年2月11日

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

For the first time in several years the name of this highly anticipated Gartner MQ for Data Science and Machine…

10 条评论
AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

2018年12月4日

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

As in the past, we bring you a roundup of predictions and analysis from experts. We have asked What were the main…

6 条评论
How Important is that Machine Learning Model be Understandable?

2018年11月19日

How Important is that Machine Learning Model be Understandable?

The previous KDnuggets Poll asked When building Machine Learning / Data Science models in 2018, how often was it…

10 条评论
Anticipating the next move in data science – my interview with Thomson Reuters

2018年11月18日

Anticipating the next move in data science – my interview with Thomson Reuters

Thomson Reuters has a series, AI experts, where they interview thought leaders from different areas - including…

11 条评论
Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

2018年10月31日

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

The latest KDnuggets Poll asked: What was the largest dataset you analyzed / data mined? This poll received 1108 votes,…

5 条评论
How many Data Scientists are there and is there a shortage?

2018年9月19日

How many Data Scientists are there and is there a shortage?

(this blog was jointly written with Preet Gandhi, NYU) The 2011 McKinsey report on Big Data said that “The United…

8 条评论
Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

2018年7月30日

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

This article is based on a KDnuggets blog jointly written with Dan Clark. The 2018 World Cup is over, with France…

45 条评论
SuperDataScience Podcast: Insights from the Founder of KDnuggets

2018年7月23日

SuperDataScience Podcast: Insights from the Founder of KDnuggets

I recently appeared on Super DataScience Podcast, where I had an interesting conversation with SDS Founder Kirill…

4 条评论

See all articles

Which Big Data, Data Mining, and Data Science Tools go together?

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

Gregory Piatetsky-Shapiro的更多文章

社区洞察

其他会员也浏览了

Data Mining

DATA MINING PROCESS

Why Mining Unstructured Supply Chain Data is a Goldmine

Revealing efficiency and unlocking value with data and process mining

The insidious threat of data mining bias

Where Analytics, Data Mining, Data Science were applied in 2016

Data Mining Concepts and Process

Data Mining and the top-tier companies use them

Apriori Algorithm

The Art and Science of Data Mining in the Aviation Industry

Gregory Piatetsky-Shapiro的更多文章

KDnuggets: Personal History and Nuggets of Experience

Which Data Science Skills are core and which are hot/emerging ones?

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

How Important is that Machine Learning Model be Understandable?

Anticipating the next move in data science – my interview with Thomson Reuters

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

How many Data Scientists are there and is there a shortage?

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

SuperDataScience Podcast: Insights from the Founder of KDnuggets

社区洞察

其他会员也浏览了

Data Mining

DATA MINING PROCESS

Why Mining Unstructured Supply Chain Data is a Goldmine

Revealing efficiency and unlocking value with data and process mining

The insidious threat of data mining bias

Where Analytics, Data Mining, Data Science were applied in 2016

Data Mining Concepts and Process

Data Mining and the top-tier companies use them

Apriori Algorithm

The Art and Science of Data Mining in the Aviation Industry