Which Big Data, Data Mining, and Data Science Tools go together?
Gregory Piatetsky-Shapiro
Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.
We took anonymized data from the results of the 2015 KDnuggets Data Mining Software Poll, and performed association analysis the top 20 tools. The dataset consisted of 2759 votes, each for one or more tools. At the bottom of this post there is a link to download the anonymized dataset.
We used a version of Apriori algorithm to analyze the results.
There are many ways to measure how significant is associations between two nominal or binary features, eg chi-square or T-test, but we use a simple measure we call "Lift", defined as
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )
where pct(X) is the percent of users who selected X.
Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)
Note that this measure is symmetric: Lift (X & Y) = Lift (Y & X)
Fig. 1 shows the heat map for the top 10 Data Mining tools. The lift values are displayed in their respective matrix positions and the color gradient represents the degree of association from high to low.
If lift is > 1.2 the square is reddish, if less than 0.8, bluish, else grey.
Spark and Hadoop have the highest association with a lift=3.31, followed by Spark and Python (lift=2.05). We also note strong association between Excel and SQL, and Tableau and SQL.
The lowest associations were found between SAS base and KNIME (0.51), SAS base and RapidMiner (0.52), and KNIME and RapidMiner (0.56).
A similar heat map (Fig. 2) was computed showing the various associations between Open Source and Commercial tools.
See the rest of the post on KDnuggets: https://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html
Although new but Oracle's BDD and Infocaptor work well with Hadoop as well
BY/JDA Fulfillment, Demand | Product Owner | Rapid Solutions
9 年Thanks. A useful classification tip sheet.
AI Leader | Strategic Communication
9 年I agree with Roger, Teradata Aster and R. They compliment each other very well.
Senior Director, Data Management and Business Intelligence
9 年R and Teradata Aster with the AsterR package
Analytics | Big Data | Business Intelligence | Data Engineering | Data Governance | Lean 6 Sigma Black Belt | Machine Learning
9 年Nice. We use R+SQL (1.30) and R+Smthg-like-Tableau (1.42)