登录查看更多内容

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

发布日期: 2018年10月31日

+ 关注

The latest KDnuggets Poll asked:

What was the largest dataset you analyzed / data mined?

This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.

Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.

Highlights:

Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.

Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018

2018 data is shown as a column, to stand apart from lines for previous years.

This poll also asked about employment type, and the breakdown was

Company or Self-Employed, 62% (was also 62% in 2016)
Student, 17% (was 20% in 2016)
Academia/University, 13% (was 10% in 2016)
Government/non-profit, 4.8% (was 5.1% in 2016)
Other, 3.2% (was 2.4% in 2016)

Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median

Circle size corresponds to the number of responses.

Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:

Europe, 34.9% (was 35.1%)
US/Canada, 34.4% (was 36.9% in 2016)
Asia, 15.6% (was 17%)
Latin America, 6.9% (was 5.6%)
Africa/Middle East, 4.9% (was 3.2%)
Australia/NZ, 3.2% (was 2.3%)

Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.

Read the rest on KDnuggets:

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends-

https://www.kdnuggets.com/2018/10/poll-results-largest-dataset-analyzed.html

Gene Ferruzza

SVP Decision Sciences at Targetbase Inc.

5 年

Interesting survey, particularly the international input.? I'm curious if you think this is a measure of available data for analysis, or the typical size of an assembled dataset?? In most cases the former is constantly growing.? Thanks for conducting this survey!

1 次回应

?? Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | Risk Analysis and Optimization

5 年

It would interesting to include maximum number of records and variables rather than just large file sizes. I find problems become more complex with large numbers of data points

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

更多精彩文章

社区洞察

其他会员也浏览了

Doppelganger: Your Data has a Twin?

Largest Dataset Analyzed Poll shows surprising stability, more junior Data Scientists

Big Data present and future challenges

Why I Believe 2019 Will Be the Year of Data Science

What Is Big Data?

Is it possible to empower people through Big Data?

Happy Birthday Big Data! (But should we celebrate?)

Big Data for Better Business Today and Tomorrow

Big Data Challenges

The 7 success factors towards data transformation

KDnuggets: Personal History and Nuggets of Experience

2021年12月4日

Which Data Science Skills are core and which are hot/emerging ones?

2019年9月17日

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

2019年2月11日

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

2018年12月4日

How Important is that Machine Learning Model be Understandable?

2018年11月19日

Anticipating the next move in data science – my interview with Thomson Reuters

2018年11月18日

How many Data Scientists are there and is there a shortage?

2018年9月19日

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

2018年7月30日

SuperDataScience Podcast: Insights from the Founder of KDnuggets

2018年7月23日

The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

2018年6月6日