Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends
KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

The latest KDnuggets Poll asked:

What was the largest dataset you analyzed / data mined?

This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.

Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.

Highlights:

  • Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
  • Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
  • Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
  • Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.

Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018

2018 data is shown as a column, to stand apart from lines for previous years.

This poll also asked about employment type, and the breakdown was

  • Company or Self-Employed, 62% (was also 62% in 2016)
  • Student, 17% (was 20% in 2016)
  • Academia/University, 13% (was 10% in 2016)
  • Government/non-profit, 4.8% (was 5.1% in 2016)
  • Other, 3.2% (was 2.4% in 2016)

Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median

Circle size corresponds to the number of responses.

Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:

  • Europe, 34.9% (was 35.1%)
  • US/Canada, 34.4% (was 36.9% in 2016)
  • Asia, 15.6% (was 17%)
  • Latin America, 6.9% (was 5.6%)
  • Africa/Middle East, 4.9% (was 3.2%)
  • Australia/NZ, 3.2% (was 2.3%)

Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.

Read the rest on KDnuggets:

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends-

https://www.kdnuggets.com/2018/10/poll-results-largest-dataset-analyzed.html

 

Gene Ferruzza

SVP Decision Sciences at Targetbase Inc.

5 年

Interesting survey, particularly the international input.? I'm curious if you think this is a measure of available data for analysis, or the typical size of an assembled dataset?? In most cases the former is constantly growing.? Thanks for conducting this survey!

?? Alastair Muir, PhD, BSc, BEd, MBB

Data Science Consultant | Risk Analysis and Optimization

5 年

It would interesting to include maximum number of records and variables rather than just large file sizes. I find problems become more complex with large numbers of data points

要查看或添加评论,请登录

社区洞察

其他会员也浏览了