Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends
Gregory Piatetsky-Shapiro
Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.
The latest KDnuggets Poll asked:
What was the largest dataset you analyzed / data mined?
This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.
Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.
Highlights:
- Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
- Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
- Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
- Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018
2018 data is shown as a column, to stand apart from lines for previous years.
This poll also asked about employment type, and the breakdown was
- Company or Self-Employed, 62% (was also 62% in 2016)
- Student, 17% (was 20% in 2016)
- Academia/University, 13% (was 10% in 2016)
- Government/non-profit, 4.8% (was 5.1% in 2016)
- Other, 3.2% (was 2.4% in 2016)
Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median
Circle size corresponds to the number of responses.
Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:
- Europe, 34.9% (was 35.1%)
- US/Canada, 34.4% (was 36.9% in 2016)
- Asia, 15.6% (was 17%)
- Latin America, 6.9% (was 5.6%)
- Africa/Middle East, 4.9% (was 3.2%)
- Australia/NZ, 3.2% (was 2.3%)
Finally, we examine the largest dataset analyzed by both employment and region for 3 largest regions.
Read the rest on KDnuggets:
Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends-
https://www.kdnuggets.com/2018/10/poll-results-largest-dataset-analyzed.html
SVP Decision Sciences at Targetbase Inc.
5 年Interesting survey, particularly the international input.? I'm curious if you think this is a measure of available data for analysis, or the typical size of an assembled dataset?? In most cases the former is constantly growing.? Thanks for conducting this survey!
Data Science Consultant | Risk Analysis and Optimization
5 年It would interesting to include maximum number of records and variables rather than just large file sizes. I find problems become more complex with large numbers of data points