How Disks Became 150X Slower Since 1985
The sad truth is that computers have become slower than ever. In this post I'm going to explain why that's the case for disks, and prove it with facts. If your current personal computer or business server feels strangely a lot slower than what you recall from systems from 20 years ago that's because (Moore's Law be damned) it actually is. Unless you're in a cave on an island in the middle of the Indian Ocean, you haven't escaped the buzz and feed around Big Data. Vast amounts of data are collected and analyzed from every transaction, and increasingly from a wide range of devices (Internet of Things). The insights from this data will change the world, personalize our interactions, and help cure disease. Just when we thought we couldn't possibly create more data than the number of people on the planet, we've found news ways explode our data another order of magnitude by capturing data not only from people, but from a wide range of devices and companies. Then of course there is meta-data - the data that describes data itself.
All this data, Terabytes, Petabytes and Exabytes of it, surely requires vast physical resources. Even our personal computers and phones now require dozens if not hundreds of gigabytes to store the data we use daily.
Relax - disks are now so cheap and RAM so large, we are swimming in physical resource.
We're swimming in a sea of cheap plentiful random access storage - whether is it HDDs, SSDs, RAM or the emerging Storage Class Memory technologies. Or so goes the myth. Truth: It's a fantasy wrapped in a delusion. The harsh reality is that companies are buckling under the weight of this data, and the costs of IT infrastructure to contain and process it. While disk prices have fallen, the volumes have risen so quickly and uses for the data have expended to widely, that immense capital expenditures are needed for the associated RAM, and CPUs to examine this data and derive useful information. Here some sobering considerations on the specific issue of disk speeds:
Over the past 30 years disk drives have very impressively increased their capacity by 10,000X and I/O transfer rates by 65X. At first blush, it's amazing. Storage costs are now pennies per gigabyte, and with several orders of magnitude increase in both capacity and speed, we're rockin' it. There is a subtlety in this; hard to see but with dramatic consequences: The data capacity has grown far faster than the transfer rate. 150 times faster actually. Although the speed of the disks is now 65X faster, they individually hold 10,000X more data, so the time required to access the contents of the disk is actually 150X slower than it was 30 years ago. Oh, for the glory days of 1985!
[Disk Transfer Rates, image reused with permission from D. DeWitt, PASS Summit Keynote 2009]
If the total time to read disks has essentially gotten 150X worse, what about the price-performance of these disks? After all, storage prices have been dropping rapidly. In fact the cost per megabyte has dropped from about $30/MB in 1985 to roughly 40 cents per gigabyte today, or in other words, just fractions of a penny per megabyte. In other words, if the performance of disks is effectively 150X slower than it was 30 years ago, the price per MB is now about one million times cheaper. See the diagram below - note the disk prices are shows a "X" symbols, and the y-axis is logarithmic. So, indeed price-performance has improved. It's a sorry consolation that we can buy so much disk, when we have to wait 150X times longer to see what's on it.
(Source: here)
The next time you need to buy disks, give some thought to what you need to put there. These days, less may be more.
[In my next post I'll discuss CPU and RAM - are we further ahead?]
Data & Technology Transformation | Global Data & Analytics Lead
8 年Thanks Sam for this clear and concise article. I agree with John that columnar storage technology available from SAP Hana, Microsoft SQL Server and IBM DB2 and the likes are a great leap forward and to your point; less is more. With Columnstore available in SQL Server since 2012 the "average Joe" now have the ability to increase the responsiveness of his \ her BI solution to customers without costing an arm and a leg. The sad reality is that few technologist utilize this capability effectively in solving business challenges.
Creating Predictable Proven Revenue Processes Daily
9 年Sam, Insightful analysis about the current state of Data Management. How do we deal with this? Is there a way for us to increase data access under existing technologies? or is there something new coming soon?
Sam great analysis! Back when disks were 2 GB each it would take 512 disks to make up 1 TB so we automatically had the bandwidth. Today we have 1-2 disks, exactly your point. The other big issue is stalling the processing we have all these cores and the processing gets stalled. Big reason the benefits of columnar in-memory storage and processing are so significant. SAP IQ did a good job of processing columnar data pages were large for relational at 128k to 512k and packed with a single column of compressed data values but still the processor would need the next page of values and prefetching helped make sure the page was in database cache but still time for the cache page to get into the CPU cache (TLB). SAP HANA has the columns stored in contiguous memory in a compressed format and uses the Intel parallel vectorization libraries to go thru the data. Keeps the cores from stalling (get full quantum – time slice from scheduler) and plows thru the data at memory speeds. Progress is getting better for disk based databases using SSD's and Flash instead of using slow spinning disks but small io sizes (large number of iops) and row storage (CPU cache thrashing – parsing unnecessary columns) has a significant drag on performance. Having compressed column data stored and processed independently in-memory is a big leap forward.