When is a terabyte not a terabyte?
Or, "Why didn't I get as much capacity as I thought I bought?"
The following article covers much of the details of the Binary (Base 2) vs. Decimal (Base 10) numeric systems, and how they are used to describe computer data storage capacities.
The single most important takeaway are these:
- Computers only understand binary. So binary capacities are the only capacities that matter.
- Decimal capacities are a marketing invention of drive manufacturers. In other words, they are "fake news".
The problem comes from several areas.
Originally, all memory and storage capacities were stated using standard prefixes with the assumption they were binary capacities. This continued through the multi-megabyte hard drives of the 1980s and early 1990s.
Then, in the mid-1990s, the hard drive manufacturers started using the term "Billion Byte" to describe their new gigabyte capacity hard drives. This clever marketing ploy, to emphasize capacity in the easier to understand decimal system. However, after partitioning and formatting these "GB" drives reported much less capacity to their operating systems, which only understood binary. The drive manufactures dismissed this by claiming the difference was due to partition geometries and file system overhead forced by operating system manufacturers. However, OSs improved their ability to manage large sized drives, and their file system overhead efficiency, while drives got larger and larger. As a result, as the linked article states, the IEC created new prefixes, and new abbreviations, for binary in an attempt to reduce confusion. But in many ways, it made things more confusing.
For example, my current state of the art, Windows 10 computer, with its 512GB SSD shows the capacity as 511,500,611,584 bytes (511.5 "GB" Base 10).
Assuming the SSD actually has 512,000,000,000 bytes of available raw capacity, the partitioning and filesystem overhead is only 499,388,416 bytes, or less than 1/10th of one percent. Then dividing by 1024 three times yields 476.372 gibibytes (GiB)
But Windows calls that "476 GB", not "GiB". Windows accurately calculates binary capacity, but reports GiB using the decimal "GB" abbreviation. This leads to more confusion. One could easily ask: "What happened to 7% of my storage!"
It's not just Microsoft and Windows. VMware and vCenter similarly use "GB" and "TB" to refer to "GiB" and "TiB":
ESXi | Container size in vCenter doesn't correlate with the size in Prism
Again, computers only understand binary.
The solution to this is to use decimal (Base 10) capacities and their abbreviations only for hard drive capacities, solid state drive capacities, and system raw reporting capacities.
Formatted and usable (post-RAID) capacities should only be reported using binary (Base 2) capacities, as this are what operating systems and file systems see, use, and report. Unfortunately, as we see, OS and hypervisor vendors have not enforced this, and often report binary capacities using the legacy (now decimal) terms.
Otherwise, calling a GiB a GB, or a TiB a TB, significant understates expected storage capacity.
- Calling a megabyte (GB, Base-10) a mebibtye (MiB) understates binary storage capacity by 4.63%
- Calling a gigabyte (GB, Base-10) a gibibyte (GiB) understates binary storage capacity by 6.87%
- Calling a terabyte (TB, Base-10) a tebibyte (TiB) understates binary storage capacity by 9.05%
- Calling a petabyte (PB, Base-10) a pebibyte (PiB) understates binary storage capacity by 11.18%
As the numbers get bigger, the error gets larger.
What is the cost of usable (post RAID capacity), formatted (after applying filesystem structure and metadata), all-flash (expensive capacity, especially on enterprise class RAID arrays), binary capacity (the only capacity the OS sees)? Suddenly a 9%-11% loss of capacity on an array or server cluster is real money, easily reaching $10,000 for a $100,000, 90TiB (vs. 100TB) all-flash storage array.
Storage and computing system vendors are not trying to deceive customers. But their sizing tools often can report either, so sometimes mistakes are made. Always double check the stated usable capacities of a storage system. This includes software defined storage such as hyperconverged and object storage systems.