A 'bit' More About Data De-dupe...
And why not all data duplication ratios and hero number data protection claims are created equal (Part 2)
Previously, I blogged on (link below) what de-duplication is and why you should evaluate carefully any vendor’s data protection efficiency claims. After-all, ‘vendor math’ is a real thing…and all vendors like to put (and pitch) their best foot and product features forward. Who doesn’t?
But if a vendor studiously avoids talking about some key performance attribute, claim or differentiated feature (like deduplication, for example) – or tries to downplay said feature/performance numbers or protests "What difference does it make?"-- then you should drill down and ask why "it doesn’t matter”. Some claims can be a 'bit' like “Voodoo Marketing” and vaporize under close scrutiny and facts (thank you Mr. H.W. Bush).
Vendor marketing claims aside, the focus should be on what’s real and more important/critical to your business and IT organization. And of course, whether a vendor data protection solution truly supports your business objectives, operations and supporting IT infrastructure -- along with your user community. Ultimately, it all comes down to which specific vendor solution has the 'best' direct and measurable bearing on reducing your org's cost and increasing revenue and profitability.
Integral to all of these tangible financial considerations, of course, is your mission critical data for production workloads -- and just how safe, resilient/secure and recoverable it is. So...PLEASE...whatever your IT strategy or vendor preference happens to be...do yourself a favor and PROTECT YOUR DATA! Regardless of where it resides...(i.e, Edge, ROBO, Datacenter, Remote site, DR site(s), Private Cloud, Hybrid Cloud, Public Cloud or Multi-Cloud, wherever...).
Your data deserves to be well protected and immediately recoverable wherever it may be stored, archived or sequestered ala an isolated cyber recovery solution. So called "immutable snapshots" are NOT sufficient for true backup and reliable/resilient data protection! For one, they can be inadvertently or deliberately deleted! And, by the way, data recoverability is really why we protect data, right? For those 'rainy day' or 'stuff happens' annoying data unavailability incidents, corrupted data cases, site disasters or successful malware/ransomware attacks. Protected data is no good unless you can recover and restore it FAST commensurate with your business org's RPO and RTO SLAs (recovery point and recovery time objectives).
So the real question to ask is just HOW efficient, productive and cost-effective a given data protection (DP) solution will be for YOUR real-world, mission critical apps, data and workloads when going with a particular vendor solution architecture or platform? How much more data will you be able to protect reliably and predictably with the least amount of IT on-premises hardware and software infrastructure costs from a given vendor’s DP solution(s)?
Speaking of 'the cloud', there’s the enticing and alluring Public Cloud ‘pay only for what you use’ variable OPEX billing promise...and well worn mantra. But monthly public cloud compute and data storage service provider usage fees (especially when pulling or recovering data out) can be surprisingly high…and escalate rapidly. We've heard reports of monthly billing up to $20M USDs per month for public cloud usage from a large, but obviously unnamed, customer. Yikes! Now that's some monthly bill...
Then there's the current 'shadow multi-cloud' trend going on...where orgs suddenly find their user community increasingly deploying compute, storage, DR and data protection in multiple clouds (without a coherent multi-cloud strategy, rationalization or monitoring). No wonder many small/medium businesses, mid-range and high end enterprises are 'repatriating' their apps, workloads and data from Public Clouds back to their own on-premise(s) environments.
So, "Mister Rodger....why is 'a bit more about data dedupe' relevant here??
It's very simple. The less actual physical data/footprint you end up backing up, copying and storing/retaining in AWS, Azure or Google public clouds…the less $$$ you’ll be charged for each month. That makes imminent sense, right? Now that’s customers’ math in action. And you definitely want to get this math right so you can leverage 'the multi-cloud' without runaway monthly utility bill...or cloud chaos. Remember, at Dell EMC we're trying to help cost effectively transform your business and IT org into modernized, comprehensive and holistic use case and deployment models....covering the whole gamut of a solid IT infrastructure environment and resources. Here, actual 'total cost of ownership' and 'realistic cost to serve' financial numbers count big time; and should be high on your list for evaluating viable solutions for your production workloads.
So efficient data protection (and dedupe) can dramatically improve on and off-prem productivity; resource utilization; backup and restore response times; CAPEX and OPEX budgets; cloud costs and SLA/user community compliance and satisfaction.
Now for a rather surprising 'factoid'. Many customers out there actually assume backing up data to 'the cloud'...by default... means their data (and corresponding VMs and apps) will 'automagically' be copied and protected in 'the cloud' and included in their subscription rates. Or that any data sent to or created in the cloud is not vulnerable to cyber threats like data unavailability, viruses, corruption, malware or ransomware (from any cause or source). Some learn the hard way...and that they’re 'under insured' data protection wise. Others put all their data eggs in one basket so to speak, whether it's a single datacenter site, media type or backup copy (and whether on-prem and/or in the cloud).
Now, here’s another verifiable 'factoid':
IT industry analysts agree…
"Dell EMC is #1 in the Data Protection Appliance & Software Market."
By all means, fact check me. But as you evaluate various vendor data protection software and hardware solutions, please keep the above in mind. There’s a reason why Dell EMC has achieved that recognition and industry mantel.
Again -- vendor claims and math aside -- the only thing that really matters (or should matter) is just how efficient a vendor’s solution is in terms of doing more storage or data protection with less hardware (or cloud) resources – for a given raw and usable data storage capacity. This cannot be over emphasized.
In particular, the realm of data protection boils down to these very fundamental questions for all DP vendors... along with getting verifiable performance numbers from them:
- How much backup storage is needed to procure, provision and manage your protected data?
- How long do (or will) backups, replication and restores really take?
- How much additional infrastructure (i.e., networking) is needed to support the solution?
- How much WAN bandwidth actual replication copies require and utilize?
- How long will it take to restore my data after some disaster (i.e, Disaster Recovery); and how well can I protect it from cyber threats?
So, is an oft-quoted ‘hero’ dedupe or data reduction figure cited by a specific vendor really better if the ratio or percent is higher? Answer: “It depends”. Architecting, calculating, testing, and measuring actual data reduction efficiency (i.e., both dedupe and compression in play) can vary from vendor to vendor; workload to workload; SLA to SLA and data protection platform to platform. This variation is due to a vendor’s architecture… tempered by your datacenter's workloads, data sets and environment used during data reduction analysis. But equally important, also the way it was tested or measured...
Enter that tried and true, realistic and telling production-oriented Proof of Concept (PoC) and/or vendor short list bake-off. Stands to reason any vendor’s PoC must be driven by your requirements and business/data environment… not theirs…or ours. You should set the PoC table, not the vendor(s). Especially for those 'quick and dirty', 'drop and go' demos and compelling PoCs which can wow you with some cool results while simultaneously gaming the PoC with favorable conditions. Happens all the time. Trust me, our esteemed competitors do a lot of gaming in order to put their best foot forward. Afterall, all vendors want to showcase their value add, including Dell EMC. Just make sure their value add is truly valuable and relevant to you and your org.
Below are a few very high-level examples of how different data duplication and compression measurement hero numbers and claims can be derived. If included them for the tech savvy and innately curious.
For the rest of you, here's a good place to exit from this blog. If you do, a sincere thank you for your time and interest!
1. A storage platform’s raw vs usable storage
- Usable capacity matters! All things considered equal, a 5:1 ratio on one platform might actually generate MORE effective/logical capacity than the 10:1 by another vendor who has less physical usable storage % based on their ‘raw’ capacity. In the end, physical usable and effective/logical capacity of data actually protected (usable x dedupe efficiency) are numbers you really should examine/consider.
- Example: a 48TB ‘raw’ capacity array/target storage device with HIGH overhead (SW OS/RAID schema/etc.) – but with 10:1 data reduction ratio -- might yield 24TB usable (@ 50 % overhead) … and 240TB effective/logical data storage capacity (usable x 10). However, another 48TB raw capacity storage device with 7.3:1 dedupe and medium overhead of 30% (for example) would yield 33TB usable (@ 30% overhead) – and, thus, 240.9TB effective/logical data storage capacity.
- Generally, there's inline data deduplication and post deduplication processes. The advantage of inline data dedupe is that data reduction occurs before actual storage to client or target storage media. Post dedupe scans data for optimized data reduction after ingest...therefore requiring disk capacity for 'temporary' storage till the post process is complete. The former can take up host resources with reduced I/O throughput, while the latter can take up usable storage till post process is complete.
Above diagram (courtesy of TechStorage) depicts post versus inline dedupe. There are trade-offs to both, but generally inline deduplication is quite popular these days.
2. Dedupe-friendly versus dedupe-unfriendly data sets
- Some block and file data sets are more redundant than others, and this attribute can vary by running certain dedupe (or data reduction) “tests” to back up dedupe ratio/efficiency claims. And, by the way, dedupe operations should only be run on real data and not “white space”, such as contained in VMs when deriving any claimed data reduction numbers. Yes, a couple of 'modern' and 'disruptive' HCI vendors out there have recently (and curiously) increased their dedupe ratio claims … and one even includes that white space in their data reduction calculations.
- Example? The more redundant (i.e., repeated) data you have written to your BU&R appliance (target) or source side (client), the higher the data dedupe ratios you will see...or can 'tweak'. Conversely, servers, clients, targets or apps running virtual/multiple O/S, different/random and databases, etc. can yield lower deduplication ratios and “efficiency” percentage.
- Some dedupe architectures are limited to fixed block size dedupe (e.g., 64KB or 128KB blocks) for varying 'data granularity', while others are variable block sized based (e.g., between 64KB to 256KB block sizes, dynamically handled). As a rule, smaller fixed block sizes can yield better deduplication efficiency/performance; while variable block dedupe can assist in ‘on the fly’ data I/O optimization. Larger block sides yield less efficient dedupe. Regardless of which schema is in play, the real 'goodness' criteria (correlated to data granularity) in backup and restores shows up in actual RTO and RPO times (recovery time objective....and recovery point objective).
- Some vendors do NOT offer or clarify the data environment/characteristics behind their dedupe claims (which always seem to be optimistic). So, ask for their performance and data reduction efficiency test data numbers (and test cases). Remember, these numbers can ultimately impact your RTO and RPO times and user/business community SLA policies.
Generally speaking, block and file data are dedupe friendly compared to object data (e.x., AWS S3 or Azure Blob unstructured data formats)
3. Data Volatility/change rates
- Data deduplication ratios are related to the number of changes occurring to the data… and data blocks. Changed blocks and files result as new data comes in, gets updated, or gets modified.
- Effectively, each percentage increase in data change assumed/tested lowers the resulting dedupe ratio. For example, a 20:1 dedupe ratio is estimated or measured from a data change rate average of roughly 5%.
- Not only do data change rate assumptions have a bearing on dedupe ratio/efficiency claims, they also affect consumption/utilization of the DP storage platform in question – for both physical and logical data protection.
- Be wary of under-sizing: HCI/SDS scale-out data protection appliances are prone to making undersized, optimistic and compelling capacity sizing quotes based low 2% data change rates...and low data size growth rates.
4. Compressed Data
- Virtually all DP vendors rely on data compression as part of their overall data reduction footprint. What industry standard data compression algorithm is being used and how efficient is it?
- Vendors can base their data reduction (or dedupe) claims on ‘built-in’ or ‘presumed’ data compression after the fact. They may claim data reduction or dedupe ratios on data that includes data getting subsequently compressed…for further data reduction. White space can be included in data reductionratios…with compression further compacting the data
- Typical data compression algorithm/efficiency assumes a 2:1 further reduction of already deduplicated (i.e., redundant) data. Ask about compression, because not all data compresses down to the same extent with the same efficiency.
- Example? If a vendor’s REAL architectural deduplication ratio/efficiency is 15:1, above 2:1 compression ratio that will further mash down data would effectively result in a claimed dedupe ratio of 30:1. But, again, what really matters -- or should matter -- is just how small your data footprint has been reduced by dedupe AND compression. Here, less physical data is more....and more resulting logical data maximum (calculated by average total data reduction ratios) is best! Particularly when sending -- or rather pulling -- data from 'pay for usage' Public Clouds.
5. Data Retention period assumptions
- Length of data retention also affects the overall data reduction footprint. Typically, some vendors might assume optimistic data retention periods – if divulged at all. Retention duration is very important; the longer data is retained the better the deduplication efficiency (over time). This is true not only for sequential data on disk drives, but for random, non-sequential writes on flash storage (i.e, SSDs). And when it comes to flash storage/SSD data caching and ‘tiering’ or ‘offloading’ (i.e, most HCI/SDS hybrid scale-out appliances (i.e., nodes with both HHDs and one or more SSDs) be sure to ascertain whether hero dedupe/data reduction numbers are based on a partially full and/or “warmed up” flash drive(s).
- For example, to achieve a data-reduction ratio of 10:1 to 30:1, users would likely need to retain and deduplicate a single data set over a period of at least 20 weeks. The notion here is that the effects of data reduction (mostly deduplication) really kick in overtime as warm data ‘cools’. If you don't have the capacity to store data for that long, the data reduction rate will be lower.
- On lower fixed raw capacity HCI/SDS DP primary, secondary, and data protection-specific vendor scale-out platforms, their limited (and fixed) on-board/per node storage capacity can constrain data retention to the point where they CANNOT retain data for 20 weeks or more. They must offload snaps and/or any full backups to ‘the cloud’.
- A certain data protection HCI scale-out appliance start-up vendor, for example, requires frequent data off-loads from host node/clusters to the public cloud frequently (i.e., usually less than their claimed monthly interval). Ironically, as more nodes with fixed CPU and storage capacity have to be added to meet data protection requirements, the less I/O throughput performance can result (due to increased intra cluster I/O data and management traffic). A seeming contradiction to HCI scale-out appliances ‘linear scale-out performance’ architectural claims…but customers are finding out those fixed HCI compute and storage resourced nodes quickly run out of steam when it comes to handling their complex workloads and large data sets, change rates, etc.
6. Backups vs Snapshots… Frequency and Assumptions
- For complete, true data protection, full backups are a must. Snapshots simply are not enough when it comes to complete data backups. Re-hydrating snaps for data and app restores can be unwieldy and time consuming. But here’s another area where dedupe/data reduction claims can be manipulated; by obfuscating actual test assumptions or test cases performed or claiming that snaps are ‘good enough’ for data protection. [You probably hear a lot about “immutable snaps”… which actually are NOT so immutable, since they can be deleted or corrupted]
- In contrast to frequent/regular snaps, backups (there are several types, including full, incremental and synthetic) include robust data integrity and deduplication software (e.g., Dell EMC's PowerProtect Software) that provide a more complete, reliable, granular and consistent picture over time into the data 'fulls' being backed up on a regular basis. Do you want a bunch of point-in-time series snapshots of your child’s soccer game (aka, Jay Giles' "Freeze Frame!") …or a running video clip up to that point in time? That, conceptually, is the difference between a data snap shot and full back-up FWIW.
- Frequent, full backups typically yield higher data reduction/dedupe efficiency the more frequently you do them (compared to snaps). Why? Full client and/or target efficient BU&R DP software products (and appliances!) can generate greater deduplication ratios and efficiency results largely due to being able to run full deduplication and compression passes (scans) on a storage server each time a backup is run. This is true even though they only back up changes to existing files or new files (i.e., change block tracking or CBT).
- Two small HCI scale-out appliance vendors for data protection, for example, both rely heavily on snaps in between those monthly (or less frequent) full backups of their clusters that get offloaded to the cloud. Hence, they usually receive only data changes sent as part of the backup software's daily incrementals or CBT differences (via metadata tables/indexing). Snapshot driven data protection schemas – though represented as ‘good enough’ data protection – invariably base their dedupe claims on full backups. Why? A bunch of daily snaps end up taking up alot of usable disk space as they pile up locally… along with generating lower effective/logical data capacities once offloaded up in the cloud.
Public Clouds are great for supplemental compute and storage resourcing or virtual datacenter deployments (along with long term data retention or DR), but watch your monthly utility bills and control your costs. Here's where a vendor's higher deduped logical data storage efficiency can lower your monthly storage usage fees. So don't be duped...
Architecture Matters!
Turns out Dell EMC’s data protection architecture is an industry stand-out in efficiency, performance, data integrity/reliability, DR and malware protection, application coverage and cost-to-serve and protect. This applies to both our data protection PowerProtect Software and PowerProtect Hardware appliances (i.e., Data Domain BU appliance, IDPA all in one DP platform and X400 scale-out integrated appliance along with our DD Virtual Appliance software).
Oh...by the way, Dell EMC is the leader in modern, transformative digital IT architectures and solutions...across the entire IT infrastructure spectrum. Today, for example, Dell EMC proudly launches are next generation PowerProtect DD backup and recovery target appliance. It follows our highly successful, #1 industry ranked Data Domain appliance family...and joins our equally impressive PowerProtect IDPA (integrated data applaince) and X400 scale-out data protection appliances. But that's another story, for another blog...
More proof? It’s quite telling to point out that Dell EMC’s Data Domain Boost software (for enhanced dedupe and data throughput performance) is used by many data protection vendors…and our competitors. This is not by coincidence or inconsequential. Data Domain Boost and DDBEA (Data Domain Boost for Enterprise Applications) are renowned, well regarded data protection performance ‘turbochargers’ for on-prem datacenters, ROBO/SMB, remote and cloud deployments. In fact, many of our competitors also use Boost software with their DP solutions. If ‘imitation is the highest form of flattery’, then ‘adoption is the highest form of validation’.
To wrap things up here, Dell EMC makes a concerted, deliberate effort to provide realistic, meaningful and accurate (and conservative) performance, throughput, IOPS, dedupe/efficiency and data integrity claims. That is our culture. That is our mission…. We even provide written guarantees behind many of those performance and efficiency numbers…PLUS generous future proofing and trade-in/trade-up customer loyalty pams. Our esteemed competitors? Not so much. (i.e., "Bueller? Bueller?") It’s an integral part of our global Digital and IT 'transformative ' and modernized infrastructure, products, solutions and professional services initiatives.
And you can be assured our solution proposals will be realistic, well scoped, sized and configured to meet your IT department and org’s data volume, data change rate, full back-up SLAs and recovery performance – with the greatest value prop in the IT data protection market. And right quoted from initial acquisition and installment.
But keep asking us and other vendors your pointed questions and tailor those PoC definitions to your actual data protection and production real-world requirements. And, remember, regardless which vendor solution you end up going with, be sure to BACKUP AND PROTECT YOUR DATA with an efficient, reliable and high performance solution -- regardless of where your data resides!