Benchmarks don't lie but liars use benchmarks
Fawad A. Qureshi
Field CTO @ Snowflake | LinkedIn Learning Instructor | Sustainability ??, Data Strategy, Business Transformation
"Lies, damned lies, and statistics" is a phrase that describes the persuasive power of statistics in bolstering weak arguments. Another variation of this saying goes: "Statistics don't lie, but liars use statistics."
Assessing and benchmarking different technologies is a common practice across industries. In this discussion, I'll delve into data-centric benchmarks related to data platforms. A former colleague, Marco Ullasci , often remarked, "Benchmarks don't lie, but liars use benchmarks." This points to the tactics data platform vendors employ in their benchmarks to attract new customers.
Let's discuss some of the tactics:
1. Benchmarks are set long before a single line of code is ever written
I have always maintained that the outcome of a benchmark is determined even before the first line of code is penned. You might be wondering how this can be possible. Consider this analogy: Imagine you possess a brand-new Ferrari while I have a bicycle. Now, I challenge you for a race. First of all, you might ask about my sanity. However, I'll guide us to a muddy hiking trail if I can choose the racecourse on the actual race day.
If your objective is to find the best mode of transport for hiking, then this benchmark is relevant. Yet, if that weren't the goal, you'd likely be left perplexed, pondering why we are racing on a hiking trail—beyond the simple fact that I'm resolute in my desire to win.
In a parallel manner, within the domain of data platforms, vendors often try to construct benchmarks that showcase them in a favorable light, even when there's no apparent advantage to the customer.
2. Sponsored Benchmark Studies
We are witnessing a rise in sponsored benchmark studies orchestrated by "independent industry analysts." These studies or papers consistently conclude with the sponsoring company's product emerging as the top contender. Even more troubling is that I have encountered the same research firm publishing various papers on identical topics, each funded by different companies, yet consistently crowning the sponsoring company's product as the victor.
Sometimes, the study is not directly sponsored by the vendor but rather by a system integrator whose core business relies on the vendor being discussed. System Integrators often fail to disclose their relationship and want to come across as an independent consulting organization to the naive reader.
In one recent report, the analysts stated, "We conducted a good faith estimate of the competition." When did benchmarks transition into exercises rooted in faith?
Whenever I come across a sponsored benchmark study, I generally dismiss it. This inclination stems from the fact that approximately 99% of the time, the study promotes the sponsoring company's product. If the survey is not sponsored, I quickly analyze the organization conducting the benchmark to determine if most of its business relies on the vendor in question. The inherent conflict of interest hinders the analysts from conducting an impartial assessment.
3. Law of Large Numbers in Benchmarks
The Law of Large Numbers states that when a significant number of trials are conducted, the average results should closely approach the expected value. This alignment becomes even more precise as the number of trials increases.
For those familiar with basic statistics, it's common knowledge that the probability of obtaining a head on a balanced coin is 50%. However, if I flip the coin four times and achieve three heads, can I conclude that the likelihood of getting a head is 75% for a balanced coin?
The law of large numbers establishes that after experimenting numerous times, the outcomes will increasingly match the anticipated value.
领英推荐
This principle significantly influences benchmarks. Consider a scenario where you test your queries with only 1TB of data while your production systems manage over 1PB of data. The 1TB sample might fit into the memory of most contemporary data platforms, yielding exceptionally positive results. To counteract potential biases from tiny sample sizes, conducting tests that more closely mirror production conditions, encompassing volume, complexity, and concurrency, is advisable.
4. Maintaining Technical Debt
Another technique I've observed legacy vendors employing in benchmarks involves compelling the customer to utilize the code from their current platform as-is, making only minimal modifications. They are well aware that their code has undergone optimization for their platforms over decades, and it bears a substantial load of technical debt. This strategy creates significant hurdles for other vendors to compete equitably.
This approach is feasible if you want to maintain current affairs and ensure that your code doesn't change, carrying over the technical debt to the new platform. However, if your goal is to modernize the platform and eliminate any lingering technical debt, then it becomes imperative to outline a target architecture. This architecture should be adhered to by all vendors, ensuring a level playing field. After all, as the saying goes, "Nothing changes if nothing changes."
5. The Myth of Industry Standard Benchmarks
To address the challenges mentioned earlier in devising meaningful benchmarks, the TPC Transaction Processing Performance Council introduced specifications for various data-centric benchmarks in 1994. While this appeared promising in theory, it ultimately failed to capture real-world complexities.
Consider the following limitations within these benchmarks:
Due to these factors, TPC benchmarks seem akin to a solved puzzle. The data model has remained stagnant since 1994. Vendors invest years fine-tuning their platforms to excel in TPC-DS benchmarks. Nonetheless, with the constrained number of tables and queries, we fall victim to the small numbers effect. Vendors can engineer specific index structures and optimizations to enhance the performance of TPC queries, masking the fact that real-world queries suffer from dire performance issues.
Recommendations
Wondering what steps to take? Here are some practical tips:
Are there any other tips you would add to the list?
If you like, please subscribe to the FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn.
Sr. Systems Engineer
1 年Great article ????. I have a question though. In order to personalize a benchmark and keeping single data copy, how practical is it to maintain a single real big test dataset which is varied in its quality? Context: Different vendors need and operate on independent dataset. Also if you have huge data set but it is similar/uniform in values then doesn't matter how many trials you do, your test coverage is limited.
Strategic Data and AI Advisor
1 年What about open sourced benchmark code that can be inspected, and independently reproducible? In your view, benchmarks provide any value? What about for consumption planning?
I think that’s a credit to Mark Twain. So accurate in this case as well. As always, Fawad A. Qureshi , your perspective is right on.