Benchmarks don't lie but liars use benchmarks

Benchmarks don't lie but liars use benchmarks

"Lies, damned lies, and statistics" is a phrase that describes the persuasive power of statistics in bolstering weak arguments. Another variation of this saying goes: "Statistics don't lie, but liars use statistics."

Assessing and benchmarking different technologies is a common practice across industries. In this discussion, I'll delve into data-centric benchmarks related to data platforms. A former colleague, Marco Ullasci , often remarked, "Benchmarks don't lie, but liars use benchmarks." This points to the tactics data platform vendors employ in their benchmarks to attract new customers.

Let's discuss some of the tactics:

1. Benchmarks are set long before a single line of code is ever written

I have always maintained that the outcome of a benchmark is determined even before the first line of code is penned. You might be wondering how this can be possible. Consider this analogy: Imagine you possess a brand-new Ferrari while I have a bicycle. Now, I challenge you for a race. First of all, you might ask about my sanity. However, I'll guide us to a muddy hiking trail if I can choose the racecourse on the actual race day.

If your objective is to find the best mode of transport for hiking, then this benchmark is relevant. Yet, if that weren't the goal, you'd likely be left perplexed, pondering why we are racing on a hiking trail—beyond the simple fact that I'm resolute in my desire to win.

In a parallel manner, within the domain of data platforms, vendors often try to construct benchmarks that showcase them in a favorable light, even when there's no apparent advantage to the customer.

No alt text provided for this image

2. Sponsored Benchmark Studies

We are witnessing a rise in sponsored benchmark studies orchestrated by "independent industry analysts." These studies or papers consistently conclude with the sponsoring company's product emerging as the top contender. Even more troubling is that I have encountered the same research firm publishing various papers on identical topics, each funded by different companies, yet consistently crowning the sponsoring company's product as the victor.

Sometimes, the study is not directly sponsored by the vendor but rather by a system integrator whose core business relies on the vendor being discussed. System Integrators often fail to disclose their relationship and want to come across as an independent consulting organization to the naive reader.

In one recent report, the analysts stated, "We conducted a good faith estimate of the competition." When did benchmarks transition into exercises rooted in faith?

Whenever I come across a sponsored benchmark study, I generally dismiss it. This inclination stems from the fact that approximately 99% of the time, the study promotes the sponsoring company's product. If the survey is not sponsored, I quickly analyze the organization conducting the benchmark to determine if most of its business relies on the vendor in question. The inherent conflict of interest hinders the analysts from conducting an impartial assessment.

No alt text provided for this image

3. Law of Large Numbers in Benchmarks

The Law of Large Numbers states that when a significant number of trials are conducted, the average results should closely approach the expected value. This alignment becomes even more precise as the number of trials increases.

For those familiar with basic statistics, it's common knowledge that the probability of obtaining a head on a balanced coin is 50%. However, if I flip the coin four times and achieve three heads, can I conclude that the likelihood of getting a head is 75% for a balanced coin?

The law of large numbers establishes that after experimenting numerous times, the outcomes will increasingly match the anticipated value.

No alt text provided for this image
Source: https://www.statology.org/law-of-large-numbers/

This principle significantly influences benchmarks. Consider a scenario where you test your queries with only 1TB of data while your production systems manage over 1PB of data. The 1TB sample might fit into the memory of most contemporary data platforms, yielding exceptionally positive results. To counteract potential biases from tiny sample sizes, conducting tests that more closely mirror production conditions, encompassing volume, complexity, and concurrency, is advisable.

4. Maintaining Technical Debt

Another technique I've observed legacy vendors employing in benchmarks involves compelling the customer to utilize the code from their current platform as-is, making only minimal modifications. They are well aware that their code has undergone optimization for their platforms over decades, and it bears a substantial load of technical debt. This strategy creates significant hurdles for other vendors to compete equitably.

This approach is feasible if you want to maintain current affairs and ensure that your code doesn't change, carrying over the technical debt to the new platform. However, if your goal is to modernize the platform and eliminate any lingering technical debt, then it becomes imperative to outline a target architecture. This architecture should be adhered to by all vendors, ensuring a level playing field. After all, as the saying goes, "Nothing changes if nothing changes."

5. The Myth of Industry Standard Benchmarks

To address the challenges mentioned earlier in devising meaningful benchmarks, the TPC Transaction Processing Performance Council introduced specifications for various data-centric benchmarks in 1994. While this appeared promising in theory, it ultimately failed to capture real-world complexities.

Consider the following limitations within these benchmarks:

  1. Uniform Data Distribution: This notion contradicts real-world data, which rarely exhibits perfect uniformity. (When have we encountered real-world data without any skew?)
  2. Fixed Query Count: Enforcing a fixed number of queries resembles knowing the exam questions in advance. (Do you only run a handful of known queries on your system all the time?)
  3. Sequential Execution Bias: This approach disregards the concurrency in real-world scenarios. (What about the intricacies of concurrent operations?)
  4. Limited Data Model: The benchmarks restrict themselves to 24 tables. However, real-world production data platforms often harbor over 100,000 tables across customers and industries.

Due to these factors, TPC benchmarks seem akin to a solved puzzle. The data model has remained stagnant since 1994. Vendors invest years fine-tuning their platforms to excel in TPC-DS benchmarks. Nonetheless, with the constrained number of tables and queries, we fall victim to the small numbers effect. Vendors can engineer specific index structures and optimizations to enhance the performance of TPC queries, masking the fact that real-world queries suffer from dire performance issues.

Recommendations

Wondering what steps to take? Here are some practical tips:

  1. Personalize Your Benchmark: Your ideal benchmark is tailored to your unique data, operating environment, and future objectives.
  2. Unveil Deceptive Techniques: Watch out for the "smoke and mirrors" strategies highlighted earlier. Prioritize genuine data volumes, complexity, and concurrency by simulating real production scenarios.
  3. Set a Target Architecture: Craft a target architecture for your data platforms, then devise tests to assess their compatibility with this future state.
  4. Single Data Copy: Don't allow vendors to manipulate multiple copies. Instead, please ensure all tests run on a single data copy for an authentic evaluation.
  5. Introduce Surprise Queries: Infuse ad-hoc queries during testing to gauge how the system handles unexpected workloads, unveiling its adaptability.
  6. Consider Total Cost of Ownership: Embrace a holistic perspective by incorporating the total cost of ownership in your assessment. Prioritize price/performance over merely throwing hardware at software issues.
  7. Evaluate Environmental Impact: Extend your benchmark criteria to encompass the environmental implications of the technology.
  8. Incorporate Non-Functional Attributes: Include non-functional aspects like ease of use, availability, fault tolerance, and agility in your evaluation matrix.
  9. Test Recovery Scenarios: Put systems through their paces by incorporating failure recovery scenarios. Purposefully induce node and component failures to evaluate graceful recovery.
  10. Anticipate Migration Complexities: Grasp the intricacies of migrating from your current solution to the new platform. This awareness is crucial for seamless transitions.

Are there any other tips you would add to the list?


If you like, please subscribe to the FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn.


Sajid Abbas

Sr. Systems Engineer

1 年

Great article ????. I have a question though. In order to personalize a benchmark and keeping single data copy, how practical is it to maintain a single real big test dataset which is varied in its quality? Context: Different vendors need and operate on independent dataset. Also if you have huge data set but it is similar/uniform in values then doesn't matter how many trials you do, your test coverage is limited.

回复
Franco Patano

Strategic Data and AI Advisor

1 年

What about open sourced benchmark code that can be inspected, and independently reproducible? In your view, benchmarks provide any value? What about for consumption planning?

I think that’s a credit to Mark Twain. So accurate in this case as well. As always, Fawad A. Qureshi , your perspective is right on.

要查看或添加评论,请登录

Fawad A. Qureshi的更多文章

社区洞察

其他会员也浏览了