The problem with benchmarks
A brief intro: this is the reconstructed remnants of a blog I originally wrote while I was a Developer Advocate at Starburst. It was a response to some questionable benchmark-based marketing being done by one of our smaller competitors, but it never saw the light of day. Now I find myself at Firebolt where I’m running and soon to be publishing some of our own benchmarks, and I feel like there’s still a lot to be said about why benchmarks are bad… but good. They’re evil… but useful. All of what’s about to follow has been bouncing around in my head a lot the past couple weeks, so it’s helpful to put it to words once more.
The blog I’m writing for Firebolt will go into some detail on how we tried to avoid the biggest pitfalls that can limit the relevance of a benchmark, but before that goes out into the world, I’d like to start with my own personal piece about, well… how benchmarks can suck.
There’s four key issues with benchmarks:
You should go grab some tea or coffee, and let’s talk about each one.
There’s always an agenda
If you want to support an argument with data, you can. At certain scales of data, driving a truck loaded to the brim with hard drives back and forth across the country is the fastest way to transfer that data. In certain environments, the fastest approach is a cell network that bounces your data up into orbit and then brings it back down again. At other times, it may be an ethernet cable running across your apartment. If you’re willing to disregard context, you could create some pretty graphs and make some convincing arguments that make the case that trucks, satellites, or cables are the fastest way to move data. If you want to make a scientific argument that jelly beans cause acne, you can make it happen:
P-hacking is not the same thing as generating a benchmark, but the practice doesn’t have to be much different. If you set out to prove something with a benchmark, you will be able to do it. There is some scenario, some configuration, some arrangement of inputs and outputs that can prove anything will run a benchmark faster than anything else. It may need to be extremely contrived, but you can do it.
On top of that, benchmarks are known to exist. Software competing on performance is being built by people who know that their software is being judged by benchmarks. It is not an uncommon practice to add optimizations that specifically speed up tasks or queries that exist in well-known, public benchmarks. These optimizations may benefit real users, but they just as frequently may not, making the benchmark performance look better than it ever will in reality thanks to hyper-specific optimizations that are only ever invoked when running Industry Standard Benchmark Query #17. It’s akin to cheating on a test without learning the material.
Now, not every benchmark is this way. Not everyone running a benchmark is setting out to mislead you, and not every system is trying to cook the books by optimizing for benchmarks in niche and impractical ways. But both of these behaviors are normal enough, and they’re important to keep in mind as a possibility when you see numbers and charts that are trying to convince you of something.
They lack context
Even when every step of the way has been honest and taken with the best intentions, the reality is that bias is everywhere. I’m going to briefly pick a fight with a very specific piece that annoyed me into writing this blog in the first place: a post that compares StarRocks to Trino, which I’m not going to link to, but which I’m sure you can go find if you’re truly curious. It has some gems:
If you want to make an informed comparison, and you realize that the best way to do that is to have engineers to tell you about one thing, you might realize that using a website and some Google searching wouldn’t give you all the answers about the other thing. The asymmetry is impressive. It gets worse later on:
Yes, that’s an entire, standalone paragraph. Tallying up algorithms is a… unique… approach to claiming technological superiority. But it’s a number, and if you throw it in a chart, it looks like a data-driven argument.
Now, most benchmarks and comparison pieces are not this devoid of context. In a perfect world, the individuals publishing any kind of comparison, benchmark or otherwise, have some real technical background knowledge and deeper understanding of what they’re talking about. But even when you have that knowledge, there’s always going to be some asymmetry. You know more about the companies you work for. You know more about the tools you’ve used most recently. You know more about the stuff you care about and are invested in.
And in the context of a benchmark, any bias can be insidious. It could influence how you set up your data, how you set up your tests, what performance tweaks you’re making to each system, or how you go about running the benchmark. There’s a decent number of arbitrary decisions that go into benchmarking, and when it’s time to start picking numbers out of a hat, you’re going to be influenced by what you know. What’s reasonable? What’s standard? What’s the most common use case? What’s this thing normally going to look like? Answers to these questions are going to be influenced by your experience, your knowledge, and what you’ve worked with more.
Worse, even if you didn’t set out to be deceptive from the start, when choices start feeling arbitrary, you’ll find easy opportunities to default to the paths that best support the point you’re hoping to make. The siren song is tempting, especially if you’re getting paid to sing that same song yourself.
So even when a benchmark is being set up and run with the best intentions, it may still be lying to you. Not because the person who made it is evil or deceptive, but because bias is everywhere. There’s no way to avoid it, and it may make that graph look substantively different than if someone else had been doing the work.
领英推荐
Maybe you think we can eliminate bias. By testing different software out of the box, you mitigate the effects of imbalanced expertise. But not every system is designed to run perfectly right out of the box. Configuration and optimizations are a big part of deploying a complex system, so should you even care about how they perform out of the box? Maybe it comes out of the box set up to be the most generally useful, but is the benchmark you’re running the general use case? What if different systems have different interpretations of what constitutes “general” use? Are they using the same hardware? Are they designed to use the same hardware? Is that hardware the hardware you’ll use them on?
In order to approach the idea of fairness, you start to lose practical meaning. The more neutral and devoid of context a benchmark is, the less biased it is, but the less attached to reality it is, too. Testing software in sterile environments isn’t a realistic reflection of reality. Yes, you can try to use industry standards and the most common hardware to chase reality, but how often have you worked at a company where every part of your tech and data stack was using some standard, common setup that you could find anywhere else?
We’re all gullible
“Ok,” you say, “But I know all this. I’ll approach benchmarks with skepticism and cynicism to avoid being fooled. I’ll look for the benchmarks done by neutral third parties that have as little bias as possible. I know how to look at data and take it with a grain of salt.” That’s not a bad attitude! But how much do you trust those neutral third parties to approach everything with the same diligence? How equitable will that diligence be? Are there inherent differences in the things they’re testing that make an attempt at neutrality ineffectual? Is someone paying them to make and publish that benchmark?
It’s too easy to miss someone else’s mistake or oversight. There aren’t objectively correct answers to handle the nuances of defining the right hardware, the right scale, or the right configuration. The complexity involved in trying to set up different technical systems to compete against each other in a true, perfectly apples-to-apples comparison means it’s often not practical to do so.
The other big thing is that if the people making the benchmarks don’t have the answers to everything, you don’t, either. If the most well-intentioned, expertly-crafted benchmark can’t be free from bias or favoring one of the things it’s testing, you’re not free from bias, either. Your brain really wants to be right, and if you have any preconceived notions of which thing is better, you might find a way to let confirmation bias interpret the evidence in a way that favors that notion. If you don’t have any prior opinion, a misleading benchmark may lead you down the path towards favoring the wrong technology. Your prior knowledge may give you insight into how to interpret it, but what if you’re adjusting for the same thing the benchmark runners adjusted for, meaning it’s an over-adjustment?
All this is to say - it’s messy. It’s hard to derive the truth from benchmarks. Constructing truth is difficult, interpreting truth is difficult, and when there’s opportunities for deception both intentional and unintentional, it can be a mess.
Performance isn’t everything
The reality is that there are things that matter more than performance. Whether that’s necessary security and privacy features, critical integrations, or availability on various platforms, it hardly matters how fast something runs if you can’t get it running in the first place. You should always ask yourself how much time and effort it will take you to make something operational, then compare that to how much time you expect improved performance to save you. If the migration from one system to another takes a mountain of engineering to make happen, an advantage on a benchmark is going to need to be impressive to justify making that migration.
So then what?
If you’ve read this far, you might think that I’m trying to convince you that you should avoid benchmarks entirely. This isn’t the worst advice of all time, though it’s bad advice. Time to walk things back.
Performance isn’t everything, but it does matter! A lot! You get compounding benefits from more performance. Saving time doesn’t just mean saving time - it also means saving money. In a world where compute costs are generally billed as consumption/time, a more performant system can do the same amount of work in less time, which in turn means lower bills. Anyone who’s used to waiting on things to complete waits less, context switches less often, and has their efficiency boosted. So how do you evaluate performance?
Look for stories
Because I worked on Trino for a while, I saw a lot of people who talked about their performance improvements when switching from Hive to Iceberg. On the extreme end, I saw stories where queries using the same engine took literally 1% of the time they took previously. On the modest end, the performance gains were closer 10-15%. The trend was consistent, though, and it led me to a conclusion that I’m comfortable broadcasting: Iceberg is faster than Hive in production. I didn’t need official, neutral benchmarks to learn that - narratives from real companies comparing real workloads piled up to make a compelling case.
Ideally, you should be able to find similar stories for whatever you want to do. If you’re having performance issues with your current tech stack, others have likely had those same issues. And if you’ve found some alternatives that are promoting themselves as the newer, faster version of what you’re currently using, others have done that, too. See if there are talks, blogs, or stories accounting the swap from the old to the new. Did it work? It’s an anecdote, but the more anecdotes you can compile, the more you understand the real-world trend.
Try it yourself
Proof-of-concepts and bake-offs are not new ideas, but they exist for a reason. When you test a system with your own workload, you learn how fast it will be able to accomplish that work. You can bypass all the mess of generic benchmarks, acts of deceit, and any other factors that separate existing benchmarks from the performance that you care about. This isn’t always easy or cheap to do, particularly depending on what? you’re testing, but it is the most accurate.
And though cost may be a concern, if you’re operating at such a scale that you need to prioritize “as performant as possible” in order to minimize compute costs, you’re probably also operating at a scale that allows you to run your own tests instead of relying on benchmarks. If you truly care about performance and are trying to squeeze out every last ounce that you can, you should be willing to run your own tests instead of trusting some benchmark that doesn’t represent your data, your hardware, and your use case.
Use benchmarks
And you know, sometimes, you should look at benchmarks. They’re imperfect, but they are a useful tool. Don’t let perfect be the enemy of good. Some benchmarks are lies. Many are flawed or skewed or don’t accurately represent reality. There is always a margin of error. But if a neutral party can show you that something is consistently faster than its alternative across a variety of scenarios or scales, that’s a meaningful datapoint. Take it with a grain of salt, but take it. The key takeaway here should be to be wary of benchmarks, not to disregard them entirely.
And look, I’m working on an assortment of benchmarks comparing Firebolt to a bunch of cloud data warehouses. That’s not because I’m a hypocrite who can’t follow my own advice - it’s because they are genuinely useful. They have flaws, issues with bias and partiality, and there’s no truly fair way to compare technologies to each other in a vacuum on even footing. But despite these flaws, benchmarks still are one of the most scientific ways to get towards the truth of how different systems perform.
Product Marketing Leader - Data Analytics, Data Warehouse at Firebolt | Data | ML| Generative AI | B2B SaaS I GTM I PLG I ex- AWS AI, Microsoft Data & AI Alumni
1 个月I read this a few times; thanks for sharing the perspective here,?Cole Bowden. I truly like it, and to summarize my learnings from the blog-> Benchmarks, while inherently flawed due to bias, agendas, and their lack of real-world relevance, remain valuable tools when approached with caution. Real-world use cases and narratives often provide deeper insights, while proof-of-concepts and self-run tests are essential for accurate performance evaluation. Ultimately, readers should maintain skepticism and interpret benchmarks within the context of their specific needs and environments.
dev advocate, trainer, blogger, data engineer
1 个月Well thought out and I like it. Thanks for sharing, Cole!!