Don't write the fastest code
Yes, it's coming from a performance engineer
I came across this video on youtube, it's getting pretty famous but the claims under this video might give you a bad impression of the actual impact of clean code practices on the performance application. Commenting under this video is disabled and I might have an idea why.
The main takeaway of the video is, if you break all the clean code rules, you get 35x the performance. Is that a breakthrough in computer science or have all the developers forgotten how to optimize their code? Is it a CPU manufacturers conspiracy to make you buy faster hardware every time you use inheritance?
To be fair, those measurements are accurate, given this particular scenario. There are however a few minor tiny little details making these measurements garbage for around 95% of the software written to date and 99% of all the web applications.
Here are a few points you should be aware of before you raise a performance risk as soon as you see the keyword "interface" or "virtual" anywhere on the code base.
1. The test case is a loop over a single array.
This is a subset of the Data Oriented Architecture, where you optimize your code to reduce the cache misses. It's most famous implementation is an Entity Component System, commonly used in physics simulation, heavy computational task, or near-realtime soft simulations - also known as video games.
The prerequisite for your performance gain is data is relatively static and so is the sequence of operations you intend to invoke on this data. And it's extremely rare the case for a web service - I have not seen a single web application capable of that.
The creator makes a claim that a difference between 37 and 35 cycles was caused by a "L3 cache hit" or a "cache warmup" or "branch predictor warmup". On an intel CPU, L3 cache fetch itself costs you 30-60 cycles. This means all the calls are in L1-L2 domain.
https://www.7-cpu.com/cpu/Haswell.html
领英推荐
2. The test is single-threaded, memory footprint is extremely low
Again, a blessing for CPU cache hits - there's nothing else disturbing its work - CPU is busy.
A standard web application's heap size would be calculated in GBs, thread count is well over few hundred. Standard web application is bound to use L3 and RAM fetches all the time and it's a feature. L3 fetch is around 60-70 cycles, RAM page fetch is roughly the same. For a scenario where you can't predict the method or the class you'll be invoking - add that to the cost of your method execution.
Knowing that, suddenly a difference between 1 and 35 cycles per function call is not that much of a problem right?
3. Performance benchmarks for languages don't lie
Java is famous for its architecture design around interfaces - everything is an interface there - this would mean Java would be around 15 to 35 times slower than C yet the performance difference between C and Java for most algorithms is... negligible? Is C language implicitly using virtual tables under the hood?
Looking at the comparison between the two, I can't see anything standing out that much - perhaps you've seen something different?
This only shows, the examples presented in the video are some extreme edge cases - they are known to developers and have been known for a while now - but they're not used for a reason.
4. Your APM/Observability doesn't show that ?!
Like... what? We have a code that's 35 times slower than it should be and yet your APM complains about some silly slow database call? You've got enough CPU capacity but your calls are slow because you've used a list instead of a hashmap and now you do a full list scan to find your record? You're allocating new objects that should be static in the first place? Your concurrency implementation invokes too many locks, creating lock contention? You've implemented an O^2 function and your datasets only keep growing?
There are way more expensive operations on your application right now and your monitoring tools should support you finding those
First step before starting any optimization is to measure things right - and visualize what's your biggest pain point. Cost of a function call of 35 cycles may sound enormous and "horrible" at first - yet it's almost always not your biggest problem and you have a long way of optimizing the IO and algorithms before you start even noticing those. Focus on what's important, prioritize the optimizations based on the actual and measurable impact and you probably won't have to worry about the slight overhead of a virtual table.
If you're a game developer, however - it's a really good piece of advice and good place to start. I strongly recommend the "Game Engine Architecture" by Jason Gregory - it covers all the benefits of the Data Oriented Architecture - including its footprint on CPUs and memory.