登录查看更多内容

Don't write the fastest code

Jakub Dering

Tech Lead For Performance Engineers/ Conference Speaker

发布日期: 2023年4月17日

Yes, it's coming from a performance engineer

I came across this video on youtube, it's getting pretty famous but the claims under this video might give you a bad impression of the actual impact of clean code practices on the performance application. Commenting under this video is disabled and I might have an idea why.

The main takeaway of the video is, if you break all the clean code rules, you get 35x the performance. Is that a breakthrough in computer science or have all the developers forgotten how to optimize their code? Is it a CPU manufacturers conspiracy to make you buy faster hardware every time you use inheritance?

To be fair, those measurements are accurate, given this particular scenario. There are however a few minor tiny little details making these measurements garbage for around 95% of the software written to date and 99% of all the web applications.

Here are a few points you should be aware of before you raise a performance risk as soon as you see the keyword "interface" or "virtual" anywhere on the code base.

1. The test case is a loop over a single array.

This is a subset of the Data Oriented Architecture, where you optimize your code to reduce the cache misses. It's most famous implementation is an Entity Component System, commonly used in physics simulation, heavy computational task, or near-realtime soft simulations - also known as video games.

The prerequisite for your performance gain is data is relatively static and so is the sequence of operations you intend to invoke on this data. And it's extremely rare the case for a web service - I have not seen a single web application capable of that.

The creator makes a claim that a difference between 37 and 35 cycles was caused by a "L3 cache hit" or a "cache warmup" or "branch predictor warmup". On an intel CPU, L3 cache fetch itself costs you 30-60 cycles. This means all the calls are in L1-L2 domain.

https://www.7-cpu.com/cpu/Haswell.html

Arpit Bhayani 2 年前

Hash Table Internals - Part 5 - Quadratic Probing

Arpit Bhayani 2 年前

Tearing Down the Memory Wall

Sharada Yeluri 2 年前

2. The test is single-threaded, memory footprint is extremely low

Again, a blessing for CPU cache hits - there's nothing else disturbing its work - CPU is busy.

A standard web application's heap size would be calculated in GBs, thread count is well over few hundred. Standard web application is bound to use L3 and RAM fetches all the time and it's a feature. L3 fetch is around 60-70 cycles, RAM page fetch is roughly the same. For a scenario where you can't predict the method or the class you'll be invoking - add that to the cost of your method execution.

Knowing that, suddenly a difference between 1 and 35 cycles per function call is not that much of a problem right?

3. Performance benchmarks for languages don't lie

Java is famous for its architecture design around interfaces - everything is an interface there - this would mean Java would be around 15 to 35 times slower than C yet the performance difference between C and Java for most algorithms is... negligible? Is C language implicitly using virtual tables under the hood?

Looking at the comparison between the two, I can't see anything standing out that much - perhaps you've seen something different?

This only shows, the examples presented in the video are some extreme edge cases - they are known to developers and have been known for a while now - but they're not used for a reason.

4. Your APM/Observability doesn't show that ?!

Like... what? We have a code that's 35 times slower than it should be and yet your APM complains about some silly slow database call? You've got enough CPU capacity but your calls are slow because you've used a list instead of a hashmap and now you do a full list scan to find your record? You're allocating new objects that should be static in the first place? Your concurrency implementation invokes too many locks, creating lock contention? You've implemented an O^2 function and your datasets only keep growing?

There are way more expensive operations on your application right now and your monitoring tools should support you finding those

First step before starting any optimization is to measure things right - and visualize what's your biggest pain point. Cost of a function call of 35 cycles may sound enormous and "horrible" at first - yet it's almost always not your biggest problem and you have a long way of optimizing the IO and algorithms before you start even noticing those. Focus on what's important, prioritize the optimizations based on the actual and measurable impact and you probably won't have to worry about the slight overhead of a virtual table.

If you're a game developer, however - it's a really good piece of advice and good place to start. I strongly recommend the "Game Engine Architecture" by Jason Gregory - it covers all the benefits of the Data Oriented Architecture - including its footprint on CPUs and memory.

Don't write the fastest code

Jakub Dering

Tech Lead For Performance Engineers/ Conference Speaker

1. The test case is a loop over a single array.

领英推荐

2. The test is single-threaded, memory footprint is extremely low

3. Performance benchmarks for languages don't lie

4. Your APM/Observability doesn't show that ?!

更多精彩文章

社区洞察

其他会员也浏览了

PerfPMR Part 4: Adding Custom Scripts

Beyond the Code: CPU-Led LLMs, Python Library for Prompt Optimization, and RAG Limitations

How CPUs Decode Human Language into Machine Language: The Magic Behind Ones and Zeros ????

Manual MM on the Heap in Rust

To harness benefits of parallel processing

Cracking the Quantum Code: Stabilizers, Errors, and Fault Tolerance

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Embracing the Combined Power of Worker Threads and Server Actions in Next.js

Multithreading

Overcoming IO overhead in micro-services

1. The test case is a loop over a single array.

领英推荐

2. The test is single-threaded, memory footprint is extremely low

3. Performance benchmarks for languages don't lie

4. Your APM/Observability doesn't show that ?!

Speedscale Review - Are load tests relics from the past?

2024年11月23日

**Global Technical Debt Nears 7 Billion Story Points, Calls for Market Regulation Grow**

2024年6月6日

Instrumenting Load Tests with OpenTelemetry

2023年12月21日

JMeter is my bottleneck!!!*

2023年3月13日

What if…? Anatomy of an if statement

2023年2月17日

Performance Testing using Neocortix LoadTest and JMeter

2022年2月22日

Global Performance Testing - part 2: Test design

2022年2月20日

Global performance testing- part 1

2022年2月15日

Know Your Transports - Performance testing guide for integrated systems

2022年2月13日

Data-oriented reporting for black box performance testing part 2

2022年1月15日

社区洞察

其他会员也浏览了

PerfPMR Part 4: Adding Custom Scripts

Beyond the Code: CPU-Led LLMs, Python Library for Prompt Optimization, and RAG Limitations

How CPUs Decode Human Language into Machine Language: The Magic Behind Ones and Zeros ????

Manual MM on the Heap in Rust

To harness benefits of parallel processing

Cracking the Quantum Code: Stabilizers, Errors, and Fault Tolerance

A step-by-step guide to install Intel Advisor and analyze a sample application and find out where Vectorization matters the most

Embracing the Combined Power of Worker Threads and Server Actions in Next.js

Multithreading

Overcoming IO overhead in micro-services

Global Technical Debt Nears 7 Billion Story Points, Calls for Market Regulation Grow