Mentor low Latency C or C++ code

I received a request to mentor low latency C/C++ code from several readers as a result of my last blog post "C++ faster than C# depends on the coding idom"  I love to mentor but thought it would help a larger group if I shared the some of the basic rules in a separate blog post. I use these techniques for our Stock and Forex trading prediction engines but they are equally useful in most domains. This is a huge topic area easily large enough to fill several books so please forgive items I missed.   

I greatly appreciate your comments since they are often the genesis of my next post.  An updated version of Mentor Low Latency C or C++ code. is available on my Blog or contact me directly.

It is easy to tell people how make high performance systems but it requires years of focus to gain the intuition about where to look and how to think. Feel free to contact us as we provide this kind of help for teams who can not afford the time needed to grow their own gurus. 

I tend to focus on performance at two levels. The first is getting the maximum speed per CPU core. The second is using clever distributed architecture to allow more CPU cores to participate in the solution.  I have found that adequate focus on maximizing performance per CPU core can allow a single CPU to do much more work. This can delay the need for the next set of sophisticated distributed processing techniques. Reducing the number of distributed components can reduce code volume and complexity enough to pay for the cost of local optimization several times over. It is fairly common to see these optimizations increase the work capacity of a single multi-core CPU by at least 10 times which effectively removes the need to break that work among 10 computers.  It really is a question of pay now or pay a lot more not very much latter. *1 

Low latency and high performance are two separate but closely related domains. In low latency systems what most people actually mean is latency that is reliably below an acceptable threshold.   

Early in my career I worked on near real-time systems where latency of CPU availability was critical.  We were gathering telemetry coming in from high bandwidth radio and microwave links. If the CPU was blocked and didn't respond we could loose several frames of data. Since we were stress testing billion dollar machines loosing data when a critical failure occurred could become a multi-million dollar cost and require re-running the test.  This is a good example of when fixed or bounded latency was critical. High performance was important but performance was mostly a tool we used to help meet the bounded latency requirements.

Why I focus on performance and latency

A large fraction of our income comes from clients who have systems that are responding too slow (high latency) normally caused by poor performance.  Most of the industry is living by the maximum "CPU power is increasing so fast that we don't have to worry about performance". One of the most common quotes is "premature optimization is evil" which they interpret as "don't worry about performance until it has manifest as a problem".  

 I disagree with this approach largely because we have major companies hiring us to fix problems in systems were they have invested millions of dollars and are on the verge of cancellation because the team has struggled for months or years without reliably delivering sufficient performance.

I regularly meet very talented executives and senior managers who worked for decades to reach a position of authority and are at risk of loosing their jobs and their credibility due to a poorly performing system. They have not been able to fix the problems despite multiple attempts over an extended period of time.   I may be excessively focused on performance but it is mostly because I have seen lack of focus cause good people harm. 

I fully agree that delivering code capable of solving the business problem as early as possible in the project is fundamental to success. Where I diverge is that when you have code in a working state for medium sized modules or services. You need to determine the highest transaction rates and largest data volumes needed by the production loaded system. Then you need to demonstrate the ability of the module or service to meet at least 2 times the production performance requirement under sustained tests. The critical aspect is testing with data volumes larger than the production system data size.  

The most disruptive and most common failure I encounter is engineers testing with small amounts of data who are seeing good performance and then having the system fail to perform well at production data volumes.   

High caliber engineers find this incredibly embarrassing and make it a career goal to never have one of their services fail to perform adequately when tested at peak production loads using full sized data.  If this is caught by QA or even worse in production the high caliber engineer views it as a personal failure.

Engineers are only partially at fault because their managers or site ops quite often refuse or are unable to budget the hardware and data needed for this level of test.  This has to be imposed on the culture from the senior VP or CTO and should become a significant aspect of the bonus goals for Senior Directors, Directors, Senior managers and Product managers.
The second most common failure is engineers testing in a LAN where they have sub 1ms ping times and then having the application run very slow when deployed in the real world. 

Full scale testing should be done before the module, service or feature is considered complete and before a new feature can be taken off the scrum board.  This is a tax on the development but it should be invested in throughout the life cycle of the product. The only exception is when you string all the distributed components together early in the RAD cycle to test the work flow.  

Developers may resist adopting this process but once they experience it for  a couple years the better developers become advocates and enforcers because they see where it saved them late night crisis sessions. 

High performance focus points:

 Listed in highest priority focus first.  

  1. Seek to obtain latency requirements measured as business units of functionality delivered per second. 
  2. Seek to maximize utilization of available cores while minimizing contention for resources then we
  3. Utilize more cores from distributed systems while minimizing overhead from moving data between these systems. 
  4. Always test latency with 2 to 3 time production data sizes.
  5. Always test latency with at least 150% worst case predicted loads.
  6. Always test latency with at least 100% worst case predicted production loads with 50% of distributed components in failed state.
  7. Always test with increasing loads and increasing data sizes until the system fails or fails to meet performance requirements. 
  8. Latency has a non trivial cost. The lowest latency is the local CPU register, followed by Level 1 CPU cache, Level 2 CPU cache, Main RAM, Virtual RAM High performance local secondary storage, LAN accessible resources and WAN or internet resources. The speed of access at each level can easily be 10X slower than the next more closer point and I have seen it over 1,000 times slower. The more times you cross these boundaries the slower things get. If you have serial dependence on remote access it can rapidly destroy the performance of otherwise well designed systems. 

Low latency focus points:

  1. Define maximum acceptable latency for modules of code.
  2. Measuring and profiling to determine risk points of latency goal violation.
  3. Breaking the work up into smaller units to fit the latency goals.
  4. Removing variable latency that could cause violation of the latency goals.
  5. Testing at the extreme limits of the system to identify Latency violation culprits.
  6. Improving performance of latency risk code so it stop being a risk. 
  7. Defer non-critical processing as needed to meet the latency requirements.
  8. Reserving cores to do the high latency work while allowing other cores to focus on the core processing requirements. Especially important when context switching costs are a substantial fraction of the latency response window. 
  9. Adding queuing and other technical tricks to allow non critical work to be deferred to allow shifting resources to latency critical responses.
  10. Adding clever distributed architecture and multiple processor or multiple core systems to increase resource availability. Shares minimizing dependence on blocking shared resources with high performance code.
  11. Always test with increasing loads and increasing data sizes until the system fails or fails to meet latency requirements by 200%. Quantify and document this curve. 
  12. Memory management techniques are critical especially in garbage collected systems because garbage collection imposes variable delays that are quite often outside the control of the engineer but even in GC systems engineers can control the burden they place on the GC and sometimes they can choose when the GC will run preemptively so they can cause the GC delay to occur when it is less likely to cause latency violations. 
  13. In distributed systems measure latency at the client for every distributed system boundary where resources are accessed via the LAN, Storage or distributed storage subsystems. Buy or build the infrastructure to measure this across thousands to millions of requests end to end so you can identify latency hots spots. This more than any other technique can help focus resources for maximum ROI. 

Latency versus performance are focus points:

As you can see the axis of work focus for high performance can be quite different than the work you do for low latency. The caveat is that if your code is sufficiently high performance then it can respond so fast that you meet the latency requirements without additional work.   The two problem domains tend become blurred because the causal factors overlap.  I think it is important to understand which factor is the primary driver but it is even more important to identify when sloppy poor performance code is the cause of the latency issues. 

C vs C++:

I am currently a better C programmer than C++ although I used to teach OOA/OOD and OOP in C++.  I state this explicitly because many of the optimization techniques I use were pioneered by the guru veterans who wrote the core of operating systems in C and assembler. I was fortunate to learn from one of those veterans who was the primary engineer for a real-time operating system used by NASA.  Unfortunately most of these veterans are now retired but they piqued my interest and it has remained highly entertaining throughout my career. 

I separate C from C++ even because while it is possible to obtain C level performance from C++, the most common Object Oriented patterns used with C++ "Idiomatic C++" tends to create designs very similar to those in Java. While this can simply thinking about the problem domain it also tends to produce systems that violate several of the rules we follow to produce high performance low latency systems. This conceptual approach ends up woven throughout the system in such extensive fashion that it can be labor intensive to fix. This is not a problem with the C++ language or compiler but rather the way people are trained to think about using language to express their designs. I will admit that I spent several years teaching these OOA, OOD and OOP techniques so I have certainly contributed to the problem. The OOP techniques are still valuable but you need to understand the trade-offs especially in high performance / low latency systems. 


Basic High performance rules

  • Test profile and retest
  • Avoid memory allocation wherever possible and Especially avoid lots of small allocations.
  • Double avoid small allocations that are then thrown away quickly. This is the death nail for performance in idiomatic C++, Java and idiomatic Scala is even worse.
  • If you must allocate memory then allocate larger chunks and reuse them whenever possible.
  • Avoid copying where possible. 
  • Sometimes it is faster to scan to end so you know length rather than adopt a dynamic or growing array or to use reallocate. Sometimes it is dozens of times faster. This is more true with modern file cache systems where re-reading a file stream from disk is at nearly RAM speed anyway. 
  • Use as little indirection as possible every pointer->pointer->pointer access has a cost and these really add up in tight loops.  It is even worse when the indirection causes an increase in level-1 CPU cache misses which can dramatically slow things down. You may see me copy things into local variables to avoid the indirect look-ups.  I actually learned this trick in Node.js and python code but found that it had a nearly linear impact in C as well.
  • Think about how many registers your CPU has and size the method complexity wherever possible to maximize A) Keeping things loaded in the registers B) keeping things loaded in level 1 CPU cache, C) keeping things loaded in level 2 cache.  This will yield benefits even when you change CPU because every CPU always has a limited number of registers and level 1 cache.
  • When possible keep your tight loops simple so the compiler can take advantage of the CPU vector operations that quite often run 10X faster. The vector operation consideration is kind of new because the CPU’s have only gotten good at them over the last few years. It has subtle impact because in the past we may add an extra bit of logic to a loop to avoid the loop overhead again with the new vector operations you may choose to use 3 methods that scan the array multiple times because if you gain 10X in basic speed you can scan up to 9 times and still have a speed advantage. The next generation of optimizers are likely to scan the same array simultaneously with multiple cores which will have a profound impact on net performance.
  • CPU pre-fetching features favor arrays where the values to be operated on are sequentially stored in in contiguous memory.  When you use pointer->pointer->pointer style indirection in tight loops the pre-fetching features cannot deliver much benefit.  Complex pointer to pointer structures such as AVL trees can also effectively disable the benefits from pre-fetch. 
  • Think hard about how you can coerce what would naturally be a more complex structure such as AVL tree, linked list or HASH as a simple linear array.   If you can twist the design to allow this it can run dozens of times faster.   This is an area where C++programmers especially those that started with Boost are at a disadvantage because they never learned a critical part of the engineering for speed process and many never learned how to think about their own clever data structures. Some of them don’t even know they are doing anything wrong and will fight to the death to protect that worldview.
  • Learn the semantics of the C restrict keyword and use it wherever possible. It enables a set of compiler optimizations that make it more likely the compiler will choose to use the advanced CPU vector operations. 
  • Avoid re-computing things especially where the lookup cost Is less than 1/3 the compute cost. This is offset by cost of RAM so it is a balancing act that requires testing. Deeply structure object calls tend to be terrible culprits in this area. 
  • Avoid sorts wherever possible.  Sometimes a clever data structure with scan can eliminate a sort even for complex things like eliminating statistical outlives there are algorithmic options that can avoid the sort and they are quite often dozens to hundreds of times faster.   Using a sorted data structure like a AVL or Pam array doesn't count as eliminating the sort because they just shift the cost to another part of the timing. Sometimes this is unavoidable but many times is just lazy engineering.
  • In finance work rather than thinking about an array of bar structures think about a bare set as composed of several arrays each representing a field such as one array for open, one array for close, etc. This allows a large number of design optimizations that fall out as the result of one decision.  You absolutely don’t want to use an array of bars where each bar is instantiated as it is read. This one poor decision will slow things down by dozens of times and will also make it harder to support generic indicator computation code. Many of the open source finance and commercial frameworks I have investigated make this devastatingly bad design decision.
  • In the Stock and FOREX domain building my data structures to accommodate incremental expansion so I can update things without re-computing entire columns has yielded substantial benefits.   This can add a little complexity but it is one of those things you absolutely have to build in early because it is a lot more expensive to add in latter. This one seems to get missed by most of the commercial and open source frameworks. 

Please see my BayesAnalytic blog for more interesting articles. 

Thanks

Joe Ellsworth

CTO Bayes Analytic LLC

contact me directly

Caveats

*1 The ability of higher performance code to delay the next layer of distributed work is important but when you must introduce distributed techniques to provide fault tolerance or greater scalability it is also feasible to solve many performance problems by throwing more hardware at it. If you must do the distributed plumbing work anyway then you may be able to delay some of the intra-service optimization. The caveat is that if the optimized code was 10X faster per CPU then you may need 20X or more CPU with the distributed techniques because you always incur additional overhead for box to box communications.  In addition compute resources are still not free and increasing your compute requirements by 20X is a sufficient amount of operational overhead to make some projects non-viable. The additional compute cost is replicated in the stress testing environments as well as in the QA and stage environments. 

Joe Ellsworth

Principal Product Security Architect at NETGEAR

8 年

Take a look at my Quantized classifier? https://bitbucket.org/joexdobs/ml-classifier-gesture-recognition. It is written in GO, is blinding fast and is delivering precision equal or better than Tensorflow’s Deep Learning based CNN for the test files I have tested I designed this engine based on what I learned building very high performance ML classifiers to predict stock price movements but it is general purpose. Right out of the box it was successful classifying breast cancer 96% of the time. I used a MIT license so it is safe to embed in commercial projects.

回复
Ali Islam, FIA

MarketData Engineer - Vela / TREP / Exegy / Activ Financial / Exchange Connectivity / Application Support / Low Latency

8 年

Great Article

回复
Abby W.

Senior Talent Acquisition Partner @ Galliford Try - Delivered by Chapter 2 (for embedded talent partners)

8 年

Mike Baxter Zack Keane Nicole Balmer... could be an interesting read?

Nipun Patel

Back End Developer,Azure Cloud,Microservices,.Net Core,Mysql,SqlServer,nHibernate,RxNet,Entity Framework,Low Latency,Algo Trading

8 年

Thank you very much for giving answer .

回复

要查看或添加评论,请登录

Joe Ellsworth的更多文章

社区洞察

其他会员也浏览了