c++ faster than C# depends on the coding idiom

I recently ran tests parsing 80 million bars of Forex data and then computed several common technical indicators for all 80 million bars. I wrote this test in a number of languages and my best C code was 5X faster than idiomatic C++ which was 1.5 times faster than idiomatic C# code.

When I modified the C# code to use memory management patterns similar to my best C code the C# version ran faster than idiomatic C++ but was still substantially slower than well optimized C code.    contact page
Source for this articles is available on GitHub

C++ is not always faster than C#: A lot depends on how the C++ code was written. A challenge with C++ is that many C++ programmers adopt a OO coding style similar to what they would use with Java which results in millions of small objects getting created and destroyed. You can write very efficient C++ code but the idiomatic approach is more common. I am not talking about changing core algorithms but rather using the same algorithm but coded to maximize CPU caching while minimizing memory allocation and free activities.

The C# garbage collector is pretty good: In an idiomatic OO coding style you tend to create and destroy lots of objects. C# can occasionally out perform the idiomatic OO C++ code.  This is because the heap can and does become fragmented. The C# (.net) garbage collector is pretty good at memory recovery and consolidation. Depending on process life the heap fragmentation may never become a factor but I have seen this degradation occur pretty fast under extreme loads common in our Machine Learning prediction engines.  

Heap fragmentation will normally become a more significant issue for longer lived C++ processes coded with idiomatic OO patterns. There are many ways to work around the issue but they can be expensive to retrofit into large code-bases. 
It is relatively easy to write C++ code that avoids a majority of the problem but it must be an explicit design decision that is ideally made early in the project life cycle. 

Well optimized C is not Idiomatic C++: Different rules apply if you use coding styles similar to those used to implement core OS kernel functions in K&R C. The critical factor is to minimize the number of memory allocations and understanding what the optimizer will do to the generated assembly. Well designed C code quite often out performs the typical C# code by over 300% and it is fairly common to see carefully optimized C code to out perform idiomatic C# code by 9X.

Mindset over Language: It is entirely possible to write optimized C++ code that performs the same as the best C code but I find it difficult to train C++ engineers to think that way. I have found that if I want speed then I should hire people who have been writing straight C for embedded chips and retrain them in the domain. It seems to be a smaller mental shift than retraining idiomatic OO Java or C++ programmers. There are exceptions where engineers in Java have been working on the core of high performance engines like Lucene are forced to learn similar optimizing techniques. 

CPU Native Vector operations even faster: I recently conducted experiments using the -Ofast options of the C compiler. I have found that minor changes to the way the code of some loops allows the generated assembly to be shrunk by about 70% because it is able to replace some code with native CPU vector operations. When this occurs the results are quite often 300% to 900% faster than the original C loop. Only relatively simple loops can be optimized this way so I have started using profiling to identify the 5% most visited functions and modifying that C code to the optimizer can use the vector operations. Once you do this the C code gains so much more performance than C# just doesn't compete.

The .net designers are always working to improve so in the future you could expect them to be able to use more of the advanced CPU functions provided you have the data organized in memory so it is well suited apply the vector operations.   


My next step would be optimizing the C code to directly use advanced CPU vector features but that is more than I want to tackle and it isn't always portable. For now I will focus on using C code patterns the compiler can optimize into the vector operations. If I am going to go non portable I will probably switch to CUDA where I have more net CPU capacity. 

The downside of rich libraries: Java, C# and C++ all offer a comprehensive set of libraries which can reduce engineering costs but the come with a cost. Many of these libraries were built using idiomatic OO memory usage patterns. In Java and even more so in Scala you see code that creates intermediate objects only to pass them into another object with creates more intermediate objects all of which have to be managed and garbage collected. The net result is you inherit a library that imposes substantial performance penalties on your system and quite often don't even know it has done so.  


Great engineers tend to test early to identify these kinds of bottlenecks but the largest group of junior to mid level engineers just assume the library developer did this level of optimization.  

Premature Optimization: I see a lot of comments about premature optimization being undesirable. It is true that you can waste incredible amounts of time improving something that doesn't matter.  It is also true that this has become the cop-out for lazy engineers.
I have been employed many times to isolate and fix problems due performance and stability in multi-million dollar projects. When I get involved these projects are quite often at risk of cancellation if they can not fix the problem fast.  These issues almost always resolve down to people who didn't understand the costs of the idioms they had adopted and quite often they don't even know there is a choice. 
Since poor performance has cost my clients millions of dollars and there seem to be lots of clients who need my services it seems to be an epidemic in the industry. I believe there is a proper balance between pre-mature optimization and sufficient optimization. There a lot of and possibly a majority of development teams who are on the wrong side of that balance. 

The key is identifying where the problems are and designing so you can rip out the poor performance libraries and replace them without undesirable ripples across the code-base.


The performance issues are not endemic to many languages but programmers working at the C level seem to have a better grasp of the issue. I tend to find that C level libraries are better optimized. There are exceptions where I recently found that the parse double function in the standard C library was substantially slower than the same functionality in Python or GO. It took 10 minutes of searching but I found a C version that was 3X faster than either. 

Please see my blog https://bayesanalytic.com for other interesting articles. You can also reach me directly via our contact page

Thanks Joseph Ellsworth

CTO & Chief Algorithms Scientist

Bayes Analytic.com LLC

contact page

 See Also: Mentor low Latency C/C++ Programming Idiom

In this context an idiomatic approach to using a programming language means a common or possibly dominant way of using the language to express algorithms. It is typically driven by the influence of an authoritative personality or a dominant group of individuals. It represents a culturally accepted conceptual approach rather than technical capacity of the language. When this conceptual approach becomes the dominant widely accepted way to use the language it becomes idiomatic of that language.  A specific idiomatic approach is considered an idiom. 


You could restate it as the most common or most widely accepted way to use the language especially when that pattern of usage is motivated by cultural acceptance rather than technical capacity. 


Idiom a :the language peculiar to a people or to a district, community, or class. See also Harmful java IdiomsWhat is idiomatic programmingwiki Programming IdiomIdiomatic ScalaIdiomatic Python or PythonicC++ IdiomsCommon Java IdiomsJava8 idioms, More C++ Idioms

 

 

 

 

Nipun Patel

Back End Developer,Azure Cloud,Microservices,.Net Core,Mysql,SqlServer,nHibernate,RxNet,Entity Framework,Low Latency,Algo Trading

7 年

Nice article

回复
Scott Yeager

Principal Software Dev Engineer

9 年

Any chance of getting my hands on your sample C++ code you used in the test and the data it was parsing?

Joe Ellsworth

Principal Product Security Architect at NETGEAR

9 年

Hey Jason, We offer services to automate profitable proprietary strategies and other services to fix or optimize automation work done by others. Please keep us in mind if you run into any opportunities where we can help. I have been leaning towards CUDA but it wasn't really based on technical merit. I can get CUDA cards to test with for as little as $250 and upper end cards for about $1,500 which means I can have a top of the line system stuffed for about 15K which is in the discretionary budget of most directors. I thought making start-up cheap and minimizing barriers to learn and experimentation would help propel the CUDA cards to dominance simply because more engineers would have experience and eventually some of those engineers would be promoted to positions of authority. Intel seems to be entering the market from the super computer high end working down. I think we have seen enough examples of the companies working from the low end up eating the market for the high end, high profit margin business to know which side is more likely to win. I think Intel may need to send their people to school focused on the "Innovators dilemma" specifically on the lessons about what the smaller sized hard disks did to the 14" and 20" drive business. The lesson is pretty easy to see. Like many companies Intel has a history promoting their cash cows while demoting anything that would endanger their cash cow. The GPU appear to be a potential danger to cash-flow for the high end CPU lines. This will become a real threat when compilers emerge that can use some optimizations similar to those used by gcc -Ofast today to split workloads out across the GPU automatically. If this is true then Intel is more likely to demote the technology where NVIDIA has every incentive to push and improve the GPU technology. I would love to be wrong. The fun part of the computer industry is that just when you think you understand the landscape, something comes out of left field and changes the game. One thing that could favor Intel is they hold a lot of patents and I would bet that they hold patents on some of the techniques we need in the compilers to automatically use the GPU.

Great article. What do you think of Xeon Phi vs the CUDA route?

回复

要查看或添加评论,请登录

Joe Ellsworth的更多文章