An Interesting Comparison between ARM (Thumb-2) (w/ C) and ?CPU (w/ ?PPL)
I have long been interested in computer logic and code generation. There has never been a race between CISC and RISC because no new CISC architectures have been developed since the RISC debate began (i.e. during the last 30 years). Many performance comparisons have been made between CISC and RISC processors, but because RISC architectures are all newer, the only conclusion that can be drawn is that newer processors are often faster than older ones. There is nothing surprising about that. The origins of the x86 and its modern AMD64 mutation stem from long before the RISC debate began and can't be considered to be representatives of good CISC CPU design. Their architectures were patterned after the 8080 which was one of the first 8-bit machines from the 1970's. Still, this architecture dominates the desktop. But the preeminence of this kind of desktop is now being threatened by ARM which has become the prime electronic brain of mobile computing devices.
Since ARM has established its foothold, the software development industry has systematically abandoned the practice of using assembly language to enhance and improve the performance and reliability of computer software. This has been for several reasons. The most obvious is that assembly language is not very portable. This is primarily because the programming model of the architecture is explicitly visible to the programmer. On top of this, there are different standards for syntax coming from different software tool suppliers. For example, in the x86 world, there is considerable difference between Intel vs. Microsoft vs. AT&T syntax. In the ARM world, Keil source code looks very different from GNU source code.
Even with portability aside, assembly language is itself more difficult to write than the equivalent written in a high level language (HLL). A program that is written in assembly code almost always requires many more lines of code than a program written in a HLL such as C. But writing assembly language for an architecture such as ARM is more difficult to write than it was to write the equivalent code for one of the older CISC processors such as the Motorola 68K. It is saddening to me to read requests for advice from people wanting to learn ARM assembly language just to be told by professionals to forget about it and concentrate on C. This shows that, even though ARM has been billed as a RISC processor, it does not have a Reduced Instruction Set and it is not easy to write assembly language programs for it. In fact it is more complicated and has more limitations than does a 68K processor. This is partially a result of the fixed instruction size, which is 32-bits.
Since the original 32-bit ARM instruction set came out and had been in use for some time, ARM Holdings recognized that the architecture had suffered in many ways as a consequence of this handicap. Code densities and memory usage was poor. This made performance suffer. These problems fueled inspiration for the Thumb instruction set, which was limited to 16-bit instructions. Many processor types can run either instruction set, but managing and having to deal with two incompatible machine languages put the burden of juggling the two systems in the programmer's lap. This, of course, caused problems of its own. Then a new ARM architecture came out for the Cortex-M. The 32-bit ARM instruction set was dumped overboard and the Thumb instruction set was given a face lift and was dubbed Thumb-2. Instructions can be either 16-bit or 32-bits but there is a lot of redundancy in their functionality. The 16-bit variants can only access half of the General Purpose Registers (GPR's) and some other features are more restrictive in nature. A RISC instruction set was beginning to look a bit more like CISC! Have you ever known anyone who over-stated his case and then later had to retract his claim after empirical evidence argued against the case he had made earlier? This is basically what happened.
I do a lot of embedded programming and I have produced many electronic circuit boards along with the firmware that goes with them. I had been paying attention to where ARM was going for years but I never made the plunge into doing a design around an ARM processor. But with Cortex-M, I decided that the time was right so I bought a software development system that is based on the GNU tool chain. This supports a lot of processor architectures so the commands and associated documentation can be quite complicated with the appropriate versions difficult to track down. I got the IDE and associated tools up and running and I am now ready to roll up my sleaves.
So just today, I decided that I would make a comparison between the Cortex-M Thumb-2 instruction set and that of ?CPU which I have been developing. I really didn't expect that ?CPU would fare so well since Cortex-M is a 32-bit architecture with many 16-bit instructions while ?CPU is a 64-bit architecture which has four times as many GPR's. More registers require more encoding space so it seemed that my design was going to have a handicap in this particular area. My test was to compare the code generated for a simple function written in C.
The world of ARM development is inextricably linked to the C programming language. This HLL was at the core of its development. Many features of traditional CISC processors were stripped away because C couldn't get to them anyway. One of those features is the ability to do integer math on the smaller data types such as 8-bit and 16-bit values. Computations can only be done efficiently on ARM if all integers are represented as 32-bit values. Variables in the real world often have vastly different ranges in values. One size does not fit all in the real world and the real world is what computers are all about.
?CPU was co-designed with ?PPL and not C. But ?PPL was developed for considerable compatibility with C (it is object-compatible with C) so they are like brothers. Its just that ?PPL is a lot more general in its abilities. Functions written in ?PPL can call just about any function written in C while C functions can call functions written in ?PPL as long as they don't imply features beyond the capabilities of C (like you can't pass arrays as arguments to functions unless they are wrapped inside structs). So the equivalent ?PPL function was used to generate code for ?CPU:
The double-pointed curly braces in this source code are an obvious difference between these two test programs (as is the presence of colons in type declarations). They are there to remind you that ?PPL is a Parallel Programming Language (PPL) while C is a Scalar Programming Language (SPL). But it is also because the normal curly braces in ?PPL are reserved to enclose an element list for a set as you learned in math class in school. So I ran the C compiler and it generated the following code. I will just list the machine code in hex because few of my readers will understand assembly language and even fewer will care about what ARM assembly looks like. Instructions are always a multiple of two bytes so the bytes are paired in little-endian order:
80B4 83B0 00AF 0346 0A46 FB80 1346 BB80 FA88 BB88
1344 9BB2 1BB2 1846 0C37 BD46 5DF8 047B 7047
Next I wrote a ?CPU assembly source file since the only ?PPL compiler I have written generates 32-bit x86 code. I then generated the machine code using the assembler and it is listed below. Since ?CPU instructions are always a multiple of one byte, the bytes are listed individually:
38 20 BF 08 00 BF 0C 00 02
The ARM assembly source code file size is 5,611 bytes, it has 18 instructions and the resulting machine code is 38 bytes. The ?CPU assembly source code file size is 99 bytes, there are only two instructions and the resulting machine code is 9 bytes. Granted that many will remind me that the GNU C compiler does not generate very efficient code. But that is what those who use it are stuck with unless they are among the very few who brave writing source code in assembler.
What are the reasons for such a striking discrepancy in results for these two processor architectures? The ?CPU code is only about 23% as large and the file size is a miniscule 1.76% as big as that needed to store the ARM code. If performance were to scale directly from code density, then code running on ?CPU should have four times the performance of a Cortex-M system.
The main explanation for the discrepancy in this example is that doing 16-bit computations with the ARM processor is very inefficient since the architecture has to extend their values to 32-bits at every step only having to reduce them down again to the size that the programmer specified.
Another cause for bloating in ARM architecture code is that calling a function is a 2-step process which requires that the return address to the caller must be explicitly saved and restored using the stack. The Link Register is really a waste of a resource and never should have been designed into the architecture in the first place. Neither the x86 nor the 68K have need for such a thing because pushing the return address before a function call and pulling it from the stack upon return is all done automatically. So these operations don't tie up a register that can be used for something else. However, in the output of GAS (the name of the GNU assembler), there is a note that says that these steps were optimized out of the code. Without this optimization, the Thumb-2 code would have been 42 bytes in size instead of 38.
As for the file sizes, GNU obviously carries a whole lot of baggage for supporting many target processors. It is an amazing feat that its writers were able to pull it off but at what a huge cost. Another issue is that ?CPU assembly source code is stored in ?Text and not ASCII. This is more efficient than either ASCII or UTF-8 and it stores a lot more kinds of information like text color, style, size and attributes as well as paragraph formatting.
After doing this test, I had to ponder what the results would have been if I had used an operation that does not exist in the C language. Take for example, the saturated add. Both the Cortex-M and ?CPU have support in their machine languages for doing these operations that are often needed in embedded systems development. In ?PPL, this is easy to write by using the saturated add operator which looks like a '+' with a bar above and below it:
In ?CPU assembly language, this operation is implemented as either an unsigned saturated add or a signed saturated add since the results are dependent on the type class. Only the opcode is different and the resulting machine code is:
3F 20 BF 08 00 BF 0C 00 02
To accomplish the same thing in the ARM environment requires either writing or calling a function written in assembly language to do the operation for you. This is because C has no direct way of specifying such an operation. You could do it using a C++ class but the resulting code would be very bloated, run slow and it wouldn't use the hardware capability that is there in the processor.
The implementation for supporting saturation in ARM and ?CPU are also quite different. In ARM, you do the add and then you have to use a saturation adjust instruction which follows the add. I am not sure how a Cortex-M processor handles the situation where an overflow occurs during an add. If this happens then the result of doing an adjust on the value will be wrong because the data loss has already happened. In most processors, an unsigned overflow is indicated using the Carry Flag while a signed overflow is indicated using the Overflow Flag. An adjust operation without knowing what kind of overflow to account for cannot reliably yield the correct result when an overflow occurs.
I hope my reader has learned something about ARM, Thumb-2 and ?CPU. They are all interesting architectures. I have decided to bite the bullet and learn Thumb-2 as well as ARM64 which is really a completely new (and incompatible) architecture in its own right. In the process of doing this though, I just hope I don't swallow the bullet and end up with lead poisoning.