The Future of Software: Code Generation Step 2 - Compilers
Team J.A.R.V.I.S

The Future of Software: Code Generation Step 2 - Compilers

I'm guessing a few of you are wondering why I took the time to talk about 2 very old technologies (Assembly Language and Compilers) in a series of posts that discuss the future of software.  Well, hang in there, we will get there soon!  My conjecture as to why we have gone down the wrong path as an industry is because most people don't know the history of software development.  If they did, they would not have taken the path they did.  So please bear with me for a couple more posts and we will indeed get to what I think is the single biggest mistake we made and how to correct it.

In part one of this post, we discussed the only way to program a general purpose computer in the past was by writing machine code.  Next we learned how Assembly Language provided radical improvements in productivity with no meaningful side effects.  Now we move on to the next step, compilers.  If Assembly Language is a perfect technology, why did we move on?  That's a very good question with a few levels of depth to it.  My mentors at my first job at Unisys would probably still insist that Assembly Language is by far the best language out there.  And for those few people at that tremendously high software engineering talent level, they are correct.  But for the rest of us mere mortals, Assembly Language is just too difficult to use.

First of all Assembly Language requires intimate knowledge of how the CPU works.  And every time the company that makes the CPU comes out with a new model, they introduce new instructions you need to keep pace with.  You also need to know what model of CPU is running your software so that you don't use an instruction that isn't supported.  Here is a web page that shows the instructions and evolution of the instruction set over time for an Intel x86 series CPU. It is indeed a daunting list to memorize.

The learning curve is long and difficult to just understand a single CPU.  Add to that a series of CPUs from a given manufacturer.  And if that isn't bad enough, different manufacturers tend to have completely different CPU designs (or they get into intellectual property lawsuits).  As a result, different CPU makers design their CPUs very differently.  The instruction set is different, number and names of the registers are different, the number of bits per instruction and bits per operand is different.

The byte order of the operands can even be different.  For example, an integer in an Intel CPU is stored and manipulated in little-endian format, where pretty much all the other CPUs at the time used big-endian format.  If you had the task of getting your software to run on both Intel and Motorola CPUs, with all the differences between the two, you would quickly move away from Assembly Language. 

The next evolution in programming languages is what is called High Level Languages.  The main advantage of a high level language is that you don't need to understand the CPU architecture in order to code software.  Code in high level languages do not contain any CPU concepts (well, they don't require them anyway).  But of course they need to be transformed along the way into Assembly Language or directly into machine code to execute the program.  The author of the high level language needs to do that for each CPU architecture they want to support.

To better illustrate the point, we need to pick a high level language.  There are so many of them, I won't bore you with the evolution of COBOL, FORTRAN, PL/I, and the many languages that followed.  I'll skip all of that and go right to the C programming language.  The reason for selecting that language is that I would claim it is the most prevalent programming language in the world to accomplish the main goal we are examining here, CPU independence.  Even today it is the foundation of all the major operating systems you rely upon (Windows, Linux, iOS, and Android for example).  Let's compare a simple loop in C and Assembly Language and see which one you would prefer to write.  I won't be able to verify the integrity of the code, it was posted by Max Benson, but it makes for a perfectly good illustration:

After 3 years of Unisys 1100 Series programming in Assembly Language (with 36 bit words written out in Octal instead of the more traditional 32 bit words in Hexadecimal), I can tell you my choice was an easily one.  I switched to a C programming language assignment as quickly as I could and never looked back.  The same C code I wrote worked identically on a Unisys mainframe, an MS-DOS PC, and a Unix system.  The Assembly Language programs I wrote would run only on a Unisys 1100 Series mainframe.  The C code took far less lines of code to accomplish the same task and is much more readable to the average person.  No doubt a massive productivity boost from Assembly Language.  But before we go too far, let's use what we learned in the last post and evaluate the technology.  Does it have side effects?  Why did my mainframe mentors stay away from it?

To answer those questions, we need to understand how C programming language code becomes machine code.  There are two basic approaches to making that happen, it can be compiled or interpreted.  C language is compiled (basically transformed) to execute the C language instructions into Assembly Language specific to the hardware you are running.  The Assembly Language is then assembled into machine code for execution.  Compilers do all of this work one time per program file, up front, link all the files together, and create a cohesive machine code program as the end result.  Interpreters for languages like BASIC do not translate the BASIC code into a standalone machine code program.  For those who don't know what BASIC is, think JavaScript or Python when it first emerged as a technology.  Interpreters just run the instructions step by step in a single program that was compiled or assembled previously.  Interpreters don't generate code or machine code programs like Compilers do.

Compilers do a lot of work up front so that the resulting program is as fast as possible.  Interpreters don't do any work up front and need to scan each code instruction every time before execution (ignoring modern caching techniques of course).  After scanning the high level source code, they run the machine code for the instruction at that sequence in program execution.

Therefore, if you don't mind taking the time to compile your program, and that program doesn't modify itself during execution, you should have a much faster and more efficient program in the end with a compiled language.  If you don't want to take that time (make frequent changes, or make programming logic changes in the middle of program execution), then you want an interpreted language.

Most people want speed and efficiency, so they select a compiled language given the choice.  Now we get to the fundamental side effect of a higher level language.  Can the compiler generate code that is as readable, as easy to debug, as fast, and as efficient as an Assembly Language programmer?  I can still here my Unisys mainframe mentors say "Have you ever looked at the generated Assembly Language from the compiler?  It is dreadful, no one in their right mind would write that code!".  I was highly motivated to move on from Assembly Language and left the speed and efficiency issues to others to worry about.  But at the time, back when CPU speed was very slow, all but multi-million dollar mainframe systems had only one CPU, CPUs had only one thread, and memory was very small, it indeed was a problem.  Combine that with early compiler implementation and it was a big problem.  The generated code from a Compiler could easily be 1000 times slower than the code done by an experienced Assembly Language programmer.  Back then you could actually tell what code a program was written in by its response time.  If it had sub-second response time, it was almost certainly written in Assembly Language.  Compiled and interpreted languages would run 5-10 second response time times for character-based end user interfaces.

As few people code in Assembly Language today, CPUs are so fast and multithreaded, and memory capacities are so large, inefficiency in most types of programs is simply not noticeable.  Compilers have been thoroughly optimized throughout the years as well.  But it is a significant side effect to be aware of.

The second significant side effect is the ability to debug a high level language program.  My mainframe mentors would claim that there is no way to verify the correctness of the generated code.  You are trusting the people who built the Compiler code generator to generate Assembly Language code that perfectly matches the high level language program instructions.  And if they didn't, how are you going to debug the program?  Here is where we get into the topic of code generation and round trip engineering.  In Assembly Language, code generation is pretty straightforward.  It is for the most part a one-to-one mapping between the input line of Assembly Language code and the output machine code.  You can take machine code and dis-assemble it to create Assembly Language code.  You can modify the machine code (if you dare) and get pretty good Assembly language back.  The same cannot be said for high level languages.  One programming statement in a high level language can generate many lines of Assembly Language code.  You cannot reverse compile Assembly Language code or machine code back into a high level language.  You cannot modify the generated assembly language code to optimize it.  So how do you go about debugging and optimizing the code?

Compiled language engineers had to design a whole new set of technologies to debug compiled language programs.  These technologies would watch over machine code as it executes, map it to assembly language code with special markers in it (called object code), and then map the object code back to the original source code with your choice of line breaks and comments.  CPU manufacturers had to enhance their CPUs to support debugging these programs to add hooks for watching over program instructions as they execute and data as it is modified.  It took a while, but now we have excellent debugging facilities and highly optimized generated code.  Almost everyone trusts the generated code is "correct" and just debugs their programs at a high level language code level.  But again, it is something to be aware of.  You don't have control over the generated code, you can't optimize it yourself.  If you try, you would break debugging facilities, probably break integrations with other generated code, and the next time you compile the code, you lose your optimizations.  There is no provision for round trip engineering in traditional compiled languages.

Assembly Language is still required for a very small portion of code directly interfacing with hardware or executing precisely timed instructions.  But otherwise, High Level Languages have taken over.  If you still write Assembly Language code, let me know, I'd love to hear from you!

Outside of operating system kernel software, very high volume databases, or embedded devices, there isn't much of a need for C programming either these days.  Therefore we have at least one more programming language evolution to discuss. In the next post we will evaluate object-oriented programming languages with virtual machines.  Until then, take time to think about all the people who invented and perfected high level languages, compilers, and interpreters.  Without their work, we would still be stuck learning very complex CPU architectures and programming Assembly Language! 

要查看或添加评论,请登录

Todd Lauinger的更多文章

社区洞察

其他会员也浏览了