登录查看更多内容

An Architectural Comparison

Paul McKneely

President, technoventure, inc.

发布日期: 2019年6月15日

It was about 2005 when the x86 got a face lift and became AMD64. This saved us all from the looming threat of (unwelcome) Itanium trying to bully its way into our lives and onto our desk tops. I was happy about that outcome because my situation didn't get worse. We all averted a major disaster. But that doesn't mean our situation couldn't be made better than it is. There is certainly lots of room for improvement in what we now have and the way we do things.

And so it was about two months ago I decided to start designing a new 64-bit CPU architecture that I call ?Engine. I wrote two programs to automate the process of opcode generation, documentation and associated source code. I now have a working assembler and it has been a lot of fun. With these tools it doesn't take long to reorder or redefine the instruction set and make an updated version that reflects the changes.

This latest of my projects wasn't just a whim. I've been thinking about doing this for many years. I would never have started a project like this if I hadn't been so disappointed in the systems we are now using. I have gotten to know the AMD64 instruction set pretty well over the past few months while writing an assembler, library manager and linker. But the architecture is not new to me. I've been a user of the x86 (even at the assembler level) ever since I got my first 8088-based PC running MSDOS back in 1984.

AMD had to make some difficult decisions for the transition from 32-bits to 64-bits to be viable. Intel had been too liberal when defining the x86 instruction set in the first place. They created too many variants for many of the operations and so they had used up the code space too quickly. AMD had to abrogate some opcodes to make room for others that would be needed in the 64-bit environment. But that kind of mess is not the only reason why I was so disappointed in the architecture when I first looked at it over thirty years ago.

Many programmers don't like the segmented address space and the complications it brings to programming. In the 386 as in the AMD64, multiple “selectors” are used to reference what appear to be separate spaces, each the size of what the architecture can address. That is 32-bits for the 386 and 64-bits for the AMD64. But these “segments” all get crammed back into the same shared space so they really aren't separate spaces after all. That's why they are called segments instead of spaces.

I think I pretty well understand why Intel did it the way they did. Some parts of a program's memory need to be write-protected while others do not. Because these segments have to share a common address space, any one segment has to be protected from an attempt to write beyond the end of another one that precedes it since they overlap. Only the parts of each segment that are used show through the mapping process. The basic idea was understandable. But where it lost its potential power was because all of these segments are crowded back into the same limited address space and that is what brings them into contention and the need for extra memory protection. This feature of memory management in the x86 defeats the usefulness it could have had if the "segments" were to have remained as separate spaces. But to use them as "far pointers" would have been problematic since pushing them onto the stack would have misaligned it. Also, trying to pass segment information between pointers is very unweildy in this architecture. The problem comes from the need to access any one location across all segments with a single register reference. This aspect of the x86 is very awkward. ?Engine solves all of these problems.

Where I would like to go in this article is to give you a general comparison of the familar AMD64 to the new ?Engine. I will also discuss the advantages that these two CISC systems have over a typical RISC system. Many characteristics of the AMD64 must be addressed in the light of popular operating systems such as Windows and Linux. This is a sound strategy as those operating systems have limitations because of the hardware they must run on. The ?Engine architecture has not been completely hammered out yet but enough of it now exists that I can give my readers a good idea of what it is all about.

?System Co-Design

One of my favorite stories is the one about how Dennis Ritchie, Brian Kernighan, Ken Thompson, Bob Pike and others co-developed the C programming language and UNIX. This is usually how innovation has to happen in a vacuum. My wish has been to see this happen again to but in a more complete way. That team didn't have the luxury to define the processor that their creation would run on. With all of the new technologies that are available today and the upcoming near future, ?System does have that luxury. One of the goals of this new system is to eliminate the need for assembly language in operating system code. ?Engine is the final step in a long history of ?System co-design. The process began with ?PPL and a better way of programming. After this came ideas and experiments for implementing ?OS, an operating system that would be written in that language. And with ?Engine, the ?System trinity becomes complete. Many novel techniques can't be implemented using antiquated hardware. It is sometimes better to co-develop the new technologies all at once while using your older systems to run the tools for building the new system.

Instruction Metrics

This is an area where ?Engine has some architectural features that are similar to AMD64. Instructions are of variable length (multiples of a byte) and begin on any byte boundary. This promotes high code density as there is little waste in code space. RISC architectures incur a lot of waste and suffer from low code density because of their fixed instruction size. High code density translates into better use of cache memory because it is not flushed as often and has reduced need for instruction bandwidth. Because instructions can begin on any byte boundary, all bits of the Program Counter (called the IP in x86 parlance) are used. Even though most RISC systems have 32-bit instructions, they still often code byte offsets into their branch instructions, wasting two bits. I suppose the reason they do this is for ease of implementation but these become opportunities wasted in their designs.

In constrast to AMD64, ?Engine has very few forms of any one operation. This gives it a reduced instruction set (even though it is not a RISC architecture) and promotes high code space utilization. Most instructions have only two opcodes. One is for destructive operations where one of the sources is also a destination. The other opcode uses unique sources and the destination is not a source. Where AMD64 needs many special instruction forms to provide additional addressing modes (such as immediate values), ?Engine pushes much of this additional functionality down into its operands. ?Engine's versatile operand encoding is what makes it possible to meet most addressing needs within the execution of a single instruction.

Architectural Orthogonality

Orthogonality is a claim made by most RISC proponents that holds limited validity. In general use, this term indicates the degree to which any operation can be done using any register equally. The AMD64 is particularly bad in this respect because it has so many instructions that require the use of certain registers for certain uses. For example, CX is often needed as a counter for some purpose. Most RISC processors are not as limited in the use of their general purpose registers.

However, orthogonality doesn't stop here. The principle can also be applied to other aspects of the architecture's design. Other examples include how many addressing modes are equally available to all instructions and what immediate values can be used with any one instruction. RISC processors can be particularly non-orthogonal in these areas. I recently read how an add in the Itanium can only use an 11-bit immediate in one addressing mode, a 6-bit immediate in another and another sized immediate in another. Granted that Itanium is not categorized as a RISC but a VLIW. One glaring limitiation of all RISC processors is, with a 32-bit instruction, you can NEVER include a 32-bit immediate (let alone a 64-bit immediate!) value in an instruction. You have nothing left for opcode control (or not enough room to fit it). To obtain a 32-bit immediate value either involves fetching a data value in a non-instruction stream memory cycle or building the value using multiple instructions. The AMD64 has no trouble with 32-bit immediate values in its instruction stream but the number of instructions that can handle 64-bit immediates are almost non-existant. ?Engine will happily provide immediates for any of the standard data sizes within its instruction stream. ?

Data Conversions

?Engine has a broader range of conversions that are built into loads and stores than does either AMD64 or RISC. Two source operations have size extensions built in to the instructions so they don't have to be loaded into registers before using them. This is usually the case in source code and it reduces register overhead for code that references variables only once. This generally can't be done with RISC processors and this incurs additional overhead such as saving and restoring registers that are needed to bring values into the CPU before they can be used.

Load/Store Architecture

The RISC processor model is a load/store architecture. This means that data are only loaded into and stored from registers. All of the data manipulations happen between registers. In contrast, the AMD64 does operations while loading and while storing and between registers as well. Some addressing modes will even store back to the address from which an operand was acquired. ?Engine uses more of a modified load/store model where most instructions can perform operations on registers only or use one or two that are in memory. The result of an operation is almost always placed into a register. So in that respect, ?Engine is similar to a RISC processor. For example, an add involves the loading of two operands but it also performs the add. The result is placed into a register. A RISC has to execute two load instructions to bring both values closer to the ALU so that the add can be done next. It takes a third instruction to add them together. With ?Engine, if the operands are already residing in registers, then the loads don't have to take place for any of these architectures. Many operations can be done between registers with any of these processors. What ?Engine saves when doing them all in one instruction is by the elimination of multiple instructions as well as the multiple references to the same registers. This makes code more compact.

General Purpose Registers

The sizes of programs and the speed at which they can execute are highly affected by the number of general purpose registers that are available to a programmer. There are generally few registers in a system but many memory locations. Both can be referenced explictly by a program and when they are, a memory location needs a lot longer address than does a register. Also, registers are much closer to the CPU than are memory locations so the speed of access is higher. Because of these things, having more registers available to the programmer can make programs smaller and run faster. But as the number of registers grows, so does the number of bits needed to access them and the size of the context that needs to be saved/restored when the CPU switches from running one thread to another. So there is a trade-off.

The original 8086 had eight GPRs but this was extended to 16 in the AMD64. ARM has 32. I have spent a lot of my life writing 68K assembly and it has taught me a lot of what I know and appreciate. This CISC architecture has separate data and address registers and this was one of its best design features. In almost all cases for all architectures, data are used quite differently from addresses and this has many ramifications. When you have different register types to store data vs addresses, you can double the number of registers without changing the code space you need to address them. A second benefit is that you can structure address registers to match your addressing needs. You can't do either of these things if you have only one kind of register to handle both kinds of values. Consequently, ?Engine has 32 integer registers and 32 address registers for a total of 64 with little impact on the size of the instruction set or code density. The other benefit is ?Engine's revolutionary virtual memory model which can only be implemented poorly on any other architecture that I know of.

Ease of Programming

As a programmer, I think that one of the most important qualities of an architecture is its easy of programming. I hope that my readers understand how important it is for the programmer to understand the code that he/she writes. With some architectures, especially RISC, it is difficult to understand the instruction set. At least I have difficulty understanding them. The AMD64 is probably in the middle of the road in this area. Any one of the relatively many variants can be simple to understand on its own, but getting all of them in one's head at the same time is not so easy. Also, how segmentation in the x86 works is still not very clear in my mind and I feel like I have just gotten by somehow all these years. That is not a good legacy for a CPU architecture to leave in its wake.

Enter the ?Engine which uses a non-destructive data model. Most early CISC processors (as well as many recent RISC ones) must first load a value into a data register before being used again as the destination to receive the result of an operation. This is called a destructive data model. This wastes code space due to the need to reference that same register more than once during the whole process. Most ?Engine instructions place the result of an operation into a register that can be different from its operand sources.

In ?Engine, most operations can load their operands using several addressing modes, do any necessary conversions and are completed in one instruction (on one line of code). AMD64 and RISC operations often carry on for two or three instructions. A programmer tends to lose track of where the operands are kept and where the operations are done. This is not a problem for compiler-generated code but it is for anyone writing assembly code directly.

Many different kinds of operations are needed in SIMD (vector) instructions and these are not generally available in standard scalar instruction sets. Things such as overflows can't be handled for individual vector elements. For instance, the add with saturation is used in multimedia operations to prevent numbers from wrapping around. The standard add of two white values together can produce a black which is bad behavior. Another example is the MIN or MAX operation (you need separate versions for unsigned and signed) that takes the smaller or larger of two values. With most standard instruction sets, you can only implement these operations using tests, branches and loads on an vector element basis. AMD64 and ARM only provide these kinds of operations with their vector instructions. Even though ?Engine has fewer instructions, it also adds support for these kinds of operations in its scalar instruction set.

Another way that the ?System makes source code easier to understand is by using the richness of Unicode in its programming languages. I get comments from a lot of people that they think it would be more complicated to use this larger character base in a programming language. It is counter-intuitive. People have an ability to recognize images that is wasted on limiting their vision to the small set of ASCII characters. This is why all aspects of the ?System are Unicode-based. ASCII will always be a trivial subset of Unicode which can simplify tasks for chosen purposes. But the ?System makes full use of the larger set. It makes source code more concise and easier to understand. As I learned in a software engineering course in college, a programmer will look at code many times for any one time it is written (if it is found to be worth using that is). My readers will probably not realize this until they've gotten into using this new technology. Then they will understand. I begin to hate programming when I look at the contortions that some UNIX programmers have to go through in all of the scripting languages that are connected with that operating system. All of the different combinations of slashes, dashes, back-slashes, percent signs and other characters make me dizzy. Having a large character set makes it possible to set aside characters for specific uses so they rarely mean more than one thing. This can eliminate ambiguity and makes life and work a lot more enjoyable (at least for me).

Even if you had a programming language that could totally eliminate the need for assembly language, I'd still want to keep one around for educational purposes. There is no better way to teach someone how computers work than by using assembly language. HLL's hide a lot of how a program accomplishes what it is does so they perform this function very poorly in the area of education. With its rich architecture, clear syntax and orthogonal features, there is no better CPU than ?Engine for this purpose.

Inter-dependencies

What many of us PC users want in a computer is one that is fast. We want them to have fast response and we want them to process a lot of information is a short amount of time. As we have seen the end of Moores Law draw near during the past few years, the techniques we have used to gain additional speed have been through pipelining and parallelism. Pipeling is a method that can speed up single thread processes. Multiple-issue of instructions, multi-core and SIMD are all forms of parallelism.

Parts of many instruction sequences must maintain their serial execution because of inter-dependencies. You must obtain a result before you can use it in another operation. One kind of dependency is when two instructions use the same register. Even if the two instructions are not interdependent on the data that a register contains, they must still serialize their sharing of the common hardware resource. Having many GPRs makes it easier for instructions to use different registers when they aren't sharing common data.

Another resource that often causes contention is status flags. Many CPU architectures have only one set of status flags and this includes some of the most popular RISC processors as well as the AMD64. And since many instructions depend on these flags, they are often the common resource that prevents two instructions from running in parallel. ?Engine has four sets of flags which are all stored in the Status Register. Another benefit to having more than one set of status flags (also called condition codes) is that a test can be made and then used again much later while the other flag sets have been used for other purposes.

Virtual Memory and Hard Link Addresses

Virtual memory is something that most programmers rarely think about. It is not something that can be changed by using a different programming language. But it is real and is much like the foundation of a building while application programming languages are more like the kind of roof that you decide put on the building to finish it. But for any computer architecture, the memory model becomes a lot more visible to the programmer when writing assembly code than it does when programming in a HLL. So these requirements were a major factor in the design of ?Engine's MMU architecture, address registers and its addressing modes.

What Windows and Linux can do with virtual memory is dictated by the hardware they run on. The realization of an application's run-time environment when running in Windows became more vivid to me as I wrote the linker for AMD64. It appears as a linear address space made of pages. Because the x86 is deficient on program-relative addressing modes, most programs have to be statically linked as they are loaded into the address space for each process. If you run two instances of a program, it has to be loaded again and this eats up more of your memory. The Motorola 68K has a rich set of position-independent addressing modes so you can run a program unchanged regardless of where it is held in memory. But any OS that uses the x86 can't do this. If even one instruction in a program is not position-independent, then the whole program can only run from one place. It also means that the loading and linking of any DLLs that are used will incur a significant delay before they execute. This is a feature that has contributed to Windows being given the nickname “WinDoze”. It is a scourge to be avoided among users of Real Time Operating Systems (RTOS's).

Because of its memory model, the need for position-independent code in ?Engine is of less importance even though its architecture has a rich set of position-independent addressing capabilities. Programs, Modules and Libraries can all be pre-built and ready to run. Because of its virtual address memory model, these things just have to be mapped into spaces by the operating system and run. Gone are link-loading, DLLs and all of those complicated pre-run processes that give systems poor performance and delays in response. ?OS implements all these capabilities as the world's first real-time object-oriented operating system.

要查看或添加评论，请登录

Paul McKneely的更多文章

Time/Space Trade-off in Computer Performance

2020年3月31日

Time/Space Trade-off in Computer Performance

It's interesting how the individuals making up an entire industry can set their minds to competing with one another to…
Is 0 > 1? (or even 9?)

2020年3月20日

Is 0 > 1? (or even 9?)

I decided a long time ago that most people in our “modern” society don't fully understand or accept the concept of…
Local Speed Principle

2020年3月12日

Local Speed Principle

There is no doubt that RISC processors are easier to design than CISC and that they minimize the number of gates…
An Interesting Comparison between ARM (Thumb-2) (w/ C) and ?CPU (w/ ?PPL)

2020年2月20日

An Interesting Comparison between ARM (Thumb-2) (w/ C) and ?CPU (w/ ?PPL)

I have long been interested in computer logic and code generation. There has never been a race between CISC and RISC…
Stack Packing (and other ruminations)

2019年7月11日

Stack Packing (and other ruminations)

High Level Languages (HLLs) have done a lot of damage to our μ-processors and computer technology in general. This…
Lessons Learned in CPU Design

2019年6月2日

Lessons Learned in CPU Design

I feel privileged to have lived during the heyday of the μ-processor revolution. From the 8-bitters of the 1970's…
The Hypocrisy and Hollow Promises of RISC

2019年5月6日

The Hypocrisy and Hollow Promises of RISC

A heated debate (really a fight for market share) started back in the 1990's called CISC vs. RISC.

1 条评论
A New ?PPL SD Tool Chain Back-End

2019年4月28日

A New ?PPL SD Tool Chain Back-End

https://www.youtube.
What Most People don't know about ASCII (The good and bad about Unicode)

2019年4月6日

What Most People don't know about ASCII (The good and bad about Unicode)

It is interesting to think about how technology doesn't always improve through time. In fact, technology has evolved in…
Will we always be stuck with Software Training Wheels?

2019年2月20日

Will we always be stuck with Software Training Wheels?

I have a cyber-colleague who I met virtually (through email) soon after I first gained access to the Internet in 1995…

See all articles

An Architectural Comparison

Paul McKneely

President, technoventure, inc.

?System Co-Design

Instruction Metrics

Architectural Orthogonality

Data Conversions

Load/Store Architecture

General Purpose Registers

Ease of Programming

Inter-dependencies

Virtual Memory and Hard Link Addresses

Paul McKneely的更多文章

社区洞察

其他会员也浏览了

PCI Express Primer #1: Overview and Physical Layer

Demystifying Memory Sub-systems Part1: Caches

On HPC, FP16, Sapphire Rapids, AVX512 and how Vendors can actively hurt their users

It's the 25th anniversary of Y2K. What was it like to be a programmer back then? ??

Efficiently Using GGUF Format LoRA Adapters on CPU: From Introduction to Practice

Understanding Spinlocks - How CPU supports Atomic locks

CPU works. Oh really? But how?

Understanding Processes and Threads: The Backbone of Modern Operating Systems

"The Ultimate Guide to CPU Analysis: Boosting Efficiency and Troubleshooting Performance"

?System Co-Design

Instruction Metrics

Architectural Orthogonality

Data Conversions

Load/Store Architecture

General Purpose Registers

Ease of Programming

Inter-dependencies

Virtual Memory and Hard Link Addresses

Paul McKneely的更多文章

Time/Space Trade-off in Computer Performance

Is 0 > 1? (or even 9?)

Local Speed Principle

An Interesting Comparison between ARM (Thumb-2) (w/ C) and ?CPU (w/ ?PPL)

Stack Packing (and other ruminations)

Lessons Learned in CPU Design

The Hypocrisy and Hollow Promises of RISC

A New ?PPL SD Tool Chain Back-End

What Most People don't know about ASCII (The good and bad about Unicode)

Will we always be stuck with Software Training Wheels?

社区洞察

其他会员也浏览了

PCI Express Primer #1: Overview and Physical Layer

Demystifying Memory Sub-systems Part1: Caches

On HPC, FP16, Sapphire Rapids, AVX512 and how Vendors can actively hurt their users

It's the 25th anniversary of Y2K. What was it like to be a programmer back then? ??

Efficiently Using GGUF Format LoRA Adapters on CPU: From Introduction to Practice

Understanding Spinlocks - How CPU supports Atomic locks

CPU works. Oh really? But how?

Understanding Processes and Threads: The Backbone of Modern Operating Systems

"The Ultimate Guide to CPU Analysis: Boosting Efficiency and Troubleshooting Performance"