Processor Design #1: Overview
Simon Southwell
Semi-retired logic, software and systems designer. Technical writer, mentor, educator and presenter.
Introduction
This and the next few articles are based on the notes I made for a mentoring program where I covered processor architecture and logic design using RISC-V as the case-study, as this is a modern RISC based instruction set architecture, is open-source, and is making a lot of noise in the industry right now. From knowing nothing about how processor worked the mentee executed a fully working implementation that passed all the relevant RISC-V International instruction tests. They then went on, of their own volition, to pipeline it for single cycle operation on non-memory instructions. I’m hoping that this article will provide enough information that anyone who wants to do so can reproduce what my mentee did, at least to a finite state-machine based design, but I will also discuss some steps beyond these fundamental principles for more advanced features.
RISC-V is being used as a relevant example instruction set, but this is not a document on all aspects of RISC-V and I will stick to only those features that allow a processor core to be implemented. Thus, in this article, we will stick to the base implementation and relevant control and status registers, whereas there are many instruction extensions to this base. Also, RISC-V can be 32- or 64-bit (or even 128-bit) and has three privilege modes—machine, supervisor, and user—or four if you count hypervisor—but we will stick to 32-bits and the highest priority mode only (machine) as that has restrictions on any permissions. RISC-V also supports multiple hardware threads (harts), but we will stick to just one…and so on. There are many, many good resources out there for those who want to know more about RISC-V, including the specifications which I will give links to at the end of the article.
Throughout the articles I will be making reference to my own RISC-V logic implementation for illustration and example. The source code, along with documentation, is available on github. It is targeted at FPGA, though would easily be implemented on an ASIC, and is restricted to the specification I have just laid down. I have sometimes sacrificed efficiency for clarity in the design as it is meant for informative and educative purposes and there are many better, more developed, more verified, implementations than this available as open source (e.g., Ibex), but my core is architected and documented for ease of understanding. Where a ‘better’ approach might be warranted I will discuss this in the text to explore more general processor design features.
In this first article, though, is discussed processor function and design in a generic way (though looking at some real examples) in order to define what a processor is, where it fits within a system, and what the common traits are for the vast majority of processors. In future articles, we will look at the RISC-V architecture, the instructions it defines for the base system, and the register sets (both general purpose and CSR). Then we will look at the logic architecture to actually implement such a processor core, including optimisations and alternatives. Finally, we will look at assembly language, the lowest level programming (discounting programming directly in machine code, which nobody would be foolish to do since the 1970s). There is an instruction set simulator (ISS) as part of the accompanying RISC-V project, and this can be used to experiment with assembly language programming without the need for processor hardware.
What is a Processor?
Firstly, I want to say that a processor doesn’t do very much. It reads a set of fairly simple instructions from a memory or internal registers, manipulates associated data according to those instructions, and stores this altered data either internally, or back to memory. And that’s it. The power comes from the fact that it can do these instructions very fast, and that more complex operations can be achieved by combining the limited set of instructions. The instructions used vary between different processor, but a result that might surprise you is that, in the limit, only one instruction is really necessary! All the processors that have multiple instructions are actually doing engineering to make the processors’ operations more efficient. Such One Instruction Set Computers (OISC) actually exist and can perform the same functions as a bigger processor, albeit much less efficiently.
A modern processor has more than one instruction, and these are encoded as plain binary numbers (‘machine code’) which the processor reads from memory (maybe RAM, flash, or other storage device) to manipulate data and read and write to memory. This is the software running on the processor. It will have a bus to access memory and other devices such as I/O, using protocols such as AXI or Avalon (see my article on busses). The internal registers vary in number and purpose between different processors, as we shall see, and this register set and the instruction set that the processor recognises, is known as the processor’s Instruction Set Architecture (ISA). The ISA defines, then whether the processor is RISC-V RV32I, ARMv8, IA64 etc. There may be different implementations of the same ISA, but the processor is classed based on the ISA that it implements.
Processors are often classed a n bit, for example 8-bit. This (usually) refers to the size of the data and instructions it processes, and this has been increasing with time. For microprocessors, it all started with the Intel 4004 at 4-bit, and the 1980s 8-bit home computer revolution with 8-bit processors, such as the MOS 6502, Zilog Z80, and the Intel 8080—all originally designed in the 1970s. A short period of 16-bit processors (e.g., Motorola 68000 and Intel 80286) was taken over by 32-bit processors such as the Intel 80386, SPARC v8 and ARM Cortex-M. This 32-bit era still persists in the embedded processor world, but modern PCs, workstations, and smart phones, use 64-bit processors (I’m ignoring graphics processors), such as Intel i9, and the Apple A15 incorporating 64-bit ARM based processors. Beyond this are Very Long Instruction Word (VLIW) processors such as the HP/STMicroelectronics ST200 family of processors, Analog Device’s SHARD DSP processor and the u-blox software defined model (SDM) processor (which I worked on as a DSP software engineer doing 4G physical layer code including the code on the VLIW processor).
CISC versus RISC
Another categorisation of processors is whether it is CISC or RISC. That is, is it a complex instruction set computer, or a reduced instruction set computer? In the early days of processors, iterations of processors tended to add more instructions with more complex functionality to aid in the efficiency of programs which, originally, were written directly with the processors’ own specific instructions. With the advent of higher-level programming languages and their compilers, it was found in research that, in general, compilers would use 80% of the instructions only 20% of the time, and 20% of the instructions 80% of the time. Using this result work was done to design processor architectures that had fewer (i.e., a reduced) number of instructions—the ones that ran most of the time—that run more efficiently. The more complex functions can be emulated using multiple simpler instructions which, although slower than a dedicated instruction, is done less often on a processor that can run much faster.
The term complex instruction set computer (CISC) was retro-fitted to earlier processors that had characteristics of large instructions set, including instructions that had complex functionality, with multiple ‘modes’ for each instruction and could be variable in the size to encode the instruction. An example of a CISC processor is the Intel x86 family and the processors derived from this architecture. RISC processors, by contrast, have fewer more simple instructions, implemented for fast execution. The instructions are all fixed width allowing ease of pipelining an implementation. Also, RISC processors generally separate data manipulation from memory input and output. So all data manipulation is done from values held in internal registers and placed back in internal registers. All movement to or from memory are done with instructions that can’t alter the date. This separation avoids having multiple modes for each data manipulation instruction—e.g., supporting a function that can have data either from memory or a register, or a memory location indirected by another register etc., as is common in CISC processors. Example RISC processor architectures include ARM and, of course, RISC-V. Indeed, even modern CISC processors often have an internal RISC architecture, and stages to break down complex instructions to multiple simpler internal operations.
Role of a Processor in an Embedded System
My particular interests and experiences are with embedded systems and SoCs, so where does a processor fit in such a system? Within an embedded system or SoC one or more processors, might sit on a system bus, such as an AHB or AXI bus, along with a memory sub-system (with caches and MMU) and with peripherals that it might control, such as ethernet, UART, USB etc. It might also have an interrupt controller and timer to produce internal and external ‘events’ (more on this later). The diagram below shows a simple SoC arrangement.
This arrangement is oversimplified, but indicative of most processor-based systems, with software in memory running on one or more processors which control a set of peripheral devices over the system bus or interconnect fabric, which make up a system, such a controller for a storage device, a smartphone, or even a Raspberry Pi. If there are multiple cores, then the cores may also communicate with each other, usually through memory.
Basic CPU Operation
We’ve not yet defined any kinds of instructions that a processor my use, but we can still discuss what a processor does when it is powered up and taken out of reset, that is common to the majority of processors. The diagram below shows what happens at this first step.
After reset is removed, a processor will start to read an instruction from some predetermined fixed location, for example at address 0 (shown as step 1 on the diagram). The value of the instruction is returned from memory to the logic for interpretation (step 2). It may be that the particular instruction manipulates data from two of the internal registers and so these are fetched by the logic (step 3). The result of that manipulation might be, depending on the instruction, placed back into another internal register (step 4), and the instruction execution is completed. In step 1, the address is shown to come from a particular register labelled PC. This is the program counter. When an instruction completes, this is normally incremented to point to the start address of the next instruction located in memory immediately after that just executed. For a 32-bit RISC machine, all instructions are 32-bits, or 4 bytes, and so the PC (a byte address) would be incremented by 4 and the whole cycle started again.
The next instruction, instead of manipulating data, might be a memory access, such as a write.
The diagram shows a store operation, where two internal register values are accessed, one to form an address for the data to be written and one for the actual data to be stored. So, instead of the ‘result’ being written back to an internal register, it is directed to memory. Similarly, a read from memory instruction might read an internal register for an address value, perform a read operation from memory, and the returned data written back to another internal register.
One last class of instruction is one which can change the value of the PC from its default of moving to the address of the next instruction. The diagram below illustrates this:
Here, the instruction is read from the current PC address and sent to the logic. The logic might read two internal registers to compare their values. If they meet some criteria, such as being equal, the logic then overrides the default increment of the PC to some new address location. Often this is an offset from the current PC value, forward or back, that is encoded in the instruction itself. So, in the diagram, if the two registers did not meet the criteria (of being equal, say), then the PC would increment as normal and the new PC address would be 0xC. If the two register were equal, then the instruction might have an offset of, say, -8 and the PC would be set back to 0x0 and start executing instruction from there once more. The offset could just as easily have been positive, +8, say, and would then skip the instructions at 0xC and 0x10, and start executing instructions from 0x14.
领英推荐
This basically describes all the types of operations that a modern RISC processor does: data manipulation to and from registers, memory reads and writes, and overriding the program counter. The only addition to this is an exception where an external signal (interrupt) or an internal error event can also change the program counter value, but this is not normal program flow and we will look at this shortly.
Processor Core Port Architecture
Above has been shown that a processor core reads instructions from memory. Some of those instructions will direct the core to load data from memory or store data to memory. These two classes of memory access, instruction and data, can be handled in one of two ways via the external ports of the core. The diagram below shows two examples of cores with memory port configurations.
The first configuration has a single memory port. The diagram shows a simple SRAM like port with a wait request, but it could be any kind of port to access memory (e.g., AHB). This configuration is known as a von Neumann architecture. This might cause conflict on memory access if an instruction could be read whilst data is being loaded from, or stored to, memory. If the internal design is such that an instruction is completed before the next instruction is fetched, then no conflict arises. However, this is not a very efficient implementation, and modern pipelined implementations can fetch instructions in parallel with accessing memory for data. The second configuration has separate memory ports for instructions (a read only port) and for data (a read and write port) and is known as a Harvard architecture. Now, memory can be accessed for data, as well as instructions fetched in parallel. It may be that the instructions are in a separate ROM which is connected directly to the instruction port, with the data port to RAM or DRAM. Alternatively, the instructions and data may ultimately reside in the same set of memory, which would simply move the conflict to the memory sub-system which would need to arbitrate access for both ports. However, the memory bandwidth may be higher than that of the core’s memory ports, and so both ports can be run at 100% efficiency. This is not an uncommon situation in an SoC.
Internal Registers
Also common amongst nearly all processors is a set of internal registers, which we have alluded to already. The type and number vary, but they all have very similar roles. The basic categories of these internal registers are as follows:
Within the general-purpose registers, certain ones might be nominated as having additional specific purposes but could still be used as a general-purpose register. Also, many modern processors nominate a register to always read as zero.
Below are some examples or real registers sets from three different processors, and from different eras.
All three of these examples have registers that fit, roughly, into the listed categories. The 6502 has A, X and Y registers that are more or less general purpose. The ‘stack pointer’ is kind of custom but could be used as a general-purpose register. The PC maps directly to the functionality already described, and the P register (processor flags) is a status register. The ARM processor has a set of general-purpose registers, r0 to r15, where r13 is nominated for a ‘stack pointer’, r14 for a ‘link register’ and r15 for the program counter. The PSR is the status register. Finally, the Tricore processor from Infineon (which I worked on, constructing software models of the processor and system) has 32 general purpose register (though split as address and data) and a PC, with control and status in the PCXI and PSW registers.
This, I think, serves to illustrate that there is a commonality of internal registers amongst varied processors. Compilers, such as gcc, work (in broad terms) by having a generic model of registers which it uses to map the programming language code to an intermediate code before mapping to actual registers (and instructions) available on the target processor. Modern processor architecture is now sympathetic to this compilation process to make mapping straightforward.
In the diagram for the three processor register sets, looking at the ARM and Tricore, the diagrams show labels in brackets next to some of the register names. I’ve mentioned before that some registers are nominated for particular functions but this need not be the case (though it might be for some processors). In reality, only if writing code in assembly language (the lowest level programming) would this be true. The reason for nominating registers like this is to have a convention when compiling code from a higher-level language. For interoperability of code compiled separately it is helpful if everyone follows the same convention. This convention is often called the application binary interface (ABI). It dictates things like the register to use for the stack pointer (I’m not going to define these terms here), which register to use as inputs to the call of a function, and which as outputs and things like this, relevant to a high-level language.
Exceptions and Interrupts
Before we leave this generic discussion on how processors are used and operate, I want to mention exception and interrupts. I have mentioned these in passing in the above sections, but I want to fill in some of the blanks.
We have discussed the flow of a program running on a processor as proceeding in a sequential fashion through memory, though this can be altered using instructions to change the default program counter increment to change the flow through the running program. Another way to change the flow of the program is with ‘exceptions’. These usually happen from some internal error condition, or from some external event or interrupt.
For the internal events an error condition might be, for example, an unrecognised instruction. If an instruction read from memory has a value that does not decode to any known instruction, or some fields within an instruction have illegal values, this is an error and causes an exception. Some processors have special instructions to actually cause particular exceptions, so that a program can generate these events itself, rather than on an error condition.
For external events, these usually come in the form of interrupt input signals. There can be multiple interrupt inputs to a processor core, but fundamentally this boils down to one exception with other logic (an interrupt controller) sorting out if a given interrupt input is enabled (so the processor responds) or is the top priority to respond to if multiple interrupts are active. The interrupt controller logic might be within a core, but is often external to the core, with just a single input signal to the processor.
For both the internal and external events a processor will finish its current instruction and then change the program counter from its normal next value to be some fixed, predetermined, address in memory (not unlike after coming out of reset). It will save the address of the next instruction it would have normally executed to some store (usually a register). In some cases, the source of the exception may add an offset from the fixed address to differentiate the various types and sources of exceptions. The code that is located at this fixed region of memory is specially written and is known as an exception handler. So, for example, if a UART peripheral has a new byte just arrived it might raise an interrupt so indicate this byte needs processing. The interrupt causes an exception, the exception handler is called and will identify that the exception is from the UART, so call a routine to fetch the byte which might place it in a buffer in memory for the main software to process. When this is finished (and the handler code is usually as small and fast as possible), the exception can be completed and the program counter reset to the saved address so that it can carry on from where it left off at the point of the exception. If the event handler can’t process the exception, this is when it can ‘crash’, perhaps displaying a message (if it can) before halting the processor.
In the simplified SoC diagram above I showed a timer, along with the other peripherals which, itself, can be a source of interrupts. This is important when constructing a multi-tasking system, where the processor core is running an operating system and several other processes and threads ‘concurrently’. That is, it appears that multiple programs are running, but actually only one is running at a time, and they are swapped out at regular intervals by the operating system software, which is where the timer comes in. This might be programmed to interrupt after a given time, and then a particular processes code allowed to run. When the timer interrupts at the end of the period, the exception handler notices that this is so, and hands control back to the operating system, which can then swap in another process to start running from where it last was running when it was swapped out, and so on. (There are other reasons a process might be swapped out, such as it waiting for the UART to receive another byte, as it won’t make progress anyway, so the OS might as well let something else run, but this is still event driven and under the control of the OS software.) Thus multiple processes and threads can make progress on a single core, as if running in parallel on multiple processors. However, from the processor core’s point of view, this is all just interrupts to jump to the handlers, and the handlers are software to be run, just as for the main code, to know what actions need taking.
Conclusions
In this introductory article we have restricted to discussing processor cores in a generic way, without committing to details of supported instructions (which can be just one, but usually isn’t) or the particulars of a logic implementations. This is so that the fundamental characteristics can be teased out, without the complexity of the details for a particular architecture or an implementation strategy.
And it turns out that a processor does not do very much that’s particularly complicated (I hope I’ve demonstrated). It reads instructions (which are just numbers in memory) which tells the logic to process data in and out of internal registers, load or store data from memory or update the program counter to start executing somewhere other than the next sequential location. Exceptions are very much like the PC update instructions, except they are caused by errors or external events (interrupts) or maybe even special instructions. In these cases, the new program counter address is set to a fixed location (or fixed location plus an offset), and normal program flow can be restored once the exception is handled by the specially written software, having saved the place where the exception changed the PC.
In the next article I want to map what we have discussed here to a modern real-world example, the RISC-V architecture. We will look at the base configuration, the instructions defined for this, and the registers, both general purpose and the control and status registers. This should set us up nicely for discussing an implementation in logic.
DDR Validation Engineer & Engineering Manager, Ampere Computing; Computational Systems Buff
2 年Informative article
Director ORIC -UIT | Consultant | Team Lead MERL, RISC-V Processor Design | FPGA Design Engineer I CEO Brandmefy
2 年Awesome
Staff ASIC Design Verification Engineer at STMicroelectronics
2 年Thanks for sharing this