Processor Design #3: Processor Logic
Simon Southwell
Semi-retired logic, software and systems designer. Technical writer, mentor, educator and presenter.
Introduction
In the last article the 32-bit RISC-V base ISA was introduced, along with the Zicsr extensions that are really needed for a useful implementation. The specific classes of instructions were defined for data manipulation in and out of internal registers (also defined), memory accesses, and program counter updating. With the addition of control and status registers and exception handling that they mostly control, we should have enough to implement a compliant RISC-V processor in logic.
In this article is discussed approaches to implementation of a processor, and reference will be made to an example open-source Verilog implementation that is available on github for those who want to study the code alongside this discussion. Some of the design decisions will be highlighted and also alternatives and more advanced approaches. The aim of the example implementation is to provide a working design with some common features to make a useful and somewhat efficient core, but in places clarity has taken precedence at the slight cost of speed and logic size. Where I can, these will be ‘confessed’. Better open-source implementations exist that will be more efficient in gates, speed or power consumption, or a mixture of these.
Before looking at implementation specifics I want, in the first part of the article, to define what functions the logic will need to perform. There are various operations that are commonly defined for the basic operations of a processor core for executing instructions, with slight variations. The diagram below shows a typical set of these operations.
Starting at the top, without implementation specifics, a processor will fetch an instruction from memory, it will decode the instruction and read the values of the indexed source registers, it will execute the instruction using the register values, it will access memory (if a load or store instruction) and it will write the result back to a destination register (unless a store) or the program counter. The process then repeats, either as the normal flow of execution or the PC was updated due to an exception condition that’s occurred, externally or internally.
In the diagram I have shown some optional steps, where the reading of the source registers might be bundled as part of decoding, or possibly split out as a separate prefetch, and for memory accesses combined in the execute step or have its own step. In theory, all of these steps could be done in one big operation and the reality is that different implementations have different sets of combined or split steps to meet specific requirements. None-the-less, the above diagram lends itself to the start of an implementation design, so let’s begin with this.
Design Approaches
When designing a solution to a core implementation, as with any design process, a set of conflicting requirements will shape what approach is taken. Perhaps the engineering problem being solved lends itself to a large number of tiny processing elements, so the core design should be as small as possible whilst trading instruction execution speed (see, for instance, the fascinating bit-serial SERV RISC-V core). Alternatively, instruction execution efficiency might be key, and a superscalar and deeper pipeline solution is required (more on these terms later)—the SiFive series 7 processors are dual-issue superscalar, 8-state pipelined cores. The truth is, it is likely to be mixture of requirements—power, size, speed, complexity, time to market—and a balance between them all. First, though, let’s look at just implementing something that functions. The required functionality from the processor ISA does not change with any of these engineering considerations, and we can keep things simple to begin with.
Finite-State-Machine Based Core
The diagram above, suggesting various operations for an executing processor, looks very much like a finite-state-machine (FSM), if a very simple one. If we were only interested in making a functional core, then the FSM approach is a good strategy. The diagram below adapts the original diagram into?a set of matching states of an FSM.
In its simplest form the FSM can just do exactly what the original diagram showed, with one clock cycle for each step (ignoring wait states for now). However, we can already improve on that by skipping states when there are irrelevant for the particular instruction. For example, if the instruction is not a load or store, then the memory state can be skipped. If the instruction is a store, then the writeback step might be skipped—so long as the PC can be updated for any exception conditions in the memory state. Indeed, exceptions that are internal error conditions might occur in the other stages, and paths from these states directly to writeback when these occur might be another efficiency improvement.
At its simplest, every instruction would take 6 cycles to execute. With some stated improvements, data manipulation and PC updating instructions can be reduced to 5 cycles. This does not sound particularly impressive, but this FSM approach has some major advantages. Firstly, it is very simple to understand and implement and avoids some complex issues of other approaches (as we shall see). It is likely, therefore, to be in a working state much sooner than other methods, with fewer problems along the way. Many early processors are multi-cycle instruction execution implementations, as it was complex enough already to design integrated circuits at that time, and the designers were limited in the amount of logic they could lay down. So, I would suggest that anyone new to this subject and implementing their first processor core starts with this approach. Everything else we are about to talk about does not affect the functional operation of the core and is just there to engineer the efficiency of one or more parameters of an implementation. So, we’ll leave the FSM based implementation discussion here, as I hope it is self-evident as an approach to the required logic design.
Pipeline
The classic step of improving the instruction execution speed of a processor core, or indeed any serial data path such as a streaming interface, is to pipeline the design, where each step is a stage in the pipeline feeding into the next step. The diagram below shows the steps ‘unwrapped’ into a pipeline:
Now, in this arrangement, an instruction is fetched by the fetch ‘stage’ in a given clock cycle and the returned instruction passed to, in this case, the rs prefetch stage. The prefetch stage, in the cycle after the fetch, can start processing the instruction to retrieve the values in the indexed source registers. In parallel, though, the fetch stage can get the next instruction. As this continues, the pipeline is quickly filled in all the stages and instructions are being fetched and fed into the pipeline every clock cycle. So, apart from an initial latency before the first instruction is finished, the pipeline completes an instruction every cycle. We have now gone from, at its basic, 6 cycles for the FSM to 1 cycle for a pipeline, and the pipeline doesn’t look any more complicated than the FSM, perhaps is even simpler. Of course, there is a price to pay, and we have just created a whole set of problems for ourselves that need solving.
Pipeline Hazards
If a processor only ever did memory accesses and data manipulation instructions, where it was never updating a register that was needed by any of the other instructions in the pipe, then things would be fine. However, it is possible that an instruction’s destination register is a source register for an instruction that is in a pipeline stage behind it. Since, during the rs prefetch, an old ‘stale’ value was fetched, the instruction will fail without some intervention. This is known as a register writeback hazard. Fortunately, the internal registers are only updated in one stage, the writeback stage, and it is always just writing to one register (rd), so this can be monitored and compared with the rs indexes for the instructions at each previous stage. If it matches either of the rs indexes, the source value is replaced with that being written to the destination register. This does mean that the indexes for the rs values must be forwarded through the pipeline, and the rd index and writeback value routed to each stage with a comparator and mux to choose between the value from the previous stage or the writeback value fed-forward.
The second issue we have introduced by going to a pipelined architecture is branch hazards. The default behaviour has been to keep fetching instructions linearly through memory (the default PC + 4 behaviour for the RV32I). If a PC updating instruction is processed then, at writeback, the PC is potentially updated to a new non-linear address and all the instructions in the pipeline behind are then wrong. Dealing with this hazard is known as branch prediction which varies in complexity, from avoiding the problem to quite complex solutions.
Branch Prediction
A whole new article could be written on the subject or branch prediction so we will limit ourselves to a few common solutions. Before we look at these, it should be noted that all these schemes make some assumptions about the nature of the code that runs on a processor. In general, instructions aren’t executed from random places in memory but have some localisation, the simplest being located in consecutive addresses in memory. Even when there is a branch, this might be to code that is not far removed from the current PC address, such as a small local loop. If it were truly random, branch prediction wouldn’t really work.
There are two major categories of branch prediction; static and dynamic. The simpler of the two is static branch prediction. Here a predetermined action will be taken when processing a PC updating instruction. Below is a list of three popular strategies.
The first is to assume that the branch or jump is always taken and to flush the pipeline. When encountering a branch or jump, the pipeline effectively stops fetching new instructions until all the stages have completed and any new PC address has been written to the PC. Whilst flushing, the pipeline is ‘stuffed’ with no-operations instructions (e.g., addi x0, x0, 0), This completely avoids any branch hazard but means execution time is increased to flush the pipe, even if the branch would not have been taken. It also requires a partial decode of the instruction to flag a branch or jump. An improvement might be to pre-calculate the new PC value if taken and start fetching from there. This is okay for branch instructions and jump-and-link (jal) instructions, as the potential new PC value is just the current value plus an immediate value from the instruction. The jump-and-link-relative instruction (jalr), though, is also a function of a source register, indexed by rs1, which might be stale in the register file, with an instruction ahead in the pipe about to update it. Partial decode, addition, and handling stale register values quickly becomes complicated, potentially creating critical path logic or the need to add more pipeline stages to relieve this with the writeback hazards also needing addressing in the new stages.
The next strategy is to never take the branch. That is, we will carry on linearly regardless. The advantage of this is that no early partial decode is required and we will be right for a proportion of the time and incur no penalty. How big a proportion depends on the nature of the code running on the core. The disadvantage is that, when wrong, the instructions in the pipe must be cancelled by some method—marked as invalid or forced to a “no-operation” (nop) instruction, etc. Also, if the instruction was a jump, then it could be known in advance that the instruction would be taken. It is perhaps the simplest to implement and is an improvement on the flushing variant of the always-take method but avoids the always-take variant of new PC calculation and complexity of fetching from the non-consecutive address.
The last static method I want to mention is the take-on-negative. That is, for a branch, the branch is taken if the new PC is less-then-or-equal to the current PC (negative) but isn’t taken if the new PC is greater than the current PC. To determine this only the immediate value needs to be processed as either negative or not. For jumps, of course, then it is unconditionally taken. The complexity of logic for always-take methods still applies. The reason this might be an improvement is an assumption about the nature of code—particularly compiled higher level language code—where local loops (such as for-loops of C or C++ for example) are often executed multiple times before exiting, making jumping back more likely than jumping forward. By assuming a backwards branch might be a local loop that is not about to terminate, taking the branch makes the likelihood of being correct higher. Of course, a backwards jump might not be this situation and thus the wrong decision to take.
Lastly, in this section, I want to look at dynamic branch prediction. This differs from static branch prediction as the decision to take or not take is based on state that is going to be kept at run time and so the decision is not-predetermined. There is only space in this article to look at one simple method for illustration purposes, but this is a vast subject and there are many solutions of varying complexity, some of which are proprietary and not in the public domain, I’m sure.
A table is kept of recent branch locations (addresses) along with some state to use in making a decision when next encountered. The table will be finite, so some method for replacing entries when newly encountered branch locations are executed. As such, this is like a cache of branch addresses with a least-recently-used (LRU) replacement algorithm (see my article on caches). Like for static methods this relies on some locality of code execution for some period. The state kept is just a (signed) counter that increments to a maximum value when a branch is taken (that is, actually taken, not the predicted value), and decrements to a minimum value when a branch isn’t (actually) taken. For the decision on taking or not taking, if the counter is positive, then the branch is predicted as taken, and if negative it is predicted as not taken. The number of bits in the counter mustn’t be so small as to add no real prediction, but not so large that it takes too long to respond to changes in the direction. In fact a two-bit counter is a practical value. The diagram below summarises this:
The starting position can be either at a count of 0 or a count of 1 since no previous information is available so it’s a 50:50 situations whether to predict one way or the other. This state is kept separately for each branch address in the table until an old address is swapped out for a newly encountered branch address with initialised state.
As you can see, branch prediction gets messy pretty quickly, but much attention is given to it as it is all about the efficiency of executing instructions and not pausing in execution due to branching. This is not unlike the added complexity of caches for instruction and data in memory sub-systems which also can stall a pipeline waiting for data to be written or read with wait states.
The memory access wait states are also something that needs to be handled within an implementation which we will look at in the example implementation. This is usually done by stalling the pipeline when wait states are inserted by the memory sub-system. Functionally, this is fairly simple with the update of output state held whilst the memory access is pending. However, care in the design needs to be taken so as not to stall a cycle after a wait state is asserted and can lead to critical paths in the pipeline if using an externally generated signal directly for stalling all stages.
Superscalar
With a pipelined architecture, fixing all the hazards it generates, an efficient branch predictor, and a single cycle memory sub-system, a design could approach running at 1 instruction per cycle. This then is the limit of what we can achieve for a single core, right? Can we do any better? The answer is yes, using superscalar architecture.
The limiting factor is the pipeline, if saturated, can’t process instructions any faster. By duplicating the logic within a core, with effectively two (or more) pipelines, instructions can be executed in parallel. The reading of memory instructions needs now to be at a higher bandwidth to match the increased throughput of the core. This might be reading over a wider bus (64-bits when the instructions are 32-bits such as for RV32I) or running the fetch cycles at a higher clock frequency (or a combination of both). The diagram below shows a simplified two pipeline superscalar arrangement.
Two duplicate sets of stages are present, with the increased bandwidth fetch stage, and a new dispatcher stage in between. Like for the initial move to pipelining, going to superscalar creates more problems. If an instruction is to update a register in one pipeline and the next logical instruction has this as a source then if it is dispatched down the other pipeline in the same cycle, then we have a race condition. What if, though, the dispatcher had a following instruction that didn’t rely on the result from either of the instructions in front and didn’t write back to any register that was a source for them? Then the dispatcher could send that instruction into the other pipeline for execution, “out of order”. The instruction that was paused can then be sent into the pipeline for execution. This keeps the pipeline full without changing the result. For example, if the first instruction is addi, x1, x0, 1 (put 1 into x1) followed by addi x2, x1, 1 (add 1 to x1 and place in x2), then addi x3, x0, 6 (place 6 into x3), then the last instruction can be executed in parallel with the first, then the second instruction dispatched. The result will still be x1 = 1, x2 = 2, and x3 = 6.
Here I have assumed that the pipeline is executing at 100% but, of course, memory accesses can stall a pipeline, branch prediction may fail and cause flushing etc. The dispatcher will actually keep tabs on all the stages for all the pipelines about which destination registers are to be updated. It will prefetch later instructions, up to the depth of the pipeline, in order to lookahead about whether any instruction is a candidate for dispatching out-of-order so it can keep the pipelines as full as possible. Of course this is all complicated by branch and jump instructions—and you thought branch prediction was messy. We are getting to quite advanced features now, and still are only just looking under the hood, but we must stop here and look at an example implementation.
领英推荐
Example Implementation
To finish this article, I want to review the design of the RISC-V implementation on github to illustrate some of the topics discussed above. This will not be exhaustive as much of the detail of the design is given in the logic’s documentation. Not everything we’ve talked about is implemented to keep things simple enough to understand but to have advanced enough features for an efficient design. The implementation has the following specification.
The RV32M and RV32E features, though available, need not bother us here and can actually be configured out.
Top Level
The base RV32I logic is implemented as a 5-stage pipe in 3 separate blocks. A register file (regfile) implements the internal registers and the program counter. As such it spans both the fetch stage, by providing the PC, and the writeback phase where registers and the PC are updated. The rs prefetch and decode stages are implemented in the decoder block, and the execute and writeback phases are implemented in the ALU block (with regfile register updates in the writeback phase as well). The diagram shows the top-level block diagram arrangement.
The diagram does not show the Zicsr or RV32M blocks. These logically sit in the execute phase and, in this implementation, they are in separate modules. This is so that the core can be easily configured without these features for reduced size if those functions are not required. It requires some additional muxing to do this but appeared not to be on the critical timing path. Otherwise, the additional instructions could have been added to the ALU and the CSR registers to the regfile blocks. The decoder already partially decodes these instructions and forwards them to the blocks.
Register File
The regfile block contains the general-purpose registers and PC. It is configurable to use either logic gates or memory. Which is best depends on the target technology.?FPGAs tend to have available memory blocks which, if not all already allocated, can be used for the registers, saving on consuming logic resources which might be at a premium in a small device. There is an issue, though, if the memories do not have dual read ports. R-Type instructions, as discussed in the second article in this series, have two source register to fetch. Either an additional pipeline stage for fetching the registers separately can be used (not preferred) or, if resources allow, two memories can be used. When updating registers during writeback, both memories are updated with the same value at the same offset. When fetching source registers data, one memory is allocated for rs1 and the other for rs2. For targeting an ASIC, whether logic or memory is more efficient will depend on the technology. A dual port memory is more likely to be available for ASICs though.
Whether a memory or logic, the registers themselves are just 31 registers of 32 bits wide. Logic intercepts reads and writes to x0, so it always reads 0. The regfile will increment the PC by default, unless the writeback stage indicates an update.
Decoder
The decoder block extracts the source register indexes and sends then to the regfile for reading the value during the rs prefetch phase. This is done for rs1 and rs2?regardless of whether the instruction ultimately require them. The instruction is then registered for the decode phase. The diagram below shows the main functions within the decoder logic:
The decoder extracts the immediate bits from the instruction. Since these bits vary according to instruction type, each type is extracted and then the correct one selected and, where applicable, sign-extended, within the decode phase. The register values returned from the prefetch phase are routed to one of two decoder output parameters, a and b, for routing to the ALU. This diagram shows that these returned register values can be overridden by feedback rd values from the writeback stage, discussed later. As well as rs1 for a, and rs2 for b, these outputs can be selected to be the PC value (from regfile) or 0 for a, and the trap vector (Zicsr’s mtvec + offset value) and immediate value for b, depending on the instruction. The diagram shows which type of instructions these are selected for. Other decoder outputs for the ALU include the PC value forwarded, the immediate value to be used as a PC offset, the instruction’s rd index value and the set of controls to select the ALU execution, decoded from the instruction. These are essentially a bitmap of what operations to perform in the execute phase. All these outputs are registered to place them in the execute phase.
ALU
The ALU is essentially going to take the a and b parameters from the decoder and generate a result on a c output. This is true for all the data manipulation instructions and the store instruction, whilst PC and memory read data are used for loads and jumps respectively. Some addition outputs are generated for memory accesses and PC updates. The diagram below shows a simplified block diagram for the ALU.
The controls from the decoder select which operation to perform. The load instruction has an alignment operation on data returned from memory and the jump instruction takes the forwarded PC and adds 4. The rest of the instructions use a and b, as selected during the decode phase. The result is registered as output c.
In addition to this main ALU operation, the block generates a strobe for updating the pc, with the new PC value being either the PC input plus the forwarded offset value for branches, or a plus b for jumps (which contain PC and immediate values). Load and store strobes are generated as appropriate, with an address output calculated from the forwarded offset value and the b parameter. From this address any byte enables are calculated from the lower bits. The value to be written for stores is on the c output. Finally, the rd index value is output for the writeback of the ?c result to the register file.
In the Verilog implementation the ALU code is not fully optimized for gate count.?For example, there are three shift RV32I shift instructions, sll, srl and sra. In the code this is handled by calculating the result for each of the possible instructions and then selecting the required value depending on the particular instruction active. The code snippet below shows this:
Hopefully this clear how this operates to produce a shift result, but the three innocuous lines with the shifts hide a lot of synthesized gates, with 32-bit barrel shifters which use a fair amount of gates. A single signed barrel shifter might have been employed with a_signed a 33-bit sign extended version of a. For right shifts, a_signed is then used for both srl and sra. For shift left, the input of the barrel shifter is a_signed with bits 31:0 bit reversed and bit 32 set to 0. The output of the barrel shifter is then extracted and bits 31:0 bit reversed for the result. This would result in a single 33-bit barrel shifter and some additional muxing, which ought to consume less resources, but may add to the timing path. If timing is not critical then this is a useful optimisation, but the code is less clear. Actually, even the shift for aligning data on loads and stores could also use the same barrel shifter, adding additional muxing to the inputs and outputs. Similarly, a maximum of 2 adders/subtractors are needed (jal and jalr add 4 to the PC, plus the new PC value addition calculation), but actually 4 are used for adds, subtracts, new PC calculation, and PC + 4. In the implementation, clarity and speed where chosen over resources. All these decisions are a classic compromise between clarity, logic resources and speed.
Hazards and Stalls
In the logic implementation, write back hazards are dealt with in the decoder and ALU by comparing the rd index on an active writeback with the rs indexes for that stage. The diagram below shows the logic for selecting between the input from the previous stage and the fed-back rd value. Note that the writeback logic ensures rd is 0 when no active writeback.
Within the register file, if using memory, it takes two cycles before a value written to the register array can be available at the output. When the decoder reads from the registers, if currently writing to a matching rs index then the write value is sent back. If matching against the value in the previous cycle, this is selected, otherwise that from the register file is sent. This is illustrated in the diagram below:
To deal with branch hazards under the chosen never-take policy, whenever a branch (or jump) updates the PC, the ALU cancels the next instruction it was about to execute and forwards this ‘branch taken’ state to the decoder to clear its output state, effectively turning the decoder output to a “no-operation” so the ALU will not do anything with this instruction as well.
Stalling on memory accesses is a matter of detecting a wait state and forwarding a stall signal to all the stages to hold their state. Global stalls can be a source of critical path and faster means might be employed such as a shallow FIFO between each stage where an output is held-off when the FIFO is full. This keeps the stalling local to an interface between two stages, rather than global, but is more complicated to implement and costly in gates—the outputs from the some of the stages can have many bits.
Zicsr
The control and status registers are implemented in a separate module and are logic gates rather than an array. As indicated in the previous article only a small sub-set of the possible registers will be implemented, similar to the NIOS V/m, with just a few extra. I won’t go over their detailed function which is discussed in the last article but is mainly concerned with exception handling, with some additional counters.
The Zicsr logic handles both the external exceptions (interrupts, software, and timer) and the internal exceptions. The interrupt and software exceptions are signals that originate from outside the core, whereas the timer interrupts are part of the Zicsr logic. The timer can be read and updated from outside the core so that it can be mapped into the memory address space, as per the specification. The timer can be configured to be a 1μs real-time count, or just reflect the clock cycle counts in mcycle and mcycleh, which are always implemented. Since this implementation has a fixed clock, using the cycle counts is valid. If it ever had a variable clock input, that would not be true.
All the internal exceptions originate from the decoder, either as decoding an ecall or ebreak instruction, or as an error condition, such as misaligned addresses etc.
The Zicsr block has outputs to update the PC on an exception and also to write to a general-purpose register when executing a CSR instruction. The decoder partially decodes the CSR instructions and forwards the parameters to the Zicsr block to complete decoding and access the registers. The system instruction, mret, is decoded by the decoder block, and a flag sent to the Zicsr block. In this case the decoder sends a “no-operation” to the ALU as the Zicsr block will handle the instruction. The logic updates the CSR register state and also updates the PC to the exception return address that saved in the mepc register when an exception was generated.
The Zicsr logic also, configurably, adds the retired instruction counter registers (minstret and minstreth) which are incremented on a pulse from the ALU for each instruction completed.
Conclusions
In this article we have looked at what the minimum logic might be to implement a RISC-V processor, starting with just a functional approach based around an FSM. From then on, we looked at ways of improving on this basic design, not to add functionality but to make it operate more efficiently.
Starting with pipelining, the design could suddenly process more instructions per cycle, but at a cost of introducing some hazards which needed to be solved with writeback feedforwards and branch hazard handling. The branch hazards were dealt with using branch prediction algorithms of varying complexity and efficiency and were based on some assumptions about the nature of executing code. This got quickly messy and is only the beginning of the range of possible algorithms employed. With these methods the design could approach executing (in the limit) one instruction per cycle. Looking at superscalar architectures, this can be improved even further but, once again, this has added complications. Complex logic is needed for multiple pipelines and out-of-order execution of instructions without altering the functional results.
Finally, we reviewed an implementation that sits somewhere between the simplest FSM based approach and the complex solutions already looked at which, hopefully, can give a useful example of an IP core for further exploration. The review was cursory, and the implementation documentation gives a fuller picture of the design details for those interested in following this up. Perhaps have a go at implementing a core yourself.
In the next, and final, article in the series we look at the RISC-V assembly language so that the core can be programmed. This may not be of interest to everyone reading these articles, and so it has been left until last, but it ought to be informative and for those wishing to run simulations on the example core (or its accompanying instruction set simulator) will allow them to explore executing all the different instructions.
Staff Digital Design Engineer at Synopsys
2 年Thanks for your time