C++ Modelling of SoC Systems Part 1: Processor Elements

Simon Southwell

Semi-retired logic, software and systems designer. Technical writer, mentor, educator and presenter.

发布日期: 2023年10月2日

Introduction

In this article I want to discuss the modelling of System-on-Chip (SoC) systems in software. In particular in C++, but other programming languages could be used. The choice will depend on many factors but, as we shall see, there are some advantages in modelling with a language that will also be the ‘programming language’ of the model. Modelling processor-based systems in software is not uncommon. In my own career alone, I have seen this done, to varying degrees, at Quadrics, Infineon, Blackberry, u-blox and Global Inkjet Systems and have been involved in constructing some of these models and used the models at all of them.

SoCs are generally characterised by having on a single chip many of the functions that might, in the past, have been separate components on a PCB, or set of PCBs. We can define some common characteristics of an SoC that we’re likely to find on any device. The SoC will have one or more processor cores, and this immediately implies a memory sub-system, with a combination of one or more devices from ROM, Flash, SRAM, DDR DRAM etc., to facilitate running programs and operating systems. The cores may have caches to varying levels, and may support virtual memory, implying an MMU (memory management unit) or, if not VM support, at least memory protection in the form of an MPU (memory protection unit). Some memory mapped interconnect or bus will be needed for the core(s) to access memory and other devices, so an interconnect/bus system will be present, such as Amba busses (APB, AHB, AXI, CHI), or Intel/Altera busses (Avalon). Almost certainly, the processor will need to support interrupts, and an interrupt controller would then be needed for multiple nested interrupts with, perhaps, support for both level and edge triggered interrupts. If support for a multi-tasking and/or real-time operating system is needed, then a real-time timer that can generate interrupts will be present, along with other counter/timer functionality, including perhaps watchdog timers.

Once we have the processor system, with memory and interconnect, the SoC will need to interact with the real world via peripherals. These might be mapped on the main memory address map, but there may be a separate I/O space. The peripherals might be low bandwidth serial interfaces (UART, I2C, SPI, CAN), or higher bandwidth interfaces, such as SDIO, USB, Gigabit Ethernet or PCIe. Moving data from the interfaces (especially those with high bandwidth) might require direct memory access (DMA) functionality, utilising streaming bus protocols. Encryption and security support may also be required. For control and status of external devices, a certain number of general purpose I/O (GPIO) pins may be supported, as well as analogue-to-digital converters (ADCs) and/or digital-to-analogue convertors (DACs). There may also be peripherals to drive display devices such as LCD displays.

Having a set (or sub-set) of these general-purpose functions, where required, custom functions, specific to the system being developed, can be added. In an FPGA based SoC these might be implemented in the logic part of the device. The diagrams below show two commonly used FPGA devices that have SoC hard macro logic (Custom ASIC type logic implementation): one from AMD and one from Intel.

These two devices have very similar architectures and sets of SoC components. This is not surprising for two reasons. Firstly, they serve the same market and are competitors in that market, and secondly, they are based around the same processor system, namely the ARM Cortex-A, using the same Amba interconnect. They do, however, give a good reference point to what a generic SoC might look like, and what functionality is present. They contain a lot of options for interfaces and protocols, and a specific implementation may not use all of them, so any modelling of a given system need only model what is going to be present and used in the implementation.

SoCs are not restricted to FPGAs, and many ASICs follow this same pattern. I worked on a Zigbee based wireless ASIC which was also ARM based and had a smaller set of peripherals, but not dissimilar to those above so customers to adapt the chip for their specific application.

Having defined a typical set of functions one might find in an SoC, we find that there are a lot of complex things present, from the processor cores to the peripherals and all the functionality and protocols in between. How can we make a model that covers all this functionality?

If cycle accurate modelling is required then the model is likely to converge in complexity of the logic implementation and we need some strategies to simplify the problem or else the model development will rival the logic in effort and elapsed time. It is possible, if an HDL implementation of the logic is available, to convert this to a programming language. The Verilator simulator can convert SystemVerilog or Verilog to C++ or SystemC (a set of libraries for event driven simulation) which can be interfaced to other C++ based model functions. However, this rather negates some of the advantages of having a C++ model; namely, having a model on which software can be developed before logic implementation is available, speed of execution, and ease of system modification for architecture experimentation and exploration. So, is it worth making a software model of an SoC system at all?

In the rest of this article, and the following article(s), I want to break down each of the functions we looked at for an SoC and look at strategies for putting together software models to quickly and usefully construct a system that can be used to develop software and explore a design space either before committing to a logic development or used in parallel with a development to shorten schedules and mitigate risk. We will begin in this article by looking at ways to have a processing element on which we can run our embedded software.

Modelling the Processing Element

The beating heart of an SoC are the processor cores. One of the motivations for building a software model is to execute software that is targeted at the final product. The software will run on one or more cores and, in general, have a memory mapped view of the rest of the SoC hardware. There may be a separation of memory and I/O spaces, but this is just another simple level of decode. Other than the memory and I/O views, the only other external interaction the processor cores usually have is that of interrupts. This may range from a single interrupt (e.g., RISC-V), where an external interrupt controller can handle multiple sources of interrupt, to having a vector of interrupt inputs (e.g., ARM Cortex with built in NVIC). Actually, in either case, the interrupt controller could be modelled as an external peripheral. What remains is what I call the ‘processing element’, with just this memory and interrupt interfaces. This simplifies what we have to model considerably. The next question is, what processor do we need to model? There are two answers to this, one of which is obvious, and the other not so obvious:

Model the processor that is targeted for the product.
Don't model the processor

The first answer is, I hope you’ll see, the obvious answer, and we will look at constructing instruction set simulators later, including with timing accurate modelling.

The second option is not so obvious. Whatever we model, we want to run a program that can read and write to memory (or I/O space) and be interrupted, just like a real processor core. If we present an API to that software so that it can do these memory accesses and have interrupt service routines called when an external interrupt event occurs, then we are close to a solution. The model, presumably, is compiled and run on a PC or workstation, likely compiled for an x84-64 processor. Even if the embedded software is targeted for a different processor, such as a RISC-V RV32G processor, then it might still be possible to cross-compile it for the model’s host machine—especially if steps are taken to ease this process, as we will discuss shortly. This saves on constructing a specific processor model, which requires a good understanding of the processor’s instruction set architecture (ISA), or when no third party model is available. Since an instruction set simulator is, itself, just a program, once we have a generic processing element model, we can simply make the program we run on it an ISS and, voila, we have a model that can run code for an architecture other than the host computer’s processor. The diagram below summarises these two cases:

Hopefully it is clear that one is, in general, just an extension of the other and that taking a generic processor route has an ‘upgrade path’ for more accurate processor modelling as the next logical step.

In the next section I want to look at this generic processing element approach, before looking at methods for constructing instructions set simulators for specific processor instruction set architectures (ISAs).

Generic Processor Models

The question on whether to take a generic processor model approach or use an ISS is really down to the timing accuracy required of the model.? With an ISS, instruction execution time is usually well documented and can be built into the model, as we will see when discussing this approach. For a generic solution, we can still have timing models, but these will be more crude estimates based on statistical modelling (or educated guesses). None-the-less, this may still be very useful in constructing the embedded code and running on a model with the desired peripheral functionality.

Memory Accesses

It's fair to say, I think, that most SoC processors memory accesses will be done through the equivalent of load and store instructions of fairly limited functionality, perhaps being able to load or store from bytes to double words etc. From a software viewpoint, this is largely hidden (unless writing in assembly language), and the software manipulates variables, arrays, structures, class members etc. In a generic processor model these memory data structures can just be part of the program and reside on the host. It gets interesting when accessing memory mapped peripherals and their registers.

The simplest API for accessing memory mapped space within the SoC model is perhaps a pair of C like functions, or C++ methods in a class, to read and write such as shown below (assuming ultimately targeting a 32-bit processor):

uint32_t read_mem  (uint32_t addr, access_type type, bool &fault);
void     write_mem (uint32_t addr, uint32_t data, access_type type,
                    bool &fault);

The type argument defines the type of access—byte, half-word and so on. Of course, these will be wrapped up as methods in an API class, and there may be I/O equivalents. This isn’t an article on C++, but the functions could be overloaded so that the type of data (the return type for read_mem, and the data argument for write_mem) could define the type of access, dropping the type argument. Where possible I will avoid obfuscating the points being made with this kind of ‘best practice’ optimisations. When writing your own models, you should use good coding style (and comment liberally), but I want to keep things simple. You can, of course, write the whole thing in C, and the embedded code to be run on the model may well be in that language in any case.

In many of the embedded systems I have worked on, the software has a virtualising layer between the main code and accessing the registers of the various memory mapped hardware. This is a Hardware Abstraction Layer (HAL) and might consist of a set of classes that define access methods to all the different registers and their sub-fields—perhaps one per peripheral—built into a hierarchy that matches that of the SoC. I.e., a sub-unit may consist of a set of peripherals, each with their own register access class, and even, perhaps, some memory, gathered into a parent class for the sub-unit. The advantage here of having a HAL is that it can be used to hide the access methods we defined above and make compiling the code for both the target and the host running the model that much easier. Ultimately, the HAL will do a load or a store to a memory location. If we arrange things so that, when compiled for the target, the HAL simply makes a memory access (a = *reg or *reg = a), but when compiled for the model references the methods (a = read_mem(reg, WORD, fault) or write_mem(reg, WORD, fault)) then the embedded software gets the same view of the SoC registers whether running on the target platform or running on the generic processor as part of the SoC model. Indeed, this was done at one of my employers and the HAL was automatically generated from JSON descriptions, as was the register logic, ensuring that the software and hardware views agreed. Again, avoiding C++ nuances, it is possible (for those interested) that if the register types are not the standard types (e.g., uint32_t) but a custom type, accesses such as a = *reg or *reg = a can be overloaded to call the read and write methods, so retaining pointer access. This is more complicated, and a HAL would virtualise this away anyway, making it unnecessary.

Whether overloading pointers, using a HAL, or just calling a read and write API method directly, from a software view we have an API for reading and writing to a memory mapped bus. We haven’t discussed what goes in these methods yet, but we will get to this when we talk about modelling the bus/interconnect.

Interrupts

The other interface to the model of a processing element we identified was for interrupts. Notoriously, on a PC or workstation, when running user privilege programs we don’t have access to the computer’s interrupts directly. Fortunately, we do not need to.

In a real processor core, at each execution of an instruction, the logic will inspect interrupt inputs, gating them through specific and then master interrupt enables and if one is active, and enabled, will alter the flow of the program in accordance with the processor’s architecture. Thus the granularity of an interrupt is at the instruction level. For our generic processor model, we aren’t running at the instruction level, but just running a program on a host machine. We do, however, access the SoC model with the read and write API calls. Since the SoC model will be the source of interrupts, this is a good point to inspect the current interrupt state. Glossing over just how that state might get updated for the moment, so long as, at each read and write call, the interrupt state can be inspected, we can implement interrupts and have interrupt service routine functions.

If the read_mem and write_mem methods of the memory access class call a process_int method as the first thing they do, then this can keep interrupt state and make decisions on whether to call an interrupt service routine (ISR) method. The main program is stalled at the memory access call whilst the ISR is running, and so will return to that point when the ISR method exits. The ISRs themselves can access memory and can also be interrupted by higher priority interrupts allowing hierarchical interrupt modelling to be achieved. A sketch for an API class with interrupts is shown below:

class ApiWithInterrupts
{
public:
      static const int max_interrupts = 32;

      ApiWithInterrupts () {
          int_active        = 0;
          int_enabled       = 0;
          int_master_enable = false;

          for (int idx = 0; idx < max_interrupts; idx++) {
              isr[idx]      = NULL;
          }
      };

      void  write_mem (uint32_t addr, uint32_t  data, access_type type,
                       bool &fault) {
           process_int();
          /* write access code */
      }

      uint32_t read_mem  (uint32_t addr, uint32_t *data, access_type type, 
                          bool &fault) {
           process_int();
          /* read access code */
      }

      void enableMasterInterrupt  (void);
      void disableMasterInterrupt (void);

      void enableIsr    (const int int_num);
      void disableIsr   (const int int_num);

      void updateIntReq (const uint32_t intReq);
      void registerIsr  (const pVUserInt_t isrFunc, const unsigned level);

private:
      void process_int();
      
      pVUserInt_t isr[max_interrupts];
      uint32_t    int_enabled;
      bool        int_master_enable;
      uint32_t    int_req;
      uint32_t    int_active;
};

Here we have a class with the two methods for read and write, and I’ve shown these with some code to show that an internal process_int method is called before actually processing the access. The class contains some state, with an array of function pointers, set to NULL in the constructor, which can be set to point to external functions via the registerIsr() method. A master interrupt variable, int_master_enable, can be set or cleared with enable- or disableMasterInterrupt methods. Similarly, the individual enables can be controlled with enableIsr and disableIsr methods. To actually interrupt the code, the updateIntReq method is called with the new interrupt state, which would set the int_req internal bitmap variable, which process_int will process. A bitmap int_active variable is also used by process_int to indicate which interrupts are active (i.e., requested and enabled). There can be more than one active, and the highest priority will be the one that is executing.

This type of method is used with the OSVVM co-simulation code and I write about how this is done with more detail in a blog on that website. In this environment there is an OsvvmCosim class with, amongst other methods, a transRead and transWrite methods. This is used as a base class to derive an OsvvmCosimInt class, then overloads the transRead and transWrite methods (and others) to insert a processInt method call which models the interrupts. The ISRs don’t reside within the class, but external functions can be registered by the user to be called for each of the ISR priority levels. The blog gives more details and the referenced source code can be found on OSVVM’s github repository, so I won’t repeat the description here, and the details of the processInt methods of that code serves to show how this would be done with the sketch class described above, and its process_int method.

So here we have a framework to build an SoC and run a program. We have defined a class with read and write capabilities and the ability to update interrupt state and have the running program interrupted with prioritised and nested interruptable interrupt service routines, provided externally by registering them with our class. We can now write a program and a set of ISRs that uses this class to do memory mapped accesses and support interrupts. I’ve left off the details of the read_mem and write_mem methods, for now, as this is how we will talk to the rest of the model which will be dealt with in another article.

Davide Nardella 1 年前

Extending the Power of Logic Simulations using the…

Simon Southwell 2 年前

Processor Design #1: Overview

Simon Southwell 2 年前

Instruction Set Simulators

With the class defined from the last section we can write arbitrary programs and interact with the rest of the model (when we get that far). Of course, that arbitrary program could just be an instruction set simulator (ISS). One difference is that the granularity of interrupts will be at the instruction level, rather than the read and write memory level, and the ISS model of the processor itself, in some cases, will be contain the interrupt handling code. Thus the API class we defined before simplifies considerably, with the read and write methods no longer requiring a process_int call, and all the code associated with interrupts disappears. We still need to inspect interrupt state but, as we shall see, a slightly different method is used. In the OSVVM code, the non-interrupt class (OsvvmCosim) is defined as a base class, and then a derived class (OsvvmCosimInt) overloads the read and write methods to insert an interrupt processing method at the beginning of each one, and then call the base class’s read or write method. If this split was done to the class from the last section, then the base class could be used for an ISS, which wouldn’t need the interrupt functionality externally. In the rest of this section I want to outline the architecture of an ISS model which is largely common to modelling any processor’s ISA.

Just as for a logic implementation, we have some basic operation we must implement:

Reading an instruction
Decoding the instruction
Executing the instruction

These three basic functions are repeated, in an execution loop, indefinitely or until some termination condition has been reached, such as having executed a particular number of instructions, executed a particular instruction (like a break for example) or some such state, set up prior to running the processor. In addition to these basic functions, some state also needs to be modelled for things like internal registers and the program counter. These can all be collected into a processor model class.

For reading an instruction we are already set up, as we can use our API with the read_mem method to read instructions preloaded into memory though, as we’ll see later, this will be done via an indirection. For RISC type processors, only one word is read per instruction such as ARM, RISC-V and LatticeMico32 processors. Therefore the decode process is completely isolated from the other steps. This is the case for my RISC-V and LatticeMico32 ISS models. For non-RISC, usually older, processors, instructions may be variable in length, with a basic instruction opcode followed by zero or more arguments. Therefore a decode, or partial decode needs to be done, and then any further bytes/words read before moving to execution. Thus, reading and decoding can be entwined somewhat, complicating the first two stages, though only mildly so. This is the case for my 6502 processor and 8501 microcontroller models.

Decoding an instruction will involve extracting opcode bits to uniquely identify the instruction for execution, with the other bits being ‘arguments’ such as source and destination registers, immediate bits and the like. Depending on how many opcode bits the processor’s ISA defines will determine how many possible unique instructions there can be, though they might not all decode to a valid instruction. The? number of opcode bits might be quite small, such as for the LatticeMico32, which has 6 bits and 64 possible instructions and the 8051 which has 8 bits for 256 possible instructions. For other processors it may be much higher and the RISC-V RV32I processor’s R-type instructions have 17 bits (see my article on RISC-V for more details). Many ISS models I have seen use a switch/case statement for the decoding. For the small opcode processors, like the LatticeMico32 with 6 bits, a switch statement with 64 cases to select the execution code is manageable. For the larger opcode spaces, such as the 17 bits of RISC-V RV32I, this then becomes 131072 cases most of which will be invalid instructions. To manage all of the different architectures, I prefer to use a hierarchy of tables which have pointers to instruction execution methods as part of each entry. For the smaller opcode spaces, this table hierarchy can be one deep (i.e., a single flat table), but for the large spaces this is broken down. The RISC-V instruction formats have a common 7-bit opcode, and then have various other functX fields of various sizes, such as a three bit funct3 or a seven bit funct7 fields. We can use this to produce a hierarchy. An initial primary table can be made with the number of entries for the opcode (i.e., 128). Each entry in the table can have a flag saying whether it is an instruction, and then has a pointer to an instruction execution method, or points to another table. A secondary table would have entries for the funct3 field, and a tertiary table would have entries for the funct7 field. This can be repeated for any depth required. Decoding then walks down the table until it finds an instruction entry.

The diagram below, taken from the RV32 ISS Reference Manual , shows this situation.

So what might each table entry look like? Here we define a structure (class) to group all the relevant information and make an array of these structures for the tables. The code snippet below shows a top-level structure for the rv32 ISS.

typedef struct rv32i_decode_table_t
{
    // Flag to indicate 'ref' a sub-table reference
    // (and not an instruction entry)
    bool                      sub_table;

    // Either a reference to an instruction entry
    // or a reference to a sub-table
    union {
        rv32i_table_entry_t   entry;   // A decoded entry
        rv32i_decode_table_t* p_entry; // A pointer to a sub-table
    } ref;

    // Pointer to an instruction method
    pFunc_t                   p;
} rv32i_decode_table_t;

This structure has the sub-table flag, a union of either a pointer to a decoded instruction data structure or to another table and then a pointer to an instruction execution function (which is null if a sub-table). The decoded instruction data structure is all the fields of the instruction extracted out, which is ‘filled in’ by the decode code. Since this will be passed to all the instruction execution functions, it contains all possible fields for all instruction types, so that the instruction execution methods can simply pick the out appropriate values they need.

typedef struct {
   uint32_t            instr;  // Raw instruction
   uint32_t            opcode; // Opcode field
   uint32_t            funct3; // Sub-function value (R, I, S and B types)
   uint32_t            funct7; // Sub-function value (R-type)
   uint32_t            rd;     // Destination register
   uint32_t            rs1;    // Source register 1
   uint32_t            rs2;    // Source register 2
   uint32_t            imm_i;  // Sign extended immediate value for I type
   uint32_t            imm_s;  // Sign extended immediate value for S type
   uint32_t            imm_b;  // Sign extended immediate value for B type
   uint32_t            imm_u;  // Sign extended immediate value for U type
   uint32_t            imm_j;  // Sign extended immediate value for J type
   rv32i_table_entry_t entry;  // Copy of instruction table entry
} rv32i_decode_t;

The decode table arrays are constructed and filled by the constructor, to link the table hierarchies and point to the instructions’ execution methods. In the execution loop, decoding does the ‘table walk’, indexing down the table with the opcode and functX values as indexes, until it reaches an instruction execution entry. The entry decode table is filled in from the raw instruction value, and then the instruction method, pointed to in the entry, is called with the entry decode table as an argument. If the pointer is pointing to a ‘reserved’ method, then the decoding reached an invalid/unsupported instruction, and an exception can be raised.

The instruction execution methods are now simply a matter of executing the functionality of the instruction. Below is an example of an add instruction method:

void rv32i_cpu::addr(const p_rv32i_decode_t d) {
  if (d->rd) {
    state.hart[curr_hart].x[d->rd] = state.hart[curr_hart].x[d->rs1] + 
                                     state.hart[curr_hart].x[d->rs2];
    }
    increment_pc();
}

As you can see, this is now fairly straight forward. The branch and jump instructions will update the PC (with the former on a condition), whilst the load and store instructions will do reads and writes. Now these could use the API class that we defined earlier, but a better way ?(as I hope I will convince you) is to have internal read and write methods which call an external callback function registered with the ISS model. (If you are not familiar with pointers to functions and callback methods, I talk about these in one of my articles on real-time-operating systems, in the Asymmetric Multi-Processor section.

The reason I think this is better is because it decouples the ISS completely from the rest of the SoC model, allowing the ISS to drop in to any SoC model, which can register its external memory access callback function and run code on the ISS for that environment. The rv32 ISS read and write methods check for any access errors (such as misaligned addresses) and then executes the external memory callback function if one has been registered. If one hasn’t been registered, or if the call to the callback returns indicating it didn’t handle the access, the ISS will attempt to make an access to a small 64Kbyte memory model it has internally. If the access is outside of this range, then an error is generated.

Many (but not all) of the instructions can generate exceptions and in the rv32 ISS a process_trap method is defined to handle these called, as appropriate, from the instruction execution methods, with the trap type. The process_trap method simply updates register state for the exception and sets the PC to the appropriate exception address. Since interrupts are forms of exception, we can also have a process_interrupts method. This, though, is not called from the instruction methods, but is in the execution loop so that it is called every instruction. Some processors have a mixture of internal and external interrupt sources. So, for example, the RISC-V processor can generate timer interrupts internally (ironically from timers that are allowed to be external), whilst also have an external interrupt input. In order for the ISS to be able to be interrupted by external code we, once again, use an externally registered callback function. At each call to process_interrupts this callback (if one registered) is executed and returns the external interrupt request state. This is then processed against interrupt enable state and, if enabled, a call to the process_trap is made, with an interrupt type instead of an internal exception type and the PC will be altered similarly to that for an exception.

So we now have all the components for the basic functionality, with decode, execution, memory access and exception/interrupt handling. To run an actual program we just need a run loop.

  while (!halt) {
      if (!process_interrupts()) {
          // Fetch instruction
          curr_instr = fetch_instruction();

          // Decode
          p_entry = primary_decode(curr_instr, decode);

          // Execute
          if (p_entry != NULL)
              error = execute(decode, p_entry);
          else
              process_trap(RV32I_ILLEGAL_INSTR); 
      }
 }

We can now refine our ISS diagram a bit with the callback functions provided by the external model software, called by the ISS for memory accesses and inspecting interrupt state, and then these callbacks making use of the API we defined earlier.

Timing Models

If modelling timing to some degree is important (and it may not be), then what is possible will depend on the choice of whether using a generic model or a processor specific model. With the generic model the software running on the processing element is just a program running on the host machine and only when it interacts with the rest of the model (e.g., do a read or write) is there any concept of the advancement of ‘model’ time. Of course, if there is some data on the average mix between memory access and non-memory access instructions in a similar real-world system, then an estimate for clock cycles run between calls to the API read and write methods can be made, and some state kept that’s updated when calls to the methods made. It will, necessarily, be a crude estimate, but may be a useful approximation of how a system might perform. However, the generic method is not really suitable for more accurate performance measurements.

With an ISS things become a lot better. Processor execution times are usually well understood and documented for a given implementation. For example, from the documentation of the rv32 RISC-V softcore we have:

1 cycle for arithmetic and logic instructions
Jumps take 4 cycles
Branches take 1 cycle when not taken ?and 4 cycles when taken
Loads take 3 cycles plus wait states
Stores take 1 cycle plus wait states

This is a fairly straight forward specification, and the ISS instruction execution functions can update some count state to keep track of cycle time. The exception is the memory access instructions which also have a wait state element. This wait delay is a black-box as far as the processor is concerned, as it is caused by the external modelling of the memory sub-system. Therefore, the memory callback function prototype specifies a return value which is the additional cycles added by the memory access, and the SoC model memory functionality will calculate this. Thus, the memory access instructions will add their base timing to the cycle count, and the callback return value will then be subsequently added. From a processor point of view, then, we have cycle accurate behaviour. Of course, if the core has complex features, such as out of order execution, dynamic branch prediction, or is superscalar in architecture, accurate cycle counts become harder as the model must take these factors in to account.

Debugging

With my software hat on, the first question I might ask when presented with a software model of a system to program is, how will I debug my code? Here, I’m talking about the code that is running on the model, rather than the code that is the model—which can be debugged using the normal tools and methods as it’s just an application running on the host machine. In fact, for the generic processor model, the code to be executed is just an extension of that model code, perhaps usefully separated from the model code, but compiled and linked with it. So the same techniques and tools can be used here and the issue is sorted.

For an ISS based model things are a little bit more complicated—but not much. Here the software running on the model (using the ISS) is probably cross-compiled for the particular processor, and the host tools can’t be used and one would need to use those supplied with the toolchain for that modelled processor architecture. However, taking gdb as a common debug tool, it has a remote mode where it can connect to a processor remotely, via a TCL/IP socket, and then send ‘machine’ versions of the common commands to load programs, set breakpoints, run code, inspect memory, step and continue the program etc. If the ISS model has some TCP/IP server code that can receive and decode the gdb machine commands, and then act appropriately, sending any required responses, a debugging session can be set up using the processor’s version of gdb. This has, in fact, been done for my RISC-V and LatticeMico32 ISS models, and the source code can be inspected for how this is implemented. The LatticeMico32 ISS documentation has sections on the gdb interface, and an appendix on how to set up the Eclipse IDE with gdb so that a full IDE can be used for debugging. So, now we have a full debug environment and can debug code.

Multiple Processing Elements

Many embedded systems have multiple processor cores, and we may need to model this. With the generic processor model, we might ask, do we really need to model multiple cores as the code is just host application code? The main motivation here is that the system may be set up to have different functionality on each core, rather than just have them as a pool of processing resources to run processes and threads as allocated by the operating system. For this latter situation, maybe no further modelling is needed for multiple cores. Even if the embedded code is multi-threaded, these can just be threads running on the host. The only issue to solve is that multiple threads accessing the memory read and write API will need to be done in a thread safe way. This might be wrapping the API calls with mutexes to make sure any access is completed atomically by each thread. For the case with each core performing different tasks, the source code is likely to be structured with this split, and so these could be run as separate host threads and the same methods used to ensure thread safe operation. Again, as for the timing models, the generic processor model will stray from accurate and predictable flow of code when modelling multiple cores in this way, and it is mainly for software architecture accuracy, with the aim to be able, as much as possible, to compile and run the same code on the model as for the target platform.

For the ISS based modelling we don’t need to rely on threads or mutexes and can maintain a single threaded application. How? It was implied, when discussing debugging, that the ISS model is able to be stepped one instruction at a time. The run loop code snippet shown earlier had a ‘while not halt’ as the main loop, where halt is doing a lot of heavy lifting. Actual real code will have a whole host of possible reasons to break out of the loop, allowing breaks on reaching a certain address in the instructions (a break point) or after a certain number of instructions have been run, such as 1 (a step). We can use this step feature to advance the processor externally instead of free running. Now a run loop can have calls to step multiple ISS objects and step them in sequence. With access to the ISS objects’ concept of time, the execution order can be improved by, at each loop iteration, only step the ISS object that has a smaller cycle count time. This, then, minimises the error in cycle counts between the processor models and keeps them synchronised. This is discussed in more detail in the LatticeMico32 documentation , under the Multi-processor System Modelling section, for those wanting to know more about this subject. This method can be done for any number of processor cores required to be modelled and can even be done with different processor models, which is not ?an uncommon situation in some embedded systems.

Conclusions

We have started our look at constructing software models of SoC systems by looking at ways to allow us to run embedded software on a ‘processing element’. This might be a virtual element where a specific processor isn’t modelled, or an instruction set simulator, where the target processor is fully modelled. In either case we present an API that does the basic processor external operations—read and write to memory space and get interrupted.

For the generic model, the API can be used directly for reads and writes, and strategies were discussed to model nested interrupts whilst maintaining single threaded code. Timing modelling with the generic processor model was shown to be limited in accuracy but may have some useful application with real-world based estimates.

The ISS modelling was broken down to mimicking the basic steps of a core and we looked at using a table hierarchy for instruction decoding and the use of pointers to instruction execution methods in the table, to be executed when the decoding terminates in the decoded instruction entry within the tables. To de-couple the processor models from the rest of the modelling, the API was not used directly, but callback functions used for memory space accesses and inspecting interrupt request state, which can then use the API, allowing modelling of the bus and interconnect to be external to the processor model, which we’ll discuss in the next article. Methods were also discussed for accurate timing models and debugging, as well as how to handle the modelling of multiple processor cores.

In this article, we have just focused on processor modelling, and we still have to look at the bus/interconnect, memory sub-system and all the various peripherals, as well as looking at interfacing to external programs to extend the model’s usefulness into such domains as co-simulation or to interface to other external models. I will cover these subjects in the article(s) to come.

C++ Modelling of SoC Systems Part 1: Processor Elements

Simon Southwell

Semi-retired logic, software and systems designer. Technical writer, mentor, educator and presenter.

Introduction

Modelling the Processing Element

Generic Processor Models

Memory Accesses

Interrupts

领英推荐

Instruction Set Simulators

Timing Models

Debugging

Multiple Processing Elements

Conclusions

更多精彩文章

社区洞察

其他会员也浏览了

Processor Design #1: Overview

RTL DEBUG

FPGA Programming: How To Get Started with Parallel Hardware Coding

From More to Moore: Breakthrough FPGA State Machines with Category Theory

Product of the Week: Avalue Technology Inc.’s HPM-SRSUA Server-Grade ATX Motherboard

How CPUs Decode Human Language into Machine Language: The Magic Behind Ones and Zeros ????

Navigating Abort Points in Logic Equivalence Checking

RTL vs. Software Mentality in FPGA/ASIC Design; Latency From 161 to 2 Clock Cycle!

An Analogy Between VLSI Development and Cloud Infrastructure as Code

Accelerate Development of High-Performance Products for Aerospace and Defense on Powerful FPGA Platforms

Introduction

Modelling the Processing Element

Generic Processor Models

Memory Accesses

Interrupts

领英推荐

Instruction Set Simulators

Timing Models

Debugging

Multiple Processing Elements

Conclusions

The Python/C Interface

2024年9月4日

Logic Development and Make

2024年8月15日

Ethernet and TCP/IP

2024年8月2日

Performance Measurements of VProc on Verilator

2024年6月30日

Introduction to USB: Part 5

2024年3月4日

Introduction to USB: Part 4

2024年2月8日

Introduction to USB: Part 3

2024年2月4日

The VProc Virtual Processor VIP

2024年1月2日

Introduction to USB: Part 2

2023年12月20日

Introduction to USB: Part 1

2023年12月15日

社区洞察

其他会员也浏览了

Processor Design #1: Overview

RTL DEBUG

FPGA Programming: How To Get Started with Parallel Hardware Coding

From More to Moore: Breakthrough FPGA State Machines with Category Theory

Product of the Week: Avalue Technology Inc.’s HPM-SRSUA Server-Grade ATX Motherboard

How CPUs Decode Human Language into Machine Language: The Magic Behind Ones and Zeros ????

Navigating Abort Points in Logic Equivalence Checking

RTL vs. Software Mentality in FPGA/ASIC Design; Latency From 161 to 2 Clock Cycle!

An Analogy Between VLSI Development and Cloud Infrastructure as Code

Accelerate Development of High-Performance Products for Aerospace and Defense on Powerful FPGA Platforms