Demystifying Memory Sub-systems Part 2: Virtual Memory

Demystifying Memory Sub-systems Part 2: Virtual Memory

Introduction

In this article we will look at virtual memory and what problems it is trying to solve in a multi-tasking environment. Dividing memory in to 'pages' to map to main memory will be discussed, as well as the tables used to keep track of these mappings. Hardware support for virtual memory is also covered in the form of memory management units (MMUs) and a real-world example of VM is discussed in the form of the RISC-V specification, using Sv32 as an example and how an MMU could use this do virtual address translation.

In the first article of this series, discussing caches, I gave a quick history lesson on what the situation was, pre-caches, in 1981 when the blossoming PC revolution was well underway. Let’s start but going back to this era which was also pre-virtual memory.

These original PCs could only perform was task at a time—they were single task machines. After reset the computer would boot and execute start up operating system code until it was ready to except user input to load some application. The OS might be in ROM (or loaded into RAM from disk or tape, via some BIOS in ROM), leaving some unreserved RAM for applications. The OS would load an application to this spare RAM and the jump to the start address of that application which would then run. The OS is no longer running in the sense that it cannot process any more user input to load and run another program. The application, of course, might make calls to OS service routines, but a single ‘thread’ of execution is maintained. When the application exits, this thread of execution returns to the OS, and new user input can be accepted to load and run other applications.

Given the speed of microprocessors and memory at the time, this was all that was possible on a PC. As CPUs become faster, and faster, larger memories became available, PCs started to have the capacity to run more than one application or task at a time. But there is a problem. The application software written for, say, DOS, would have some minimum system memory requirements to run, and would expect that memory to be at a known fixed location within the memory map. Each application would make the same assumptions about where the resources it could use would reside. To run more than one task at once would cause clashes in the use of memory.

One solution is to compile the code to only use relative addresses: so-called position-independent code (PIC). The OS can load separate applications in differing parts of memory (assuming enough memory) and run the code from their loaded locations, swapping between the tasks to give CPU time to each. (This is context switching for multi-tasking, which is outside the scope of this article.) A version of Linux, called μClinux, uses just such a method, so there is no virtual memory. As an example, my LatticeMico32 instruction set simulator (mico32) can boot this version of Linux, which also uses μClibc and BusyBox for a multi-tasking environment and user shell. The diagram below, taken from the ISS manual, shows the fairly straightforward system required to boot this OS without a memory management unit and VM support. (The ISS does actually have a basic MMU model, but it is not used in this scenario.)

No alt text provided for this image

This approach, though quite useful, is limiting in that all the relevant code must reside in RAM at the same time, which will quickly run out as more tasks need to be run together ‘concurrently’.

A better solution might involve fooling each task into having the illusion that it is the sole process running and has access to all the memory it would have had in the original situation of a single task computer. That is, it has a ‘virtual’ view of memory. In reality, the task has its memory dynamically mapped to real physical memory, and each task's memory is mapped to different parts of actual memory, avoiding clashes. On top of this, in order to support multiple tasks whose total memory requirements might exceed that of the physical memory available, parts of a task’s memory might be removed from physical memory and stored on a slower storage device, such as a hard disk drive. As tasks are given CPU time by the OS, this data may be reloaded to physical memory (most likely to a different physical location from the last time it was loaded). This situation was already available on main-frame computers of the time, and quickly adopted in PCs and then embedded SoC systems. Since the application code has a virtual memory view as if it were the only task running, code from the pre-virtual memory era can run without modification.

This system of memory mapping is Virtual Memory, and the in rest of this article we will discuss how one might arrange virtual memory and the operations of hardware, such as a memory management unit (MMU), to support this virtual memory in a multi-tasking system. We will look at how memory is divided into pages, how the pages in virtual memory are mapped to physical memory and how this mapping is tracked and managed efficiently.

Pages

In order to get a handle on managing virtual memory, the memory is divided into ‘pages’. That is, fixed blocks of contiguous memory. A virtual memory system may support only a single page size, such as 4Kbytes, but some systems can support pages of a few different sizes. A 4Kbyte page size is common in operating systems such as Linux, for example. This is quite a small page size but was chosen some time ago and modern computers could deal with larger pages (and some do), but 4Kbyte is still widespread.

A particular task or process will have a virtual memory space that is contiguous. This space is divided up into pages, each with a page number. Since this is virtual memory, this is the virtual page number (VPN). The page number is really the top bits of the virtual address of the start of the page. Each page is aligned to its size, so that a 4Kbyte page will have the bottom 12 bits equal to 0 at its start address. The page number is then the remaining address bits. For a 32-bit system, for example, this is bits 31 down to 12. The virtual pages will be mapped to physical pages. Real physical memory is also divided into pages in the same way, with their own physical page numbers (PPN). When as task requires memory for its code or data, the virtual pages will be allocated to physical pages, and tables must be kept to remember which physical page addresses are being referenced when an access by the task is made to its virtual addresses. The diagram summaries the situation of virtual to physical mapping.

No alt text provided for this image

The left of the diagram shows part of a process running requiring some contiguous buffer—a large array, say. The buffer is larger than the page size but is contiguous in virtual memory, with the virtual address ranging from the buffer’s base address (0x78ffc084 in the example) to the base plus buffer size (0x79000084). This address space will map over a set of contiguous virtual pages, though it needn’t necessarily be aligned with the start and/or end of a page. The OS will allocate physical pages for each of the virtual pages referenced. This is done as whole pages, even though the used virtual memory doesn’t use the start or end of the lowest and highest virtual pages’ memory. The physical pages could be allocated to any PPN in physical memory and need not be contiguous.

In order to access the data in physical memory, tables must be kept with the translations from virtual pages to physical pages. The page tables reside in physical memory and, of course, don’t have virtual page equivalents. As was mentioned before, the system can run out of physical pages to hold all the required data for every running process, and so some that have not been accessed for a while can be stored in secondary storage, as shown for one of the pages in the diagram. These are kept on swap partitions or page files, depending on the operating system. On Linux, for instance, there is a swap partition for storing these pages. The diagram below shows a Linux system’s partitions, with the swap partition highlighted.

No alt text provided for this image

The pages that have been ‘swapped’ out to disk do not have a mapping to physical memory, and no entry in the page tables. If a swapped-out page is referenced, the lookup in the page table fails and the swap space searched. If found, then a resident page in memory is selected for removal to swap space, and the referenced page loaded to that physical page, and the page tables updated to map the virtual address of the loaded page to the selected physical page. Of course, if the virtual page is not in the swap space either, then an exception is raised. The details of how an OS does this swapping and the format of a swap partition is beyond the scope of this article, which is looking at VM from an MMU point of view. For those that want a deeper dive into OS VM functionality, I have found Mel Gorman’s book Understanding The Linux Virtual Memory Manager helpful, which is freely available online. Chapter 11 discusses swap management.

Note that there are separate page tables for each process that is running, since each process can reference the same virtual address as another process, mapped to different physical pages in memory. That clash, of course, is where we started and why we need virtual memory. Some systems (that I have worked on) combine the page tables ?for all processes, with additional information to uniquely identify the process with an ID (PID) or context number. A lookup then has to match both a page number and a context number/PID.

The logic unit that manages the translations from virtual to physical pages, along with some other functionality we will discuss, is the memory management unit (MMU).

Memory Management Unit

Hardware support for virtual memory comes in the form of a memory management unit. It is usually fairly autonomous as a function in regards translating between virtual and physical pages, which needs to be done at high speed in order for the system to perform efficiently. The operating system, however, would usually be looking after the physical page allocations and the context switches, and would configure the MMU accordingly. This might be as simple as pointing the MMU to a different set of page tables when there is a context switch. The MMU would access, for a given process, a set of translations from memory, kept in a page table, in order to do the translations. As we shall see, it might also keep a local cache of some translation to speed up this process. The MMU would also normally be responsible for memory access protection and privilege.

Fundamentally, then, in co-ordination with the operating system, the MMU does the following things:

  • Translates virtual addresses to physical addresses
  • Reads these translations from page tables in memory and keeps a cache of them
  • Does memory protection and monitors access privileges.

TLB

As mentioned above, page tables are kept in physical RAM and each time the processor does a load or a store to virtual memory the MMU has to translate this to a physical load or store. If it had to access the pages tables in main memory every time, then this access latency would be added to each and every load/store that the processor did. To alleviate this penalty, a cache of translations is kept in a translation lookaside buffer (TLB). This acts exactly like a cache for instructions or data that was discussed in the previous article. Only, now, instead of being data or instructions in the cache it is virtual to physical page translations. The logic is that same as for a data cache, just with parameters that fit translations. For instance, an entry in the page table (a Page Table Entry or PTE) for a 32-bit system might be 4 bytes, so the TLB’s cache-line size would be 4 bytes. The set size might be smaller and may even be a fully associative cache but, in general, the function is the same as that for a data cache.

Page Tables

The entries in the page tables (PTEs) in RAM could all be kept in a single flat table, but this would be a problem. Searching through a table if the entries where just randomly scattered in a variable sized table would be too slow. Therefore, tables are organized where the position in the table is a function of the virtual page number. This is similar to the index bits of an address in a set associative cache, as discussed in the previous article. A flat table using this method, though, would require quite large tables. Even in a 32-bit system that can address up to 4Gbytes of memory this requires, if using 4Kbyte pages, 1 mega table entries for each possible virtual page. And this is for every process running. For a 64-bit system this rises to impractical values.

The solution is to have a hierarchy of page tables where the virtual page number is divided into sections, with the top bits used to index into a first table. If the PTEs are 4 bytes, then the top 10 bits might be used, and the table fit neatly into a 4Kbye page. The PTE there, instead of being a translation, is actually a pointer to another table. The next (10) bits are used to index into that table, and so on, until a translation entry is found. The number of tables traversed depends on the architectural size. For example, a 32-bit system might have just two steps, whilst a 64-bit system might have a 3 or even 4 steps. Traversing the tables like this is called a table walk. This could be done by the operating system and the TLB updated by software but, more efficiently, this would be done in logic with a table walk engine as part of the MMU functionality. The process of reading an entry and either continuing to another table, or fetching the translation is something well suited to logic. This also gives a mechanism whereby different sized pages can be supported. If a table walk terminates early by finding a translation in a higher table, then that page is 2n × number-early-steps times bigger than the fundamental page size. For example, with 4 Kbyte pages, and 10-bit page number sub-indexes, terminating one step before the 4Kbyte translation table gives a page size of 210 × 4Kbyte = 4Mbyte page.

Real World Example, RISC-V

Having described the virtual memory functions of an MMU, let’s look at a real-world example to bring the theory together.

Volume 2 of the RISC-V specification (see sections 4.3 to 4.5), defines page-based support for virtual memory for various size address spaces. It defines 3 sizes, as listed below:

  • Sv32 maps 32-bit virtual addresses to 34-bit physical addresses, 4Gbyte virtual space, 16Gbyte physical space, two level page tables
  • Sv39 maps 39-bit virtual addresses to 56-bit physical addresses, 512Gbyte virtual space, 64Pbyte physical space, three level tables
  • Sv48 maps 48-bit virtual addresses to 56-bit physical addresses, 256Tbyte virtual space, 64Pbyte physical space, Four level tables

We will concentrate on Sv32, as an example, as the diagrams are less messy, but I hope it will be clear how this would scale to the larger address spaces. The diagram below shows the format of the virtual- and physical addresses, along with the format of a page table entry:

No alt text provided for this image

Since Sv32 is for a 32-bit system, the virtual address is 32 bits. The pages are 4Kbytes, and so the lower 12 bits of the address are an offset into a page. The next 20 bits are the virtual page number, but these are divided into two 10-bit values as shown. The physical address format is bigger, with the same offset bits for a 4Kbyte page, but 22 bits for the physical page number. For a 32-bit architecture, only 4GBytes of virtual space can be accessed by an individual process. However, since there are multiple processes, these can be mapped to a large physical memory space. In this case, to a 16Gbyte space. Thus, a system can choose to add up to 16GBytes of memory, even the architecture can only address 4Gbytes.

The page table entry (PTE) consists, firstly, of the 22 bits of a physical page number. (Note that bits 33 down to 12 of the physical address are mapped to bits 31 down to 10 of the PTE.) A couple of bits are reserved, and then some control and status bits are defined for privilege and permission control, some TLB cache status and other status bits for use by the system.

Firstly, the valid bit just marks the PTE as containing a translation. This is used for both the table entry in RAM and when in the TLB.?The next three bits defined the access permissions and hierarchy of the table entry—i.e., whether it points to another table or not.?The diagram shows what each combination means, with greyed out entries reserved. The values are mostly combinations of read, write and/or execution permissions. If all three bits are 0, however, then the entry is a pointer to the next table in a table hierarchy, and the PPN is the physical page number where that table is located.

The user bit indicates that the page has permissions for access in user privilege mode (the lowest privilege for RISC-V). The global bit is used if the system supports multiple address spaces, and the page is mapped to all the spaces. The accessed bits, along with the dirty bit is used, much as for the caches described in the previous article. The accessed bit used for choosing which pages to swap out when a new page requires to be loaded. This could be from the TLB to RAM, or from RAM to the swap space on disk. Similarly, the dirty bit indicates that the page has been written since last written back to the layer below, thus indicating whether it needs to be written back not if swapped out.

Having defined the format of the various addresses and page table entries, let’s look at how this all works together. The diagram below shows how an Sv32 two stage lookup might work.

No alt text provided for this image

When the processor does a load or store, the VPN bits from the virtual address are presented to the TLB. If the translation is in the TLB (a hit), then the physical page number (PPN) is returned. The physical address to memory is then the PPN concatenated with the 12 bits of offset. The data at this physical address might be in the data (or instruction) cache, or the cache system may have to retrieve this from main memory, just as described in the previous article.

If the entry is not in the TLB (a miss), then a table walk must be started to look up the translation in RAM. In a RISC-V system the root table’s page number for the running process is stored in a supervisor CSR register called satp (Supervisor Address Translation Protection). The equivalent in an x86 system is the CR3 register. This satp register will be updated at each context swap to point to the tables for the process being restarted. The top bits of the virtual address are used to index into the first table. The entry there will have the xwr bits set to 0 (unless it is a ‘super page’), and the PPN in the entry points to the next table’s physical page where the actual translation resides. The lower VPN bits then index into that page to get the PPN for the access. This would be loaded into the TLB, possibly unloading a resident entry first, and the process continue as if for a hit. It is at this point that the access permissions are checked, and an exception/trap/interrupt raised, if the checks fail, so that the operating system can intervene as appropriate. Needless to say, the access is not completed in this circumstance.

So that’s how, in general, an MMU works, using real-world formats, and there’s nothing more to discuss, right? Well, there is one more thing where we need to go back to the cache and work out how the virtual memory, the MMU and the data/ instruction caches interact.

Virtual Memory and Caches

Until now we have looked at virtual memory independently of a memory sub-system’s caches. When, in the previous article, we looked at caches it was assumed that all addresses were physical addresses. Looking at both together, the most obvious thing to do to combine the two would be to first translate virtual to physical addresses and then send the physical address accesses through the cache. This is a valid arrangement, and is known as a ‘physical index, physical tag’ architecture (PIPT). The diagram below shows this arrangement.

No alt text provided for this image

The disadvantage of this arrangement is that (assuming the VA hits in the TLB) the TLB must be looked up first, and only then the cache inspected for a hit on the physical tag. The two lookup latencies are added.

A solution might be to work on the virtual addresses instead. This arrangement is known as ‘virtual index, virtual tag’ or VIVT. The diagram below shows this architecture.

No alt text provided for this image

This arrangement avoids the TLB latency if the data is in the cache, as the hit cache entry is a virtual address and physical memory references are not needed. However, it has a couple of problems. Firstly, the TLB entry must still be looked up for the checking the permission bits. Secondly the cache would need flushing each time there was a context switch as processes can use the same virtual addresses and there would be a clash if a matching address from a different process remained in the cache. PIDs/context numbers as part of the tag might alleviate that somewhat, but that is more complicated and, whilst a give process is running, the cache entries for a different process will definitely not be hit, wasting cache space for the running process.

A compromise solution is where the virtual address is used for the cache index, but the physical address is used for the tag. This is known as ‘virtual index, physical tag’ (VIPT). The diagram below shows this arrangement.

No alt text provided for this image

In this arrangement both the cache and TLB lookups are done in parallel. Since the physical address’s tag was stored in the cache tag RAM, then the physical address returned from the TLB can be compared with the physical address from the cache tag lookup and a hit/miss flagged. This architecture then reduces the latency to just that of the longest between the TLB and cache lookups which, hopefully, are comparable.

Conclusions

In this article we have looked at virtual memory and what problems it is trying to solve for a multi-tasking environment. A virtual view of memory is given to each task/process/program so that the software does not need to be aware of the resource clashes caused by running lots of different code concurrently.

Virtual memory support is implemented by dividing the virtual and physical address spaces into pages, and then the virtual pages are mapped to physical pages, dynamically. Page tables are resident in memory to keep track of these mappings, with page table sets for each running process. The logic to do the table lookup and virtual to physical address translations are implemented in an MMU, which also checks for access permissions. The MMU might keep a cache of translations in a TLB in order to speed up the process, just as for a data or instruction cache that speeds up main memory accesses.

We looked at a real-world example of virtual memory formats for addresses and page tables to consolidate the theory, using the RISC-V specification. We also had to discuss how the data/instruction caches interact with the virtual memory functionality in order to get an efficient system.

Over these two articles I’ve summarised the main aspects of a typical memory sub-system and it might seem that every time a feature was added to solve a particular problem a new one was added which need solving. For example, coherency problems caused by caches in a distributed multi-processor environment, or the positioning of VM translations with respect to those caches to avoid unnecessary latencies. A processor-based environment, as we saw historically, would function without these systems, but not very efficiently. It is often the case that engineering is about optimising systems with added complexity to get the most from the capabilities. Sometimes a solution causes a new issue which must be solved but, over time, a solution is found to all of these to get a system with best performance.

If you never need to work directly on, or with, memory sub-systems I hope this and the previous article will act as a case-study on how systems need to be refined in engineering to get a better solution, even if that adds complexity. If you work within a processor-based systems (as is likely) an awareness of the memory sub-system functionality is useful when designing within that system (in either logic or software realms) and I hope these summaries are useful to that end. I have, in the past, worked more directly on designs that implement MMU functionality but, even in completely different environments where the design space was outside of the processor systems and sub-systems, I have had to design DMA functionality that can traverse over buffers in contiguous virtual memory, but distributed in physical memory, and solve issues caused by cache coherency in order to make it function correctly. This would not have been possible without a knowledge of the functionality discussed in these articles.

Damian Smith

Computer Scientist & Embedded Logic (FPGA) Designer

2 年

This one took me a bit more head-scratching as it's rather more involved. Great description - thank you!

回复

要查看或添加评论,请登录

Simon Southwell的更多文章

  • Nested Vectored Interrupts and Co-Simulation

    Nested Vectored Interrupts and Co-Simulation

    Introduction In general, co-simulation, in the context discussed in this article, is the ability to simulate in both…

    1 条评论
  • Instruction Set Simulators, gdb, IDEs and Co-simulation

    Instruction Set Simulators, gdb, IDEs and Co-simulation

    Introduction I have previously discussed instruction set simulators (ISS) in various contexts, including a discussion…

    1 条评论
  • The Python/C Interface

    The Python/C Interface

    Introduction In this article I want to talk about the Python C interface. Now that can be one of two directions, of…

    2 条评论
  • Logic Development and Make

    Logic Development and Make

    Introduction The ‘make’ utility has been around a long time now (since 1976), and my relationship with it isn’t much…

    3 条评论
  • Ethernet and TCP/IP

    Ethernet and TCP/IP

    Introduction I have had a few requests to cover something about ethernet over the last few months and so I’ve finally…

    1 条评论
  • Performance Measurements of VProc on Verilator

    Performance Measurements of VProc on Verilator

    Introduction I have written about the VProc virtual processor before, in terms of what it is, what it is used for, how…

    5 条评论
  • Introduction to USB: Part 5

    Introduction to USB: Part 5

    Introduction In the first 4 articles in this series, we got ourselves to a point just shy of transferring data in USB…

    1 条评论
  • Introduction to USB: Part 4

    Introduction to USB: Part 4

    Introduction In the previous article we looked at the USB 3 physical layer, looking at encoding, scrambling and the…

  • Introduction to USB: Part 3

    Introduction to USB: Part 3

    Introduction In the previous two articles (see here and here) we got up to USB 2.1 with a half-duplex differential pair…

    1 条评论
  • The VProc Virtual Processor VIP

    The VProc Virtual Processor VIP

    Introduction I have written about the VProc virtual processor verification IP in a previous article, which was the…

    1 条评论

社区洞察

其他会员也浏览了