By the end of this lesson, you should be able to:
Virtual memory is a fundamental concept in modern computer systems that allows the processes running within a computer’s operating system (and the operating system itself) to use more memory than it physically possesses. It accomplishes this by using a combination of hardware and software techniques to transparently manage the allocation of data and programs between a computer’s physical memory (RAM) and secondary storage, typically a hard disk drive or solid-state drive. The implementation of virtual memory (VM) requires hardware support by the computer’s CPU and the memory management unit; thus, virtual memory is not necessarily available on all computer systems, particularly those using simple (and cheap) processors such as the Z80, 8086, or 68000 family of processors that are still common in embedded and industrial control systems.
The Motivation Behind Virtual Memory: Virtual memory was born out of the need to overcome the limitations imposed by the physical constraints of early computer systems. In the early days of computing, physical memory (RAM) was expensive and limited in size. As a result, running large or multiple programs concurrently was a significant challenge. To address this issue, computer scientists and engineers developed virtual memory as a solution.
Historical Example: Consider the case of early personal computers like the IBM PC, which typically came with a few kilobytes of RAM. Running complex software, such as word processors or graphical applications, required significantly more memory than was available. Virtual memory emerged as a way to give these systems the illusion of having more memory than they physically possessed.
Virtual memory remains a cornerstone of modern computer systems for several important reasons:
Increased Addressable Memory Space: Virtual memory allows applications to access a much larger addressable memory space than the physical RAM available. This is vital for running large applications or handling vast datasets. In modern 64-bit operating systems, applications can theoretically address up to 18.4 million terabytes (TB) of virtual memory (264), providing ample room for complex computational, data processing, machine learning, and database tasks.
Process Isolation and Protection: Virtual memory provides a robust mechanism for isolating and protecting processes from each other. Each process operates in its own virtual address space, preventing unintended interference and enhancing security. If one application crashes due to a memory error, it doesn’t affect other running applications, enhancing system stability.
Simplified Memory Management: Virtual memory simplifies memory management for both the operating system and application developers. Programs can assume they have access to a vast, contiguous address space without worrying about the physical memory’s constraints. Developers can allocate memory dynamically as needed, making it easier to write flexible and efficient software.
Improved Multitasking: Virtual memory enables efficient multitasking, allowing multiple applications to run concurrently without requiring each to fit entirely into physical RAM. A modern operating system can seamlessly switch between running processes, making it appear as though they are all running simultaneously, even if the physical memory is limited.
In short, virtual memory is not just a concept but a fundamental architectural feature that plays a pivotal role in the efficient functioning of modern computer systems. It has revolutionized the way we utilize memory resources and has become indispensable for supporting today’s complex and resource-intensive computing tasks.
Before diving into the details of virtual memory, paging, address translation, and page tables, let’s review how operating systems work with the CPU to run programs. While the specific mechanism differ between operating systems, the essentials are the same. In the examples below we use the mechanisms most commonly used in Unix-derived operating systems such as Linux and MacOS.
Let’s start at the beginning and briefly review how an operating system books, using Linux as an example. When Linux boots, a series of steps occur to initialize and start running the operating system. Below are the key steps of the Linux boot process:
Now the system is ready to run programs and applications within processes.
In Linux, processes are launched using a variety of methods, with the primary method being through the execution of a program or command from a terminal window or a command shell (such as bash or zsh). When we say process, we mean any program or executable. Even running a simple command like “ls” means that a process is launched and the program’s code is loaded from disk and executed as the instructions of the process. No user program code can run without a process.
Here’s a more complete overview of how a program is run with a process and how that process is often launched in Linux:
ls
, firefox
, or gcc
will initiate a process associated with that command.PATH
environment variable.PATH
. For example, /usr/bin/ls
or /home/user/myapp
.&
to the command, like command &
.|
(pipe), >
(redirect output), and <
(redirect input) to manipulate how processes communicate and handle data../scriptname.sh
command.chmod +x scriptname.sh
) and contains the appropriate shebang line (e.g., #!/bin/bash
) at the beginning.fork()
followed by a form of exec()
to create a new process that is a copy of the parent process and then load new code into that “forked” child process.cron
or systemd timers
.ps
(process status), kill
(terminate a process), nice
(adjust process priority), and renice
(change process priority).The operating system tracks various pieces of information internally for every process to manage and control its execution. This information is stored in a data structure known as the “Process Control Block” (PCB) or “Task Control Block” (TCB). Below is an overview of the key information that the operating system maintains for each process:
Process ID (PID): A unique identifier assigned to each process, allowing the OS to distinguish between different processes.
Process State: Indicates the current state of the process, such as “Running,” “Ready,” “Blocked,” or “Terminated.” The state helps the OS manage process scheduling and resource allocation.
Program Counter (PC): A register that holds the memory address of the next instruction to be executed by the process. If the operating system supports virtual memory, then the PC always references virtual addresses.
CPU Registers: The values stored in CPU registers that belong to the process, including general-purpose registers, stack pointers, and others. Any addresses held by a register, such as the stack pointer, will always be virtual addresses in a virtual memory environment.
Priority and Scheduling Information: Information related to the process’s priority or scheduling priority, which affects its position in the scheduling queue.
Memory Management Information:
File Descriptor Table: A table that tracks the files and I/O devices the process has open. Each entry in the table contains information such as file pointers and access permissions.
Resource Information: Information about the resources allocated to the process, including allocated memory, open files, and other system resources.
Parent Process ID (PPID): The PID of the parent process that created this process. Useful for process hierarchy management.
Child Process List: A list of PIDs for any child processes spawned by the current process.
Signal Handlers: Information about how the process handles signals (interrupts) sent by the operating system or other processes.
Accounting and Statistics: Data related to the process’s resource usage, execution time, and other performance metrics.
Environment Variables: A list of environment variables that affect the process’s execution environment.
Working Directory: The current directory associated with the process. This directory is used as the default location for file operations.
User and Group ID: The user and group identifiers associated with the process, used for access control and security purposes.
Inter-Process Communication (IPC) Information: If the process communicates with other processes, information about IPC mechanisms like pipes, sockets, and message queues may be stored.
Stack and Heap Information: Details about the process’s stack and heap memory areas, including their current size and limits.
Exit Status: The exit status or return code of the process, indicating the success or failure of its execution.
Terminal and Session Information: For processes associated with terminal sessions, information about the controlling terminal and session ID.
Security Context: Security-related information, including privileges and access control attributes.
These pieces of information are essential for the operating system to manage processes effectively, including process creation, memory management, scheduling, termination, resource allocation, and communication between processes. The PCB or TCB serves as a data structure (held in memory) that the operating system uses to keep track of and manage these details for each process in the system.
In a computer physical addresses correspond to real signals going over the motherboard to DIMM sockets, selecting specific locations on memory chips. On the other hand, virtual addresses are the addresses used by the CPU to access both program code and data. Remember that virtual addresses are translated to physical addresses by the memory management unit (MMU).
Changing these translations can be used to create multiple virtual address spaces, each one corresponding to a separate translation setting. However there is still only one physical address space – the relationship between physical addresses (represented by electrical signals) to memory locations in DRAM chips.
A simple mechanism is using base and bounds registers to translate a standard pre-process virtual address space (i.e., 0..sizeof(process)) into a different location in physical memory, as illustrated here.
The series of videos by Dr. Black-Schaffer of Uppsala University provides an overview of many of the concepts we will address in this lesson and watching it before reading the remainder may help set context. Note that his discussion of virtual memory uses the MIPS rather than Intel architecture. While there are subtle differences, it is interesting how similar they are.
This lesson examines one approach to address translation in detail: the one used by the original Intel Pentium CPU, and by any Intel-compatible CPU (such as those made by AMD) running in 32-bit mode. It uses 32-bit virtual addresses, 32-bit physical addresses, and a page size of 4K (212) bytes. Since pages are 212 bytes each, addresses can be divided into 20-bit page numbers and 12-bit offsets within each page, as shown here.
The same principles apply to any modern CPU, but keeping to a 32-bit CPU is simpler for explanation, although modern CPU’s often have an address space greater than 32 bits. A page size of 4096 is somewhat arbitrary but it strikes a good balance between pages that are too small and require larger page tables and might require more movement of memory and cause more page faults and larger page sizes that might lead to more internal fragmentation and wasted memory.
Note that pages are always aligned, i.e., they start with offset 0 and go up through offset 4095. A page can’t start somewhere in the middle, like at address 100, in the same way that the 20th century starts at 1900, not at some arbitrary year between 1900 and 1999.
The role of the computer’s memory management unit (MMU) in this system is to map a 20-bit virtual page number to a 20-bit physical page number; the offset can pass through unchanged, as shown here.
Although paged address translation is far more flexible than base and bounds registers, there’s one disadvantage: it requires a great deal more information. Base and bounds translation only requires two values, which can easily be held in registers in the MMU. In contrast, paged translation must be able to handle a separate mapping value (to either a physical page or an invalid indication) for each of over a million virtual pages. This results in single-level page tables being simply too large and not feasible in practice, as we will see below.
The only possible place to store the amount of information required by paged address translation is in memory itself, so the MMU uses page tables in memory to specify virtual-to-physical mappings.
One of the simplest ways to structure a page table for mapping 20-bit page numbers is as a simple array with 220 entries. With this configuration, each virtual page has an entry, and the value in that entry is the corresponding physical page number. This single-level table is located in physical memory, and the MMU needs a pointer to this table, which is stored in an MMU register. (On Intel-compatible CPU’s, this register is Control Register 3, or CR3; this abbreviation is used to refer to the page table root pointer from now on.)
Although address translation is performed in hardware, pseudo-code is used here to describe the algorithm the MMU would use for a single-level page table. VA and PA stand for virtual and physical addresses, and VPN and PPN represent virtual and physical page numbers.
PA = translate(VA):
VPN, offset = split[20 bits, 12 bits] (VA)
PTE = physical_read(CR3 + VPN * sizeof(PTE), sizeof(PTE))
if not PTE.present:
fault
return PTE.PPN + offset
Note that this means that every memory operation performed by the CPU now requires two physical memory operations: one to translate the virtual address, and a second one to perform the actual operation. If this seems inefficient, it is. You’ll learn how to improve this inefficiency in a bit.
Although the page table solves one problem, it causes another: it uses 4MB of memory per map.
With 4MB page tables and one table per process, this would require 2.5GB of memory just for page tables if you have 640 processes on a shared server, on a machine with 4GB total, assuming a 32-bit address space. Worse yet, each table would require a contiguous 4MB region of memory, running into the same problem (external fragmentation) that paged address translation tried to solve. And imagine if the address space were 64-bits.
To solve this problem, almost all 32-bit processors (e.g., Intel-compatible, ARM) use a 2-level page table, structured as a tree. Here’s an example of what one looks like.
The left-most (top) ten bits of the virtual page number index into the top-level table (sometimes called the page directory), which holds a pointer (i.e., the memory address) to a second-level page table. The bottom ten bits of the virtual page number are used as an index into this second-level table, giving the location where the actual physical address will be found. At first glance, it appears that this structure takes just as much space as a single-level table. Doesn’t that mean the same problem still exists?
It turns out that to map a full 4GB of memory, you’ll still need 4MB for page tables. But, if a process only needs a small amount of memory, then most of the entries in the top-level directory will be empty (shown here as P=0), and the page table will only require a small amount of memory. Plus we can use any set of 4KB pages for the table, instead of needing a contiguous 4MB block.
Here is an example of address translation using a 2-level table that uses 3 pages: physical pages 00000 (the root directory), 00001, and 00003. Two data pages are mapped: 00002 and 00004. Any entries not shown are assumed to be null, i.e., the present bit is set to 0.
In the short tutorial video below, Dr. Maria Jump of (formerly) Khoury Boston demonstrates address translation by way to examples.
The 2-level table address translation processes you just learned about is even less efficient than the single-level table, requiring two additional memory accesses by the MMU to retrieve page table entries for every CPU memory access. Even if MMU accesses to memory can be satisfied from the L1 cache, this will still slow down the CPU by a factor of three or more. To reduce this inefficiency, a special-purpose cache called the translation look-aside buffer (TLB) is introduced. Instead of holding memory values, like the L1 and L2 caches, the TLB holds virtual page number to physical page number mappings. It’s typically very small, holding from between 64 mappings (on various Intel Atom CPU’s) to about 1500 mappings (the more expensive Xeon CPU’s). This is in part because it has to be very fast - these translations are needed before the CPU can look in its cache for a value.
A famous computer science quote attributed to David Wheeler is: “All problems in computer science can be solved by another level of indirection,” to which many people add “except indirection.” A corollary to this is that most performance problems can be solved by adding a layer of caching. How are these quotes applicable to paged address translation?
Using the TLB, the translation process now looks like this. The following acronyms are used: VPN is virtual page number, and PPN is physical page number. Since each entry is 4 bytes, we can calculate the address of entry n by multiplying n times 4.
translate VA -> PA:
(VPN, offset) = split([20,12], VA)
if VPN is in TLB:
return TLB[VPN] + offset
(top10, bottom10) = split([10,10], VPN)
PT = phys_read(CR3 + top10 * 4)
PPN = phys_read(PT + bottom10 * 4)
add (VPN -> PPN) to TLB, evicting another entry
return PPN + offset
How well does this perform? If all of the code and data of a process fit into 640 pages (about 2.5MB) on a mid-range machine, all translations will come out of the TLB and there will be no additional overhead for address translation. If the working set, or amount of memory in active use, is larger than this, the page table accesses will cause performance to decrease. However, typically the vast majority of accesses are to pages in the TLB (e.g., currently running code, stack, etc.) and only a small fraction of accesses require going to the page tables in memory for translation.
Like any other cache, a TLB only functions correctly if it is consistent, i.e., the entries in the TLB accurately reflect the in-memory page tables which they are caching. There are two primary sources of inconsistency in TLB entries:
Modifications:: Sometimes the OS must modify the address space of a running program. An example is demand paging (covered in the next module), where the OS maps in new pages, and un-maps others when they are paged out. When changing the page table in memory, the OS must ensure that the TLB is not caching a copy of the old entry.
Context switches: To provide both a consistent address space layout as well as memory protection, the OS will typically provide multiple processes with different mappings for the same virtual page; clearly it’s important to make sure that the correct mapping is used according to which process is running.
The issue of modifications can be solved in a fairly straightforward way: the MMU provides one instruction to flush the TLB entry (if there is one) for a particular page, and another to flush the entire TLB. When entries are flushed from the TLB, there is almost always a performance impact, because of the extra memory accesses needed to reload those entries the next time they are required. In the case where individual entries are being flushed the overhead is not that significant, because (a) the OS is already spending a lot of time modifying the page table, and (b) it doesn’t do this very often, anyway.
However, the issue with context switches is harder to solve. The performance overhead can be ignored and the entire TLB can be flushed on every context switch (this is done on most Intel-compatible CPU’s). With a 500-entry TLB and a 4-level page table, this results in throwing away 2000 memory accesses worth of work on each context switch. Another solution is to add an MMU register identifying the current process (an Address Space ID, or ASID), and tag each TLB entry with the ASID for the process it is valid in. If a process is interrupted for a short time, most of its TLB entries will remain cached, while the ASID field will prevent them from being mistakenly used in another process.
The short tutorial by Dr. Maria Jump (formerly Khoury Boston) shows what the 32-bit Intel page table entry (PTE) looks like:
The page size of a processor plays a large role in determining how much address space can be addressed. In particular, assuming that the page table tree is built out of single pages, a 2-level page table can map N2 pages, where N is the number of page table entries that fit in a single page. Thus, if the address space is 32 bits, a page table entry (physical page number plus some extra bits) can fit in 4 bytes. The maximum virtual memory that can be mapped with a 2-level page table is as shown here for various page sizes:
2K pages: 512 (29) entries per page = virtual address space of 218 pages of 211 bytes each = 229 bytes (½ GB)
4K pages: 1024 (210) entries per page = virtual address space of 220 pages of 212 bytes each = 232 bytes (4GB)
8K pages: 2048 (211) entries per page = virtual address space of 222 pages of 235 bytes each = 235 bytes (32GB)
In other words, 2K pages are too small for a 32-bit virtual address space unless the process moves to a deeper page table, while 8K pages are bigger than necessary. For example, the SPARC and Alpha CPU’s, early 64-bit processors, used 8KB pages.
64-bit Intel-compatible CPU’s like the Intel i5 and i7 use 4K pages for compatibility, and 8-byte page table entries, because four bytes is too small to hold large physical page numbers. This requires a 4-level page table, for total virtual address space of 236 × 212 = 48 bits:
{.width=“70%”}
Clearly, the penalty for TLB misses is higher in this case than in the 32-bit case, as there are four memory accesses to the page table instead of two. Intel and AMD have already announced support for 5-level page tables in future processors, allowing address spaces larger than 256 TB.
Consider the following very simple Linux program launched in a process:
char hello[] = "hello world\n";
void _start(void) {
syscall(4, 1, hello, 12); /* syscall 4 = write(fd,buf,len) */
syscall(1); /* syscall 1 = exit() */
}
This is a “hello world” program and doesn’t use any libraries; instead, it uses system calls to directly write to standard out (file descriptor 1) and to exit. In Linux, _start
is the point at which execution of a program begins; normally the _start
function is part of the standard library, which performs initialization before calling main.
The short video by Dr. Maria Jump (formerly of Khoury Boston) walks through the process of building a page table for this program:
What happens if a single physical memory page is mapped into two different process address spaces? It works just fine. Each process is able to read from the page, and any modifications it makes are visible to the other process, as well. The MMU only sees one page table at a time, so there’s no conflict here, even though it might appear that there is one at first glance.
This is helpful when, for example, the same program is run multiple times. Think about a server where multiple database server instances of the same database are running: the program instructions are the same and are read-only, so the physical pages containing the instructions can be shared among processes. Same with common libraries: the code can be shared.
There are two ways where page sharing is used:
Information sharing: Some databases and other large programs use memory segments shared between processes to efficiently pass information between those processes. This is a form of “inter-process communication” but is rare as it requires careful synchronization of memory access which can be difficult to achieve.
Memory saving: Most processes use the same set of shared libraries to communicate with the OS, the graphical interface, etc., and these libraries must be mapped into the address space of each process. But, most of the memory used by these libraries (program code, pre-defined strings and other constant data) is read-only, and so a single copy can be safely mapped into the address space of each process using the library. This is quite common; in fact your laptop would need far more memory if Windows, MacOS, and/or Linux didn’t do this type of page sharing.
So far, we assumed that page tables contain proper entries to physical pages. But what if that is not the case and the page table does not contain a required mapping? Then a “page fault” occurs and the mapping must be established. A fault can also occur when a second level page table does not exist or when a read-only page is attempted to be written.
A page fault is a special form of exception that has the following two characteristics: first, it is generated when an address translation fails, and second, it occurs in the middle of an instruction, not after it is done. When a page fault occurs, control is passed to a page fault handler – code within the operating system’s kernel that deals with the exception.
Typical information that the MMU passes to the page fault handler is:
After the page fault handler returns, the instruction that caused the fault resumes, and it retries the memory access that caused the fault in the first place. A single instruction can cause multiple, different page faults, of which there are two different types:
Instruction fetch: A fault can occur when the CPU tries to fetch the instruction at a particular address. If the instruction “straddles” a page boundary (i.e., a 6-byte instruction that starts 2 bytes before the end of a page) then you could (in the worst case) get two page faults while trying to fetch an instruction.
Memory access: Once the instruction has been fetched and decoded, it may require one or more memory accesses that result in page faults. These memory accesses include those to the stack (e.g., for CALL, PUSH, POP, and RET instructions) in addition to load and store instructions. As before, accessing memory that straddles a page boundary will result in additional faults.
Operating systems use two primary strategies in handling page faults:
Kill the program: This is how C programs are debugged. If the access is in fact an error, the default action is to kill the process, so that the page fault handler never returns.
Resolve the fault: The OS modifies the page tables to establish a valid mapping for the failing address, and then returns from the page fault handler. The CPU retries the memory access, which should succeed this time.
Note that several additional and different strategies for handling page faults are used in hardware virtualization; these are not addressed in this lesson.
Page fault handling was a primary motivation behind the early development of RISC (reduced instruction set) computers. In particular, to avoid the complex logic that CISC (complex instruction set) CPUs use to save CPU state during a page fault, RISC CPUs simplify the instruction set so that instructions can start over from the beginning after a page fault. This means:
Instructions and data words are aligned on natural boundaries, so that no accesses straddle two pages.
Using only LOAD and STORE instructions (with no side effects) to access memory. No auto-incrementing an address in a register, even for the stack pointer, because you will get the wrong result if you run it twice. (RISC CPU’s typically save the return address in a register, and the compiler generates code to save it to the stack.)
In the Linux kernel, the memory map is represented as a list of vm_area_struct
objects, each of which contain the following information:
The page table is a simple structure defined by the CPU hardware. The virtual memory map in the OS is a purely software data structure, and can be as simple or complex as the OS writers decide.
The page fault handler searches this list for a match which can result in one of the following for the memory map below:
08048000-08049000 r-xp 00000000 08:03 4072226 /tmp/a.out
08049000-0804a000 rw-p 00000000 08:03 4072226 /tmp/a.out
0804a000-0804b000 rw-p 00000000 00:00 0 [anon]
bffd5000-bfff6000 rw-p 00000000 00:00 0 [stack]
No match: This is an access to an undefined address. It’s a bug, and the OS terminates the process with a “segmentation fault” error.
Any page in bffd5000-bfff6000: These are demand-allocate stack pages. The page fault handler allocates a physical memory page, zeros it (in case it’s accidentally holding data from another process), puts it into the page table, and returns.
Page 08048000: This page is mapped read-only from the executable file “a.out,” so the page fault handler allocates a memory page, reads the first page from “a.out” into this page, inserts the page into the page table (marked read-only), and returns.
Page 08049000: This page is mapped read/write from the executable file. Just like page 08048000, when a fault occurs the OS allocates a page, reads its contents from the executable, maps the page in the page table (read/write this time) and returns. (although not shown in the listing, the two pages map different offsets in the file)
Page 0804a000: Like the stack, this is demand-allocated and zero-filled. The page fault handler allocates a page, zeroes it, puts it into the page table, and returns.
Note that this is an example of a general-purpose approach known as lazy allocation. It’s often more efficient to defer allocating resources until they’re actually needed, rather than allocating a potentially excessive amount at startup.
As an example, here are the steps involved in the execution of the first instruction of a hypothetical process. When the process starts executing the PC (program counter) is set to 1000, and CR3 points to a top-level page directory, but that directory is empty.
The first few steps taken are shown in the following video by Dr. Maria Jump (formerly of Khoury Boston):
Rather than limit processes to available physical memory, secondary memory on disks (mechanical hard drives or solid state drives) can be used to temporarily store pages that do not fit in memory and swap pages back and forth between drives and physical memory. The area used on disk for saving pages is often called the backing store, backing file, swap space, swapping file, or virtual memory file. It can be a fixed segment on disk, a full drive, or a file (that is either fixed or expands as needed, i.e., it is dynamically sized).
Page faults allow data to be dynamically fetched into memory when it is needed, in the same way that the CPU dynamically fetches data from memory into the cache. This allows the operating system to over-commit memory: the sum of all process address spaces can add up to more memory than is available, although the total amount of memory mapped at any point in time must fit into RAM. This means that when a page fault occurs and a page is allocated to a process, another page (from that or another process) may need to be evicted from memory.
For read-only mappings, this is simple: just forget the mapping, and if a page fault is returned for the page at some later point in time, it can be paged back in again from the disk.
There are two types of virtual segments: file-backed and anonymous. The majority of file-backed segments are demand-paged executables. If a segment is backed by a specific file, on page faults for pages in that segment, the page will be swapped in from that same file.
When a page is evicted from memory, if it is part of a file-backed read-only segment (i.e., program code) it can simply be dropped, and can be read in again the next time it is needed. If the page is part of a writable memory-mapped file (rare, but programs sometimes do this) and it has been modified, it will need to be written back to that file before the page can be released. However, if there is no corresponding file (what in Unix is called an anonymous segment, used often for the stack and heap) then a location in “swap space” on disk must be allocated where the page can be written.
In Linux, this location is typically a dedicated swap partition; in Windows, this temporary storage resides in the PAGEFILE.SYS file. Such segments are typically created in memory; however if pages from that segment are evicted to free up memory they are written to swap pages first.
Note that initialized, writable data in a program (e.g. initialized global variables) is faulted in from the executable file, but if it is modified it must be paged out to swap just like dynamically-allocated segments.
How does the operating system determine whether a page has been modified and needs to be written to disk? It uses the D bit in the page table entry for this.
When a page is mapped in the page table, the D bit in the PTE is set to zero; when the CPU writes to a page with D = 0, the MMU re-writes the page table entry with D = 1. When the OS decides to evict a page, the D bit tells it whether the page is “clean,” i.e., it hasn’t been modified, or whether it is “dirty” and has to be written back to disk. Note that the A bit is set to 1 when a page is accessed; we’ll see how it’s used in the next module when we look at page replacement strategies.
How does the OS find a page after it’s been written out to swap? When the OS is paging in from a file (e.g., executable code), finding the data is straightforward, as there is a direct mapping between a range of pages in a specific file and corresponding pages in the virtual memory space. This correspondence can easily be stored in the definition of that virtual address segment. When pages are saved to swap space this doesn’t work, however, as the locations they are saved to are allocated dynamically and fairly arbitrarily.
This problem is solved by using the page table itself. After evicting a page, its page table entry is invalidated by setting P = 0; however, the other 31 bits of the entry are ignored by the MMU. These bits are used to store the location of the page in swap space, so it can be found later later at page fault time. Thus, the page table entry does dual duty: when the page is present it points to the physical page itself, and is interpreted by the MMU; otherwise, it points to a location in swap space, and is ignored by the MMU and used by the page fault handler.
Demand paging from files and from swap provides the mechanisms to create the traditional memory hierarchy, as shown here.
To access address A:
If it is not in the cache, then the old cache line is evicted, and A is loaded into the resulting empty cache line. This is done in hardware.
If it’s not in memory, then the old page is evicted, and the page containing A is loaded into the resulting empty page. This is done in software.
In general, this works because of locality: the cache line, page in memory, etc., get accessed multiple times before eviction.
Note that page replacement occurs even on desktop and laptop computers with lots of memory: if you leave a large program like Chrome or Microsoft Word idle for half an hour while you use another memory-hungry program, memory will be released from the idle process and given to the active one; if you switch back, the original program will run slowly for a while as it swaps these pages back in.
Caching and virtual memory all rely on two key principles of “locality”: temporal and spatial locality.
Temporal locality means that if the CPU accessed a particular piece of data recently, it is likely to access it again in the near future.
Spatial locality means that if the CPU accesses one memory location, it is likely to access nearby memory locations soon after.
Cache and virtual management policies, such as page replacement algorithms (e.g., Least Recently Used or LRU), determine which data gets evicted when the cache is full and new data needs to be loaded.
The goal of a memory hierarchy is to reduce the latency of memory accesses by providing a small, high-speed buffer for frequently used data in faster memory.
An important aspect of virtual memory (and caching) is what happens when there is a limited amount of memory available and physical memory, in the case of virtual memory, is overcommitted. In this case, every time a page is swapped in from disk, it will be necessary to remove, or evict, another page from memory. The choice of which page to evict is important: the best page to choose would be one that won’t be needed anymore, while the worst page to evict would be one of the next to be used, in which case, paging it back in would force another page to be evicted, and the work of paging it out and back in again would be wasted.
In fact, replacement of items in a cache is a general problem in computer systems; examples include:
The page replacement problem can be stated in abstract form:
Given the following:
A demand-paging strategy is an algorithm that for each access ai does the following:
In other words, it only fetches page j on demand (i.e., in response to a request for it).
There are several common page replacement strategies:
LRU (least-recently used): Here, accesses to each page are tracked after it has been loaded into memory, and the least-recently-used page is evicted (unsurprisingly, given the name of the strategy). The idea behind LRU is that pages which have been accessed in the recent past are more likely to be accessed in the near future than pages which have not been accessed in a long time.
Although this is a small example, a performance improvement is noted, with four misses compared to six for FIFO.
OPT (the optimal demand-paged strategy): although it is simple it is of course impossible to implement, since it requires knowledge of the future. We look at it because it provides a way of telling how well a real replacement strategy is performing. Is it close to OPT? Or is it far worse?
OPT picks a page to evict by looking forward in time and finding the page which goes for the longest time without being accessed again.
Except for seeing the future, OPT plays by the same rules as other demand-paging algorithms: in particular, it can’t fetch a page until it is accessed. That’s why the OPT strategy still has misses. (another way to think of it is to count a miss every time you have to fetch a page from disk - an extension of the same proof shows that no strategy using prefetching will fetch fewer pages than OPT)
CLOCK (FIFO with Second Chance): LRU is simple and quite effective in many caching applications, and would be ideal to use for the operating system to determine which pages to evict from memory. But there is one small problem: in a virtual memory system, a “miss” corresponds to a page fault and fetching a page from disk, while a “hit” is when the page is already mapped in memory and the access succeeds in hardware. This means that once a page is faulted into memory, any further use of that page is “invisible” to the operating system. If the OS doesn’t know when a page was last used, it can’t implement the Least-Recently-Used replacement strategy.
Despite this issue, it’s still possible to do better than FIFO by using the A (“accessed”) bit in the page table entry, which indicates whether the page has been accessed since the last time the bit was cleared. (It’s like the D bit that we saw in the previous module, except that it gets set to 1 for reads as well as writes.)
RANDOM: The RANDOM algorithm simply selects a page for replacement at random. It is the easiest to implement but does not consider any access history or locality of reference. Random replacement can lead to inconsistent and unpredictable performance, and it does not guarantee optimal page replacement decisions.
To summary, the choice of page replacement algorithm depends on the specific requirements and constraints of the system. LRU is often considered the ideal choice for optimizing page replacement decisions, but it can be more computationally expensive to implement. FIFO and CLOCK are simpler alternatives that strike a balance between simplicity and performance. RANDOM is the least predictable and should be avoided in most real-world scenarios.
This lesson is based on work by Desnoyers, Peter (Fall 2020). How Operating Systems Work, Fall 2020 Edition, Northeastern University, that is provided under Creative Commons License 4.0.
In this lesson, we provided insight into the concept of virtual memory, addressing primitive address translation, and memory protection using “base and bounds.” We then explored the limitations of this basic translation system, why it’s no longer in use, and introduced the page table mechanism that has largely replaced it. We showed how to resolve common challenges regarding contiguous allocation, external fragmentation, paged allocation, internal fragmentation, single and two-level pages, translation look-aside buffers (TLBs). We demonstrated address translation by creating a sample page table for a memory layout. We introduced several page replacement algorithms for virtual memory page swapping.
Desnoyers, Peter (Fall 2020). How Operating Systems Work. Fall 2020 Edition. Northeastern University. Creative Commons License 4.0. This lesson uses excerpts from this book and updates it.