92.605 Virtual Memory & Its Implementation

Objectives

By the end of this lesson, you should be able to:

Understand the need for virtual memory
Describe address translation and the use of page tables
Construct single and multiple level page tables
Translate virtual to physical addresses
Understand page fault handling
Describe page replacement algorithms
Appreciate virtual memory backing files

Overview

Virtual memory is a fundamental concept in modern computer systems that allows the processes running within a computer’s operating system (and the operating system itself) to use more memory than it physically possesses. It accomplishes this by using a combination of hardware and software techniques to transparently manage the allocation of data and programs between a computer’s physical memory (RAM) and secondary storage, typically a hard disk drive or solid-state drive. The implementation of virtual memory (VM) requires hardware support by the computer’s CPU and the memory management unit; thus, virtual memory is not necessarily available on all computer systems, particularly those using simple (and cheap) processors such as the Z80, 8086, or 68000 family of processors that are still common in embedded and industrial control systems.

The Motivation Behind Virtual Memory: Virtual memory was born out of the need to overcome the limitations imposed by the physical constraints of early computer systems. In the early days of computing, physical memory (RAM) was expensive and limited in size. As a result, running large or multiple programs concurrently was a significant challenge. To address this issue, computer scientists and engineers developed virtual memory as a solution.

Historical Example: Consider the case of early personal computers like the IBM PC, which typically came with a few kilobytes of RAM. Running complex software, such as word processors or graphical applications, required significantly more memory than was available. Virtual memory emerged as a way to give these systems the illusion of having more memory than they physically possessed.

Virtual memory remains a cornerstone of modern computer systems for several important reasons:

Increased Addressable Memory Space: Virtual memory allows applications to access a much larger addressable memory space than the physical RAM available. This is vital for running large applications or handling vast datasets. In modern 64-bit operating systems, applications can theoretically address up to 18.4 million terabytes (TB) of virtual memory (2⁶⁴), providing ample room for complex computational, data processing, machine learning, and database tasks.
Process Isolation and Protection: Virtual memory provides a robust mechanism for isolating and protecting processes from each other. Each process operates in its own virtual address space, preventing unintended interference and enhancing security. If one application crashes due to a memory error, it doesn’t affect other running applications, enhancing system stability.
Simplified Memory Management: Virtual memory simplifies memory management for both the operating system and application developers. Programs can assume they have access to a vast, contiguous address space without worrying about the physical memory’s constraints. Developers can allocate memory dynamically as needed, making it easier to write flexible and efficient software.
Improved Multitasking: Virtual memory enables efficient multitasking, allowing multiple applications to run concurrently without requiring each to fit entirely into physical RAM. A modern operating system can seamlessly switch between running processes, making it appear as though they are all running simultaneously, even if the physical memory is limited.

In short, virtual memory is not just a concept but a fundamental architectural feature that plays a pivotal role in the efficient functioning of modern computer systems. It has revolutionized the way we utilize memory resources and has become indispensable for supporting today’s complex and resource-intensive computing tasks.

Background

Before diving into the details of virtual memory, paging, address translation, and page tables, let’s review how operating systems work with the CPU to run programs. While the specific mechanism differ between operating systems, the essentials are the same. In the examples below we use the mechanisms most commonly used in Unix-derived operating systems such as Linux and MacOS.

Boot Process

Let’s start at the beginning and briefly review how an operating system books, using Linux as an example. When Linux boots, a series of steps occur to initialize and start running the operating system. Below are the key steps of the Linux boot process:

BIOS/UEFI Initialization:
- The boot process begins with the computer’s Basic Input/Output System (BIOS) or Unified Extensible Firmware Interface (UEFI) performing hardware initialization.
- The BIOS/UEFI conducts a Power-On Self-Test (POST) to check hardware components like CPU, RAM, storage devices, and peripherals.
- The BIOS/UEFI is located in a read-only memory chip (ROM) and the code for the BIOS/UEFI is written by third parties and is hard-wired into your computer.
Bootloader Execution:
- After hardware initialization, the BIOS/UEFI searches for a bootable device, typically the system’s storage devices (e.g., hard drive, SSD, USB Drive). The sequence with which devices are searched for a bootable operating system is defined in the BIOS/UEFI setup and can be changed.
- Once a bootable device is found, the BIOS/UEFI loads the bootloader program from a specific location on the storage device.
- Common Linux bootloaders include GRUB (Grand Unified Bootloader) and LILO (LInux LOader).
Bootloader Configuration:
- The bootloader displays a boot menu if multiple operating systems are installed.
- The user can choose which OS to boot or specify boot options by editing bootloader configurations.
Kernel Loading:
- The bootloader loads the Linux kernel (vmlinuz) into memory. The kernel is the core of the operating system responsible for managing hardware, processes, and system resources.
- The bootloader also passes essential parameters and command-line arguments to the kernel.
- Kernel parameters can be “tweaked” to change how the operating system behaves. For example, paging files for virtual memory can be limited and the device used for the paging file can be selected. Other parameters commonly include the virtual memory page size.
Initramfs (Initial RAM Filesystem):
- In some cases, an initramfs image is loaded into memory before the actual root filesystem. Initramfs contains essential drivers and tools needed to mount the root filesystem.
- This step is crucial when the root filesystem resides on an encrypted disk or in a complex storage configuration.
Kernel Initialization:
- The Linux kernel begins initializing the system.
- It detects hardware, initializes devices, and sets up crucial data structures.
- The kernel mounts the root filesystem (specified in bootloader parameters) as the root of the directory tree.
Init Process (PID 1):
- Once the root filesystem is mounted, the kernel executes the first user-space program known as the init process.
- In modern Linux distributions, init has been replaced by systemd or other init systems, but the concept remains the same.
- The init process is responsible for initializing user-space services, daemons (background processes without interaction), and the entire user environment.
User-Space Initialization:
- User-space initialization involves launching essential system services and user-level processes.
- systemd, SysVinit, or another init system is responsible for managing the startup of system services.
- The system goes through various runlevels or targets, bringing up services according to dependencies.
Login Prompt (Optional):
- If the system is configured as a multi-user environment, it may present a login prompt on one or more virtual terminals.
- Users can log in and start interacting with the system.
Graphical User Interface (Optional):
- On systems configured with a graphical user interface (GUI), a display (or window) manager like GDM (GNOME Display Manager) or LightDM is started.
- Users can log in through the GUI, launching their preferred desktop environment.
User Interaction:
- Once the system is fully initialized, users can interact with the running Linux environment, run programs/applications, and perform command line tasks.

Now the system is ready to run programs and applications within processes.

Processes

In Linux, processes are launched using a variety of methods, with the primary method being through the execution of a program or command from a terminal window or a command shell (such as bash or zsh). When we say process, we mean any program or executable. Even running a simple command like “ls” means that a process is launched and the program’s code is loaded from disk and executed as the instructions of the process. No user program code can run without a process.

Here’s a more complete overview of how a program is run with a process and how that process is often launched in Linux:

Executing a Command:
- The most common way to launch a process is by executing a command from a shell (command-line interface) or a graphical user interface (GUI) application launcher.
- For example, running a command like ls, firefox, or gcc will initiate a process associated with that command.
Command-Line Execution:
- When you run a command from the terminal, the shell (e.g., bash or zsh) interprets the command and looks for the corresponding executable file in directories listed in the PATH environment variable.
- If the executable file is found, the shell creates a new process and loads the executable into memory.
- Most shell commands are programs residing in /usr/bin or some other folder.
Path to Executable:
- You can specify the full path to the executable if it’s not in one of the directories listed in the PATH. For example, /usr/bin/ls or /home/user/myapp.
Background and Foreground Execution:
- Processes can be launched in the foreground (where they run in the terminal, and you can see their output and provide input via the keyboard or another input device) or in the background (where they run independently, and you can continue using the terminal, but not interact with the process).
- To run a command in the background, you can append & to the command, like command &.
Process Control Operators:
- You can use process control operators like | (pipe), > (redirect output), and < (redirect input) to manipulate how processes communicate and handle data.
Shell Scripts:
- Shell scripts are text files containing a series of commands. You can launch a process by running a shell script with the ./scriptname.sh command.
- Make sure the script is executable (chmod +x scriptname.sh) and contains the appropriate shebang line (e.g., #!/bin/bash) at the beginning.
Graphical User Interface (GUI):
- In a GUI environment like GNOME or KDE, you can launch processes by clicking on application icons, using the system menu, or opening files associated with specific applications.
Process Forking:
- Processes can be launched programmatically by other processes, most commonly using system calls such as fork() followed by a form of exec() to create a new process that is a copy of the parent process and then load new code into that “forked” child process.
Background Services and Daemons:
- Linux also launches background processes known as daemons and services during system startup. These processes run independently of user interaction and handle various system tasks, such as managing hardware, network services, and system logging.
Task Scheduler:
- Scheduled processes or jobs can be launched automatically at specified times or intervals using task schedulers like cron or systemd timers.
Remote Access:
- Processes can be initiated remotely through protocols like SSH. For example, you can SSH into a remote server and start processes on that server.
Process Management Commands:
- You can manage processes using commands like ps (process status), kill (terminate a process), nice (adjust process priority), and renice (change process priority).

Process Internals

The operating system tracks various pieces of information internally for every process to manage and control its execution. This information is stored in a data structure known as the “Process Control Block” (PCB) or “Task Control Block” (TCB). Below is an overview of the key information that the operating system maintains for each process:

Process ID (PID): A unique identifier assigned to each process, allowing the OS to distinguish between different processes.
Process State: Indicates the current state of the process, such as “Running,” “Ready,” “Blocked,” or “Terminated.” The state helps the OS manage process scheduling and resource allocation.
Program Counter (PC): A register that holds the memory address of the next instruction to be executed by the process. If the operating system supports virtual memory, then the PC always references virtual addresses.
CPU Registers: The values stored in CPU registers that belong to the process, including general-purpose registers, stack pointers, and others. Any addresses held by a register, such as the stack pointer, will always be virtual addresses in a virtual memory environment.
Priority and Scheduling Information: Information related to the process’s priority or scheduling priority, which affects its position in the scheduling queue.
Memory Management Information:
- Memory Allocation: Details about the memory allocated to the process, including the base address and size of the process’s address space.
- Page Tables: Information about the page tables and memory mapping for virtual-to-physical address translation.
File Descriptor Table: A table that tracks the files and I/O devices the process has open. Each entry in the table contains information such as file pointers and access permissions.
Resource Information: Information about the resources allocated to the process, including allocated memory, open files, and other system resources.
Parent Process ID (PPID): The PID of the parent process that created this process. Useful for process hierarchy management.
Child Process List: A list of PIDs for any child processes spawned by the current process.
Signal Handlers: Information about how the process handles signals (interrupts) sent by the operating system or other processes.
Accounting and Statistics: Data related to the process’s resource usage, execution time, and other performance metrics.
Environment Variables: A list of environment variables that affect the process’s execution environment.
Working Directory: The current directory associated with the process. This directory is used as the default location for file operations.
User and Group ID: The user and group identifiers associated with the process, used for access control and security purposes.
Inter-Process Communication (IPC) Information: If the process communicates with other processes, information about IPC mechanisms like pipes, sockets, and message queues may be stored.
Stack and Heap Information: Details about the process’s stack and heap memory areas, including their current size and limits.
Exit Status: The exit status or return code of the process, indicating the success or failure of its execution.
Terminal and Session Information: For processes associated with terminal sessions, information about the controlling terminal and session ID.
Security Context: Security-related information, including privileges and access control attributes.

These pieces of information are essential for the operating system to manage processes effectively, including process creation, memory management, scheduling, termination, resource allocation, and communication between processes. The PCB or TCB serves as a data structure (held in memory) that the operating system uses to keep track of and manage these details for each process in the system.

Simple Address Translation

In a computer physical addresses correspond to real signals going over the motherboard to DIMM sockets, selecting specific locations on memory chips. On the other hand, virtual addresses are the addresses used by the CPU to access both program code and data. Remember that virtual addresses are translated to physical addresses by the memory management unit (MMU).

Changing these translations can be used to create multiple virtual address spaces, each one corresponding to a separate translation setting. However there is still only one physical address space – the relationship between physical addresses (represented by electrical signals) to memory locations in DRAM chips.

A simple mechanism is using base and bounds registers to translate a standard pre-process virtual address space (i.e., 0..sizeof(process)) into a different location in physical memory, as illustrated here.

Tutorial: Virtual Memory

The series of videos by Dr. Black-Schaffer of Uppsala University provides an overview of many of the concepts we will address in this lesson and watching it before reading the remainder may help set context. Note that his discussion of virtual memory uses the MIPS rather than Intel architecture. While there are subtle differences, it is interesting how similar they are.

Paged-Address Translation

This lesson examines one approach to address translation in detail: the one used by the original Intel Pentium CPU, and by any Intel-compatible CPU (such as those made by AMD) running in 32-bit mode. It uses 32-bit virtual addresses, 32-bit physical addresses, and a page size of 4K (2¹²) bytes. Since pages are 2¹² bytes each, addresses can be divided into 20-bit page numbers and 12-bit offsets within each page, as shown here.

The same principles apply to any modern CPU, but keeping to a 32-bit CPU is simpler for explanation, although modern CPU’s often have an address space greater than 32 bits. A page size of 4096 is somewhat arbitrary but it strikes a good balance between pages that are too small and require larger page tables and might require more movement of memory and cause more page faults and larger page sizes that might lead to more internal fragmentation and wasted memory.

Note that pages are always aligned, i.e., they start with offset 0 and go up through offset 4095. A page can’t start somewhere in the middle, like at address 100, in the same way that the 20th century starts at 1900, not at some arbitrary year between 1900 and 1999.

The role of the computer’s memory management unit (MMU) in this system is to map a 20-bit virtual page number to a 20-bit physical page number; the offset can pass through unchanged, as shown here.

Although paged address translation is far more flexible than base and bounds registers, there’s one disadvantage: it requires a great deal more information. Base and bounds translation only requires two values, which can easily be held in registers in the MMU. In contrast, paged translation must be able to handle a separate mapping value (to either a physical page or an invalid indication) for each of over a million virtual pages. This results in single-level page tables being simply too large and not feasible in practice, as we will see below.

Single-Level Page Tables

The only possible place to store the amount of information required by paged address translation is in memory itself, so the MMU uses page tables in memory to specify virtual-to-physical mappings.

One of the simplest ways to structure a page table for mapping 20-bit page numbers is as a simple array with 2²⁰ entries. With this configuration, each virtual page has an entry, and the value in that entry is the corresponding physical page number. This single-level table is located in physical memory, and the MMU needs a pointer to this table, which is stored in an MMU register. (On Intel-compatible CPU’s, this register is Control Register 3, or CR3; this abbreviation is used to refer to the page table root pointer from now on.)

Although address translation is performed in hardware, pseudo-code is used here to describe the algorithm the MMU would use for a single-level page table. VA and PA stand for virtual and physical addresses, and VPN and PPN represent virtual and physical page numbers.

PA = translate(VA):
   VPN, offset = split[20 bits, 12 bits] (VA)
   PTE = physical_read(CR3 + VPN * sizeof(PTE), sizeof(PTE))
   if not PTE.present:
      fault
   return PTE.PPN + offset

Note that this means that every memory operation performed by the CPU now requires two physical memory operations: one to translate the virtual address, and a second one to perform the actual operation. If this seems inefficient, it is. You’ll learn how to improve this inefficiency in a bit.

Although the page table solves one problem, it causes another: it uses 4MB of memory per map.

With 4MB page tables and one table per process, this would require 2.5GB of memory just for page tables if you have 640 processes on a shared server, on a machine with 4GB total, assuming a 32-bit address space. Worse yet, each table would require a contiguous 4MB region of memory, running into the same problem (external fragmentation) that paged address translation tried to solve. And imagine if the address space were 64-bits.

2-Level Page Tables

To solve this problem, almost all 32-bit processors (e.g., Intel-compatible, ARM) use a 2-level page table, structured as a tree. Here’s an example of what one looks like.

The left-most (top) ten bits of the virtual page number index into the top-level table (sometimes called the page directory), which holds a pointer (i.e., the memory address) to a second-level page table. The bottom ten bits of the virtual page number are used as an index into this second-level table, giving the location where the actual physical address will be found. At first glance, it appears that this structure takes just as much space as a single-level table. Doesn’t that mean the same problem still exists?

It turns out that to map a full 4GB of memory, you’ll still need 4MB for page tables. But, if a process only needs a small amount of memory, then most of the entries in the top-level directory will be empty (shown here as P=0), and the page table will only require a small amount of memory. Plus we can use any set of 4KB pages for the table, instead of needing a contiguous 4MB block.

Here is an example of address translation using a 2-level table that uses 3 pages: physical pages 00000 (the root directory), 00001, and 00003. Two data pages are mapped: 00002 and 00004. Any entries not shown are assumed to be null, i.e., the present bit is set to 0.

Tutorial: Address Translation

In the short tutorial video below, Dr. Maria Jump of (formerly) Khoury Boston demonstrates address translation by way to examples.

Translation Look-Aside Buffer

The 2-level table address translation processes you just learned about is even less efficient than the single-level table, requiring two additional memory accesses by the MMU to retrieve page table entries for every CPU memory access. Even if MMU accesses to memory can be satisfied from the L1 cache, this will still slow down the CPU by a factor of three or more. To reduce this inefficiency, a special-purpose cache called the translation look-aside buffer (TLB) is introduced. Instead of holding memory values, like the L1 and L2 caches, the TLB holds virtual page number to physical page number mappings. It’s typically very small, holding from between 64 mappings (on various Intel Atom CPU’s) to about 1500 mappings (the more expensive Xeon CPU’s). This is in part because it has to be very fast - these translations are needed before the CPU can look in its cache for a value.

A famous computer science quote attributed to David Wheeler is: “All problems in computer science can be solved by another level of indirection,” to which many people add “except indirection.” A corollary to this is that most performance problems can be solved by adding a layer of caching. How are these quotes applicable to paged address translation?

Using the TLB

Using the TLB, the translation process now looks like this. The following acronyms are used: VPN is virtual page number, and PPN is physical page number. Since each entry is 4 bytes, we can calculate the address of entry n by multiplying n times 4.

translate VA -> PA:
   (VPN, offset) = split([20,12], VA)
   if VPN is in TLB:
      return TLB[VPN] + offset
   (top10, bottom10) = split([10,10], VPN)
   PT = phys_read(CR3 + top10 * 4)
   PPN = phys_read(PT + bottom10 * 4)
   add (VPN -> PPN) to TLB, evicting another entry
   return PPN + offset

How well does this perform? If all of the code and data of a process fit into 640 pages (about 2.5MB) on a mid-range machine, all translations will come out of the TLB and there will be no additional overhead for address translation. If the working set, or amount of memory in active use, is larger than this, the page table accesses will cause performance to decrease. However, typically the vast majority of accesses are to pages in the TLB (e.g., currently running code, stack, etc.) and only a small fraction of accesses require going to the page tables in memory for translation.

TLB Consistency

Like any other cache, a TLB only functions correctly if it is consistent, i.e., the entries in the TLB accurately reflect the in-memory page tables which they are caching. There are two primary sources of inconsistency in TLB entries:

Modifications:: Sometimes the OS must modify the address space of a running program. An example is demand paging (covered in the next module), where the OS maps in new pages, and un-maps others when they are paged out. When changing the page table in memory, the OS must ensure that the TLB is not caching a copy of the old entry.
Context switches: To provide both a consistent address space layout as well as memory protection, the OS will typically provide multiple processes with different mappings for the same virtual page; clearly it’s important to make sure that the correct mapping is used according to which process is running.

The issue of modifications can be solved in a fairly straightforward way: the MMU provides one instruction to flush the TLB entry (if there is one) for a particular page, and another to flush the entire TLB. When entries are flushed from the TLB, there is almost always a performance impact, because of the extra memory accesses needed to reload those entries the next time they are required. In the case where individual entries are being flushed the overhead is not that significant, because (a) the OS is already spending a lot of time modifying the page table, and (b) it doesn’t do this very often, anyway.

However, the issue with context switches is harder to solve. The performance overhead can be ignored and the entire TLB can be flushed on every context switch (this is done on most Intel-compatible CPU’s). With a 500-entry TLB and a 4-level page table, this results in throwing away 2000 memory accesses worth of work on each context switch. Another solution is to add an MMU register identifying the current process (an Address Space ID, or ASID), and tag each TLB entry with the ASID for the process it is valid in. If a process is interrupted for a short time, most of its TLB entries will remain cached, while the ASID field will prevent them from being mistakenly used in another process.

Tutorial: Page Table Entries

The short tutorial by Dr. Maria Jump (formerly Khoury Boston) shows what the 32-bit Intel page table entry (PTE) looks like:

Effect of Page Size

The page size of a processor plays a large role in determining how much address space can be addressed. In particular, assuming that the page table tree is built out of single pages, a 2-level page table can map N² pages, where N is the number of page table entries that fit in a single page. Thus, if the address space is 32 bits, a page table entry (physical page number plus some extra bits) can fit in 4 bytes. The maximum virtual memory that can be mapped with a 2-level page table is as shown here for various page sizes:

2K pages: 512 (2⁹) entries per page = virtual address space of 2¹⁸ pages of 2¹¹ bytes each = 2²⁹ bytes (½ GB)

4K pages: 1024 (2¹⁰) entries per page = virtual address space of 2²⁰ pages of 2¹² bytes each = 2³² bytes (4GB)

8K pages: 2048 (2¹¹) entries per page = virtual address space of 2²² pages of 2³⁵ bytes each = 2³⁵ bytes (32GB)

In other words, 2K pages are too small for a 32-bit virtual address space unless the process moves to a deeper page table, while 8K pages are bigger than necessary. For example, the SPARC and Alpha CPU’s, early 64-bit processors, used 8KB pages.

64-bit Intel-compatible CPU’s like the Intel i5 and i7 use 4K pages for compatibility, and 8-byte page table entries, because four bytes is too small to hold large physical page numbers. This requires a 4-level page table, for total virtual address space of 2³⁶ × 2¹² = 48 bits:

{.width=“70%”}

Clearly, the penalty for TLB misses is higher in this case than in the 32-bit case, as there are four memory accesses to the page table instead of two. Intel and AMD have already announced support for 5-level page tables in future processors, allowing address spaces larger than 256 TB.

Page Table Instantiation

Consider the following very simple Linux program launched in a process:

char hello[] = "hello world\n";
void _start(void) {
   syscall(4, 1, hello, 12); /* syscall 4 = write(fd,buf,len) */
   syscall(1);               /* syscall 1 = exit() */
}

This is a “hello world” program and doesn’t use any libraries; instead, it uses system calls to directly write to standard out (file descriptor 1) and to exit. In Linux, _start is the point at which execution of a program begins; normally the _start function is part of the standard library, which performs initialization before calling main.

The short video by Dr. Maria Jump (formerly of Khoury Boston) walks through the process of building a page table for this program:

Page Faults

So far, we assumed that page tables contain proper entries to physical pages. But what if that is not the case and the page table does not contain a required mapping? Then a “page fault” occurs and the mapping must be established. A fault can also occur when a second level page table does not exist or when a read-only page is attempted to be written.

A page fault is a special form of exception that has the following two characteristics: first, it is generated when an address translation fails, and second, it occurs in the middle of an instruction, not after it is done. When a page fault occurs, control is passed to a page fault handler – code within the operating system’s kernel that deals with the exception.

Typical information that the MMU passes to the page fault handler is:

The instruction address when the page fault occurred
The address that caused the page fault
Whether the access attempt was a read or a write
Whether the access was attempted in user or supervisor mode

After the page fault handler returns, the instruction that caused the fault resumes, and it retries the memory access that caused the fault in the first place. A single instruction can cause multiple, different page faults, of which there are two different types:

Instruction fetch: A fault can occur when the CPU tries to fetch the instruction at a particular address. If the instruction “straddles” a page boundary (i.e., a 6-byte instruction that starts 2 bytes before the end of a page) then you could (in the worst case) get two page faults while trying to fetch an instruction.
Memory access: Once the instruction has been fetched and decoded, it may require one or more memory accesses that result in page faults. These memory accesses include those to the stack (e.g., for CALL, PUSH, POP, and RET instructions) in addition to load and store instructions. As before, accessing memory that straddles a page boundary will result in additional faults.

Operating systems use two primary strategies in handling page faults:

Kill the program: This is how C programs are debugged. If the access is in fact an error, the default action is to kill the process, so that the page fault handler never returns.
Resolve the fault: The OS modifies the page tables to establish a valid mapping for the failing address, and then returns from the page fault handler. The CPU retries the memory access, which should succeed this time.

Note that several additional and different strategies for handling page faults are used in hardware virtualization; these are not addressed in this lesson.

Page fault handling was a primary motivation behind the early development of RISC (reduced instruction set) computers. In particular, to avoid the complex logic that CISC (complex instruction set) CPUs use to save CPU state during a page fault, RISC CPUs simplify the instruction set so that instructions can start over from the beginning after a page fault. This means:

Instructions and data words are aligned on natural boundaries, so that no accesses straddle two pages.
Using only LOAD and STORE instructions (with no side effects) to access memory. No auto-incrementing an address in a register, even for the stack pointer, because you will get the wrong result if you run it twice. (RISC CPU’s typically save the return address in a register, and the compiler generates code to save it to the stack.)

Page Fault Handling

In the Linux kernel, the memory map is represented as a list of vm_area_struct objects, each of which contain the following information:

Start address
End+1 address
Permissions: read/write/execute
Flags: various details on how to handle this segment
File, offset (if mapped from a file)

The page table is a simple structure defined by the CPU hardware. The virtual memory map in the OS is a purely software data structure, and can be as simple or complex as the OS writers decide.

The page fault handler searches this list for a match which can result in one of the following for the memory map below:

08048000-08049000 r-xp 00000000 08:03 4072226 /tmp/a.out
08049000-0804a000 rw-p 00000000 08:03 4072226 /tmp/a.out
0804a000-0804b000 rw-p 00000000 00:00 0 [anon]
bffd5000-bfff6000 rw-p 00000000 00:00 0 [stack]

No match: This is an access to an undefined address. It’s a bug, and the OS terminates the process with a “segmentation fault” error.
Any page in bffd5000-bfff6000: These are demand-allocate stack pages. The page fault handler allocates a physical memory page, zeros it (in case it’s accidentally holding data from another process), puts it into the page table, and returns.
Page 08048000: This page is mapped read-only from the executable file “a.out,” so the page fault handler allocates a memory page, reads the first page from “a.out” into this page, inserts the page into the page table (marked read-only), and returns.
Page 08049000: This page is mapped read/write from the executable file. Just like page 08048000, when a fault occurs the OS allocates a page, reads its contents from the executable, maps the page in the page table (read/write this time) and returns. (although not shown in the listing, the two pages map different offsets in the file)
Page 0804a000: Like the stack, this is demand-allocated and zero-filled. The page fault handler allocates a page, zeroes it, puts it into the page table, and returns.

Note that this is an example of a general-purpose approach known as lazy allocation. It’s often more efficient to defer allocating resources until they’re actually needed, rather than allocating a potentially excessive amount at startup.

As an example, here are the steps involved in the execution of the first instruction of a hypothetical process. When the process starts executing the PC (program counter) is set to 1000, and CR3 points to a top-level page directory, but that directory is empty.

The first few steps taken are shown in the following video by Dr. Maria Jump (formerly of Khoury Boston):

Shared Memory

Consider the case of a multi-user computer, where multiple users are running the same program (i.e., the shell, /bin/bash) at the same time. The memory usage of the three programs will look something like this:

The code section in each process is identical for two reasons. First, because they are mapped from the same file, they all started with the same values. Second, they are mapped read-only, so no changes can be made to any of them. Thus, the different processes can share the memory pages used to map the code section, so that no matter how many copies of the same program are running, only a single copy of the program code is needed in memory. Is this safe? Yes, because each process still sees exactly the same data they were supposed to see in this address range, and are unable to change the values seen by other processes.

How does the OS determine that it can share the same page between two processes? When a page fault happens, and the page fault handler determines that it needs to read a particular file page (i.e., page 10 from the executable /bin/bash) it first searches to see whether that page is already stored in an existing memory page. (The OS only checks for the case where pages in different processes map exactly the same page in exactly the same file. If there are different executable files that happen to be exact copies of each other, the OS won’t know that they’re the same and will happily load them into memory at the same time.) If it finds a copy already in memory, it can increment a reference count on that page and map it into the process page table, instead of having to allocate a new page and read the data in from the disk. When a process exits, instead of blindly de-allocating any memory mapped by that process, the reference count of each page is decremented, and it is only freed when this count goes to zero, indicating that no other address spaces are mapping that page.

Note that the operating system also provides a way for applications to create read/write memory regions which are explicitly shared between processes, and used for communication between them. This can be used for high-performance communication between processes, and is used in a number of database and similar enterprise applications.

`fork()` and copy-on-write

In all the cases you’ve seen so far, page sharing has been used to share read-only pages: these are intrinsically safe to share, because processes are unable to modify the pages and thereby affect other processes. But, can writable pages be shared safely? The answer is yes, but it has to be done carefully.

First, some background on how this can be done. The Unix operating system uses two function calls to create new processes and execute programs: fork() and exec(). fork() makes a copy of the current process (in fact, the system call returns twice, once in the parent and once in the child), while exec(file) replaces the address space of the current process with the program defined by file and begins executing that program at its designated starting point.

The short video below by Dr. Maria Jump (formerly Khoury Boston) explains the steps taken by Unix when instantiating a process.

There are reasons why this general approach is a good idea, and reasons why it’s not. But, since it’s been working this way for almost 40 years, it’s not likely to change anytime soon. In early versions of Unix, fork() was implemented by literally copying all the writable sections (e.g., stack, data) of the parent process address space into the child process address space. After doing all this work, most (but not all) of the time, the first thing the child process would do is to call exec(), throwing away the entire contents of the address space that was just copied.

Consider the case when a large program (e.g., Chrome) tries to execute a small program (e.g., /bin/ls) in a child process. When the parent process, e.g., Chrome, executes fork(), it is necessary to copy the entire parent address space, which will then be discarded when exec() is called. That would cause significant wasted effort and increase the time required to launch a process. However, there are a few mechanism that can be implemented to avoid this issue.

Sharing read/write pages: The issue of copying the read-only portions of the address space can be avoided by using page sharing, but what about writable data? A quick inspection of several Chrome and Safari instances (using pmap on Linux and vmmap on MacOS) indicates that a browser with two or three open tabs can easily have over 300MB of writable address space (note that this measurement was taken years ago; it’s probably a lot more now). These pages can’t just be given writable mappings in each process, or changes made in one process would be visible in the other. In certain cases (e.g., the stack) this mutual over-writing of memory would most likely be disastrous.
Copy-On-Write: To avoid this issue, the pages are shared between the processes, but made read-only. If one of the processes writes to a page, the page fault handler “breaks” the copy-on-write sharing for that page – it allocates a new page and copies the old one into it, so that each process can now have a separate, writable copy of the modified page. When the pages are no longer shared (i.e., the child process address space is replaced by a call to exec()) the remaining copy-on-write mappings can be changed back to normal writable mappings.

Here is an example of a copy-on-write mapping, where at first the pages are mapped read-only in the two page tables. When a page fault occurs, the OS copies the page, giving each address space a writable copy.

Copy-on-write is a basic strategy that can be used in many contexts, from file systems to application data structures. It reduces both the cost of copying and the total space consumed, by only copying when it is absolutely necessary to. Like demand allocation, this is an example of a lazy algorithm.

Backing Files for Virtual Memory

Rather than limit processes to available physical memory, secondary memory on disks (mechanical hard drives or solid state drives) can be used to temporarily store pages that do not fit in memory and swap pages back and forth between drives and physical memory. The area used on disk for saving pages is often called the backing store, backing file, swap space, swapping file, or virtual memory file. It can be a fixed segment on disk, a full drive, or a file (that is either fixed or expands as needed, i.e., it is dynamically sized).

Page faults allow data to be dynamically fetched into memory when it is needed, in the same way that the CPU dynamically fetches data from memory into the cache. This allows the operating system to over-commit memory: the sum of all process address spaces can add up to more memory than is available, although the total amount of memory mapped at any point in time must fit into RAM. This means that when a page fault occurs and a page is allocated to a process, another page (from that or another process) may need to be evicted from memory.

For read-only mappings, this is simple: just forget the mapping, and if a page fault is returned for the page at some later point in time, it can be paged back in again from the disk.

Types of Virtual Segments

There are two types of virtual segments: file-backed and anonymous. The majority of file-backed segments are demand-paged executables. If a segment is backed by a specific file, on page faults for pages in that segment, the page will be swapped in from that same file.

When a page is evicted from memory, if it is part of a file-backed read-only segment (i.e., program code) it can simply be dropped, and can be read in again the next time it is needed. If the page is part of a writable memory-mapped file (rare, but programs sometimes do this) and it has been modified, it will need to be written back to that file before the page can be released. However, if there is no corresponding file (what in Unix is called an anonymous segment, used often for the stack and heap) then a location in “swap space” on disk must be allocated where the page can be written.

In Linux, this location is typically a dedicated swap partition; in Windows, this temporary storage resides in the PAGEFILE.SYS file. Such segments are typically created in memory; however if pages from that segment are evicted to free up memory they are written to swap pages first.

Note that initialized, writable data in a program (e.g. initialized global variables) is faulted in from the executable file, but if it is modified it must be paged out to swap just like dynamically-allocated segments.

Dirty and Clean Pages

How does the operating system determine whether a page has been modified and needs to be written to disk? It uses the D bit in the page table entry for this.

When a page is mapped in the page table, the D bit in the PTE is set to zero; when the CPU writes to a page with D = 0, the MMU re-writes the page table entry with D = 1. When the OS decides to evict a page, the D bit tells it whether the page is “clean,” i.e., it hasn’t been modified, or whether it is “dirty” and has to be written back to disk. Note that the A bit is set to 1 when a page is accessed; we’ll see how it’s used in the next module when we look at page replacement strategies.

How does the OS find a page after it’s been written out to swap? When the OS is paging in from a file (e.g., executable code), finding the data is straightforward, as there is a direct mapping between a range of pages in a specific file and corresponding pages in the virtual memory space. This correspondence can easily be stored in the definition of that virtual address segment. When pages are saved to swap space this doesn’t work, however, as the locations they are saved to are allocated dynamically and fairly arbitrarily.

This problem is solved by using the page table itself. After evicting a page, its page table entry is invalidated by setting P = 0; however, the other 31 bits of the entry are ignored by the MMU. These bits are used to store the location of the page in swap space, so it can be found later later at page fault time. Thus, the page table entry does dual duty: when the page is present it points to the physical page itself, and is interpreted by the MMU; otherwise, it points to a location in swap space, and is ignored by the MMU and used by the page fault handler.

The Memory Hierarchy

Demand paging from files and from swap provides the mechanisms to create the traditional memory hierarchy, as shown here.

To access address A:

If it is not in the cache, then the old cache line is evicted, and A is loaded into the resulting empty cache line. This is done in hardware.
If it’s not in memory, then the old page is evicted, and the page containing A is loaded into the resulting empty page. This is done in software.

In general, this works because of locality: the cache line, page in memory, etc., get accessed multiple times before eviction.

Note that page replacement occurs even on desktop and laptop computers with lots of memory: if you leave a large program like Chrome or Microsoft Word idle for half an hour while you use another memory-hungry program, memory will be released from the idle process and given to the active one; if you switch back, the original program will run slowly for a while as it swaps these pages back in.

Principle of Locality

Caching and virtual memory all rely on two key principles of “locality”: temporal and spatial locality.

Temporal locality means that if the CPU accessed a particular piece of data recently, it is likely to access it again in the near future.
Spatial locality means that if the CPU accesses one memory location, it is likely to access nearby memory locations soon after.

Cache and virtual management policies, such as page replacement algorithms (e.g., Least Recently Used or LRU), determine which data gets evicted when the cache is full and new data needs to be loaded.

The goal of a memory hierarchy is to reduce the latency of memory accesses by providing a small, high-speed buffer for frequently used data in faster memory.

Page Replacement

An important aspect of virtual memory (and caching) is what happens when there is a limited amount of memory available and physical memory, in the case of virtual memory, is overcommitted. In this case, every time a page is swapped in from disk, it will be necessary to remove, or evict, another page from memory. The choice of which page to evict is important: the best page to choose would be one that won’t be needed anymore, while the worst page to evict would be one of the next to be used, in which case, paging it back in would force another page to be evicted, and the work of paging it out and back in again would be wasted.

In fact, replacement of items in a cache is a general problem in computer systems; examples include:

Cache line replacement in the hardware CPU cache
Entry replacement in the TLB
Buffer replacement in a file system or messaging buffer pool
Page replacement in virtual memory

The page replacement problem can be stated in abstract form:

Given the following:

A disk D holding d (virtual) pages, with virtual addresses 0,1,…d-1;
A memory M consisting of m (physical) pages, where each page is either empty or holds one of the d virtual pages, and
An access pattern a₁, a₂, a₃, … where each a_i is a virtual address in the range (0, d-1)

A demand-paging strategy is an algorithm that for each access a_i does the following:

If a_i is already in one of the m physical pages in {M} (a hit), do nothing.
Otherwise (a miss):
1. Selects a physical page j in M (holding some virtual address M_j) and evict it
2. Fetches virtual page a_i from disk and places it in physical page j

In other words, it only fetches page j on demand (i.e., in response to a request for it).

Page Replacement Strategies

There are several common page replacement strategies:

FIFO (first-in first-out): In this algorithm, the page evicted from memory is the first page to have been fetched into memory. This strategy is very simple to implement, as it only requires keeping track of the order in which pages were fetched into memory.

LRU (least-recently used): Here, accesses to each page are tracked after it has been loaded into memory, and the least-recently-used page is evicted (unsurprisingly, given the name of the strategy). The idea behind LRU is that pages which have been accessed in the recent past are more likely to be accessed in the near future than pages which have not been accessed in a long time.

Although this is a small example, a performance improvement is noted, with four misses compared to six for FIFO.
OPT (the optimal demand-paged strategy): although it is simple it is of course impossible to implement, since it requires knowledge of the future. We look at it because it provides a way of telling how well a real replacement strategy is performing. Is it close to OPT? Or is it far worse?

OPT picks a page to evict by looking forward in time and finding the page which goes for the longest time without being accessed again.

Except for seeing the future, OPT plays by the same rules as other demand-paging algorithms: in particular, it can’t fetch a page until it is accessed. That’s why the OPT strategy still has misses. (another way to think of it is to count a miss every time you have to fetch a page from disk - an extension of the same proof shows that no strategy using prefetching will fetch fewer pages than OPT)
CLOCK (FIFO with Second Chance): LRU is simple and quite effective in many caching applications, and would be ideal to use for the operating system to determine which pages to evict from memory. But there is one small problem: in a virtual memory system, a “miss” corresponds to a page fault and fetching a page from disk, while a “hit” is when the page is already mapped in memory and the access succeeds in hardware. This means that once a page is faulted into memory, any further use of that page is “invisible” to the operating system. If the OS doesn’t know when a page was last used, it can’t implement the Least-Recently-Used replacement strategy.

Despite this issue, it’s still possible to do better than FIFO by using the A (“accessed”) bit in the page table entry, which indicates whether the page has been accessed since the last time the bit was cleared. (It’s like the D bit that we saw in the previous module, except that it gets set to 1 for reads as well as writes.)
RANDOM: The RANDOM algorithm simply selects a page for replacement at random. It is the easiest to implement but does not consider any access history or locality of reference. Random replacement can lead to inconsistent and unpredictable performance, and it does not guarantee optimal page replacement decisions.

To summary, the choice of page replacement algorithm depends on the specific requirements and constraints of the system. LRU is often considered the ideal choice for optimizing page replacement decisions, but it can be more computationally expensive to implement. FIFO and CLOCK are simpler alternatives that strike a balance between simplicity and performance. RANDOM is the least predictable and should be avoided in most real-world scenarios.

Acknowledgement

This lesson is based on work by Desnoyers, Peter (Fall 2020). How Operating Systems Work, Fall 2020 Edition, Northeastern University, that is provided under Creative Commons License 4.0.

Summary

In this lesson, we provided insight into the concept of virtual memory, addressing primitive address translation, and memory protection using “base and bounds.” We then explored the limitations of this basic translation system, why it’s no longer in use, and introduced the page table mechanism that has largely replaced it. We showed how to resolve common challenges regarding contiguous allocation, external fragmentation, paged allocation, internal fragmentation, single and two-level pages, translation look-aside buffers (TLBs). We demonstrated address translation by creating a sample page table for a memory layout. We introduced several page replacement algorithms for virtual memory page swapping.

All Files for Lesson 92.605

Resources

Acknowledgements

Desnoyers, Peter (Fall 2020). How Operating Systems Work. Fall 2020 Edition. Northeastern University. Creative Commons License 4.0. This lesson uses excerpts from this book and updates it.

Errata

Let us know.

92.605
Virtual Memory & Its Implementation

Martin Schedlbauer, PhD

2024-02-14

Objectives

Overview

Background

Boot Process

Processes

Process Internals

Simple Address Translation

Tutorial: Virtual Memory

Paged-Address Translation

Single-Level Page Tables

2-Level Page Tables

Tutorial: Address Translation

Translation Look-Aside Buffer

Using the TLB

TLB Consistency

Tutorial: Page Table Entries

Effect of Page Size

Page Table Instantiation

Page Faults

Page Fault Handling

Shared Memory

`fork()` and copy-on-write

Backing Files for Virtual Memory

Types of Virtual Segments

Dirty and Clean Pages

The Memory Hierarchy

Principle of Locality

Page Replacement

Page Replacement Strategies

Acknowledgement

Summary

Resources

Acknowledgements

Errata

92.605Virtual Memory & Its Implementation

Martin Schedlbauer, PhD

2024-02-14

Objectives

Overview

Background

Boot Process

Processes

Process Internals

Simple Address Translation

Tutorial: Virtual Memory

Paged-Address Translation

Single-Level Page Tables

2-Level Page Tables

Tutorial: Address Translation

Translation Look-Aside Buffer

Using the TLB

TLB Consistency

Tutorial: Page Table Entries

Effect of Page Size

Page Table Instantiation

Page Sharing

Page Faults

Page Fault Handling

Shared Memory

Sharing Executables and Libraries

fork() and copy-on-write

Backing Files for Virtual Memory

Types of Virtual Segments

Dirty and Clean Pages

The Memory Hierarchy

Principle of Locality

Page Replacement

Page Replacement Strategies

Acknowledgement

Summary

Resources

Acknowledgements

Errata

92.605
Virtual Memory & Its Implementation

`fork()` and copy-on-write