Kernel:Virtual Memory
Overview
The IRIX virtual memory system manages physical and virtual memory resources on MIPS-based Silicon Graphics systems, supporting both 32-bit and 64-bit addressing modes with dynamic switching. It is designed for high-performance computing, graphics, and real-time applications, incorporating SVR4 features like demand paging, memory-mapped I/O (mmap, mprotect), and ELF executables. The system handles anonymous memory (e.g., heap, stack) via tree-based structures for copy-on-write efficiency during forks, and file-backed memory through vnode-integrated page caching. It supports large physical memory (up to 16 GB in later versions), virtual address spaces up to 1 TB per 64-bit process, and NUMA-aware allocation for multiprocessor scalability.
Key aspects include page frame management, swapping to disk, variable page sizes for optimization, memory migration for load balancing, and reverse mapping for efficient invalidation. The VM subsystem integrates with IRIX's filesystem layers (e.g., XFS, EFS) for unified caching and I/O, and includes real-time enhancements like page locking to minimize latencies. It emphasizes multiprocessor efficiency with fine-grained locking, preemption, and hardware-specific optimizations for MIPS processors (e.g., R8000, R10000).
Key Components and Flow
- **Address Space Management**: Processes operate in user or kernel mode, with virtual addresses mapped to physical memory or swap. User space is segmented (e.g., kuseg for 32-bit: 0-2GB), while kernel accesses extended regions.
- **Page Frame Data (pfdat)**: Tracks physical pages, including flags for states like hashed, queued, recycled, or bad. Supports replication in NUMA environments.
- **Anonymous Memory Handling**: Uses binary trees of anon structures for copy-on-write sharing post-fork, with caches for pages (pcache) and swap handles (scache).
- **File-Backed Caching**: Vnode page cache integrates with filesystems for demand-loaded pages, handling flush, invalidate, and toss operations.
- **Swapping and Paging**: Manages swap devices (up to 255), demand paging, and precomputation of memory needs (availsmem) for resource-constrained environments.
- **NUMA and Large Pages**: Node-specific data structures (pgdata, nodepda) for free lists and stats; supports multiple page sizes with coalescing and splitting.
- **Migration and Mapping**: Page migration for balancing; reverse maps (rmap) for quick invalidation during unmap or protection changes.
- **Real-Time Features**: Page locking, bounded interrupt latencies, and Miser integration for batch scheduling.
Typical flow: On page fault (vfault/pfault), lookup in caches; if missing, allocate or swap in. Forks duplicate anon trees; exits prune and free resources. Memory pressure triggers swapping, coalescing, or migration.
Key Functions
The VM system comprises operations for memory allocation, mapping, fault handling, and resource management. Below is a detailed functional overview, describing roles, logic, and interactions.
Anonymous Memory Initialization and Allocation
Initializes global structures like allocation zones and shared anon lists. Creates new anon handles for regions, setting up locks and caches. Allocates and initializes anon nodes with reference counts and depth hints for tree management.
Fork Handling (Duplication)
Duplicates anon structures for child processes, forming or extending binary trees to enable copy-on-write sharing. Allocates new leaf nodes for parent and child, linking them to the original root. Checks for tree depth and prunes if necessary. Reserves resources for potential cache growth.
Exit or Region Free (Deallocation)
Disassociates regions from anon memory, removing leaf nodes and freeing pages/swap space. Collapses branches with single children recursively toward the root. Releases locks and nodes when reference counts reach zero.
Tree Optimization (Collapse and Prune)
Reduces tree depth by merging parent-child pairs when one child is absent, transferring pages and swap handles. Selects survivor based on page count; discards covered pages. Prunes empty intermediate nodes with no pages/swap to prevent unbounded growth in long-lived forking processes.
Page Insertion into Anon
Adds pages to leaf node caches, covering lower tree levels if needed. Ensures no duplicates; handles potential cache resizing.
Page Modification Handling
Determines copy or steal for writes: Copies for shared or non-top-level pages; steals otherwise, clearing swap info and rehashing to top level. Heuristics for caching swapped pages when memory is abundant.
Swap Handle Management
Merges or clears swap handles during collapses; transfers non-covered handles to survivors.
Vnode Page Cache Initialization
Sets up per-vnode caches during allocation or recycling, initializing hash structures and reference counts.
Vnode Page Lookup and Attachment
Searches caches for pages by logical number; attaches by incrementing use counts, resolving races with allocation/recycling.
Vnode Page Insertion
Adds pages to caches, resizing if needed; supports conditional insertion to handle races, freeing duplicates.
Vnode Page Removal and Invalidation
Removes pages from caches, marking bad or requeuing free; flushes/invalidates ranges, tossing to buffers if needed.
Page Recycling and Migration
Recycles pages for reuse, removing from caches; migrates by copying state to new frames, updating caches atomically.
Memory Requirement Estimation
Precomputes swap/memory needs for operations, caching for runtime linker and batch jobs.
Hardware Workaround Checks
Scans for CPU errata indicators, flagging proxy use for instruction rewriting in executable pages.
Undocumented or IRIX-Specific Interfaces and Behaviors
Anonymous Memory Tree Structures
Binary trees of anon nodes for efficient copy-on-write: Leaves represent process-private pages; internal nodes share via parents. Depth hints and pruning thresholds (e.g., PRUNE_THRESHOLD) prevent degradation in forking servers. Covered pages discarded during merges; swap caches (scache) track out-of-memory handles.
Special lists for unreferenced, dirty, shared anon pages (psanonmem counter). Node-specific free lists with locks; removal macros handle MP races.
Page and Swap Caches (pcache/scache)
Dual caches per anon/vnode: pcache hashes in-memory pages; scache manages swap handles for faulted-out pages. Dynamic resizing, preemption during long operations. Undocumented tokens for space reservation.
Memory Reservation (availsmem)
Precomputes and reserves physical/swap memory for execs/forks, integrated with Miser for batch retries (EMEMRETRY). Global caching for runtime components.
NUMA-Aware Data Structures
Per-node pgdata for free lists, stats, and rotors; nodepda integration for allocation. Large page (lpage) support with coalescing daemons, splitting, and stats (e.g., vfault/pfault retries by size).
Page Migration and Reverse Mapping
Migr subsystem for NUMA balancing; rmap for efficient TLB shootdowns and unmaps. Pagemigr transfers states, handling object mutations.
Hardware and Real-Time Extensions
Flags for MIPS errata (e.g., R5000 CVT, TFP prefetch); mtext proxies for instruction patching. Page locking for real-time, bounded latencies, and VCE avoidance for cache coloring.
Vnode Cache Synchronization
Reference counting (v_pcacheref) with wait bits for reclaim coordination; broadcast on zero refs. Integration with DMAPI, ShareII for extended file ops.
These extend standard UNIX VM with SGI-specific optimizations for MIPS, NUMA, graphics, and HPC, emphasizing scalability and reliability.