TEXT   32

virtual memory integration txt

Guest on 26th July 2022 01:24:51 PM

  2. Recall the aspects of the virtual memory (VM) system:
  4.   * Isolation (illusion of -- debugging support breaches it)
  5.   * Page sharing between apps
  6.   * Demand paging
  7.   * VM as file I/O cache
  9. Isolation is achieved with help from the harware (see "Solaris x86 internals" pp. 79--83)
  10. and the note on PAE below.
  12. The other properties above are all due to the modern VM design, which
  13.   (a) maintains a "reverse" mapping for all physical pages: for each page,
  14.       the OS records its usage
  15.   (b) uses the page fault handler as the main workhorse for paging in blocks from block
  16.       devices and delegates to it whenever possible (e.g., "open" means "mmap",
  17.       and "mmap" means page table entry setup; a "read" will then cause a #PF handler
  18.       to actually call the driver's block reading code)
  19.   (c) relies on the ELF format's rich knowledge about the structure of executables
  20.       and libraries.
  22. Consider dynamic linking/loading design as motivated by the economics
  23. of reusing and remapping library code.  As code gets more and more
  24. functionality, there is a break-even point between statically
  25. compiling all the needed function code into executables (only needed
  26. functions will be pasted into the final executable), and factoring
  27. shared code into dynamically loaded libraries (aka shared
  28. objects). Shared object code is trimmed off the executables' own
  29. .text, but now one must load the entire page of a dynamically linked
  30. library where a needed function is located. However, these loaded
  31. pages can be shared between multiple virtual address spaces if needed
  32. by them. Thus go the VM trade-offs happen.
  34. Note that it's only non-writable pages of shared object files that can
  35. be shared between processes for their lifetimes; writable pages of
  36. data obviously must receive a separate copy in every address space as
  37. soon as they are written to! This design is called copy-on-write: a
  38. writable page is shared until the first write into it, at which point
  39. the process that wrote into it must receive its own private copy of
  40. that page (while other processes that have not yet written to it may
  41. continue sharing the page as loaded).
  43. --------------[ Anatomy of address spaces: ]---------------------
  45. "OpenSolaris Internals" Chapter 9.2 explains the theory of address spaces.
  47. Chapter 9.4 explains how address spaces are implemented and handled:
  49. proc_t.p_as  -> struct as -> (AVL Tree) -> "struct seg" (AKA seg_t)
  51. Cf.: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/seg.h#102
  53. Observe seg_ops, the dispatch table of operations/methods that will handle (consecutive)
  54. memory segment mapping, casting the  void* s_data  member of seg_t
  55. into whatever type the operations apply to (e.g., segvn_data for segvn_ops,
  56. segmap_data for segmap_ops, and so on):
  58. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/seg.h#117
  60. In the terms of the C++ object system, seg_t is an "abstract class": a common
  61. base class for a variety of classes that actually represent objects with that common
  62. set of operations---but meaningless and not instantiatable as such. In Java, such
  63. abstract non-instantiatable class definitons are called *interfaces*, to further
  64. distinguish them from the instantiatable classes.
  66. Note that we have seen a similar pattern with vnode_t in VFS across
  67. different file systems: see VOP_* macros and fop_* methods in
  68. process-table-traversals.txt.
  70. ---------------------[ Address space life cycle ]---------------------
  72. "struct as"s' life cycle:
  74. as_alloc() [once, at system boot/init time] -> as_dup() [by fork()]
  76. as_dup() is how all address spaces after init's get created:
  77. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_as.c#774
  79. Observe how AVL structures get copied in as_dup (the loop at line 791).
  81. Look for call to as_dup() in fork():
  82. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/os/fork.c#286
  84. ---------------------[ The trajectory of a page fault ]---------------------
  86. For the purposes of this trace, we'll start with trap(),
  87. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/trap.c#455,
  88. which is the common point where Illumos C code starts handling
  89. traps and faults.
  91. Of course, to get to trap() from an instruction that, e.g., issues a
  92. virtual address that lacks a macthing entry in the hardware page
  93. tables, or an instruction that can't be decoded in a valid way,
  94. execution must be dispatched through the Interrupt Descriptor Table to
  95. the entry that corresponds to the particular kind of fault or trap. We
  96. say that the hardware "raises" that exception to switch the control
  97. flow that can no longer continue due to an error (such as a virtual
  98. address unmapped in the current page table pointed to by the current
  99. value of CR3) to a special entry point in IDT.
  101. These entry points, one per exceptional trapped condition, are encoded
  102. in IDT entries; they typically lead to assembly routines such as
  103. cmnint(). These assembly routines line up the CPU-provided info
  104. about the trap (such as the faulting virtual address) according to
  105. the C calling convention, so that it can be fed into a C function.
  106. This is what cmnint() does, so that trap() can be called.
  108. In turn, trap() dispatches on the trap code, trapno. Observe the
  109. big case statement at
  110. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/trap.c#579,
  111. where the cases are constrants from the x86 CPU manual, encoded as constants at
  112. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/sys/trap.h#41
  114. Page fault #PF is number 14, 0xe, called T_PGFLT.
  116. Then pagefault() is called from the trap() to handle various cases of #PF.
  117. It contains simple process-related checks, but all the
  118. real process-specific work is done in as_fault().
  120. Note how as_fault()
  121. (http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_as.c#841)
  122. redispatches the handling of the fault by finding the right segment in the
  123. address space's AVL tree and then calls _that segment's_ SEGOP_FAULT (line 944):
  125.         res = SEGOP_FAULT(hat, seg, raddr, ssize, type, rw);
  127. This is very important. A page fault on a swapped-out memory page merely
  128. means that page needs to be brought back from disk. A page fault on an
  129. unmapped addeess (no segment containing it) may mean either that a SIGSEGV
  130. signal should be sent to the process OR that a new page should be claimed
  131. for the process' stack. A page fault on a previously mmap-ed file (such
  132. as a library shared object) means that a physical page need to be
  133. filled by the corresponding page-sized chunk of the file on disk, and
  134. so on. The segment will call the right kind of an operation, based
  135. on its type and set in its s_ops (and supported by the right kind
  136. of s_data).
  138. See "x86 OpenSolaris internals", Section 6.2 about
  139. OpenSolaris' unified trap, faults, and exceptions handling.
  141. ---------------------[ Per-page <vnode,offset> hash table ]---------------------
  143. The kernel maintans a "struct page" page_t data structure for every physical
  144. page, explained in the beginning of Chapter 10.
  146. This is the heart of mmap-ed file sharing (libraries, in particular)
  147. and file I/O caching: it associates a physical page is associated with
  148. a piece of a file/device, i.e. maps
  150.   <vnode, offset_into_file> --> an instance of page_t <--> phys. page
  152. page_t defined at:
  154. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/page.h#498
  156. Look through the big comment that starts at line 149, and explains the uses
  157. of this page_t structure!
  159. Multiple kernel functions using this table means multiple locks
  160. protecting its different members. The long comment at line 149 of
  161. page.h tells the whole story. The story, especially the "locking
  162. protocol", is a perfect example of OS programming optimizations.
  164. Lookup in the pagetable: page_find
  165. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/vm_page.c#993
  167. Note that the hashing is done in a macro:
  168. PAGE_HASH_FUNC, http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/page.h#590
  170. The hash table itself is page_hash, defined in
  171. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/os/startup.c#306
  172. next to other globals that enable other traversals of phys memory pages.
  174. We've seen similar code in /proc traversal.
  176. ---------------------[ Segment Drivers ]---------------------
  178. Named in Table 9.4 (p. 479).
  180. Primary example: "seg_vn":
  181. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/vm/seg_vn.c#140
  183. Observe *static* methods of this driver being packed into a "struct seg_ops",
  184. called segvn_ops.
  186. Note that "struct seg"'s void* s_data member in segments created by the segvn driver
  187. will be pointing to a segvn_data (p. 482 explains it), and thus
  188. through it to the mapped file's vnode ("vp" member) and offset
  189. ("offset" member).
  191. The same driver also handles anonymous mappings. In that case, the segvn_data's "amp"
  192. member will be used instead (shown in Fig. 9.11)
  194. When faulting in a file-backed page:
  196. trap() -> pagefault() -> as_fault() -> segvn_fault() -> <fs>_getpage()
  198. where <fs> is the underlying filesystem, predominantly zfs.
  200. Observe the "dives" from abstraction layers to specific systems' implementation
  201. workhorse methods and back:
  203. mmap() -> <fs>_map() -> segvn_create() -> hat_map()
  204.            ufs_map
  205.            zfs_map
  206.            ...
  208. ---------------------[ Trap handling in Illumos ]---------------------
  210. All traps are handled uniformly by trap(). This is a conscious design
  211. decision: all registers are saved on the stack by the respective ASM
  212. interrupt handler pointed from the IDT, and then C routines are
  213. presented with the same data structure. Observe the "struct regs *rp"
  214. argument in trap, and also "caddr_t addr" (which must be extracted
  215. from the special register CR2 when pagefault handler is called). This
  216. is done by cmntrap ("common trap"),
  217. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/i86pc/ml/locore.s#1098
  219. ---which, in turn, starts with pushing all regs on the stack by
  220. INTR_PUSH, for 32-bit and 64-bit respectively:
  221. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/amd64/sys/privregs.h#202
  222. http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/intel/ia32/sys/privregs.h#153
  224. Eventually, this register push gives up the trapno in trap() as a C struct member:
  226. type = rp->r_trapno;  
  228. -----------------------------------------------------------------------
  230. Suggestions:
  232. Using DTrace's vminfo::: provider, observe all four
  233. kinds of page mappings in Fig. 9.2 (described on p. 457) in
  234. action for your favorite process. E.g., write simple "memory hog"
  235. programs to malloc() a lot of anonymous pages, or call functions
  236. with lots of stack-allocated local arrays.
  238. Observe file sharing between processes.
  239. Notice how *minor faults* are handled (see 9.4.4 for definitions)
  241. See Table 9.3 for address space manipulation functions.
  244. ---------------------[ PAE, the pre-64-bit paging ]---------------------
  246. I mentioned PAE as a stage between 32-bit x86 MMUs and the current
  247. 64-bit designs. It's a taste of how actual hardware progresses.
  249. PAE overview: http://en.wikipedia.org/wiki/Physical_Address_Extension
  250. (36 bits vs classical 32 bits of address space, i.e. 4GB -> 64GB)
  252. Classic 32-bit page translation without PAE:
  254. CR3 ->  4 KB "page directory"
  255.         (4 byte entry)*1024 -------> 4 KB "page table"
  256.                                       (4 byte entry)*1024      
  258. Page translation with PAE: (bit 5 of CR4 := 1)
  260. CR3 -> Page Dir Ptr Table
  261.        (8 bit entry)*4 -------> 4 KB "page directory"
  262.                                  (8 byte entry)*512 ----> 4 KB "page table"
  263.                                                           (8 byte entry)*512
  265. bit 7 in each PDE eliminates the last lookup stage when set; instead,
  266. the rest of the address is interpreted as an offset into a 4MB (no
  267. PAE, 22 bits) or 2MB page (with PAE, 21 bits).
  269. Bit 0 in the PTE is the crucial "Page Present" bit. When hardware translation
  270. sees it, i raises the #PF trap, which swaps the page back in and retries
  271. the instruction (at the address stored in %cr2 on entry to #PF handler).
  273. PAE was the first x86 extension to introduce the per-page NX bit is in
  274. the PTE descriptor layout, as the top bit of the 64-bit struct. It
  275. remains there to this day.
  277. See "x86 OpenSolaris internals", Section 4.3.
  279. For the OS-developer level of documentation on PTEs and PDEs in detail:
  280. "Intel 64 and IA-32 Architectures Software Developer's Manual
  281. Volume 3A: System Programming Guide", pp. 3-35 -- 3-45
  282. http://download.intel.com/design/processor/manuals/253668.pdf
  283. (from 2011: superceded but more readable).
  285. See also Section 3.12, p. 3-51 for a brief summary of TLBs.
  287. ------------------------------------------------------------------------

Raw Paste

Login or Register to edit or fork this paste. It's free.