userfaultfd.txt 6.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
  1. = Userfaultfd =
  2. == Objective ==
  3. Userfaults allow the implementation of on-demand paging from userland
  4. and more generally they allow userland to take control of various
  5. memory page faults, something otherwise only the kernel code could do.
  6. For example userfaults allows a proper and more optimal implementation
  7. of the PROT_NONE+SIGSEGV trick.
  8. == Design ==
  9. Userfaults are delivered and resolved through the userfaultfd syscall.
  10. The userfaultfd (aside from registering and unregistering virtual
  11. memory ranges) provides two primary functionalities:
  12. 1) read/POLLIN protocol to notify a userland thread of the faults
  13. happening
  14. 2) various UFFDIO_* ioctls that can manage the virtual memory regions
  15. registered in the userfaultfd that allows userland to efficiently
  16. resolve the userfaults it receives via 1) or to manage the virtual
  17. memory in the background
  18. The real advantage of userfaults if compared to regular virtual memory
  19. management of mremap/mprotect is that the userfaults in all their
  20. operations never involve heavyweight structures like vmas (in fact the
  21. userfaultfd runtime load never takes the mmap_sem for writing).
  22. Vmas are not suitable for page- (or hugepage) granular fault tracking
  23. when dealing with virtual address spaces that could span
  24. Terabytes. Too many vmas would be needed for that.
  25. The userfaultfd once opened by invoking the syscall, can also be
  26. passed using unix domain sockets to a manager process, so the same
  27. manager process could handle the userfaults of a multitude of
  28. different processes without them being aware about what is going on
  29. (well of course unless they later try to use the userfaultfd
  30. themselves on the same region the manager is already tracking, which
  31. is a corner case that would currently return -EBUSY).
  32. == API ==
  33. When first opened the userfaultfd must be enabled invoking the
  34. UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
  35. a later API version) which will specify the read/POLLIN protocol
  36. userland intends to speak on the UFFD and the uffdio_api.features
  37. userland requires. The UFFDIO_API ioctl if successful (i.e. if the
  38. requested uffdio_api.api is spoken also by the running kernel and the
  39. requested features are going to be enabled) will return into
  40. uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
  41. respectively all the available features of the read(2) protocol and
  42. the generic ioctl available.
  43. Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
  44. be invoked (if present in the returned uffdio_api.ioctls bitmask) to
  45. register a memory range in the userfaultfd by setting the
  46. uffdio_register structure accordingly. The uffdio_register.mode
  47. bitmask will specify to the kernel which kind of faults to track for
  48. the range (UFFDIO_REGISTER_MODE_MISSING would track missing
  49. pages). The UFFDIO_REGISTER ioctl will return the
  50. uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
  51. userfaults on the range registered. Not all ioctls will necessarily be
  52. supported for all memory types depending on the underlying virtual
  53. memory backend (anonymous memory vs tmpfs vs real filebacked
  54. mappings).
  55. Userland can use the uffdio_register.ioctls to manage the virtual
  56. address space in the background (to add or potentially also remove
  57. memory from the userfaultfd registered range). This means a userfault
  58. could be triggering just before userland maps in the background the
  59. user-faulted page.
  60. The primary ioctl to resolve userfaults is UFFDIO_COPY. That
  61. atomically copies a page into the userfault registered range and wakes
  62. up the blocked userfaults (unless uffdio_copy.mode &
  63. UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
  64. UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
  65. half copied page since it'll keep userfaulting until the copy has
  66. finished.
  67. == QEMU/KVM ==
  68. QEMU/KVM is using the userfaultfd syscall to implement postcopy live
  69. migration. Postcopy live migration is one form of memory
  70. externalization consisting of a virtual machine running with part or
  71. all of its memory residing on a different node in the cloud. The
  72. userfaultfd abstraction is generic enough that not a single line of
  73. KVM kernel code had to be modified in order to add postcopy live
  74. migration to QEMU.
  75. Guest async page faults, FOLL_NOWAIT and all other GUP features work
  76. just fine in combination with userfaults. Userfaults trigger async
  77. page faults in the guest scheduler so those guest processes that
  78. aren't waiting for userfaults (i.e. network bound) can keep running in
  79. the guest vcpus.
  80. It is generally beneficial to run one pass of precopy live migration
  81. just before starting postcopy live migration, in order to avoid
  82. generating userfaults for readonly guest regions.
  83. The implementation of postcopy live migration currently uses one
  84. single bidirectional socket but in the future two different sockets
  85. will be used (to reduce the latency of the userfaults to the minimum
  86. possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
  87. The QEMU in the source node writes all pages that it knows are missing
  88. in the destination node, into the socket, and the migration thread of
  89. the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
  90. ioctls on the userfaultfd in order to map the received pages into the
  91. guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
  92. A different postcopy thread in the destination node listens with
  93. poll() to the userfaultfd in parallel. When a POLLIN event is
  94. generated after a userfault triggers, the postcopy thread read() from
  95. the userfaultfd and receives the fault address (or -EAGAIN in case the
  96. userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
  97. by the parallel QEMU migration thread).
  98. After the QEMU postcopy thread (running in the destination node) gets
  99. the userfault address it writes the information about the missing page
  100. into the socket. The QEMU source node receives the information and
  101. roughly "seeks" to that page address and continues sending all
  102. remaining missing pages from that new page offset. Soon after that
  103. (just the time to flush the tcp_wmem queue through the network) the
  104. migration thread in the QEMU running in the destination node will
  105. receive the page that triggered the userfault and it'll map it as
  106. usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
  107. was spontaneously sent by the source or if it was an urgent page
  108. requested through an userfault).
  109. By the time the userfaults start, the QEMU in the destination node
  110. doesn't need to keep any per-page state bitmap relative to the live
  111. migration around and a single per-page bitmap has to be maintained in
  112. the QEMU running in the source node to know which pages are still
  113. missing in the destination node. The bitmap in the source node is
  114. checked to find which missing pages to send in round robin and we seek
  115. over it when receiving incoming userfaults. After sending each page of
  116. course the bitmap is updated accordingly. It's also useful to avoid
  117. sending the same page twice (in case the userfault is read by the
  118. postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
  119. thread).