unshare.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296
  1. unshare system call:
  2. --------------------
  3. This document describes the new system call, unshare. The document
  4. provides an overview of the feature, why it is needed, how it can
  5. be used, its interface specification, design, implementation and
  6. how it can be tested.
  7. Change Log:
  8. -----------
  9. version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
  10. Contents:
  11. ---------
  12. 1) Overview
  13. 2) Benefits
  14. 3) Cost
  15. 4) Requirements
  16. 5) Functional Specification
  17. 6) High Level Design
  18. 7) Low Level Design
  19. 8) Test Specification
  20. 9) Future Work
  21. 1) Overview
  22. -----------
  23. Most legacy operating system kernels support an abstraction of threads
  24. as multiple execution contexts within a process. These kernels provide
  25. special resources and mechanisms to maintain these "threads". The Linux
  26. kernel, in a clever and simple manner, does not make distinction
  27. between processes and "threads". The kernel allows processes to share
  28. resources and thus they can achieve legacy "threads" behavior without
  29. requiring additional data structures and mechanisms in the kernel. The
  30. power of implementing threads in this manner comes not only from
  31. its simplicity but also from allowing application programmers to work
  32. outside the confinement of all-or-nothing shared resources of legacy
  33. threads. On Linux, at the time of thread creation using the clone system
  34. call, applications can selectively choose which resources to share
  35. between threads.
  36. unshare system call adds a primitive to the Linux thread model that
  37. allows threads to selectively 'unshare' any resources that were being
  38. shared at the time of their creation. unshare was conceptualized by
  39. Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
  40. of the discussion on POSIX threads on Linux. unshare augments the
  41. usefulness of Linux threads for applications that would like to control
  42. shared resources without creating a new process. unshare is a natural
  43. addition to the set of available primitives on Linux that implement
  44. the concept of process/thread as a virtual machine.
  45. 2) Benefits
  46. -----------
  47. unshare would be useful to large application frameworks such as PAM
  48. where creating a new process to control sharing/unsharing of process
  49. resources is not possible. Since namespaces are shared by default
  50. when creating a new process using fork or clone, unshare can benefit
  51. even non-threaded applications if they have a need to disassociate
  52. from default shared namespace. The following lists two use-cases
  53. where unshare can be used.
  54. 2.1 Per-security context namespaces
  55. -----------------------------------
  56. unshare can be used to implement polyinstantiated directories using
  57. the kernel's per-process namespace mechanism. Polyinstantiated directories,
  58. such as per-user and/or per-security context instance of /tmp, /var/tmp or
  59. per-security context instance of a user's home directory, isolate user
  60. processes when working with these directories. Using unshare, a PAM
  61. module can easily setup a private namespace for a user at login.
  62. Polyinstantiated directories are required for Common Criteria certification
  63. with Labeled System Protection Profile, however, with the availability
  64. of shared-tree feature in the Linux kernel, even regular Linux systems
  65. can benefit from setting up private namespaces at login and
  66. polyinstantiating /tmp, /var/tmp and other directories deemed
  67. appropriate by system administrators.
  68. 2.2 unsharing of virtual memory and/or open files
  69. -------------------------------------------------
  70. Consider a client/server application where the server is processing
  71. client requests by creating processes that share resources such as
  72. virtual memory and open files. Without unshare, the server has to
  73. decide what needs to be shared at the time of creating the process
  74. which services the request. unshare allows the server an ability to
  75. disassociate parts of the context during the servicing of the
  76. request. For large and complex middleware application frameworks, this
  77. ability to unshare after the process was created can be very
  78. useful.
  79. 3) Cost
  80. -------
  81. In order to not duplicate code and to handle the fact that unshare
  82. works on an active task (as opposed to clone/fork working on a newly
  83. allocated inactive task) unshare had to make minor reorganizational
  84. changes to copy_* functions utilized by clone/fork system call.
  85. There is a cost associated with altering existing, well tested and
  86. stable code to implement a new feature that may not get exercised
  87. extensively in the beginning. However, with proper design and code
  88. review of the changes and creation of an unshare test for the LTP
  89. the benefits of this new feature can exceed its cost.
  90. 4) Requirements
  91. ---------------
  92. unshare reverses sharing that was done using clone(2) system call,
  93. so unshare should have a similar interface as clone(2). That is,
  94. since flags in clone(int flags, void *stack) specifies what should
  95. be shared, similar flags in unshare(int flags) should specify
  96. what should be unshared. Unfortunately, this may appear to invert
  97. the meaning of the flags from the way they are used in clone(2).
  98. However, there was no easy solution that was less confusing and that
  99. allowed incremental context unsharing in future without an ABI change.
  100. unshare interface should accommodate possible future addition of
  101. new context flags without requiring a rebuild of old applications.
  102. If and when new context flags are added, unshare design should allow
  103. incremental unsharing of those resources on an as needed basis.
  104. 5) Functional Specification
  105. ---------------------------
  106. NAME
  107. unshare - disassociate parts of the process execution context
  108. SYNOPSIS
  109. #include <sched.h>
  110. int unshare(int flags);
  111. DESCRIPTION
  112. unshare allows a process to disassociate parts of its execution
  113. context that are currently being shared with other processes. Part
  114. of execution context, such as the namespace, is shared by default
  115. when a new process is created using fork(2), while other parts,
  116. such as the virtual memory, open file descriptors, etc, may be
  117. shared by explicit request to share them when creating a process
  118. using clone(2).
  119. The main use of unshare is to allow a process to control its
  120. shared execution context without creating a new process.
  121. The flags argument specifies one or bitwise-or'ed of several of
  122. the following constants.
  123. CLONE_FS
  124. If CLONE_FS is set, file system information of the caller
  125. is disassociated from the shared file system information.
  126. CLONE_FILES
  127. If CLONE_FILES is set, the file descriptor table of the
  128. caller is disassociated from the shared file descriptor
  129. table.
  130. CLONE_NEWNS
  131. If CLONE_NEWNS is set, the namespace of the caller is
  132. disassociated from the shared namespace.
  133. CLONE_VM
  134. If CLONE_VM is set, the virtual memory of the caller is
  135. disassociated from the shared virtual memory.
  136. RETURN VALUE
  137. On success, zero returned. On failure, -1 is returned and errno is
  138. ERRORS
  139. EPERM CLONE_NEWNS was specified by a non-root process (process
  140. without CAP_SYS_ADMIN).
  141. ENOMEM Cannot allocate sufficient memory to copy parts of caller's
  142. context that need to be unshared.
  143. EINVAL Invalid flag was specified as an argument.
  144. CONFORMING TO
  145. The unshare() call is Linux-specific and should not be used
  146. in programs intended to be portable.
  147. SEE ALSO
  148. clone(2), fork(2)
  149. 6) High Level Design
  150. --------------------
  151. Depending on the flags argument, the unshare system call allocates
  152. appropriate process context structures, populates it with values from
  153. the current shared version, associates newly duplicated structures
  154. with the current task structure and releases corresponding shared
  155. versions. Helper functions of clone (copy_*) could not be used
  156. directly by unshare because of the following two reasons.
  157. 1) clone operates on a newly allocated not-yet-active task
  158. structure, where as unshare operates on the current active
  159. task. Therefore unshare has to take appropriate task_lock()
  160. before associating newly duplicated context structures
  161. 2) unshare has to allocate and duplicate all context structures
  162. that are being unshared, before associating them with the
  163. current task and releasing older shared structures. Failure
  164. do so will create race conditions and/or oops when trying
  165. to backout due to an error. Consider the case of unsharing
  166. both virtual memory and namespace. After successfully unsharing
  167. vm, if the system call encounters an error while allocating
  168. new namespace structure, the error return code will have to
  169. reverse the unsharing of vm. As part of the reversal the
  170. system call will have to go back to older, shared, vm
  171. structure, which may not exist anymore.
  172. Therefore code from copy_* functions that allocated and duplicated
  173. current context structure was moved into new dup_* functions. Now,
  174. copy_* functions call dup_* functions to allocate and duplicate
  175. appropriate context structures and then associate them with the
  176. task structure that is being constructed. unshare system call on
  177. the other hand performs the following:
  178. 1) Check flags to force missing, but implied, flags
  179. 2) For each context structure, call the corresponding unshare
  180. helper function to allocate and duplicate a new context
  181. structure, if the appropriate bit is set in the flags argument.
  182. 3) If there is no error in allocation and duplication and there
  183. are new context structures then lock the current task structure,
  184. associate new context structures with the current task structure,
  185. and release the lock on the current task structure.
  186. 4) Appropriately release older, shared, context structures.
  187. 7) Low Level Design
  188. -------------------
  189. Implementation of unshare can be grouped in the following 4 different
  190. items:
  191. a) Reorganization of existing copy_* functions
  192. b) unshare system call service function
  193. c) unshare helper functions for each different process context
  194. d) Registration of system call number for different architectures
  195. 7.1) Reorganization of copy_* functions
  196. Each copy function such as copy_mm, copy_namespace, copy_files,
  197. etc, had roughly two components. The first component allocated
  198. and duplicated the appropriate structure and the second component
  199. linked it to the task structure passed in as an argument to the copy
  200. function. The first component was split into its own function.
  201. These dup_* functions allocated and duplicated the appropriate
  202. context structure. The reorganized copy_* functions invoked
  203. their corresponding dup_* functions and then linked the newly
  204. duplicated structures to the task structure with which the
  205. copy function was called.
  206. 7.2) unshare system call service function
  207. * Check flags
  208. Force implied flags. If CLONE_THREAD is set force CLONE_VM.
  209. If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
  210. set and signals are also being shared, force CLONE_THREAD. If
  211. CLONE_NEWNS is set, force CLONE_FS.
  212. * For each context flag, invoke the corresponding unshare_*
  213. helper routine with flags passed into the system call and a
  214. reference to pointer pointing the new unshared structure
  215. * If any new structures are created by unshare_* helper
  216. functions, take the task_lock() on the current task,
  217. modify appropriate context pointers, and release the
  218. task lock.
  219. * For all newly unshared structures, release the corresponding
  220. older, shared, structures.
  221. 7.3) unshare_* helper functions
  222. For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
  223. and CLONE_THREAD, return -EINVAL since they are not implemented yet.
  224. For others, check the flag value to see if the unsharing is
  225. required for that structure. If it is, invoke the corresponding
  226. dup_* function to allocate and duplicate the structure and return
  227. a pointer to it.
  228. 7.4) Appropriately modify architecture specific code to register the
  229. new system call.
  230. 8) Test Specification
  231. ---------------------
  232. The test for unshare should test the following:
  233. 1) Valid flags: Test to check that clone flags for signal and
  234. signal handlers, for which unsharing is not implemented
  235. yet, return -EINVAL.
  236. 2) Missing/implied flags: Test to make sure that if unsharing
  237. namespace without specifying unsharing of filesystem, correctly
  238. unshares both namespace and filesystem information.
  239. 3) For each of the four (namespace, filesystem, files and vm)
  240. supported unsharing, verify that the system call correctly
  241. unshares the appropriate structure. Verify that unsharing
  242. them individually as well as in combination with each
  243. other works as expected.
  244. 4) Concurrent execution: Use shared memory segments and futex on
  245. an address in the shm segment to synchronize execution of
  246. about 10 threads. Have a couple of threads execute execve,
  247. a couple _exit and the rest unshare with different combination
  248. of flags. Verify that unsharing is performed as expected and
  249. that there are no oops or hangs.
  250. 9) Future Work
  251. --------------
  252. The current implementation of unshare does not allow unsharing of
  253. signals and signal handlers. Signals are complex to begin with and
  254. to unshare signals and/or signal handlers of a currently running
  255. process is even more complex. If in the future there is a specific
  256. need to allow unsharing of signals and/or signal handlers, it can
  257. be incrementally added to unshare without affecting legacy
  258. applications using unshare.