hrtimers.txt 8.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
  1. hrtimers - subsystem for high-resolution kernel timers
  2. ----------------------------------------------------
  3. This patch introduces a new subsystem for high-resolution kernel timers.
  4. One might ask the question: we already have a timer subsystem
  5. (kernel/timers.c), why do we need two timer subsystems? After a lot of
  6. back and forth trying to integrate high-resolution and high-precision
  7. features into the existing timer framework, and after testing various
  8. such high-resolution timer implementations in practice, we came to the
  9. conclusion that the timer wheel code is fundamentally not suitable for
  10. such an approach. We initially didn't believe this ('there must be a way
  11. to solve this'), and spent a considerable effort trying to integrate
  12. things into the timer wheel, but we failed. In hindsight, there are
  13. several reasons why such integration is hard/impossible:
  14. - the forced handling of low-resolution and high-resolution timers in
  15. the same way leads to a lot of compromises, macro magic and #ifdef
  16. mess. The timers.c code is very "tightly coded" around jiffies and
  17. 32-bitness assumptions, and has been honed and micro-optimized for a
  18. relatively narrow use case (jiffies in a relatively narrow HZ range)
  19. for many years - and thus even small extensions to it easily break
  20. the wheel concept, leading to even worse compromises. The timer wheel
  21. code is very good and tight code, there's zero problems with it in its
  22. current usage - but it is simply not suitable to be extended for
  23. high-res timers.
  24. - the unpredictable [O(N)] overhead of cascading leads to delays which
  25. necessitate a more complex handling of high resolution timers, which
  26. in turn decreases robustness. Such a design still led to rather large
  27. timing inaccuracies. Cascading is a fundamental property of the timer
  28. wheel concept, it cannot be 'designed out' without unevitably
  29. degrading other portions of the timers.c code in an unacceptable way.
  30. - the implementation of the current posix-timer subsystem on top of
  31. the timer wheel has already introduced a quite complex handling of
  32. the required readjusting of absolute CLOCK_REALTIME timers at
  33. settimeofday or NTP time - further underlying our experience by
  34. example: that the timer wheel data structure is too rigid for high-res
  35. timers.
  36. - the timer wheel code is most optimal for use cases which can be
  37. identified as "timeouts". Such timeouts are usually set up to cover
  38. error conditions in various I/O paths, such as networking and block
  39. I/O. The vast majority of those timers never expire and are rarely
  40. recascaded because the expected correct event arrives in time so they
  41. can be removed from the timer wheel before any further processing of
  42. them becomes necessary. Thus the users of these timeouts can accept
  43. the granularity and precision tradeoffs of the timer wheel, and
  44. largely expect the timer subsystem to have near-zero overhead.
  45. Accurate timing for them is not a core purpose - in fact most of the
  46. timeout values used are ad-hoc. For them it is at most a necessary
  47. evil to guarantee the processing of actual timeout completions
  48. (because most of the timeouts are deleted before completion), which
  49. should thus be as cheap and unintrusive as possible.
  50. The primary users of precision timers are user-space applications that
  51. utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
  52. users like drivers and subsystems which require precise timed events
  53. (e.g. multimedia) can benefit from the availability of a separate
  54. high-resolution timer subsystem as well.
  55. While this subsystem does not offer high-resolution clock sources just
  56. yet, the hrtimer subsystem can be easily extended with high-resolution
  57. clock capabilities, and patches for that exist and are maturing quickly.
  58. The increasing demand for realtime and multimedia applications along
  59. with other potential users for precise timers gives another reason to
  60. separate the "timeout" and "precise timer" subsystems.
  61. Another potential benefit is that such a separation allows even more
  62. special-purpose optimization of the existing timer wheel for the low
  63. resolution and low precision use cases - once the precision-sensitive
  64. APIs are separated from the timer wheel and are migrated over to
  65. hrtimers. E.g. we could decrease the frequency of the timeout subsystem
  66. from 250 Hz to 100 HZ (or even smaller).
  67. hrtimer subsystem implementation details
  68. ----------------------------------------
  69. the basic design considerations were:
  70. - simplicity
  71. - data structure not bound to jiffies or any other granularity. All the
  72. kernel logic works at 64-bit nanoseconds resolution - no compromises.
  73. - simplification of existing, timing related kernel code
  74. another basic requirement was the immediate enqueueing and ordering of
  75. timers at activation time. After looking at several possible solutions
  76. such as radix trees and hashes, we chose the red black tree as the basic
  77. data structure. Rbtrees are available as a library in the kernel and are
  78. used in various performance-critical areas of e.g. memory management and
  79. file systems. The rbtree is solely used for time sorted ordering, while
  80. a separate list is used to give the expiry code fast access to the
  81. queued timers, without having to walk the rbtree.
  82. (This separate list is also useful for later when we'll introduce
  83. high-resolution clocks, where we need separate pending and expired
  84. queues while keeping the time-order intact.)
  85. Time-ordered enqueueing is not purely for the purposes of
  86. high-resolution clocks though, it also simplifies the handling of
  87. absolute timers based on a low-resolution CLOCK_REALTIME. The existing
  88. implementation needed to keep an extra list of all armed absolute
  89. CLOCK_REALTIME timers along with complex locking. In case of
  90. settimeofday and NTP, all the timers (!) had to be dequeued, the
  91. time-changing code had to fix them up one by one, and all of them had to
  92. be enqueued again. The time-ordered enqueueing and the storage of the
  93. expiry time in absolute time units removes all this complex and poorly
  94. scaling code from the posix-timer implementation - the clock can simply
  95. be set without having to touch the rbtree. This also makes the handling
  96. of posix-timers simpler in general.
  97. The locking and per-CPU behavior of hrtimers was mostly taken from the
  98. existing timer wheel code, as it is mature and well suited. Sharing code
  99. was not really a win, due to the different data structures. Also, the
  100. hrtimer functions now have clearer behavior and clearer names - such as
  101. hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
  102. equivalent to del_timer() and del_timer_sync()] - so there's no direct
  103. 1:1 mapping between them on the algorithmical level, and thus no real
  104. potential for code sharing either.
  105. Basic data types: every time value, absolute or relative, is in a
  106. special nanosecond-resolution type: ktime_t. The kernel-internal
  107. representation of ktime_t values and operations is implemented via
  108. macros and inline functions, and can be switched between a "hybrid
  109. union" type and a plain "scalar" 64bit nanoseconds representation (at
  110. compile time). The hybrid union type optimizes time conversions on 32bit
  111. CPUs. This build-time-selectable ktime_t storage format was implemented
  112. to avoid the performance impact of 64-bit multiplications and divisions
  113. on 32bit CPUs. Such operations are frequently necessary to convert
  114. between the storage formats provided by kernel and userspace interfaces
  115. and the internal time format. (See include/linux/ktime.h for further
  116. details.)
  117. hrtimers - rounding of timer values
  118. -----------------------------------
  119. the hrtimer code will round timer events to lower-resolution clocks
  120. because it has to. Otherwise it will do no artificial rounding at all.
  121. one question is, what resolution value should be returned to the user by
  122. the clock_getres() interface. This will return whatever real resolution
  123. a given clock has - be it low-res, high-res, or artificially-low-res.
  124. hrtimers - testing and verification
  125. ----------------------------------
  126. We used the high-resolution clock subsystem ontop of hrtimers to verify
  127. the hrtimer implementation details in praxis, and we also ran the posix
  128. timer tests in order to ensure specification compliance. We also ran
  129. tests on low-resolution clocks.
  130. The hrtimer patch converts the following kernel functionality to use
  131. hrtimers:
  132. - nanosleep
  133. - itimers
  134. - posix-timers
  135. The conversion of nanosleep and posix-timers enabled the unification of
  136. nanosleep and clock_nanosleep.
  137. The code was successfully compiled for the following platforms:
  138. i386, x86_64, ARM, PPC, PPC64, IA64
  139. The code was run-tested on the following platforms:
  140. i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
  141. hrtimers were also integrated into the -rt tree, along with a
  142. hrtimers-based high-resolution clock implementation, so the hrtimers
  143. code got a healthy amount of testing and use in practice.
  144. Thomas Gleixner, Ingo Molnar