123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179 |
- hrtimers - subsystem for high-resolution kernel timers
- ----------------------------------------------------
- This patch introduces a new subsystem for high-resolution kernel timers.
- One might ask the question: we already have a timer subsystem
- (kernel/timers.c), why do we need two timer subsystems? After a lot of
- back and forth trying to integrate high-resolution and high-precision
- features into the existing timer framework, and after testing various
- such high-resolution timer implementations in practice, we came to the
- conclusion that the timer wheel code is fundamentally not suitable for
- such an approach. We initially didn't believe this ('there must be a way
- to solve this'), and spent a considerable effort trying to integrate
- things into the timer wheel, but we failed. In hindsight, there are
- several reasons why such integration is hard/impossible:
- - the forced handling of low-resolution and high-resolution timers in
- the same way leads to a lot of compromises, macro magic and #ifdef
- mess. The timers.c code is very "tightly coded" around jiffies and
- 32-bitness assumptions, and has been honed and micro-optimized for a
- relatively narrow use case (jiffies in a relatively narrow HZ range)
- for many years - and thus even small extensions to it easily break
- the wheel concept, leading to even worse compromises. The timer wheel
- code is very good and tight code, there's zero problems with it in its
- current usage - but it is simply not suitable to be extended for
- high-res timers.
- - the unpredictable [O(N)] overhead of cascading leads to delays which
- necessitate a more complex handling of high resolution timers, which
- in turn decreases robustness. Such a design still led to rather large
- timing inaccuracies. Cascading is a fundamental property of the timer
- wheel concept, it cannot be 'designed out' without unevitably
- degrading other portions of the timers.c code in an unacceptable way.
- - the implementation of the current posix-timer subsystem on top of
- the timer wheel has already introduced a quite complex handling of
- the required readjusting of absolute CLOCK_REALTIME timers at
- settimeofday or NTP time - further underlying our experience by
- example: that the timer wheel data structure is too rigid for high-res
- timers.
- - the timer wheel code is most optimal for use cases which can be
- identified as "timeouts". Such timeouts are usually set up to cover
- error conditions in various I/O paths, such as networking and block
- I/O. The vast majority of those timers never expire and are rarely
- recascaded because the expected correct event arrives in time so they
- can be removed from the timer wheel before any further processing of
- them becomes necessary. Thus the users of these timeouts can accept
- the granularity and precision tradeoffs of the timer wheel, and
- largely expect the timer subsystem to have near-zero overhead.
- Accurate timing for them is not a core purpose - in fact most of the
- timeout values used are ad-hoc. For them it is at most a necessary
- evil to guarantee the processing of actual timeout completions
- (because most of the timeouts are deleted before completion), which
- should thus be as cheap and unintrusive as possible.
- The primary users of precision timers are user-space applications that
- utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
- users like drivers and subsystems which require precise timed events
- (e.g. multimedia) can benefit from the availability of a separate
- high-resolution timer subsystem as well.
- While this subsystem does not offer high-resolution clock sources just
- yet, the hrtimer subsystem can be easily extended with high-resolution
- clock capabilities, and patches for that exist and are maturing quickly.
- The increasing demand for realtime and multimedia applications along
- with other potential users for precise timers gives another reason to
- separate the "timeout" and "precise timer" subsystems.
- Another potential benefit is that such a separation allows even more
- special-purpose optimization of the existing timer wheel for the low
- resolution and low precision use cases - once the precision-sensitive
- APIs are separated from the timer wheel and are migrated over to
- hrtimers. E.g. we could decrease the frequency of the timeout subsystem
- from 250 Hz to 100 HZ (or even smaller).
- hrtimer subsystem implementation details
- ----------------------------------------
- the basic design considerations were:
- - simplicity
- - data structure not bound to jiffies or any other granularity. All the
- kernel logic works at 64-bit nanoseconds resolution - no compromises.
- - simplification of existing, timing related kernel code
- another basic requirement was the immediate enqueueing and ordering of
- timers at activation time. After looking at several possible solutions
- such as radix trees and hashes, we chose the red black tree as the basic
- data structure. Rbtrees are available as a library in the kernel and are
- used in various performance-critical areas of e.g. memory management and
- file systems. The rbtree is solely used for time sorted ordering, while
- a separate list is used to give the expiry code fast access to the
- queued timers, without having to walk the rbtree.
- (This separate list is also useful for later when we'll introduce
- high-resolution clocks, where we need separate pending and expired
- queues while keeping the time-order intact.)
- Time-ordered enqueueing is not purely for the purposes of
- high-resolution clocks though, it also simplifies the handling of
- absolute timers based on a low-resolution CLOCK_REALTIME. The existing
- implementation needed to keep an extra list of all armed absolute
- CLOCK_REALTIME timers along with complex locking. In case of
- settimeofday and NTP, all the timers (!) had to be dequeued, the
- time-changing code had to fix them up one by one, and all of them had to
- be enqueued again. The time-ordered enqueueing and the storage of the
- expiry time in absolute time units removes all this complex and poorly
- scaling code from the posix-timer implementation - the clock can simply
- be set without having to touch the rbtree. This also makes the handling
- of posix-timers simpler in general.
- The locking and per-CPU behavior of hrtimers was mostly taken from the
- existing timer wheel code, as it is mature and well suited. Sharing code
- was not really a win, due to the different data structures. Also, the
- hrtimer functions now have clearer behavior and clearer names - such as
- hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
- equivalent to del_timer() and del_timer_sync()] - so there's no direct
- 1:1 mapping between them on the algorithmical level, and thus no real
- potential for code sharing either.
- Basic data types: every time value, absolute or relative, is in a
- special nanosecond-resolution type: ktime_t. The kernel-internal
- representation of ktime_t values and operations is implemented via
- macros and inline functions, and can be switched between a "hybrid
- union" type and a plain "scalar" 64bit nanoseconds representation (at
- compile time). The hybrid union type optimizes time conversions on 32bit
- CPUs. This build-time-selectable ktime_t storage format was implemented
- to avoid the performance impact of 64-bit multiplications and divisions
- on 32bit CPUs. Such operations are frequently necessary to convert
- between the storage formats provided by kernel and userspace interfaces
- and the internal time format. (See include/linux/ktime.h for further
- details.)
- hrtimers - rounding of timer values
- -----------------------------------
- the hrtimer code will round timer events to lower-resolution clocks
- because it has to. Otherwise it will do no artificial rounding at all.
- one question is, what resolution value should be returned to the user by
- the clock_getres() interface. This will return whatever real resolution
- a given clock has - be it low-res, high-res, or artificially-low-res.
- hrtimers - testing and verification
- ----------------------------------
- We used the high-resolution clock subsystem ontop of hrtimers to verify
- the hrtimer implementation details in praxis, and we also ran the posix
- timer tests in order to ensure specification compliance. We also ran
- tests on low-resolution clocks.
- The hrtimer patch converts the following kernel functionality to use
- hrtimers:
- - nanosleep
- - itimers
- - posix-timers
- The conversion of nanosleep and posix-timers enabled the unification of
- nanosleep and clock_nanosleep.
- The code was successfully compiled for the following platforms:
- i386, x86_64, ARM, PPC, PPC64, IA64
- The code was run-tested on the following platforms:
- i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
- hrtimers were also integrated into the -rt tree, along with a
- hrtimers-based high-resolution clock implementation, so the hrtimers
- code got a healthy amount of testing and use in practice.
- Thomas Gleixner, Ingo Molnar
|