123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274 |
- REDUCING OS JITTER DUE TO PER-CPU KTHREADS
- This document lists per-CPU kthreads in the Linux kernel and presents
- options to control their OS jitter. Note that non-per-CPU kthreads are
- not listed here. To reduce OS jitter from non-per-CPU kthreads, bind
- them to a "housekeeping" CPU dedicated to such work.
- REFERENCES
- o Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs.
- o Documentation/cgroup-v1: Using cgroups to bind tasks to sets of CPUs.
- o man taskset: Using the taskset command to bind tasks to sets
- of CPUs.
- o man sched_setaffinity: Using the sched_setaffinity() system
- call to bind tasks to sets of CPUs.
- o /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state,
- writing "0" to offline and "1" to online.
- o In order to locate kernel-generated OS jitter on CPU N:
- cd /sys/kernel/debug/tracing
- echo 1 > max_graph_depth # Increase the "1" for more detail
- echo function_graph > current_tracer
- # run workload
- cat per_cpu/cpuN/trace
- KTHREADS
- Name: ehca_comp/%u
- Purpose: Periodically process Infiniband-related work.
- To reduce its OS jitter, do any of the following:
- 1. Don't use eHCA Infiniband hardware, instead choosing hardware
- that does not require per-CPU kthreads. This will prevent these
- kthreads from being created in the first place. (This will
- work for most people, as this hardware, though important, is
- relatively old and is produced in relatively low unit volumes.)
- 2. Do all eHCA-Infiniband-related work on other CPUs, including
- interrupts.
- 3. Rework the eHCA driver so that its per-CPU kthreads are
- provisioned only on selected CPUs.
- Name: irq/%d-%s
- Purpose: Handle threaded interrupts.
- To reduce its OS jitter, do the following:
- 1. Use irq affinity to force the irq threads to execute on
- some other CPU.
- Name: kcmtpd_ctr_%d
- Purpose: Handle Bluetooth work.
- To reduce its OS jitter, do one of the following:
- 1. Don't use Bluetooth, in which case these kthreads won't be
- created in the first place.
- 2. Use irq affinity to force Bluetooth-related interrupts to
- occur on some other CPU and furthermore initiate all
- Bluetooth activity on some other CPU.
- Name: ksoftirqd/%u
- Purpose: Execute softirq handlers when threaded or when under heavy load.
- To reduce its OS jitter, each softirq vector must be handled
- separately as follows:
- TIMER_SOFTIRQ: Do all of the following:
- 1. To the extent possible, keep the CPU out of the kernel when it
- is non-idle, for example, by avoiding system calls and by forcing
- both kernel threads and interrupts to execute elsewhere.
- 2. Build with CONFIG_HOTPLUG_CPU=y. After boot completes, force
- the CPU offline, then bring it back online. This forces
- recurring timers to migrate elsewhere. If you are concerned
- with multiple CPUs, force them all offline before bringing the
- first one back online. Once you have onlined the CPUs in question,
- do not offline any other CPUs, because doing so could force the
- timer back onto one of the CPUs in question.
- NET_TX_SOFTIRQ and NET_RX_SOFTIRQ: Do all of the following:
- 1. Force networking interrupts onto other CPUs.
- 2. Initiate any network I/O on other CPUs.
- 3. Once your application has started, prevent CPU-hotplug operations
- from being initiated from tasks that might run on the CPU to
- be de-jittered. (It is OK to force this CPU offline and then
- bring it back online before you start your application.)
- BLOCK_SOFTIRQ: Do all of the following:
- 1. Force block-device interrupts onto some other CPU.
- 2. Initiate any block I/O on other CPUs.
- 3. Once your application has started, prevent CPU-hotplug operations
- from being initiated from tasks that might run on the CPU to
- be de-jittered. (It is OK to force this CPU offline and then
- bring it back online before you start your application.)
- IRQ_POLL_SOFTIRQ: Do all of the following:
- 1. Force block-device interrupts onto some other CPU.
- 2. Initiate any block I/O and block-I/O polling on other CPUs.
- 3. Once your application has started, prevent CPU-hotplug operations
- from being initiated from tasks that might run on the CPU to
- be de-jittered. (It is OK to force this CPU offline and then
- bring it back online before you start your application.)
- TASKLET_SOFTIRQ: Do one or more of the following:
- 1. Avoid use of drivers that use tasklets. (Such drivers will contain
- calls to things like tasklet_schedule().)
- 2. Convert all drivers that you must use from tasklets to workqueues.
- 3. Force interrupts for drivers using tasklets onto other CPUs,
- and also do I/O involving these drivers on other CPUs.
- SCHED_SOFTIRQ: Do all of the following:
- 1. Avoid sending scheduler IPIs to the CPU to be de-jittered,
- for example, ensure that at most one runnable kthread is present
- on that CPU. If a thread that expects to run on the de-jittered
- CPU awakens, the scheduler will send an IPI that can result in
- a subsequent SCHED_SOFTIRQ.
- 2. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y,
- CONFIG_NO_HZ_FULL=y, and, in addition, ensure that the CPU
- to be de-jittered is marked as an adaptive-ticks CPU using the
- "nohz_full=" boot parameter. This reduces the number of
- scheduler-clock interrupts that the de-jittered CPU receives,
- minimizing its chances of being selected to do the load balancing
- work that runs in SCHED_SOFTIRQ context.
- 3. To the extent possible, keep the CPU out of the kernel when it
- is non-idle, for example, by avoiding system calls and by
- forcing both kernel threads and interrupts to execute elsewhere.
- This further reduces the number of scheduler-clock interrupts
- received by the de-jittered CPU.
- HRTIMER_SOFTIRQ: Do all of the following:
- 1. To the extent possible, keep the CPU out of the kernel when it
- is non-idle. For example, avoid system calls and force both
- kernel threads and interrupts to execute elsewhere.
- 2. Build with CONFIG_HOTPLUG_CPU=y. Once boot completes, force the
- CPU offline, then bring it back online. This forces recurring
- timers to migrate elsewhere. If you are concerned with multiple
- CPUs, force them all offline before bringing the first one
- back online. Once you have onlined the CPUs in question, do not
- offline any other CPUs, because doing so could force the timer
- back onto one of the CPUs in question.
- RCU_SOFTIRQ: Do at least one of the following:
- 1. Offload callbacks and keep the CPU in either dyntick-idle or
- adaptive-ticks state by doing all of the following:
- a. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y,
- CONFIG_NO_HZ_FULL=y, and, in addition ensure that the CPU
- to be de-jittered is marked as an adaptive-ticks CPU using
- the "nohz_full=" boot parameter. Bind the rcuo kthreads
- to housekeeping CPUs, which can tolerate OS jitter.
- b. To the extent possible, keep the CPU out of the kernel
- when it is non-idle, for example, by avoiding system
- calls and by forcing both kernel threads and interrupts
- to execute elsewhere.
- 2. Enable RCU to do its processing remotely via dyntick-idle by
- doing all of the following:
- a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
- b. Ensure that the CPU goes idle frequently, allowing other
- CPUs to detect that it has passed through an RCU quiescent
- state. If the kernel is built with CONFIG_NO_HZ_FULL=y,
- userspace execution also allows other CPUs to detect that
- the CPU in question has passed through a quiescent state.
- c. To the extent possible, keep the CPU out of the kernel
- when it is non-idle, for example, by avoiding system
- calls and by forcing both kernel threads and interrupts
- to execute elsewhere.
- Name: kworker/%u:%d%s (cpu, id, priority)
- Purpose: Execute workqueue requests
- To reduce its OS jitter, do any of the following:
- 1. Run your workload at a real-time priority, which will allow
- preempting the kworker daemons.
- 2. A given workqueue can be made visible in the sysfs filesystem
- by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
- Such a workqueue can be confined to a given subset of the
- CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
- files. The set of WQ_SYSFS workqueues can be displayed using
- "ls sys/devices/virtual/workqueue". That said, the workqueues
- maintainer would like to caution people against indiscriminately
- sprinkling WQ_SYSFS across all the workqueues. The reason for
- caution is that it is easy to add WQ_SYSFS, but because sysfs is
- part of the formal user/kernel API, it can be nearly impossible
- to remove it, even if its addition was a mistake.
- 3. Do any of the following needed to avoid jitter that your
- application cannot tolerate:
- a. Build your kernel with CONFIG_SLUB=y rather than
- CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
- use of each CPU's workqueues to run its cache_reap()
- function.
- b. Avoid using oprofile, thus avoiding OS jitter from
- wq_sync_buffer().
- c. Limit your CPU frequency so that a CPU-frequency
- governor is not required, possibly enlisting the aid of
- special heatsinks or other cooling technologies. If done
- correctly, and if you CPU architecture permits, you should
- be able to build your kernel with CONFIG_CPU_FREQ=n to
- avoid the CPU-frequency governor periodically running
- on each CPU, including cs_dbs_timer() and od_dbs_timer().
- WARNING: Please check your CPU specifications to
- make sure that this is safe on your particular system.
- d. As of v3.18, Christoph Lameter's on-demand vmstat workers
- commit prevents OS jitter due to vmstat_update() on
- CONFIG_SMP=y systems. Before v3.18, is not possible
- to entirely get rid of the OS jitter, but you can
- decrease its frequency by writing a large value to
- /proc/sys/vm/stat_interval. The default value is HZ,
- for an interval of one second. Of course, larger values
- will make your virtual-memory statistics update more
- slowly. Of course, you can also run your workload at
- a real-time priority, thus preempting vmstat_update(),
- but if your workload is CPU-bound, this is a bad idea.
- However, there is an RFC patch from Christoph Lameter
- (based on an earlier one from Gilad Ben-Yossef) that
- reduces or even eliminates vmstat overhead for some
- workloads at https://lkml.org/lkml/2013/9/4/379.
- e. Boot with "elevator=noop" to avoid workqueue use by
- the block layer.
- f. If running on high-end powerpc servers, build with
- CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS
- daemon from running on each CPU every second or so.
- (This will require editing Kconfig files and will defeat
- this platform's RAS functionality.) This avoids jitter
- due to the rtas_event_scan() function.
- WARNING: Please check your CPU specifications to
- make sure that this is safe on your particular system.
- g. If running on Cell Processor, build your kernel with
- CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
- spu_gov_work().
- WARNING: Please check your CPU specifications to
- make sure that this is safe on your particular system.
- h. If running on PowerMAC, build your kernel with
- CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
- avoiding OS jitter from rackmeter_do_timer().
- Name: rcuc/%u
- Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
- To reduce its OS jitter, do at least one of the following:
- 1. Build the kernel with CONFIG_PREEMPT=n. This prevents these
- kthreads from being created in the first place, and also obviates
- the need for RCU priority boosting. This approach is feasible
- for workloads that do not require high degrees of responsiveness.
- 2. Build the kernel with CONFIG_RCU_BOOST=n. This prevents these
- kthreads from being created in the first place. This approach
- is feasible only if your workload never requires RCU priority
- boosting, for example, if you ensure frequent idle time on all
- CPUs that might execute within the kernel.
- 3. Build with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y,
- which offloads all RCU callbacks to kthreads that can be moved
- off of CPUs susceptible to OS jitter. This approach prevents the
- rcuc/%u kthreads from having any work to do, so that they are
- never awakened.
- 4. Ensure that the CPU never enters the kernel, and, in particular,
- avoid initiating any CPU hotplug operations on this CPU. This is
- another way of preventing any callbacks from being queued on the
- CPU, again preventing the rcuc/%u kthreads from having any work
- to do.
- Name: rcuob/%d, rcuop/%d, and rcuos/%d
- Purpose: Offload RCU callbacks from the corresponding CPU.
- To reduce its OS jitter, do at least one of the following:
- 1. Use affinity, cgroups, or other mechanism to force these kthreads
- to execute on some other CPU.
- 2. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these
- kthreads from being created in the first place. However, please
- note that this will not eliminate OS jitter, but will instead
- shift it to RCU_SOFTIRQ.
- Name: watchdog/%u
- Purpose: Detect software lockups on each CPU.
- To reduce its OS jitter, do at least one of the following:
- 1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these
- kthreads from being created in the first place.
- 2. Boot with "nosoftlockup=0", which will also prevent these kthreads
- from being created. Other related watchdog and softlockup boot
- parameters may be found in Documentation/kernel-parameters.txt
- and Documentation/watchdog/watchdog-parameters.txt.
- 3. Echo a zero to /proc/sys/kernel/watchdog to disable the
- watchdog timer.
- 4. Echo a large number of /proc/sys/kernel/watchdog_thresh in
- order to reduce the frequency of OS jitter due to the watchdog
- timer down to a level that is acceptable for your workload.
|