Message ID | 20231107230822.371443-10-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Make the kernel preemptible | expand |
On Tue, Nov 07, 2023 at 03:08:02PM -0800, Ankur Arora wrote: > There are broadly three sets of uses of cond_resched(): > > 1. Calls to cond_resched() out of the goodness of our heart, > otherwise known as avoiding lockup splats. > > 2. Open coded variants of cond_resched_lock() which call > cond_resched(). > > 3. Retry or error handling loops, where cond_resched() is used as a > quick alternative to spinning in a tight-loop. > > When running under a full preemption model, the cond_resched() reduces > to a NOP (not even a barrier) so removing it obviously cannot matter. > > But considering only voluntary preemption models (for say code that > has been mostly tested under those), for set-1 and set-2 the > scheduler can now preempt kernel tasks running beyond their time > quanta anywhere they are preemptible() [1]. Which removes any need > for these explicitly placed scheduling points. > > The cond_resched() calls in set-3 are a little more difficult. > To start with, given it's NOP character under full preemption, it > never actually saved us from a tight loop. > With voluntary preemption, it's not a NOP, but it might as well be -- > for most workloads the scheduler does not have an interminable supply > of runnable tasks on the runqueue. > > So, cond_resched() is useful to not get softlockup splats, but not > terribly good for error handling. Ideally, these should be replaced > with some kind of timed or event wait. > For now we use cond_resched_stall(), which tries to schedule if > possible, and executes a cpu_relax() if not. > > All of these are from set-1 except for the retry loops in > task_function_call() or the mutex testing logic. > > Replace these with cond_resched_stall(). The others can be removed. > > [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/ > > Cc: Tejun Heo <tj@kernel.org> > Cc: Zefan Li <lizefan.x@bytedance.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Peter Oberparleiter <oberpar@linux.ibm.com> > Cc: Eric Biederman <ebiederm@xmission.com> > Cc: Will Deacon <will@kernel.org> > Cc: Luis Chamberlain <mcgrof@kernel.org> > Cc: Oleg Nesterov <oleg@redhat.com> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Sounds like the sort of test which should be put into linux-next to get test coverage right away. To see what really blows up. Luis
On Fri, 17 Nov 2023 10:14:33 -0800 Luis Chamberlain <mcgrof@kernel.org> wrote: > Sounds like the sort of test which should be put into linux-next to get > test coverage right away. To see what really blows up. No, it shouldn't have been added this early in the development. It depends on the first part of this patch series, that needs to finished first. Ankur just gave his full vision of the RFC. You can ignore the removal of the cond_resched() for the time being. Thanks for looking at it Luis! -- Steve
diff --git a/include/linux/sched/cond_resched.h b/include/linux/sched/cond_resched.h deleted file mode 100644 index 227f5be81bcd..000000000000 --- a/include/linux/sched/cond_resched.h +++ /dev/null @@ -1 +0,0 @@ -#include <linux/sched.h> diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 6f0d6fb6523f..47abfc1e6c75 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2460,8 +2460,6 @@ void __audit_inode_child(struct inode *parent, } } - cond_resched(); - /* is there a matching child entry? */ list_for_each_entry(n, &context->names_list, list) { /* can only match entries that have a name */ diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d80d7a608141..d61dc98d1d2f 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -210,8 +210,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) /* play nice and yield if necessary */ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { spin_unlock_irq(&cgroup_rstat_lock); - if (!cond_resched()) - cpu_relax(); + cond_resched_stall(); spin_lock_irq(&cgroup_rstat_lock); } } diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 06366acd27b0..fb8e7aed9751 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -543,8 +543,6 @@ void debug_dma_dump_mappings(struct device *dev) } } spin_unlock_irqrestore(&bucket->lock, flags); - - cond_resched(); } } diff --git a/kernel/events/core.c b/kernel/events/core.c index a2f2a9525d72..02330c190472 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -125,7 +125,7 @@ task_function_call(struct task_struct *p, remote_function_f func, void *info) if (ret != -EAGAIN) break; - cond_resched(); + cond_resched_stall(); } return ret; diff --git a/kernel/gcov/base.c b/kernel/gcov/base.c index 073a3738c5e6..3c22a15065b3 100644 --- a/kernel/gcov/base.c +++ b/kernel/gcov/base.c @@ -43,7 +43,6 @@ void gcov_enable_events(void) /* Perform event callback for previously registered entries. */ while ((info = gcov_info_next(info))) { gcov_event(GCOV_ADD, info); - cond_resched(); } mutex_unlock(&gcov_lock); diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index 18edd57b5fe8..a3c5ce9246cd 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -19,7 +19,7 @@ #include <linux/kdb.h> #include <linux/err.h> #include <linux/proc_fs.h> -#include <linux/sched.h> /* for cond_resched */ +#include <linux/sched.h> #include <linux/ctype.h> #include <linux/slab.h> #include <linux/filter.h> @@ -295,7 +295,6 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, unsigned long), ret = fn(data, namebuf, kallsyms_sym_address(i)); if (ret != 0) return ret; - cond_resched(); } return 0; } @@ -312,7 +311,6 @@ int kallsyms_on_each_match_symbol(int (*fn)(void *, unsigned long), for (i = start; !ret && i <= end; i++) { ret = fn(data, kallsyms_sym_address(get_symbol_seq(i))); - cond_resched(); } return ret; diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 9dc728982d79..40699ea33034 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -452,8 +452,6 @@ static struct page *kimage_alloc_crash_control_pages(struct kimage *image, while (hole_end <= crashk_res.end) { unsigned long i; - cond_resched(); - if (hole_end > KEXEC_CRASH_CONTROL_MEMORY_LIMIT) break; /* See if I overlap any of the segments */ @@ -832,8 +830,6 @@ static int kimage_load_normal_segment(struct kimage *image, else buf += mchunk; mbytes -= mchunk; - - cond_resched(); } out: return result; @@ -900,8 +896,6 @@ static int kimage_load_crash_segment(struct kimage *image, else buf += mchunk; mbytes -= mchunk; - - cond_resched(); } out: return result; diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c index 93cca6e69860..b1bb683274f8 100644 --- a/kernel/locking/test-ww_mutex.c +++ b/kernel/locking/test-ww_mutex.c @@ -46,7 +46,7 @@ static void test_mutex_work(struct work_struct *work) if (mtx->flags & TEST_MTX_TRY) { while (!ww_mutex_trylock(&mtx->mutex, NULL)) - cond_resched(); + cond_resched_stall(); } else { ww_mutex_lock(&mtx->mutex, NULL); } @@ -84,7 +84,7 @@ static int __test_mutex(unsigned int flags) ret = -EINVAL; break; } - cond_resched(); + cond_resched_stall(); } while (time_before(jiffies, timeout)); } else { ret = wait_for_completion_timeout(&mtx.done, TIMEOUT); diff --git a/kernel/module/main.c b/kernel/module/main.c index 98fedfdb8db5..03f6fcfa87f8 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1908,7 +1908,6 @@ static int copy_chunked_from_user(void *dst, const void __user *usrc, unsigned l if (copy_from_user(dst, usrc, n) != 0) return -EFAULT; - cond_resched(); dst += n; usrc += n; len -= n; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 443057bee87c..83a65a3c614a 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -798,8 +798,6 @@ static int ptrace_peek_siginfo(struct task_struct *child, if (signal_pending(current)) break; - - cond_resched(); } if (i > 0) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3467a3a7d4bf..691b50791e04 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -25,7 +25,6 @@ #include <linux/refcount_api.h> #include <linux/topology.h> #include <linux/sched/clock.h> -#include <linux/sched/cond_resched.h> #include <linux/sched/cputime.h> #include <linux/sched/debug.h> #include <linux/sched/hotplug.h> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 448fe36e7bbb..4e67e88282a6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -33,7 +33,6 @@ #include <linux/refcount_api.h> #include <linux/topology.h> #include <linux/sched/clock.h> -#include <linux/sched/cond_resched.h> #include <linux/sched/cputime.h> #include <linux/sched/isolation.h> #include <linux/sched/nohz.h> @@ -51,8 +50,6 @@ #include <asm/switch_to.h> -#include <linux/sched/cond_resched.h> - #include "sched.h" #include "stats.h" #include "autogroup.h" @@ -3374,7 +3371,6 @@ static void task_numa_work(struct callback_head *work) if (pages <= 0 || virtpages <= 0) goto out; - cond_resched(); } while (end != vma->vm_end); } for_each_vma(vmi, vma);
There are broadly three sets of uses of cond_resched(): 1. Calls to cond_resched() out of the goodness of our heart, otherwise known as avoiding lockup splats. 2. Open coded variants of cond_resched_lock() which call cond_resched(). 3. Retry or error handling loops, where cond_resched() is used as a quick alternative to spinning in a tight-loop. When running under a full preemption model, the cond_resched() reduces to a NOP (not even a barrier) so removing it obviously cannot matter. But considering only voluntary preemption models (for say code that has been mostly tested under those), for set-1 and set-2 the scheduler can now preempt kernel tasks running beyond their time quanta anywhere they are preemptible() [1]. Which removes any need for these explicitly placed scheduling points. The cond_resched() calls in set-3 are a little more difficult. To start with, given it's NOP character under full preemption, it never actually saved us from a tight loop. With voluntary preemption, it's not a NOP, but it might as well be -- for most workloads the scheduler does not have an interminable supply of runnable tasks on the runqueue. So, cond_resched() is useful to not get softlockup splats, but not terribly good for error handling. Ideally, these should be replaced with some kind of timed or event wait. For now we use cond_resched_stall(), which tries to schedule if possible, and executes a cpu_relax() if not. All of these are from set-1 except for the retry loops in task_function_call() or the mutex testing logic. Replace these with cond_resched_stall(). The others can be removed. [1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/ Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Will Deacon <will@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> --- include/linux/sched/cond_resched.h | 1 - kernel/auditsc.c | 2 -- kernel/cgroup/rstat.c | 3 +-- kernel/dma/debug.c | 2 -- kernel/events/core.c | 2 +- kernel/gcov/base.c | 1 - kernel/kallsyms.c | 4 +--- kernel/kexec_core.c | 6 ------ kernel/locking/test-ww_mutex.c | 4 ++-- kernel/module/main.c | 1 - kernel/ptrace.c | 2 -- kernel/sched/core.c | 1 - kernel/sched/fair.c | 4 ---- 13 files changed, 5 insertions(+), 28 deletions(-) delete mode 100644 include/linux/sched/cond_resched.h