Message ID | 20200528195442.190116-1-surenb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] psi: eliminate kthread_worker from psi trigger scheduling mechanism | expand |
On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote: > Each psi group requires a dedicated kthread_delayed_work and > kthread_worker. Since no other work can be performed using psi_group's > kthread_worker, the same result can be obtained using a task_struct and > a timer directly. This makes psi triggering simpler by removing lists > and locks involved with kthread_worker usage and eliminates the need for > poll_scheduled atomic use in the hot path. > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > This patch is meant to address Peter's request in [1] to pull > kthread_queue_delayed_work() out from under rq->lock. This should also address > the lockdep warning about possibility of a circular dependency described in [2] I think you could've just fixed kthread_queue_delayed_work(), that code is sub-optimal. But I suppose this works too.
On Thu, Jun 4, 2020 at 6:12 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote: > > Each psi group requires a dedicated kthread_delayed_work and > > kthread_worker. Since no other work can be performed using psi_group's > > kthread_worker, the same result can be obtained using a task_struct and > > a timer directly. This makes psi triggering simpler by removing lists > > and locks involved with kthread_worker usage and eliminates the need for > > poll_scheduled atomic use in the hot path. > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > --- > > This patch is meant to address Peter's request in [1] to pull > > kthread_queue_delayed_work() out from under rq->lock. This should also address > > the lockdep warning about possibility of a circular dependency described in [2] > > I think you could've just fixed kthread_queue_delayed_work(), that code > is sub-optimal. Ok, let me look into it some more. My understanding was that the worker->lock in kthread_queue_delayed_work() was needed to synchronize worker->delayed_work_list access. But maybe I'm missing something... I assume you are talking about optimizing this beyond what https://lkml.org/lkml/2020/5/4/1148 was doing? BTW, any objections against taking https://lkml.org/lkml/2020/5/4/1148 ? It's not the ultimate fix but it is an improvement since it gets some of the operations that were unnecessarily under worker->lock out of it. > > But I suppose this works too. In PSI's case there is always one work for each worker, so the delayed_work_list and work_list are not needed and therefore I can replace kthread_worker machinery with a task and a timer. I think I can simplify this a bit further. For example group->poll_wakeup doesn't have to be an atomic. Originally I wanted to avoid a possibility of a race when poll_timer_fn sets it and psi_poll_worker resets it and as a result misses a wakeup, however if psi_poll_worker resets it before calling psi_poll_work then there is no harm in missing a wakeup because we called psi_poll_work and did the required work anyway. One question about this patch I'm not sure about and wanted to ask you Peter is whether it's ok to call mod_timer from within a hotpath (while holding rq->lock). As I described in the additional comment, there is a possibility of a race between when I check timer_pending and the call to mod_timer, so it's possible that mod_timer might be called both from psi_poll_work (psi poll work handler) and from psi_task_change (hotpath under rq->lock). I see that mod_timer takes base->lock spinlock, and IIUC such a race might block the hotpath and therefore is unacceptable. If this is true I'll need to revive the poll_scheduled atomic to close this race and then I can change mod_timer into add_timer. WDYT? And sorry for my ignorance if this is a trivial question. I'm not sure about the rules when it comes to rq->locks. Thanks, Suren. > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Thu, Jun 4, 2020 at 12:20 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Jun 4, 2020 at 6:12 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote: > > > Each psi group requires a dedicated kthread_delayed_work and > > > kthread_worker. Since no other work can be performed using psi_group's > > > kthread_worker, the same result can be obtained using a task_struct and > > > a timer directly. This makes psi triggering simpler by removing lists > > > and locks involved with kthread_worker usage and eliminates the need for > > > poll_scheduled atomic use in the hot path. > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > --- > > > This patch is meant to address Peter's request in [1] to pull > > > kthread_queue_delayed_work() out from under rq->lock. This should also address > > > the lockdep warning about possibility of a circular dependency described in [2] > > > > I think you could've just fixed kthread_queue_delayed_work(), that code > > is sub-optimal. After some more staring into kthread code I think I understand what Peter's comment meant about delayed_work_list. worker->delayed_work_list seems to be unnecessary because each kthread_delayed_work has its own timer which will add the work into worker->work_list when the time comes. So there is no need to store the delayed work in an intermediate worker->delayed_work_list. However I think kthread_destroy_worker() has an issue if it's called while worker->delayed_work_list is non-empty. The issue is that kthread_destroy_worker() does not stop all the kthread_delayed_work->timers scheduled on the worker->delayed_work_list. So if such a timer fires after a call to kthread_destroy_worker(), timer's handler will dereference the already destroyed worker. If I'm right and this is indeed an issue then I think we do need worker->delayed_work_list to cancel all the scheduled timers. The issue can be avoided if we assume that the caller will alway call kthread_cancel_delayed_work_sync() for each delayed_work scheduled on worker->delayed_work_list before calling kthread_destroy_worker(). If that's what we expect I think this expectation should be reflected in the comments and a WARN_ON(!list_empty(&worker->delayed_work_list)) be added in kthread_destroy_worker(). WDYT? > > Ok, let me look into it some more. My understanding was that the > worker->lock in kthread_queue_delayed_work() was needed to synchronize > worker->delayed_work_list access. But maybe I'm missing something... I > assume you are talking about optimizing this beyond what > https://lkml.org/lkml/2020/5/4/1148 was doing? > > BTW, any objections against taking https://lkml.org/lkml/2020/5/4/1148 > ? It's not the ultimate fix but it is an improvement since it gets > some of the operations that were unnecessarily under worker->lock out > of it. > > > > > But I suppose this works too. > > In PSI's case there is always one work for each worker, so the > delayed_work_list and work_list are not needed and therefore I can > replace kthread_worker machinery with a task and a timer. > I think I can simplify this a bit further. For example > group->poll_wakeup doesn't have to be an atomic. Originally I wanted > to avoid a possibility of a race when poll_timer_fn sets it and > psi_poll_worker resets it and as a result misses a wakeup, however if > psi_poll_worker resets it before calling psi_poll_work then there is > no harm in missing a wakeup because we called psi_poll_work and did > the required work anyway. > > One question about this patch I'm not sure about and wanted to ask you > Peter is whether it's ok to call mod_timer from within a hotpath > (while holding rq->lock). As I described in the additional comment, > there is a possibility of a race between when I check timer_pending > and the call to mod_timer, so it's possible that mod_timer might be > called both from psi_poll_work (psi poll work handler) and from > psi_task_change (hotpath under rq->lock). I see that mod_timer takes > base->lock spinlock, and IIUC such a race might block the hotpath and > therefore is unacceptable. If this is true I'll need to revive the > poll_scheduled atomic to close this race and then I can change > mod_timer into add_timer. > WDYT? And sorry for my ignorance if this is a trivial question. I'm > not sure about the rules when it comes to rq->locks. > > Thanks, > Suren. > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > >
On Mon, Jun 8, 2020 at 7:56 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Jun 4, 2020 at 12:20 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Jun 4, 2020 at 6:12 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Thu, May 28, 2020 at 12:54:42PM -0700, Suren Baghdasaryan wrote: > > > > Each psi group requires a dedicated kthread_delayed_work and > > > > kthread_worker. Since no other work can be performed using psi_group's > > > > kthread_worker, the same result can be obtained using a task_struct and > > > > a timer directly. This makes psi triggering simpler by removing lists > > > > and locks involved with kthread_worker usage and eliminates the need for > > > > poll_scheduled atomic use in the hot path. > > > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > > --- > > > > This patch is meant to address Peter's request in [1] to pull > > > > kthread_queue_delayed_work() out from under rq->lock. This should also address > > > > the lockdep warning about possibility of a circular dependency described in [2] > > > > > > I think you could've just fixed kthread_queue_delayed_work(), that code > > > is sub-optimal. > > After some more staring into kthread code I think I understand what > Peter's comment meant about delayed_work_list. > worker->delayed_work_list seems to be unnecessary because each > kthread_delayed_work has its own timer which will add the work into > worker->work_list when the time comes. So there is no need to store > the delayed work in an intermediate worker->delayed_work_list. > However I think kthread_destroy_worker() has an issue if it's called > while worker->delayed_work_list is non-empty. The issue is that > kthread_destroy_worker() does not stop all the > kthread_delayed_work->timers scheduled on the > worker->delayed_work_list. So if such a timer fires after a call to > kthread_destroy_worker(), timer's handler will dereference the already > destroyed worker. > > If I'm right and this is indeed an issue then I think we do need > worker->delayed_work_list to cancel all the scheduled timers. The > issue can be avoided if we assume that the caller will alway call > kthread_cancel_delayed_work_sync() for each delayed_work scheduled on > worker->delayed_work_list before calling kthread_destroy_worker(). If > that's what we expect I think this expectation should be reflected in > the comments and a WARN_ON(!list_empty(&worker->delayed_work_list)) be > added in kthread_destroy_worker(). WDYT? > > > > > Ok, let me look into it some more. My understanding was that the > > worker->lock in kthread_queue_delayed_work() was needed to synchronize > > worker->delayed_work_list access. But maybe I'm missing something... I > > assume you are talking about optimizing this beyond what > > https://lkml.org/lkml/2020/5/4/1148 was doing? > > > > BTW, any objections against taking https://lkml.org/lkml/2020/5/4/1148 > > ? It's not the ultimate fix but it is an improvement since it gets > > some of the operations that were unnecessarily under worker->lock out > > of it. > > > > > > > > But I suppose this works too. > > > > In PSI's case there is always one work for each worker, so the > > delayed_work_list and work_list are not needed and therefore I can > > replace kthread_worker machinery with a task and a timer. > > I think I can simplify this a bit further. For example > > group->poll_wakeup doesn't have to be an atomic. Originally I wanted > > to avoid a possibility of a race when poll_timer_fn sets it and > > psi_poll_worker resets it and as a result misses a wakeup, however if > > psi_poll_worker resets it before calling psi_poll_work then there is > > no harm in missing a wakeup because we called psi_poll_work and did > > the required work anyway. > > > > One question about this patch I'm not sure about and wanted to ask you > > Peter is whether it's ok to call mod_timer from within a hotpath > > (while holding rq->lock). As I described in the additional comment, > > there is a possibility of a race between when I check timer_pending > > and the call to mod_timer, so it's possible that mod_timer might be > > called both from psi_poll_work (psi poll work handler) and from > > psi_task_change (hotpath under rq->lock). I see that mod_timer takes > > base->lock spinlock, and IIUC such a race might block the hotpath and > > therefore is unacceptable. If this is true I'll need to revive the > > poll_scheduled atomic to close this race and then I can change > > mod_timer into add_timer. > > WDYT? And sorry for my ignorance if this is a trivial question. I'm > > not sure about the rules when it comes to rq->locks. Thanks for taking this patch, Peter. I just wanted to double-check if you considered the race I mentioned in the above paragraph and decided it's a non-issue. If it is an issue I can send a small follow-up patch to close the race or I can send a new version of the whole patch with the fix if that's preferable. Please LMK. > > > > Thanks, > > Suren. > > > > > > > > -- > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > > >
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 4b7258495a04..b95f3211566a 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -153,9 +153,10 @@ struct psi_group { unsigned long avg[NR_PSI_STATES - 1][3]; /* Monitor work control */ - atomic_t poll_scheduled; - struct kthread_worker __rcu *poll_kworker; - struct kthread_delayed_work poll_work; + struct task_struct __rcu *poll_task; + struct timer_list poll_timer; + wait_queue_head_t poll_wait; + atomic_t poll_wakeup; /* Protects data used by the monitor */ struct mutex trigger_lock; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8f45cdb6463b..e53b711bd643 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -190,7 +190,6 @@ static void group_init(struct psi_group *group) INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); mutex_init(&group->avgs_lock); /* Init trigger-related members */ - atomic_set(&group->poll_scheduled, 0); mutex_init(&group->trigger_lock); INIT_LIST_HEAD(&group->triggers); memset(group->nr_triggers, 0, sizeof(group->nr_triggers)); @@ -199,7 +198,7 @@ static void group_init(struct psi_group *group) memset(group->polling_total, 0, sizeof(group->polling_total)); group->polling_next_update = ULLONG_MAX; group->polling_until = 0; - rcu_assign_pointer(group->poll_kworker, NULL); + rcu_assign_pointer(group->poll_task, NULL); } void __init psi_init(void) @@ -547,47 +546,38 @@ static u64 update_triggers(struct psi_group *group, u64 now) return now + group->poll_min_period; } -/* - * Schedule polling if it's not already scheduled. It's safe to call even from - * hotpath because even though kthread_queue_delayed_work takes worker->lock - * spinlock that spinlock is never contended due to poll_scheduled atomic - * preventing such competition. - */ +/* Schedule polling if it's not already scheduled. */ static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay) { - struct kthread_worker *kworker; + struct task_struct *task; - /* Do not reschedule if already scheduled */ - if (atomic_cmpxchg(&group->poll_scheduled, 0, 1) != 0) + /* + * Do not reschedule if already scheduled. + * Possible race with a timer scheduled after this check but before + * mod_timer below can be tolerated because group->polling_next_update + * will keep updates on schedule. + */ + if (timer_pending(&group->poll_timer)) return; rcu_read_lock(); - kworker = rcu_dereference(group->poll_kworker); + task = rcu_dereference(group->poll_task); /* * kworker might be NULL in case psi_trigger_destroy races with * psi_task_change (hotpath) which can't use locks */ - if (likely(kworker)) - kthread_queue_delayed_work(kworker, &group->poll_work, delay); - else - atomic_set(&group->poll_scheduled, 0); + if (likely(task)) + mod_timer(&group->poll_timer, jiffies + delay); rcu_read_unlock(); } -static void psi_poll_work(struct kthread_work *work) +static void psi_poll_work(struct psi_group *group) { - struct kthread_delayed_work *dwork; - struct psi_group *group; u32 changed_states; u64 now; - dwork = container_of(work, struct kthread_delayed_work, work); - group = container_of(dwork, struct psi_group, poll_work); - - atomic_set(&group->poll_scheduled, 0); - mutex_lock(&group->trigger_lock); now = sched_clock(); @@ -623,6 +613,35 @@ static void psi_poll_work(struct kthread_work *work) mutex_unlock(&group->trigger_lock); } +static int psi_poll_worker(void *data) +{ + struct psi_group *group = (struct psi_group *)data; + struct sched_param param = { + .sched_priority = 1, + }; + + sched_setscheduler_nocheck(current, SCHED_FIFO, ¶m); + + while (true) { + wait_event_interruptible(group->poll_wait, + atomic_cmpxchg(&group->poll_wakeup, 1, 0) || + kthread_should_stop()); + if (kthread_should_stop()) + break; + + psi_poll_work(group); + } + return 0; +} + +static void poll_timer_fn(struct timer_list *t) +{ + struct psi_group *group = from_timer(group, t, poll_timer); + + atomic_set(&group->poll_wakeup, 1); + wake_up_interruptible(&group->poll_wait); +} + static void record_times(struct psi_group_cpu *groupc, int cpu, bool memstall_tick) { @@ -1099,22 +1118,20 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, mutex_lock(&group->trigger_lock); - if (!rcu_access_pointer(group->poll_kworker)) { - struct sched_param param = { - .sched_priority = 1, - }; - struct kthread_worker *kworker; + if (!rcu_access_pointer(group->poll_task)) { + struct task_struct *task; - kworker = kthread_create_worker(0, "psimon"); - if (IS_ERR(kworker)) { + task = kthread_create(psi_poll_worker, group, "psimon"); + if (IS_ERR(task)) { kfree(t); mutex_unlock(&group->trigger_lock); - return ERR_CAST(kworker); + return ERR_CAST(task); } - sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m); - kthread_init_delayed_work(&group->poll_work, - psi_poll_work); - rcu_assign_pointer(group->poll_kworker, kworker); + atomic_set(&group->poll_wakeup, 0); + init_waitqueue_head(&group->poll_wait); + wake_up_process(task); + timer_setup(&group->poll_timer, poll_timer_fn, 0); + rcu_assign_pointer(group->poll_task, task); } list_add(&t->node, &group->triggers); @@ -1132,7 +1149,7 @@ static void psi_trigger_destroy(struct kref *ref) { struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount); struct psi_group *group = t->group; - struct kthread_worker *kworker_to_destroy = NULL; + struct task_struct *task_to_destroy = NULL; if (static_branch_likely(&psi_disabled)) return; @@ -1158,13 +1175,13 @@ static void psi_trigger_destroy(struct kref *ref) period = min(period, div_u64(tmp->win.size, UPDATES_PER_WINDOW)); group->poll_min_period = period; - /* Destroy poll_kworker when the last trigger is destroyed */ + /* Destroy poll_task when the last trigger is destroyed */ if (group->poll_states == 0) { group->polling_until = 0; - kworker_to_destroy = rcu_dereference_protected( - group->poll_kworker, + task_to_destroy = rcu_dereference_protected( + group->poll_task, lockdep_is_held(&group->trigger_lock)); - rcu_assign_pointer(group->poll_kworker, NULL); + rcu_assign_pointer(group->poll_task, NULL); } } @@ -1172,25 +1189,23 @@ static void psi_trigger_destroy(struct kref *ref) /* * Wait for both *trigger_ptr from psi_trigger_replace and - * poll_kworker RCUs to complete their read-side critical sections - * before destroying the trigger and optionally the poll_kworker + * poll_task RCUs to complete their read-side critical sections + * before destroying the trigger and optionally the poll_task */ synchronize_rcu(); /* * Destroy the kworker after releasing trigger_lock to prevent a * deadlock while waiting for psi_poll_work to acquire trigger_lock */ - if (kworker_to_destroy) { + if (task_to_destroy) { /* * After the RCU grace period has expired, the worker - * can no longer be found through group->poll_kworker. + * can no longer be found through group->poll_task. * But it might have been already scheduled before * that - deschedule it cleanly before destroying it. */ - kthread_cancel_delayed_work_sync(&group->poll_work); - atomic_set(&group->poll_scheduled, 0); - - kthread_destroy_worker(kworker_to_destroy); + del_timer_sync(&group->poll_timer); + kthread_stop(task_to_destroy); } kfree(t); }
Each psi group requires a dedicated kthread_delayed_work and kthread_worker. Since no other work can be performed using psi_group's kthread_worker, the same result can be obtained using a task_struct and a timer directly. This makes psi triggering simpler by removing lists and locks involved with kthread_worker usage and eliminates the need for poll_scheduled atomic use in the hot path. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- This patch is meant to address Peter's request in [1] to pull kthread_queue_delayed_work() out from under rq->lock. This should also address the lockdep warning about possibility of a circular dependency described in [2] [1]: https://lore.kernel.org/lkml/20200428163125.GC16910@hirez.programming.kicks-ass.net/ [2]: https://lore.kernel.org/lkml/CAJuCfpG4NkhpQvZjgXZ_3gm6Hf1QgN_eUOQ8iX9Cv1k9whLwSQ@mail.gmail.com --- include/linux/psi_types.h | 7 ++- kernel/sched/psi.c | 113 +++++++++++++++++++++----------------- 2 files changed, 68 insertions(+), 52 deletions(-)