Message ID | 20240305105419.21077-1-jinpu.wang@ionos.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | [PATCHv2] md: Replace md_thread's wait queue with the swait variant | expand |
Dear Jack, dear Florian-Ewald, Thank you for your patch. Am 05.03.24 um 11:54 schrieb Jack Wang: > From: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> > > Replace md_thread's wait_event()/wake_up() related calls with their > simple swait~ variants to improve performance as well as memory and > stack footprint. > > Use the IDLE state for the worker thread put to sleep instead of > misusing the INTERRUPTIBLE state combined with flushing signals > just for not contributing to system's cpu load-average stats. > > Also check for sleeping thread before attempting its wake_up in > md_wakeup_thread() for avoiding unnecessary spinlock contention. > > With this patch (backported) on a kernel 6.1, the IOPS improved > by around 4% with raid1 and/or raid5, in high IO load scenarios. It’d be great if you elaborated on your test setup. Should this have? Fixes: 93588e2284b6 ("[PATCH] md: make md threads interruptible again") I ask, because the simple waitqueue (swait) implementation was only introduced in 2016, so 11 years later than the mentioned commit. > Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> > Signed-off-by: Jack Wang <jinpu.wang@ionos.com> > --- > v2: fix a type error for thread > drivers/md/md.c | 23 +++++++---------------- > drivers/md/md.h | 2 +- > 2 files changed, 8 insertions(+), 17 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 48ae2b1cb57a..14d12ee4b06b 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -7970,22 +7970,12 @@ static int md_thread(void *arg) > * many dirty RAID5 blocks. > */ > > - allow_signal(SIGKILL); > while (!kthread_should_stop()) { > > - /* We need to wait INTERRUPTIBLE so that > - * we don't add to the load-average. > - * That means we need to be sure no signals are > - * pending > - */ > - if (signal_pending(current)) > - flush_signals(current); > - > - wait_event_interruptible_timeout > - (thread->wqueue, > - test_bit(THREAD_WAKEUP, &thread->flags) > - || kthread_should_stop() || kthread_should_park(), > - thread->timeout); > + swait_event_idle_timeout_exclusive(thread->wqueue, > + test_bit(THREAD_WAKEUP, &thread->flags) || > + kthread_should_stop() || kthread_should_park(), > + thread->timeout); > > clear_bit(THREAD_WAKEUP, &thread->flags); > if (kthread_should_park()) > @@ -8017,7 +8007,8 @@ void md_wakeup_thread(struct md_thread __rcu *thread) > if (t) { > pr_debug("md: waking up MD thread %s.\n", t->tsk->comm); > set_bit(THREAD_WAKEUP, &t->flags); > - wake_up(&t->wqueue); > + if (swq_has_sleeper(&t->wqueue)) > + swake_up_one(&t->wqueue); > } > rcu_read_unlock(); > } > @@ -8032,7 +8023,7 @@ struct md_thread *md_register_thread(void (*run) (struct md_thread *), > if (!thread) > return NULL; > > - init_waitqueue_head(&thread->wqueue); > + init_swait_queue_head(&thread->wqueue); > > thread->run = run; > thread->mddev = mddev; > diff --git a/drivers/md/md.h b/drivers/md/md.h > index b2076a165c10..1dc01bc5c038 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -716,7 +716,7 @@ static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev) > struct md_thread { > void (*run) (struct md_thread *thread); > struct mddev *mddev; > - wait_queue_head_t wqueue; > + struct swait_queue_head wqueue; > unsigned long flags; > struct task_struct *tsk; > unsigned long timeout; Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Kind regards, Paul
Hi Paul, Thank you for your answer. Regarding the INTERRUPTIBLE wait: - it was introduced that time only for _not_ contributing to load-average. - now obsolete, see also explanation in our patch. Regarding our tests: - we used 100 logical volumes (and more) doing full-blown random rd/wr IO (fio) to the same md-raid1/raid5 device. Best regards, On Tue, Mar 5, 2024 at 1:01 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > Dear Jack, dear Florian-Ewald, > > > Thank you for your patch. > > Am 05.03.24 um 11:54 schrieb Jack Wang: > > From: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> > > > > Replace md_thread's wait_event()/wake_up() related calls with their > > simple swait~ variants to improve performance as well as memory and > > stack footprint. > > > > Use the IDLE state for the worker thread put to sleep instead of > > misusing the INTERRUPTIBLE state combined with flushing signals > > just for not contributing to system's cpu load-average stats. > > > > Also check for sleeping thread before attempting its wake_up in > > md_wakeup_thread() for avoiding unnecessary spinlock contention. > > > > With this patch (backported) on a kernel 6.1, the IOPS improved > > by around 4% with raid1 and/or raid5, in high IO load scenarios. > > It’d be great if you elaborated on your test setup. > > Should this have? > > Fixes: 93588e2284b6 ("[PATCH] md: make md threads interruptible again") > > I ask, because the simple waitqueue (swait) implementation was only > introduced in 2016, so 11 years later than the mentioned commit. > > > Signed-off-by: Florian-Ewald Mueller <florian-ewald.mueller@ionos.com> > > Signed-off-by: Jack Wang <jinpu.wang@ionos.com> > > --- > > v2: fix a type error for thread > > drivers/md/md.c | 23 +++++++---------------- > > drivers/md/md.h | 2 +- > > 2 files changed, 8 insertions(+), 17 deletions(-) > > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > > index 48ae2b1cb57a..14d12ee4b06b 100644 > > --- a/drivers/md/md.c > > +++ b/drivers/md/md.c > > @@ -7970,22 +7970,12 @@ static int md_thread(void *arg) > > * many dirty RAID5 blocks. > > */ > > > > - allow_signal(SIGKILL); > > while (!kthread_should_stop()) { > > > > - /* We need to wait INTERRUPTIBLE so that > > - * we don't add to the load-average. > > - * That means we need to be sure no signals are > > - * pending > > - */ > > - if (signal_pending(current)) > > - flush_signals(current); > > - > > - wait_event_interruptible_timeout > > - (thread->wqueue, > > - test_bit(THREAD_WAKEUP, &thread->flags) > > - || kthread_should_stop() || kthread_should_park(), > > - thread->timeout); > > + swait_event_idle_timeout_exclusive(thread->wqueue, > > + test_bit(THREAD_WAKEUP, &thread->flags) || > > + kthread_should_stop() || kthread_should_park(), > > + thread->timeout); > > > > clear_bit(THREAD_WAKEUP, &thread->flags); > > if (kthread_should_park()) > > @@ -8017,7 +8007,8 @@ void md_wakeup_thread(struct md_thread __rcu *thread) > > if (t) { > > pr_debug("md: waking up MD thread %s.\n", t->tsk->comm); > > set_bit(THREAD_WAKEUP, &t->flags); > > - wake_up(&t->wqueue); > > + if (swq_has_sleeper(&t->wqueue)) > > + swake_up_one(&t->wqueue); > > } > > rcu_read_unlock(); > > } > > @@ -8032,7 +8023,7 @@ struct md_thread *md_register_thread(void (*run) (struct md_thread *), > > if (!thread) > > return NULL; > > > > - init_waitqueue_head(&thread->wqueue); > > + init_swait_queue_head(&thread->wqueue); > > > > thread->run = run; > > thread->mddev = mddev; > > diff --git a/drivers/md/md.h b/drivers/md/md.h > > index b2076a165c10..1dc01bc5c038 100644 > > --- a/drivers/md/md.h > > +++ b/drivers/md/md.h > > @@ -716,7 +716,7 @@ static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev) > > struct md_thread { > > void (*run) (struct md_thread *thread); > > struct mddev *mddev; > > - wait_queue_head_t wqueue; > > + struct swait_queue_head wqueue; > > unsigned long flags; > > struct task_struct *tsk; > > unsigned long timeout; > > Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> > > > Kind regards, > > Paul
Dear Florian-Ewald, Thank you for your quick reply. Am 05.03.24 um 13:25 schrieb Florian-Ewald Müller: > Regarding the INTERRUPTIBLE wait: > - it was introduced that time only for _not_ contributing to load-average. > - now obsolete, see also explanation in our patch. Yes, I understood that. My question was regarding, if this should be backported to the stable series. > Regarding our tests: > - we used 100 logical volumes (and more) doing full-blown random rd/wr > IO (fio) to the same md-raid1/raid5 device. Do you have a script to create these volumes and running the fio test, so it’s easy to reproduce? Was it a spinning disk or SSD device? Kind regards, Paul
Hi Paul, Regarding the IDLE state waiting: - yes, this could/should be backported to any version which already has swait_event_idle() macro(s) - and so also TASK_IDLE - defined. About our fio driven tests - we used fio "ini" files similar to: [global] description=Random Read/Write Access Pattern bssplit=512/8:1k/4:2k/4:4k/30:8k/20:16k/10:32k/8:64k/12:128k/4 cpus_allowed=0-31 cpus_allowed_policy=split numa_mem_policy=local fadvise_hint=0 rw=randrw direct=1 size=100% random_distribution=zipf:1.2 percentage_random=80 ioengine=libaio iodepth=128 numjobs=1 gtod_reduce=1 group_reporting ramp_time=60 runtime=300 time_based [job0] filename=/dev/disk/by-name/vol0 [job1] filename=/dev/disk/by-name/vol1 ... and so on for >= 100 logical volumes. Best regards, On Tue, Mar 5, 2024 at 3:14 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > Dear Florian-Ewald, > > > Thank you for your quick reply. > > Am 05.03.24 um 13:25 schrieb Florian-Ewald Müller: > > > Regarding the INTERRUPTIBLE wait: > > - it was introduced that time only for _not_ contributing to load-average. > > - now obsolete, see also explanation in our patch. > > Yes, I understood that. My question was regarding, if this should be > backported to the stable series. > > > Regarding our tests: > > - we used 100 logical volumes (and more) doing full-blown random rd/wr > > IO (fio) to the same md-raid1/raid5 device. > > Do you have a script to create these volumes and running the fio test, > so it’s easy to reproduce? > > Was it a spinning disk or SSD device? > > > Kind regards, > > Paul
PS: backend devices were/are SSDs. On Tue, Mar 5, 2024 at 3:52 PM Florian-Ewald Müller <florian-ewald.mueller@ionos.com> wrote: > > Hi Paul, > > Regarding the IDLE state waiting: > - yes, this could/should be backported to any version which already > has swait_event_idle() macro(s) - and so also TASK_IDLE - defined. > > About our fio driven tests > - we used fio "ini" files similar to: > > [global] > description=Random Read/Write Access Pattern > bssplit=512/8:1k/4:2k/4:4k/30:8k/20:16k/10:32k/8:64k/12:128k/4 > cpus_allowed=0-31 > cpus_allowed_policy=split > numa_mem_policy=local > fadvise_hint=0 > rw=randrw > direct=1 > size=100% > random_distribution=zipf:1.2 > percentage_random=80 > ioengine=libaio > iodepth=128 > numjobs=1 > gtod_reduce=1 > group_reporting > ramp_time=60 > runtime=300 > time_based > > [job0] > filename=/dev/disk/by-name/vol0 > > [job1] > filename=/dev/disk/by-name/vol1 > > ... and so on for >= 100 logical volumes. > > Best regards, > > On Tue, Mar 5, 2024 at 3:14 PM Paul Menzel <pmenzel@molgen.mpg.de> wrote: > > > > Dear Florian-Ewald, > > > > > > Thank you for your quick reply. > > > > Am 05.03.24 um 13:25 schrieb Florian-Ewald Müller: > > > > > Regarding the INTERRUPTIBLE wait: > > > - it was introduced that time only for _not_ contributing to load-average. > > > - now obsolete, see also explanation in our patch. > > > > Yes, I understood that. My question was regarding, if this should be > > backported to the stable series. > > > > > Regarding our tests: > > > - we used 100 logical volumes (and more) doing full-blown random rd/wr > > > IO (fio) to the same md-raid1/raid5 device. > > > > Do you have a script to create these volumes and running the fio test, > > so it’s easy to reproduce? > > > > Was it a spinning disk or SSD device? > > > > > > Kind regards, > > > > Paul
diff --git a/drivers/md/md.c b/drivers/md/md.c index 48ae2b1cb57a..14d12ee4b06b 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7970,22 +7970,12 @@ static int md_thread(void *arg) * many dirty RAID5 blocks. */ - allow_signal(SIGKILL); while (!kthread_should_stop()) { - /* We need to wait INTERRUPTIBLE so that - * we don't add to the load-average. - * That means we need to be sure no signals are - * pending - */ - if (signal_pending(current)) - flush_signals(current); - - wait_event_interruptible_timeout - (thread->wqueue, - test_bit(THREAD_WAKEUP, &thread->flags) - || kthread_should_stop() || kthread_should_park(), - thread->timeout); + swait_event_idle_timeout_exclusive(thread->wqueue, + test_bit(THREAD_WAKEUP, &thread->flags) || + kthread_should_stop() || kthread_should_park(), + thread->timeout); clear_bit(THREAD_WAKEUP, &thread->flags); if (kthread_should_park()) @@ -8017,7 +8007,8 @@ void md_wakeup_thread(struct md_thread __rcu *thread) if (t) { pr_debug("md: waking up MD thread %s.\n", t->tsk->comm); set_bit(THREAD_WAKEUP, &t->flags); - wake_up(&t->wqueue); + if (swq_has_sleeper(&t->wqueue)) + swake_up_one(&t->wqueue); } rcu_read_unlock(); } @@ -8032,7 +8023,7 @@ struct md_thread *md_register_thread(void (*run) (struct md_thread *), if (!thread) return NULL; - init_waitqueue_head(&thread->wqueue); + init_swait_queue_head(&thread->wqueue); thread->run = run; thread->mddev = mddev; diff --git a/drivers/md/md.h b/drivers/md/md.h index b2076a165c10..1dc01bc5c038 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -716,7 +716,7 @@ static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev *rdev) struct md_thread { void (*run) (struct md_thread *thread); struct mddev *mddev; - wait_queue_head_t wqueue; + struct swait_queue_head wqueue; unsigned long flags; struct task_struct *tsk; unsigned long timeout;