[v6,0/6] Reduce synchronize_rcu() latency(v6)

Message ID	20240308173409.335345-1-urezki@gmail.com (mailing list archive)
Headers	show Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A94BF36D; Fri, 8 Mar 2024 17:34:14 +0000 (UTC) From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: "Paul E . McKenney" <paulmck@kernel.org> Cc: RCU <rcu@vger.kernel.org>, Neeraj upadhyay <Neeraj.Upadhyay@amd.com>, Boqun Feng <boqun.feng@gmail.com>, Hillf Danton <hdanton@sina.com>, Joel Fernandes <joel@joelfernandes.org>, LKML <linux-kernel@vger.kernel.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>, Frederic Weisbecker <frederic@kernel.org> Subject: [PATCH v6 0/6] Reduce synchronize_rcu() latency(v6) Date: Fri, 8 Mar 2024 18:34:03 +0100 Message-Id: <20240308173409.335345-1-urezki@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Reduce synchronize_rcu() latency(v6) \| expand [v6,0/6] Reduce synchronize_rcu() latency(v6) [v6,1/6] rcu: Add data structures for synchronize_rcu() [v6,2/6] rcu: Reduce synchronize_rcu() latency [v6,3/6] rcu: Add a trace event for synchronize_rcu_normal() [v6,4/6] rcu: Support direct wake-up of synchronize_rcu() users [v6,5/6] rcu: Do not release a wait-head from a GP kthread [v6,6/6] rcu: Allocate WQ with WQ_MEM_RECLAIM bit set

Message ID

20240308173409.335345-1-urezki@gmail.com (mailing list archive)

Headers

From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
To: "Paul E . McKenney" <paulmck@kernel.org>
Cc: RCU <rcu@vger.kernel.org>,
	Neeraj upadhyay <Neeraj.Upadhyay@amd.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Hillf Danton <hdanton@sina.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>,
	Frederic Weisbecker <frederic@kernel.org>
Subject: [PATCH v6 0/6] Reduce synchronize_rcu() latency(v6)
Date: Fri,  8 Mar 2024 18:34:03 +0100
Message-Id: <20240308173409.335345-1-urezki@gmail.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Reduce synchronize_rcu() latency(v6) | expand

Message

Uladzislau Rezki March 8, 2024, 5:34 p.m. UTC

This is v6. It is based on the Paul's "dev" branch:

HEAD: f1bfe538c7970283040a7188a291aca9f18f0c42

please note, that patches should be applied from scratch,
i.e. the v5 has to be dropped from the "dev".

v5 -> v6:
 - Fix a race due to realising a wait-head from the gp-kthread;
 - Use our own private workqueue with WQ_MEM_RECLAIM to have
   at least one execution context.

v5: https://lore.kernel.org/lkml/20240220183115.74124-1-urezki@gmail.com/
v4: https://lore.kernel.org/lkml/ZZ2bi5iPwXLgjB-f@google.com/T/
v3: https://lore.kernel.org/lkml/cd45b0b5-f86b-43fb-a5f3-47d340cd4f9f@paulmck-laptop/T/
v2: https://lore.kernel.org/all/20231030131254.488186-1-urezki@gmail.com/T/
v1: https://lore.kernel.org/lkml/20231025140915.590390-1-urezki@gmail.com/T/


Uladzislau Rezki (Sony) (6):
  rcu: Add data structures for synchronize_rcu()
  rcu: Reduce synchronize_rcu() latency
  rcu: Add a trace event for synchronize_rcu_normal()
  rcu: Support direct wake-up of synchronize_rcu() users
  rcu: Do not release a wait-head from a GP kthread
  rcu: Allocate WQ with WQ_MEM_RECLAIM bit set

 .../admin-guide/kernel-parameters.txt         |  14 +
 include/trace/events/rcu.h                    |  27 ++
 kernel/rcu/tree.c                             | 361 +++++++++++++++++-
 kernel/rcu/tree.h                             |  20 +
 kernel/rcu/tree_exp.h                         |   2 +-
 5 files changed, 422 insertions(+), 2 deletions(-)

Comments

Paul E. McKenney March 8, 2024, 9:51 p.m. UTC | #1

On Fri, Mar 08, 2024 at 06:34:03PM +0100, Uladzislau Rezki (Sony) wrote:
> This is v6. It is based on the Paul's "dev" branch:
> 
> HEAD: f1bfe538c7970283040a7188a291aca9f18f0c42
> 
> please note, that patches should be applied from scratch,
> i.e. the v5 has to be dropped from the "dev".
> 
> v5 -> v6:
>  - Fix a race due to realising a wait-head from the gp-kthread;
>  - Use our own private workqueue with WQ_MEM_RECLAIM to have
>    at least one execution context.
> 
> v5: https://lore.kernel.org/lkml/20240220183115.74124-1-urezki@gmail.com/
> v4: https://lore.kernel.org/lkml/ZZ2bi5iPwXLgjB-f@google.com/T/
> v3: https://lore.kernel.org/lkml/cd45b0b5-f86b-43fb-a5f3-47d340cd4f9f@paulmck-laptop/T/
> v2: https://lore.kernel.org/all/20231030131254.488186-1-urezki@gmail.com/T/
> v1: https://lore.kernel.org/lkml/20231025140915.590390-1-urezki@gmail.com/T/

Queued in place of your earlier series, thank you!

Not urgent, but which rcutorture scenario should be pressed into service
testing this?

							Thanx, Paul

> Uladzislau Rezki (Sony) (6):
>   rcu: Add data structures for synchronize_rcu()
>   rcu: Reduce synchronize_rcu() latency
>   rcu: Add a trace event for synchronize_rcu_normal()
>   rcu: Support direct wake-up of synchronize_rcu() users
>   rcu: Do not release a wait-head from a GP kthread
>   rcu: Allocate WQ with WQ_MEM_RECLAIM bit set
> 
>  .../admin-guide/kernel-parameters.txt         |  14 +
>  include/trace/events/rcu.h                    |  27 ++
>  kernel/rcu/tree.c                             | 361 +++++++++++++++++-
>  kernel/rcu/tree.h                             |  20 +
>  kernel/rcu/tree_exp.h                         |   2 +-
>  5 files changed, 422 insertions(+), 2 deletions(-)
> 
> -- 
> 2.39.2
>

Uladzislau Rezki March 11, 2024, 8:43 a.m. UTC | #2

On Fri, Mar 08, 2024 at 01:51:29PM -0800, Paul E. McKenney wrote:
> On Fri, Mar 08, 2024 at 06:34:03PM +0100, Uladzislau Rezki (Sony) wrote:
> > This is v6. It is based on the Paul's "dev" branch:
> > 
> > HEAD: f1bfe538c7970283040a7188a291aca9f18f0c42
> > 
> > please note, that patches should be applied from scratch,
> > i.e. the v5 has to be dropped from the "dev".
> > 
> > v5 -> v6:
> >  - Fix a race due to realising a wait-head from the gp-kthread;
> >  - Use our own private workqueue with WQ_MEM_RECLAIM to have
> >    at least one execution context.
> > 
> > v5: https://lore.kernel.org/lkml/20240220183115.74124-1-urezki@gmail.com/
> > v4: https://lore.kernel.org/lkml/ZZ2bi5iPwXLgjB-f@google.com/T/
> > v3: https://lore.kernel.org/lkml/cd45b0b5-f86b-43fb-a5f3-47d340cd4f9f@paulmck-laptop/T/
> > v2: https://lore.kernel.org/all/20231030131254.488186-1-urezki@gmail.com/T/
> > v1: https://lore.kernel.org/lkml/20231025140915.590390-1-urezki@gmail.com/T/
> 
> Queued in place of your earlier series, thank you!
> 
Thank you!

>
> Not urgent, but which rcutorture scenario should be pressed into service
> testing this?
> 
I tested with setting '5*TREE01 5*TREE02 5*TREE03 5*TREE04' apart of that
i used some private test cases. The rcutree.rcu_normal_wake_from_gp=1 has
to be passed also.

Also, "rcuscale" can be used to stress the "cur_ops->sync()" path:

<snip>
#! /usr/bin/env bash

LOOPS=1

for (( i=0; i<$LOOPS; i++ )); do
        tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale \
    --allcpus \
      --kconfig CONFIG_NR_CPUS=64 \
      --kconfig CONFIG_RCU_NOCB_CPU=y \
      --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
      --kconfig CONFIG_RCU_LAZY=n \
      --bootargs "rcuscale.nwriters=200 rcuscale.nreaders=220 rcuscale.minruntime=50000 \
                         torture.disable_onoff_at_boot rcutree.rcu_normal_wake_from_gp=1" --trust-make
        echo "Done $i"
done
<snip>

--
Uladzislau Rezki

Paul E. McKenney March 11, 2024, 7:19 p.m. UTC | #3

On Mon, Mar 11, 2024 at 09:43:51AM +0100, Uladzislau Rezki wrote:
> On Fri, Mar 08, 2024 at 01:51:29PM -0800, Paul E. McKenney wrote:
> > On Fri, Mar 08, 2024 at 06:34:03PM +0100, Uladzislau Rezki (Sony) wrote:
> > > This is v6. It is based on the Paul's "dev" branch:
> > > 
> > > HEAD: f1bfe538c7970283040a7188a291aca9f18f0c42
> > > 
> > > please note, that patches should be applied from scratch,
> > > i.e. the v5 has to be dropped from the "dev".
> > > 
> > > v5 -> v6:
> > >  - Fix a race due to realising a wait-head from the gp-kthread;
> > >  - Use our own private workqueue with WQ_MEM_RECLAIM to have
> > >    at least one execution context.
> > > 
> > > v5: https://lore.kernel.org/lkml/20240220183115.74124-1-urezki@gmail.com/
> > > v4: https://lore.kernel.org/lkml/ZZ2bi5iPwXLgjB-f@google.com/T/
> > > v3: https://lore.kernel.org/lkml/cd45b0b5-f86b-43fb-a5f3-47d340cd4f9f@paulmck-laptop/T/
> > > v2: https://lore.kernel.org/all/20231030131254.488186-1-urezki@gmail.com/T/
> > > v1: https://lore.kernel.org/lkml/20231025140915.590390-1-urezki@gmail.com/T/
> > 
> > Queued in place of your earlier series, thank you!
> > 
> Thank you!
> 
> >
> > Not urgent, but which rcutorture scenario should be pressed into service
> > testing this?
> > 
> I tested with setting '5*TREE01 5*TREE02 5*TREE03 5*TREE04' apart of that
> i used some private test cases. The rcutree.rcu_normal_wake_from_gp=1 has
> to be passed also.
> 
> Also, "rcuscale" can be used to stress the "cur_ops->sync()" path:
> 
> <snip>
> #! /usr/bin/env bash
> 
> LOOPS=1
> 
> for (( i=0; i<$LOOPS; i++ )); do
>         tools/testing/selftests/rcutorture/bin/kvm.sh --memory 10G --torture rcuscale \
>     --allcpus \
>       --kconfig CONFIG_NR_CPUS=64 \
>       --kconfig CONFIG_RCU_NOCB_CPU=y \
>       --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \
>       --kconfig CONFIG_RCU_LAZY=n \
>       --bootargs "rcuscale.nwriters=200 rcuscale.nreaders=220 rcuscale.minruntime=50000 \
>                          torture.disable_onoff_at_boot rcutree.rcu_normal_wake_from_gp=1" --trust-make
>         echo "Done $i"
> done
> <snip>

Very good, thank you!

Of those five options (TREE01, TREE02, TREE03, TREE04, and rcuscale),
which one should be changed so that my own testing automatically covers
the rcutree.rcu_normal_wake_from_gp=1 case?  I would guess that we should
leave out TREE03, since it covers tall rcu_node trees.  TREE01 looks
closest to the ChromeOS/Android use case, but you tell me!

And it might be time to rework the test cases to better align with
the use cases.  For example, I created TREE10 to cover Meta's fleet.
But ChromeOS and Android have relatively small numbers of CPUs, so it
should be possible to rework things a bit to make one of the existing
tests cover that case, while modifying other tests to take up any
situations that these changes exclude.

Thoughts?

							Thanx, Paul