[v4,1/4] rcu: Reduce synchronize_rcu() latency

A call to a synchronize_rcu() can be optimized from a latency
point of view. Workloads which depend on this can benefit of it.

The delay of wakeme_after_rcu() callback, which unblocks a waiter,
depends on several factors:

- how fast a process of offloading is started. Combination of:
    - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
    - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
    - other.
- when started, invoking path is interrupted due to:
    - time limit;
    - need_resched();
    - if limit is reached.
- where in a nocb list it is located;
- how fast previous callbacks completed;

Example:

1. On our embedded devices i can easily trigger the scenario when
it is a last in the list out of ~3600 callbacks:

<snip>
  <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
...
  <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
  <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
  <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
<snip>

2. We use cpuset/cgroup to classify tasks and assign them into
different cgroups. For example "backgrond" group which binds tasks
only to little CPUs or "foreground" which makes use of all CPUs.
Tasks can be migrated between groups by a request if an acceleration
is needed.

See below an example how "surfaceflinger" task gets migrated.
Initially it is located in the "system-background" cgroup which
allows to run only on little cores. In order to speed it up it
can be temporary moved into "foreground" cgroup which allows
to use big/all CPUs:

cgroup_attach_task():
 -> cgroup_migrate_execute()
   -> cpuset_can_attach()
     -> percpu_down_write()
       -> rcu_sync_enter()
         -> synchronize_rcu()
   -> now move tasks to the new cgroup.
 -> cgroup_migrate_finish()

<snip>
         rcuop/1-29      [000] .....  7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
    PERFD-SERVER-1855    [000] d..1.  7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
   TimerDispatch-2768    [002] d..5.  7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
<snip>

"Boosting a task" depends on synchronize_rcu() latency:

- first trace shows a completion of synchronize_rcu();
- second shows attaching a task to a new group;
- last shows a final step when migration occurs.

3. To address this drawback, maintain a separate track that consists
of synchronize_rcu() callers only. After completion of a grace period
users are deferred to a dedicated worker to process requests.

4. This patch reduces the latency of synchronize_rcu() approximately
by ~30-40% on synthetic tests. The real test case, camera launch time,
shows(time is in milliseconds):

1-run 542 vs 489 improvement 9%
2-run 540 vs 466 improvement 13%
3-run 518 vs 468 improvement 9%
4-run 531 vs 457 improvement 13%
5-run 548 vs 475 improvement 13%
6-run 509 vs 484 improvement 4%

Synthetic test(no "noise" from other callbacks):
Hardware: x86_64 64 CPUs, 64GB of memory
Linux-6.6

- 10K tasks(simultaneous);
- each task does(1000 loops)
     synchronize_rcu();
     kfree(p);

default: CONFIG_RCU_NOCB_CPU: takes 54 seconds to complete all users;
patch: CONFIG_RCU_NOCB_CPU: takes 35 seconds to complete all users.

Running 60K gives approximately same results on my setup. Please note
it is without any interaction with another type of callbacks, otherwise
it will impact a lot a default case.

5. An extra CONFIG_RCU_SR_NORMAL_DEBUG_GP kernel option is added
which enables additional debugging for detecting a grace period
incompletion for synchronize_rcu() users. If a GP is not fully
passed for any user, the warning message is emitted.

6. By default it is disabled. To enable this perform one of the
below sequence:

echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 .../admin-guide/kernel-parameters.txt         |  14 ++
 kernel/rcu/Kconfig.debug                      |  12 ++
 kernel/rcu/tree.c                             | 138 +++++++++++++++++-
 kernel/rcu/tree_exp.h                         |   2 +-
 4 files changed, 164 insertions(+), 2 deletions(-)

Message ID	20240104162510.72773-2-urezki@gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-lj1-f178.google.com (mail-lj1-f178.google.com [209.85.208.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8899924B23; Thu, 4 Jan 2024 16:25:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ZgBfArE7" Received: by mail-lj1-f178.google.com with SMTP id 38308e7fff4ca-2cc9fa5e8e1so8654581fa.3; Thu, 04 Jan 2024 08:25:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704385514; x=1704990314; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VUeBeX6rjtlK/bjhBpn7IZ69YLqQmCPOhVA274t93sE=; b=ZgBfArE7AD86xlClSkgVUKq5iyu9eFdQENC7awlW5eXImLMsGQ+CfqjRktR7Y0wgg1 vnz+v9A/ijBT3nLjpnO6oweOSYeUBKcG1EKjVriha6r49fDIiL5b3MQ/OnmcT7dm2jr9 5XYJZW/hLE+ABmGCRWLIsuh0eNh3gjEGaepP/OvL1nynBYfoB5MdkufvRFn11ZOs2zoh mz7id7HoUk38cQxYFjX0zWtDNVElYL4mfUeIqTn9UmNQmIz5VVxuYDHTFJ+DlVI4P6qR x6skKR2wr8w+RBsjPhyBfz8JGotQW7E8pNwOQ5MW5ZFx4gJVATtdcQh7xfOzxBsMUWz8 FRaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704385514; x=1704990314; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VUeBeX6rjtlK/bjhBpn7IZ69YLqQmCPOhVA274t93sE=; b=cbhnqIzJcYfztpTmfLbpa9iApRIU3SUT2ZkHqv94MzOpCZwypuj46W07SRwkSwfpYv KpgP4moWxBjX5H2K11TlFZ71mp+A3atkUa5Cra6kg4mvcfTrXf6qU/pclIMj7bffO8X7 yxGCWY9MfEjMscEyoOnrWSWRvMhqcdo2Ci8stHclc+wHUoKIPLbfTlMNn5lgstkt6KjH rrGqEvSw6RAjuMa323S8mXecJpCWOoj0jrmrWnQO5f0Bfksfl2JKUal2HXaJ1ERyAAMx 0qDfrRqIk9QNmRt9cFX0/A2Wa4TL2qmcPFNq2FlVGu4VeIVO9remEk6p870MGFboyta3 iw1Q== X-Gm-Message-State: AOJu0Yx+MnagP0ay1yfUOJwOVJDj4IjAx0iZANboVK9HbAQnuqEQV2Uj EC2QWghYG8xT7KipQ3JqBQA= X-Google-Smtp-Source: AGHT+IF1ScWKCh84EqPC21rjxGujoAN18razGnSr63EdUU1YP5lKAEp+KFdOVEY0D/3Bzq9tWIfjgw== X-Received: by 2002:ac2:5b84:0:b0:50e:6d96:4b32 with SMTP id o4-20020ac25b84000000b0050e6d964b32mr410302lfn.71.1704385514347; Thu, 04 Jan 2024 08:25:14 -0800 (PST) Received: from pc638.lan (host-185-121-47-193.sydskane.nu. [185.121.47.193]) by smtp.gmail.com with ESMTPSA id i5-20020a0565123e0500b0050e80d1e142sm2789252lfv.170.2024.01.04.08.25.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Jan 2024 08:25:13 -0800 (PST) From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: "Paul E . McKenney" <paulmck@kernel.org> Cc: RCU <rcu@vger.kernel.org>, Neeraj upadhyay <Neeraj.Upadhyay@amd.com>, Boqun Feng <boqun.feng@gmail.com>, Hillf Danton <hdanton@sina.com>, Joel Fernandes <joel@joelfernandes.org>, LKML <linux-kernel@vger.kernel.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>, Frederic Weisbecker <frederic@kernel.org> Subject: [PATCH v4 1/4] rcu: Reduce synchronize_rcu() latency Date: Thu, 4 Jan 2024 17:25:07 +0100 Message-Id: <20240104162510.72773-2-urezki@gmail.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20240104162510.72773-1-urezki@gmail.com> References: <20240104162510.72773-1-urezki@gmail.com> Precedence: bulk X-Mailing-List: rcu@vger.kernel.org List-Id: <rcu.vger.kernel.org> List-Subscribe: <mailto:rcu+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:rcu+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Reduce synchronize_rcu() latency(v4) \| expand [v4,0/4] Reduce synchronize_rcu() latency(v4) [v4,1/4] rcu: Reduce synchronize_rcu() latency [v4,2/4] rcu: Add a trace event for synchronize_rcu_normal() [v4,3/4] rcu: Improve handling of synchronize_rcu() users [v4,4/4] rcu: Support direct wake-up of synchronize_rcu() users

[v4,1/4] rcu: Reduce synchronize_rcu() latency

Commit Message

Comments

Patch