[RFC-PROTOTYPE,1/1] mm: Add __GFP_FAST_TRY flag

Some background and kfree_rcu()
===============================
The pointers to be freed are stored in the per-cpu array to improve
performance, to enable an easier-to-use API, to accommodate vmalloc
memmory and to support a single argument of the kfree_rcu() when only
a pointer is passed. More details are below.

In order to maintain such per-CPU arrays there is a need in dynamic
allocation when a current array is fully populated and a new block is
required. See below the example:

 0 1 2 3      0 1 2 3
|p|p|p|p| -> |p|p|p|p| -> NULL

there are two pointer-blocks, each one can store 4 addresses
which will be freed after a grace period is passed. In reality
we store PAGE_SIZE / sizeof(void *). So to maintain such blocks
a single page is obtain via the page allocator:

    bnode = (struct kvfree_rcu_bulk_data *)
        __get_free_page(GFP_NOWAIT | __GFP_NOWARN);

after that it is attached to the "head" and its "next" pointer is
set to previous "head", so the list of blocks can be maintained and
grow dynamically until it gets drained by the reclaiming thread.

Please note. There is always a fallback if an allocation fails. In the
single argument, this is a call to synchronize_rcu() and for the two
arguments case this is to use rcu_head structure embedded in the object
being free, and then paying cache-miss penalty, also invoke the kfree()
per object instead of kfree_bulk() for groups of objects.

Why we maintain arrays/blocks instead of linking objects by the regular
"struct rcu_head" technique. See below a few but main reasons:

a) A memory can be reclaimed by invoking of the kfree_bulk()
   interface that requires passing an array and number of
   entries in it. That reduces the per-object overhead caused
   by calling kfree() per-object. This reduces the reclamation
   time.

b) Improves locality and reduces the number of cache-misses, due to
   "pointer chasing" between objects, which can be far spread between
   each other.

c) Support a "single argument" in the kvfree_rcu()
   void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
   if (ptr)
        kvfree_rcu(ptr);

   We need it when an "rcu_head" is not embed into a stucture but an
   object must be freed after a grace period. Therefore for the single
   argument, such objects cannot be queued on a linked list.

   So nowadays, since we do not have a single argument but we see the
   demand in it, to workaround it people just do a simple not efficient
   sequence:
   <snip>
       synchronize_rcu(); /* Can be long and blocks a current context */
       kfree(p);
   <snip>

   More details is here: https://lkml.org/lkml/2020/4/28/1626

d) To distinguish vmalloc pointers between SLAB ones. It becomes possible
   to invoke the right freeing API for the right kind of pointer, kfree_bulk()
   or TBD: vmalloc_bulk().

Also, please have a look here: https://lkml.org/lkml/2020/7/30/1166

Limitations and concerns (Main part)
====================================
The current memmory-allocation interface presents to following
difficulties that this patch is designed to overcome:

a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
   complain about violation("BUG: Invalid wait context") of the
   nesting rules. It does the raw_spinlock vs. spinlock nesting
   checks, i.e. it is not legal to acquire a spinlock_t while
   holding a raw_spinlock_t.

   Internally the kfree_rcu() uses raw_spinlock_t(in rcu-dev branch)
   whereas the "page allocator" internally deals with spinlock_t to
   access to its zones. The code also can be broken from higher level
   of view:
   <snip>
       raw_spin_lock(&some_lock);
       kfree_rcu(some_pointer, some_field_offset);
   <snip>

b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t
   is converted into sleepable variant. Invoking the page allocator from
   atomic contexts leads to "BUG: scheduling while atomic".

Proposals
=========
1) Make GFP_* that ensures that the allocator returns NULL rather
than acquire its own spinlock_t. Having such flag will address a and b
limitations described above. It will also make the kfree_rcu() code
common for RT and regular kernel, more clean, less handling corner
cases and reduce the code size.

Description:
The page allocator has two phases, fast path and slow one. We are interested
in fast path and order-0 allocations. In its turn it is divided also into two
phases: lock-less and not:

a) As a first step the page allocator tries to obtain a page from the
   per-cpu-list, so each CPU has its own one. That is why this step is
   lock-less and fast. Basically it disables irqs on current CPU in order
   to access to per-cpu data and remove a first element from the pcp-list.
   An element/page is returned to an user.

b) If there is no any available page in per-cpu-list, the second step is
   involved. It removes a specified number of elements from the buddy allocator
   transferring them to the "supplied-list/per-cpu-list" described in [1].

A number of pre-fetched elements seems does not depend on amount of the
physical memory in a system. In my case it is 63 pages. This step is not
lock-less. It uses spinlock_t for accessing to the body's zone. This
step is fully covered in the rmqueue_bulk() function.

Summarizing. The __GFP_FAST_TRY covers only [1] and can not do step [2],
due to the fact that [2] acquires spinlock_t. It implies that it is super
fast, but a higher rate of fails is also expected.

Usage: __get_free_page(__GFP_FAST_TRY);

2) There was a proposal from Matthew Wilcox: https://lkml.org/lkml/2020/7/31/1015

<snip>
On non-RT, we could make that lock a raw spinlock.  On RT, we could
decline to take the lock.  We'd need to abstract the spin_lock() away
behind zone_lock(zone), but that should be OK.
<snip>

It would be great to use any existing flag, say GFP_NOWAIT. Suppose we
decline to take the lock across the page allocator for RT. But there is
at least one path that does it outside of the page allocator. GFP_NOWAIT
can wakeup the kswapd, whereas a "wake-up path" uses sleepable lock:

wakeup_kswapd() -> wake_up_interruptible(&pgdat->kswapd_wait).

Probably it can be fixed by the excluding of waking of the kswapd process
defining something like below:

what is equal to zero and i am not sure if __get_free_page(0) handles
all that correctly, though it allocates and seems working on my test
machine! Please note it is related to "if we can reuse existing flags".

In the meantime, please see below for a patch that adds a __GFP_FAST_TRY,
which can at least serve as a baseline against which other proposals can
be compared. The patch is based on the 5.8.0-rc3.

Please RFC.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/gfp.h            |  6 ++++--
 include/trace/events/mmflags.h |  1 +
 mm/page_alloc.c                | 31 +++++++++++++++++++++++++------
 mm/slab.c                      |  2 ++
 mm/slab.h                      |  2 ++
 mm/slob.c                      |  2 ++
 mm/slub.c                      |  1 +
 tools/perf/builtin-kmem.c      |  1 +
 8 files changed, 38 insertions(+), 8 deletions(-)

Message ID	20200803163029.1997-1-urezki@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=jC02=BN=kvack.org=owner-linux-mm@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 09D7F14B7 for <patchwork-linux-mm@patchwork.kernel.org>; Mon, 3 Aug 2020 16:30:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A43C420775 for <patchwork-linux-mm@patchwork.kernel.org>; Mon, 3 Aug 2020 16:30:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VaO6gNRb" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A43C420775 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9119F8D010F; Mon, 3 Aug 2020 12:30:55 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 8C34A8D0081; Mon, 3 Aug 2020 12:30:55 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B1408D010F; Mon, 3 Aug 2020 12:30:55 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0096.hostedemail.com [216.40.44.96]) by kanga.kvack.org (Postfix) with ESMTP id 619E18D0081 for <linux-mm@kvack.org>; Mon, 3 Aug 2020 12:30:55 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id EDE108248D52 for <linux-mm@kvack.org>; Mon, 3 Aug 2020 16:30:54 +0000 (UTC) X-FDA: 77109796428.27.cave88_380344726f9f Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin27.hostedemail.com (Postfix) with ESMTP id A87ED3D678 for <linux-mm@kvack.org>; Mon, 3 Aug 2020 16:30:54 +0000 (UTC) X-Spam-Summary: 50,0,0,1702bf62718572f8,d41d8cd98f00b204,urezki@gmail.com,,RULES_HIT:4:41:355:379:541:800:960:966:967:968:973:982:988:989:1260:1311:1314:1345:1437:1515:1605:1730:1747:1777:1792:1801:2196:2198:2199:2200:2393:2525:2565:2640:2682:2685:2693:2731:2736:2859:2897:2901:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4250:4321:4385:4605:5007:6119:6261:6653:6755:7514:7903:7904:8603:8660:9010:9025:9040:9413:10004:10394:11026:11473:11658:11914:12043:12048:12050:12219:12291:12295:12296:12297:12438:12517:12519:12539:12555:12663:12679:12683:12895:12986:13141:13148:13161:13206:13229:13230:13869:13894:14110:14687:21063:21080:21094:21323:21324:21433:21444:21450:21451:21524:21611:21627:21666:21740:21788:21789:21790:21795:21939:21987:21990:30003:30005:30012:30034:30051:30054:30055:30074,0,RBL:209.85.208.196:@gmail.com:.lbl8.mailshell.net-62.18.84.100 66.100.201.100;04yg bjr9yw47 X-HE-Tag: cave88_380344726f9f X-Filterd-Recvd-Size: 17915 Received: from mail-lj1-f196.google.com (mail-lj1-f196.google.com [209.85.208.196]) by imf49.hostedemail.com (Postfix) with ESMTP for <linux-mm@kvack.org>; Mon, 3 Aug 2020 16:30:52 +0000 (UTC) Received: by mail-lj1-f196.google.com with SMTP id x9so40370479ljc.5 for <linux-mm@kvack.org>; Mon, 03 Aug 2020 09:30:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=LX2ZzKNU6jkcmR29bK4t+yW3/VPBQyLrWkBfwHibAb0=; b=VaO6gNRbWWoT/GJfHPrVVQZtjdC3Wa1Xs1UA8viq7B7798xTMlHsn5P3CieHi1USiv qXaDTkLPzPc5rql7/H7C3SMPOnRzop1MhShA0fDuj4szfJOKWokBcjqv8VBSw+spbY45 1kB/5FAJruCVxZAhV+dGSKMOB82fUjylcIY9x9EXDzRGqvgoFeOyVP8hbP09PAsvxehT Zcc6qjWOXs9mQeMm8um+Cx5JOAZrFoMxk1Fi2KpF1gDUsmyQlQ6epYBfl6LJdLVZ2LO8 eDaSo4jfNdHOlOmcmKWKmPxOUidN+Cj5WeTGBVrbEtlji7nnKnoNA9cN5SSp5cg/056Y 3NwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=LX2ZzKNU6jkcmR29bK4t+yW3/VPBQyLrWkBfwHibAb0=; b=hQKQ5OVAHVufI+3gO/07kTmvCAltNd6hGpQlLKzHYqD3ULQIWAR8PqbeMFlM1PqNN4 9C1501jAZ22b/ElqVGhJmKN92/12L8KhjwEve4vRTSInoNrSw+TRnGE2RSb1BIE1ckoD waWWnmu/3L+LO3V2OjIounUN4KuEMiOQXI2XpBnImzwra0jFsj24YQDotSdfz7noI/Lr pQHAEDI59x9m+Z1cPJsYxjgtlFjdDJ73wUyWGz/AVamPLlS9g0j0uV1pZOzBtEXwe3ap 7gehbwadUpbrR3yG79EANDLDXdDgwk/dQVre3cORlrteswznPdtrrydMSiNSageOm3S0 6qhg== X-Gm-Message-State: AOAM532SAaJoj5/EwDgDVi5hQJg1iVQ3txoDfBYz60A+L+O4IoXa9W4t dBqo7Qpc8WmesE+zQ7nKtf4= X-Google-Smtp-Source: ABdhPJxX2c1VYw//a2cLxUfkeBcllRfg7MvgmqhPl+cUl2yfdBUQCGtMIBTb5kU6XEnLiDlqKfbGNQ== X-Received: by 2002:a2e:b0ca:: with SMTP id g10mr7003749ljl.230.1596472251007; Mon, 03 Aug 2020 09:30:51 -0700 (PDT) Received: from pc638.lan (h5ef52e31.seluork.dyn.perspektivbredband.net. [94.245.46.49]) by smtp.gmail.com with ESMTPSA id s4sm5024373lfc.56.2020.08.03.09.30.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Aug 2020 09:30:50 -0700 (PDT) From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: LKML <linux-kernel@vger.kernel.org>, RCU <rcu@vger.kernel.org>, linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, "Paul E . McKenney" <paulmck@kernel.org>, Matthew Wilcox <willy@infradead.org> Cc: "Theodore Y . Ts'o" <tytso@mit.edu>, Joel Fernandes <joel@joelfernandes.org>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Subject: [RFC-PROTOTYPE 1/1] mm: Add __GFP_FAST_TRY flag Date: Mon, 3 Aug 2020 18:30:29 +0200 Message-Id: <20200803163029.1997-1-urezki@gmail.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 X-Rspamd-Queue-Id: A87ED3D678 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	[RFC-PROTOTYPE,1/1] mm: Add __GFP_FAST_TRY flag \| expand [RFC-PROTOTYPE,1/1] mm: Add __GFP_FAST_TRY flag

[RFC-PROTOTYPE,1/1] mm: Add __GFP_FAST_TRY flag

Commit Message

Comments

Patch