mbox series

[0/3,v5] Introduce a bulk order-0 page allocator

Message ID 20210322091845.16437-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Introduce a bulk order-0 page allocator | expand

Message

Mel Gorman March 22, 2021, 9:18 a.m. UTC
This series is based on top of Matthew Wilcox's series "Rationalise
__alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
test and are not using Andrew's tree as a baseline, I suggest using the
following git tree

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9

The users of the API have been dropped in this version as the callers
need to check whether they prefer an array or list interface (whether
preference is based on convenience or performance).

Changelog since v4
o Drop users of the API
o Remove free_pages_bulk interface, no users
o Add array interface
o Allocate single page if watermark checks on local zones fail

Changelog since v3
o Rebase on top of Matthew's series consolidating the alloc_pages API
o Rename alloced to allocated
o Split out preparation patch for prepare_alloc_pages
o Defensive check for bulk allocation or <= 0 pages
o Call single page allocation path only if no pages were allocated
o Minor cosmetic cleanups
o Reorder patch dependencies by subsystem. As this is a cross-subsystem
  series, the mm patches have to be merged before the sunrpc and net
  users.

Changelog since v2
o Prep new pages with IRQs enabled
o Minor documentation update

Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2

This series introduces a bulk order-0 page allocator with the
intent that sunrpc and the network page pool become the first users.
The implementation is not particularly efficient and the intention is to
iron out what the semantics of the API should have for users. Despite
that, this is a performance-related enhancement for users that require
multiple pages for an operation without multiple round-trips to the page
allocator. Quoting the last patch for the prototype high-speed networking
use-case.

    For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
    redirecting xdp_frame packets into a veth, that does XDP_PASS to
    create an SKB from the xdp_frame, which then cannot return the page
    to the page_pool. In this case, we saw[1] an improvement of 18.8%
    from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Both potential users in this series are corner cases (NFS and high-speed
networks) so it is unlikely that most users will see any benefit in the
short term. Other potential other users are batch allocations for page
cache readahead, fault around and SLUB allocations when high-order pages
are unavailable. It's unknown how much benefit would be seen by converting
multiple page allocation calls to a single batch or what difference it may
make to headline performance. It's a chicken and egg problem given that
the potential benefit cannot be investigated without an implementation
to test against.

Light testing passed, I'm relying on Chuck and Jesper to test their
implementations, choose whether to use lists or arrays and document
performance gains/losses in the changelogs.

Patch 1 renames a variable name that is particularly unpopular

Patch 2 adds a bulk page allocator

Patch 3 adds an array-based version of the bulk allocator

 include/linux/gfp.h |  18 +++++
 mm/page_alloc.c     | 171 ++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 185 insertions(+), 4 deletions(-)

Comments

Jesper Dangaard Brouer March 22, 2021, 12:04 p.m. UTC | #1
On Mon, 22 Mar 2021 09:18:42 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9

page_bench04_bulk[1] micro-benchmark on branch: mm-bulk-rebase-v5r9
 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c

BASELINE
 single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

 Per elem: 206 cycles(tsc) 57.478 ns (step:1)
 Per elem: 154 cycles(tsc) 42.861 ns (step:2)
 Per elem: 145 cycles(tsc) 40.536 ns (step:3)
 Per elem: 142 cycles(tsc) 39.477 ns (step:4)
 Per elem: 142 cycles(tsc) 39.610 ns (step:8)
 Per elem: 137 cycles(tsc) 38.155 ns (step:16)
 Per elem: 135 cycles(tsc) 37.739 ns (step:32)
 Per elem: 134 cycles(tsc) 37.282 ns (step:64)
 Per elem: 133 cycles(tsc) 36.993 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

 Per elem: 202 cycles(tsc) 56.383 ns (step:1)
 Per elem: 144 cycles(tsc) 40.047 ns (step:2)
 Per elem: 134 cycles(tsc) 37.339 ns (step:3)
 Per elem: 128 cycles(tsc) 35.578 ns (step:4)
 Per elem: 120 cycles(tsc) 33.592 ns (step:8)
 Per elem: 116 cycles(tsc) 32.362 ns (step:16)
 Per elem: 113 cycles(tsc) 31.476 ns (step:32)
 Per elem: 110 cycles(tsc) 30.633 ns (step:64)
 Per elem: 110 cycles(tsc) 30.596 ns (step:128)

Compared to the previous results (see below) list-variant got faster,
but array-variant is still faster.  The array variant lost a little
performance.  I think this can be related to the stats counters got
added/moved inside the loop, in this patchset.

Previous results from:
 https://lore.kernel.org/netdev/20210319181031.44dd3113@carbon/

On Fri, 19 Mar 2021 18:10:31 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> BASELINE
>  single_page alloc+put: 207 cycles(tsc) 57.773 ns
> 
> LIST variant: time_bulk_page_alloc_free_list: step=bulk size
> 
>  Per elem: 294 cycles(tsc) 81.866 ns (step:1)
>  Per elem: 214 cycles(tsc) 59.477 ns (step:2)
>  Per elem: 199 cycles(tsc) 55.504 ns (step:3)
>  Per elem: 192 cycles(tsc) 53.489 ns (step:4)
>  Per elem: 188 cycles(tsc) 52.456 ns (step:8)
>  Per elem: 184 cycles(tsc) 51.346 ns (step:16)
>  Per elem: 183 cycles(tsc) 50.883 ns (step:32)
>  Per elem: 184 cycles(tsc) 51.236 ns (step:64)
>  Per elem: 189 cycles(tsc) 52.620 ns (step:128)
> 
> ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> 
>  Per elem: 195 cycles(tsc) 54.174 ns (step:1)
>  Per elem: 123 cycles(tsc) 34.224 ns (step:2)
>  Per elem: 113 cycles(tsc) 31.430 ns (step:3)
>  Per elem: 108 cycles(tsc) 30.003 ns (step:4)
>  Per elem: 102 cycles(tsc) 28.417 ns (step:8)
>  Per elem:  98 cycles(tsc) 27.475 ns (step:16)
>  Per elem:  96 cycles(tsc) 26.901 ns (step:32)
>  Per elem:  95 cycles(tsc) 26.487 ns (step:64)
>  Per elem:  94 cycles(tsc) 26.170 ns (step:128)

> The users of the API have been dropped in this version as the callers
> need to check whether they prefer an array or list interface (whether
> preference is based on convenience or performance).

I'll start coding up the page_pool API user and benchmark that.

> Changelog since v4
> o Drop users of the API
> o Remove free_pages_bulk interface, no users

In [1] benchmark I just open-coded free_pages_bulk():
 https://github.com/netoptimizer/prototype-kernel/commit/49d224b19850b767c

> o Add array interface
> o Allocate single page if watermark checks on local zones fail
Mel Gorman March 22, 2021, 4:44 p.m. UTC | #2
On Mon, Mar 22, 2021 at 01:04:46PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 22 Mar 2021 09:18:42 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> 
> page_bench04_bulk[1] micro-benchmark on branch: mm-bulk-rebase-v5r9
>  [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
> 
> BASELINE
>  single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns
> 
> LIST variant: time_bulk_page_alloc_free_list: step=bulk size
> 
>  Per elem: 206 cycles(tsc) 57.478 ns (step:1)
>  Per elem: 154 cycles(tsc) 42.861 ns (step:2)
>  Per elem: 145 cycles(tsc) 40.536 ns (step:3)
>  Per elem: 142 cycles(tsc) 39.477 ns (step:4)
>  Per elem: 142 cycles(tsc) 39.610 ns (step:8)
>  Per elem: 137 cycles(tsc) 38.155 ns (step:16)
>  Per elem: 135 cycles(tsc) 37.739 ns (step:32)
>  Per elem: 134 cycles(tsc) 37.282 ns (step:64)
>  Per elem: 133 cycles(tsc) 36.993 ns (step:128)
> 
> ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> 
>  Per elem: 202 cycles(tsc) 56.383 ns (step:1)
>  Per elem: 144 cycles(tsc) 40.047 ns (step:2)
>  Per elem: 134 cycles(tsc) 37.339 ns (step:3)
>  Per elem: 128 cycles(tsc) 35.578 ns (step:4)
>  Per elem: 120 cycles(tsc) 33.592 ns (step:8)
>  Per elem: 116 cycles(tsc) 32.362 ns (step:16)
>  Per elem: 113 cycles(tsc) 31.476 ns (step:32)
>  Per elem: 110 cycles(tsc) 30.633 ns (step:64)
>  Per elem: 110 cycles(tsc) 30.596 ns (step:128)
> 
> Compared to the previous results (see below) list-variant got faster,
> but array-variant is still faster.  The array variant lost a little
> performance.  I think this can be related to the stats counters got
> added/moved inside the loop, in this patchset.
> 

If you are feeling particularly brave, take a look at
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r10

It's a prototype series rebased on top of the bulk allocator and this
version has not even been boot tested.  While it'll get rough treatment
during review, it should reduce the cost of the stat updates in the
bulk allocator as a side-effect.
Chuck Lever March 22, 2021, 6:25 p.m. UTC | #3
> On Mar 22, 2021, at 5:18 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> 
> The users of the API have been dropped in this version as the callers
> need to check whether they prefer an array or list interface (whether
> preference is based on convenience or performance).

I now have a consumer implementation that uses the array
API. If I understand the contract correctly, the return
value is the last array index that __alloc_pages_bulk()
visits. My consumer uses the return value to determine
if it needs to call the allocator again.

It is returning some confusing (to me) results. I'd like
to get these resolved before posting any benchmark
results.

1. When it has visited every array element, it returns the
same value as was passed in @nr_pages. That's the N + 1th
array element, which shouldn't be touched. Should the
allocator return nr_pages - 1 in the fully successful case?
Or should the documentation describe the return value as
"the number of elements visited" ?

2. Frequently the allocator returns a number smaller than
the total number of elements. As you may recall, sunrpc
will delay a bit (via a call to schedule_timeout) then call
again. This is supposed to be a rare event, and the delay
is substantial. But with the array-based API, a not-fully-
successful allocator call seems to happen more than half
the time. Is that expected? I'm calling with GFP_KERNEL,
seems like the allocator should be trying harder.

3. Is the current design intended so that if the consumer
does call again, is it supposed to pass in the array address
+ the returned index (and @nr_pages reduced by the returned
index) ?

Thanks for all your hard work, Mel.


> Changelog since v4
> o Drop users of the API
> o Remove free_pages_bulk interface, no users
> o Add array interface
> o Allocate single page if watermark checks on local zones fail
> 
> Changelog since v3
> o Rebase on top of Matthew's series consolidating the alloc_pages API
> o Rename alloced to allocated
> o Split out preparation patch for prepare_alloc_pages
> o Defensive check for bulk allocation or <= 0 pages
> o Call single page allocation path only if no pages were allocated
> o Minor cosmetic cleanups
> o Reorder patch dependencies by subsystem. As this is a cross-subsystem
>  series, the mm patches have to be merged before the sunrpc and net
>  users.
> 
> Changelog since v2
> o Prep new pages with IRQs enabled
> o Minor documentation update
> 
> Changelog since v1
> o Parenthesise binary and boolean comparisons
> o Add reviewed-bys
> o Rebase to 5.12-rc2
> 
> This series introduces a bulk order-0 page allocator with the
> intent that sunrpc and the network page pool become the first users.
> The implementation is not particularly efficient and the intention is to
> iron out what the semantics of the API should have for users. Despite
> that, this is a performance-related enhancement for users that require
> multiple pages for an operation without multiple round-trips to the page
> allocator. Quoting the last patch for the prototype high-speed networking
> use-case.
> 
>    For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
>    redirecting xdp_frame packets into a veth, that does XDP_PASS to
>    create an SKB from the xdp_frame, which then cannot return the page
>    to the page_pool. In this case, we saw[1] an improvement of 18.8%
>    from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> Both potential users in this series are corner cases (NFS and high-speed
> networks) so it is unlikely that most users will see any benefit in the
> short term. Other potential other users are batch allocations for page
> cache readahead, fault around and SLUB allocations when high-order pages
> are unavailable. It's unknown how much benefit would be seen by converting
> multiple page allocation calls to a single batch or what difference it may
> make to headline performance. It's a chicken and egg problem given that
> the potential benefit cannot be investigated without an implementation
> to test against.
> 
> Light testing passed, I'm relying on Chuck and Jesper to test their
> implementations, choose whether to use lists or arrays and document
> performance gains/losses in the changelogs.
> 
> Patch 1 renames a variable name that is particularly unpopular
> 
> Patch 2 adds a bulk page allocator
> 
> Patch 3 adds an array-based version of the bulk allocator
> 
> include/linux/gfp.h |  18 +++++
> mm/page_alloc.c     | 171 ++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 185 insertions(+), 4 deletions(-)
> 
> -- 
> 2.26.2
> 

--
Chuck Lever
Mel Gorman March 22, 2021, 7:49 p.m. UTC | #4
On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 22, 2021, at 5:18 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> > 
> > The users of the API have been dropped in this version as the callers
> > need to check whether they prefer an array or list interface (whether
> > preference is based on convenience or performance).
> 
> I now have a consumer implementation that uses the array
> API. If I understand the contract correctly, the return
> value is the last array index that __alloc_pages_bulk()
> visits. My consumer uses the return value to determine
> if it needs to call the allocator again.
> 

For either arrays or lists, the return value is the number of valid
pages. For arrays, the pattern is expected to be

nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array);
for (i = 0; i < nr_pages; i++) {
	do something with page_array[i] 
}

There *could* be populated valid elements on and after nr_pages but the
implementation did not visit those elements. The implementation can abort
early if the array looks like this

PPP....PPP

Where P is a page and . is NULL. The implementation would skip the
first three pages, allocate four pages and then abort when a new page
was encountered. This is an implementation detail around how I handled
prep_new_page. It could be addressed if many callers expect to pass in
an array that has holes in the middle.

> It is returning some confusing (to me) results. I'd like
> to get these resolved before posting any benchmark
> results.
> 
> 1. When it has visited every array element, it returns the
> same value as was passed in @nr_pages. That's the N + 1th
> array element, which shouldn't be touched. Should the
> allocator return nr_pages - 1 in the fully successful case?
> Or should the documentation describe the return value as
> "the number of elements visited" ?
> 

I phrased it as "the known number of populated elements in the
page_array". I did not want to write it as "the number of valid elements
in the array" because that is not necessarily the case if an array is
passed in with holes in the middle. I'm open to any suggestions on how
the __alloc_pages_bulk description can be improved.

The definition of the return value as-is makes sense for either a list
or an array. Returning "nr_pages - 1" suits an array because it's the
last valid index but it makes less sense when returning a list.

> 2. Frequently the allocator returns a number smaller than
> the total number of elements. As you may recall, sunrpc
> will delay a bit (via a call to schedule_timeout) then call
> again. This is supposed to be a rare event, and the delay
> is substantial. But with the array-based API, a not-fully-
> successful allocator call seems to happen more than half
> the time. Is that expected? I'm calling with GFP_KERNEL,
> seems like the allocator should be trying harder.
> 

It's not expected that the array implementation would be worse *unless*
you are passing in arrays with holes in the middle. Otherwise, the success
rate should be similar.

> 3. Is the current design intended so that if the consumer
> does call again, is it supposed to pass in the array address
> + the returned index (and @nr_pages reduced by the returned
> index) ?
> 

The caller does not have to pass in array address + returned index but
it's more efficient if it does.

If you are passing in arrays with holes in the middle then the following
might work (not tested)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c83d38dfe936..4dc38516a5bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5002,6 +5002,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	gfp_t alloc_gfp;
 	unsigned int alloc_flags;
 	int nr_populated = 0, prep_index = 0;
+	bool hole = false;
 
 	if (WARN_ON_ONCE(nr_pages <= 0))
 		return 0;
@@ -5057,6 +5058,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	if (!zone)
 		goto failed;
 
+retry_hole:
 	/* Attempt the batch allocation */
 	local_irq_save(flags);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
@@ -5069,6 +5071,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		 * IRQs are enabled.
 		 */
 		if (page_array && page_array[nr_populated]) {
+			hole = true;
 			nr_populated++;
 			break;
 		}
@@ -5109,6 +5112,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			prep_new_page(page_array[prep_index++], 0, gfp, 0);
 	}
 
+	if (hole && nr_populated < nr_pages && hole)
+		goto retry_hole;
+
 	return nr_populated;
 
 failed_irq:
Chuck Lever March 22, 2021, 8:32 p.m. UTC | #5
> On Mar 22, 2021, at 3:49 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Mar 22, 2021, at 5:18 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
>>> 
>>> This series is based on top of Matthew Wilcox's series "Rationalise
>>> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
>>> test and are not using Andrew's tree as a baseline, I suggest using the
>>> following git tree
>>> 
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
>>> 
>>> The users of the API have been dropped in this version as the callers
>>> need to check whether they prefer an array or list interface (whether
>>> preference is based on convenience or performance).
>> 
>> I now have a consumer implementation that uses the array
>> API. If I understand the contract correctly, the return
>> value is the last array index that __alloc_pages_bulk()
>> visits. My consumer uses the return value to determine
>> if it needs to call the allocator again.
>> 
> 
> For either arrays or lists, the return value is the number of valid
> pages. For arrays, the pattern is expected to be
> 
> nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array);
> for (i = 0; i < nr_pages; i++) {
> 	do something with page_array[i] 
> }
> 
> There *could* be populated valid elements on and after nr_pages but the
> implementation did not visit those elements. The implementation can abort
> early if the array looks like this
> 
> PPP....PPP
> 
> Where P is a page and . is NULL. The implementation would skip the
> first three pages, allocate four pages and then abort when a new page
> was encountered. This is an implementation detail around how I handled
> prep_new_page. It could be addressed if many callers expect to pass in
> an array that has holes in the middle.
> 
>> It is returning some confusing (to me) results. I'd like
>> to get these resolved before posting any benchmark
>> results.
>> 
>> 1. When it has visited every array element, it returns the
>> same value as was passed in @nr_pages. That's the N + 1th
>> array element, which shouldn't be touched. Should the
>> allocator return nr_pages - 1 in the fully successful case?
>> Or should the documentation describe the return value as
>> "the number of elements visited" ?
>> 
> 
> I phrased it as "the known number of populated elements in the
> page_array".

The comment you added states:

+ * For lists, nr_pages is the number of pages that should be allocated.
+ *
+ * For arrays, only NULL elements are populated with pages and nr_pages
+ * is the maximum number of pages that will be stored in the array.
+ *
+ * Returns the number of pages added to the page_list or the index of the
+ * last known populated element of page_array.


> I did not want to write it as "the number of valid elements
> in the array" because that is not necessarily the case if an array is
> passed in with holes in the middle. I'm open to any suggestions on how
> the __alloc_pages_bulk description can be improved.

The comments states that, for the array case, a /count/ of
pages is passed in, and an /index/ is returned. If you want
to return the same type for lists and arrays, it should be
documented as a count in both cases, to match @nr_pages.
Consumers will want to compare @nr_pages with the return
value to see if they need to call again.

Comparing a count to an index is a notorious source of
off-by-one errors.


> The definition of the return value as-is makes sense for either a list
> or an array. Returning "nr_pages - 1" suits an array because it's the
> last valid index but it makes less sense when returning a list.
> 
>> 2. Frequently the allocator returns a number smaller than
>> the total number of elements. As you may recall, sunrpc
>> will delay a bit (via a call to schedule_timeout) then call
>> again. This is supposed to be a rare event, and the delay
>> is substantial. But with the array-based API, a not-fully-
>> successful allocator call seems to happen more than half
>> the time. Is that expected? I'm calling with GFP_KERNEL,
>> seems like the allocator should be trying harder.
>> 
> 
> It's not expected that the array implementation would be worse *unless*
> you are passing in arrays with holes in the middle. Otherwise, the success
> rate should be similar.

Essentially, sunrpc will always pass an array with a hole.
Each RPC consumes the first N elements in the rq_pages array.
Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
pass in an array with more than one hole. Typically:

.....PPPP

My results show that, because svc_alloc_arg() ends up calling
__alloc_pages_bulk() twice in this case, it ends up being
twice as expensive as the list case, on average, for the same
workload.


>> 3. Is the current design intended so that if the consumer
>> does call again, is it supposed to pass in the array address
>> + the returned index (and @nr_pages reduced by the returned
>> index) ?
>> 
> 
> The caller does not have to pass in array address + returned index but
> it's more efficient if it does.
> 
> If you are passing in arrays with holes in the middle then the following
> might work (not tested)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c83d38dfe936..4dc38516a5bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5002,6 +5002,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 	gfp_t alloc_gfp;
> 	unsigned int alloc_flags;
> 	int nr_populated = 0, prep_index = 0;
> +	bool hole = false;
> 
> 	if (WARN_ON_ONCE(nr_pages <= 0))
> 		return 0;
> @@ -5057,6 +5058,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 	if (!zone)
> 		goto failed;
> 
> +retry_hole:
> 	/* Attempt the batch allocation */
> 	local_irq_save(flags);
> 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> @@ -5069,6 +5071,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 		 * IRQs are enabled.
> 		 */
> 		if (page_array && page_array[nr_populated]) {
> +			hole = true;
> 			nr_populated++;
> 			break;
> 		}
> @@ -5109,6 +5112,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 			prep_new_page(page_array[prep_index++], 0, gfp, 0);
> 	}
> 
> +	if (hole && nr_populated < nr_pages && hole)
> +		goto retry_hole;
> +
> 	return nr_populated;
> 
> failed_irq:
> 
> -- 
> Mel Gorman
> SUSE Labs

If a local_irq_save() is done more than once in this case, I don't
expect that the result will be much better.

To make the array API as performant as the list API, the sunrpc
consumer will have to check if the N + 1th element is populated,
upon return, rather than checking the return value against
@nr_pages.

--
Chuck Lever
Mel Gorman March 22, 2021, 8:58 p.m. UTC | #6
On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote:
> >> It is returning some confusing (to me) results. I'd like
> >> to get these resolved before posting any benchmark
> >> results.
> >> 
> >> 1. When it has visited every array element, it returns the
> >> same value as was passed in @nr_pages. That's the N + 1th
> >> array element, which shouldn't be touched. Should the
> >> allocator return nr_pages - 1 in the fully successful case?
> >> Or should the documentation describe the return value as
> >> "the number of elements visited" ?
> >> 
> > 
> > I phrased it as "the known number of populated elements in the
> > page_array".
> 
> The comment you added states:
> 
> + * For lists, nr_pages is the number of pages that should be allocated.
> + *
> + * For arrays, only NULL elements are populated with pages and nr_pages
> + * is the maximum number of pages that will be stored in the array.
> + *
> + * Returns the number of pages added to the page_list or the index of the
> + * last known populated element of page_array.
> 
> 
> > I did not want to write it as "the number of valid elements
> > in the array" because that is not necessarily the case if an array is
> > passed in with holes in the middle. I'm open to any suggestions on how
> > the __alloc_pages_bulk description can be improved.
> 
> The comments states that, for the array case, a /count/ of
> pages is passed in, and an /index/ is returned. If you want
> to return the same type for lists and arrays, it should be
> documented as a count in both cases, to match @nr_pages.
> Consumers will want to compare @nr_pages with the return
> value to see if they need to call again.
> 

Then I'll just say it's the known count of pages in the array. That
might still be less than the number of requested pages if holes are
encountered.

> > The definition of the return value as-is makes sense for either a list
> > or an array. Returning "nr_pages - 1" suits an array because it's the
> > last valid index but it makes less sense when returning a list.
> > 
> >> 2. Frequently the allocator returns a number smaller than
> >> the total number of elements. As you may recall, sunrpc
> >> will delay a bit (via a call to schedule_timeout) then call
> >> again. This is supposed to be a rare event, and the delay
> >> is substantial. But with the array-based API, a not-fully-
> >> successful allocator call seems to happen more than half
> >> the time. Is that expected? I'm calling with GFP_KERNEL,
> >> seems like the allocator should be trying harder.
> >> 
> > 
> > It's not expected that the array implementation would be worse *unless*
> > you are passing in arrays with holes in the middle. Otherwise, the success
> > rate should be similar.
> 
> Essentially, sunrpc will always pass an array with a hole.
> Each RPC consumes the first N elements in the rq_pages array.
> Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
> pass in an array with more than one hole. Typically:
> 
> .....PPPP
> 
> My results show that, because svc_alloc_arg() ends up calling
> __alloc_pages_bulk() twice in this case, it ends up being
> twice as expensive as the list case, on average, for the same
> workload.
> 

Ok, so in this case the caller knows that holes are always at the
start. If the API returns an index that is a valid index and populated,
it can check the next index and if it is valid then the whole array
must be populated.

Right now, the implementation checks for populated elements at the *start*
because it is required for calling prep_new_page starting at the correct
index and the API cannot make assumptions about the location of the hole.

The patch below would check the rest of the array but note that it's
slower for the API to do this check because it has to check every element
while the sunrpc user could check one element. Let me know if a) this
hunk helps and b) is desired behaviour.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c83d38dfe936..4bf20650e5f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5107,6 +5107,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	} else {
 		while (prep_index < nr_populated)
 			prep_new_page(page_array[prep_index++], 0, gfp, 0);
+
+		while (nr_populated < nr_pages && page_array[nr_populated])
+			nr_populated++;
 	}
 
 	return nr_populated;
Mel Gorman March 23, 2021, 10:44 a.m. UTC | #7
On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote:
> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> 

Jesper and Chuck, would you mind rebasing on top of the following branch
please? 

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2

The interface is the same so the rebase should be trivial.

Jesper, I'm hoping you see no differences in performance but it's best
to check.

For Chuck, this version will check for holes and scan the remainder of
the array to see if nr_pages are allocated before returning. If the holes
in the array are always at the start (which it should be for sunrpc)
then it should still be a single IRQ disable/enable. Specifically, each
contiguous hole in the array will disable/enable IRQs once. I prototyped
NFS array support and it had a 100% success rate with no sleeps running
dbench over the network with no memory pressure but that's a basic test
on a 10G switch.

The basic patch I used to convert sunrpc from using lists to an array
for testing is as follows;

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 922118968986..0ce33c1742d9 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -642,12 +642,10 @@ static void svc_check_conn_limits(struct svc_serv *serv)
 static int svc_alloc_arg(struct svc_rqst *rqstp)
 {
 	struct svc_serv *serv = rqstp->rq_server;
-	unsigned long needed;
 	struct xdr_buf *arg;
-	struct page *page;
 	LIST_HEAD(list);
 	int pages;
-	int i;
+	int i = 0;
 
 	pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
 	if (pages > RPCSVC_MAXPAGES) {
@@ -657,29 +655,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 		pages = RPCSVC_MAXPAGES;
 	}
 
-	for (needed = 0, i = 0; i < pages ; i++) {
-		if (!rqstp->rq_pages[i])
-			needed++;
-	}
-	i = 0;
-	while (needed) {
-		needed -= alloc_pages_bulk(GFP_KERNEL, needed, &list);
-		for (; i < pages; i++) {
-			if (rqstp->rq_pages[i])
-				continue;
-			page = list_first_entry_or_null(&list, struct page, lru);
-			if (likely(page)) {
-				list_del(&page->lru);
-				rqstp->rq_pages[i] = page;
-				continue;
-			}
+	while (i < pages) {
+		i = alloc_pages_bulk_array(GFP_KERNEL, pages, &rqstp->rq_pages[0]);
+		if (i < pages) {
 			set_current_state(TASK_INTERRUPTIBLE);
 			if (signalled() || kthread_should_stop()) {
 				set_current_state(TASK_RUNNING);
 				return -EINTR;
 			}
 			schedule_timeout(msecs_to_jiffies(500));
-			break;
 		}
 	}
 	rqstp->rq_page_end = &rqstp->rq_pages[pages];
Jesper Dangaard Brouer March 23, 2021, 11:08 a.m. UTC | #8
On Mon, 22 Mar 2021 20:58:27 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote:
> > >> It is returning some confusing (to me) results. I'd like
> > >> to get these resolved before posting any benchmark
> > >> results.
> > >> 
> > >> 1. When it has visited every array element, it returns the
> > >> same value as was passed in @nr_pages. That's the N + 1th
> > >> array element, which shouldn't be touched. Should the
> > >> allocator return nr_pages - 1 in the fully successful case?
> > >> Or should the documentation describe the return value as
> > >> "the number of elements visited" ?
> > >>   
> > > 
> > > I phrased it as "the known number of populated elements in the
> > > page_array".  
> > 
> > The comment you added states:
> > 
> > + * For lists, nr_pages is the number of pages that should be allocated.
> > + *
> > + * For arrays, only NULL elements are populated with pages and nr_pages
> > + * is the maximum number of pages that will be stored in the array.
> > + *
> > + * Returns the number of pages added to the page_list or the index of the
> > + * last known populated element of page_array.
> > 
> >   
> > > I did not want to write it as "the number of valid elements
> > > in the array" because that is not necessarily the case if an array is
> > > passed in with holes in the middle. I'm open to any suggestions on how
> > > the __alloc_pages_bulk description can be improved.  
> > 
> > The comments states that, for the array case, a /count/ of
> > pages is passed in, and an /index/ is returned. If you want
> > to return the same type for lists and arrays, it should be
> > documented as a count in both cases, to match @nr_pages.
> > Consumers will want to compare @nr_pages with the return
> > value to see if they need to call again.
> >   
> 
> Then I'll just say it's the known count of pages in the array. That
> might still be less than the number of requested pages if holes are
> encountered.
> 
> > > The definition of the return value as-is makes sense for either a list
> > > or an array. Returning "nr_pages - 1" suits an array because it's the
> > > last valid index but it makes less sense when returning a list.
> > >   
> > >> 2. Frequently the allocator returns a number smaller than
> > >> the total number of elements. As you may recall, sunrpc
> > >> will delay a bit (via a call to schedule_timeout) then call
> > >> again. This is supposed to be a rare event, and the delay
> > >> is substantial. But with the array-based API, a not-fully-
> > >> successful allocator call seems to happen more than half
> > >> the time. Is that expected? I'm calling with GFP_KERNEL,
> > >> seems like the allocator should be trying harder.
> > >>   
> > > 
> > > It's not expected that the array implementation would be worse *unless*
> > > you are passing in arrays with holes in the middle. Otherwise, the success
> > > rate should be similar.  
> > 
> > Essentially, sunrpc will always pass an array with a hole.
> > Each RPC consumes the first N elements in the rq_pages array.
> > Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
> > pass in an array with more than one hole. Typically:
> > 
> > .....PPPP
> > 
> > My results show that, because svc_alloc_arg() ends up calling
> > __alloc_pages_bulk() twice in this case, it ends up being
> > twice as expensive as the list case, on average, for the same
> > workload.
> >   
> 
> Ok, so in this case the caller knows that holes are always at the
> start. If the API returns an index that is a valid index and populated,
> it can check the next index and if it is valid then the whole array
> must be populated.
> 
> Right now, the implementation checks for populated elements at the *start*
> because it is required for calling prep_new_page starting at the correct
> index and the API cannot make assumptions about the location of the hole.
> 
> The patch below would check the rest of the array but note that it's
> slower for the API to do this check because it has to check every element
> while the sunrpc user could check one element. Let me know if a) this
> hunk helps and b) is desired behaviour.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c83d38dfe936..4bf20650e5f5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5107,6 +5107,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  	} else {
>  		while (prep_index < nr_populated)
>  			prep_new_page(page_array[prep_index++], 0, gfp, 0);
> +
> +		while (nr_populated < nr_pages && page_array[nr_populated])
> +			nr_populated++;
>  	}
>  
>  	return nr_populated;

I do know that I suggested moving prep_new_page() out of the
IRQ-disabled loop, but maybe was a bad idea, for several reasons.

All prep_new_page does is to write into struct page, unless some
debugging stuff (like kasan) is enabled. This cache-line is hot as
LRU-list update just wrote into this cache-line.  As the bulk size goes
up, as Matthew pointed out, this cache-line might be pushed into
L2-cache, and then need to be accessed again when prep_new_page() is
called.

Another observation is that moving prep_new_page() into loop reduced
function size with 253 bytes (which affect I-cache).

   ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside
    add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253)
    Function                                     old     new   delta
    __alloc_pages_bulk                          1965    1712    -253
    Total: Before=60799, After=60546, chg -0.42%

Maybe it is better to keep prep_new_page() inside the loop.  This also
allows list vs array variant to share the call.  And it should simplify
the array variant code.
Matthew Wilcox March 23, 2021, 11:13 a.m. UTC | #9
On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote:
> > It's not expected that the array implementation would be worse *unless*
> > you are passing in arrays with holes in the middle. Otherwise, the success
> > rate should be similar.
> 
> Essentially, sunrpc will always pass an array with a hole.
> Each RPC consumes the first N elements in the rq_pages array.
> Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
> pass in an array with more than one hole. Typically:
> 
> .....PPPP
> 
> My results show that, because svc_alloc_arg() ends up calling
> __alloc_pages_bulk() twice in this case, it ends up being
> twice as expensive as the list case, on average, for the same
> workload.

Can you call memmove() to shift all the pointers down to be the
first N elements?  That prevents creating a situation where we have

PPPPPPPP (consume 6)
......PP (try to allocate 6, only 4 available)
PPPP..PP

instead, you'd do:

PPPPPPPP (consume 6)
PP...... (try to allocate 6, only 4 available)
PPPPPP..

Alternatively, you could consume from the tail of the array instead of
the head.  Some CPUs aren't as effective about backwards walks as they
are for forwards walks, but let's keep the pressure on CPU manufacturers
to make better CPUs.
Mel Gorman March 23, 2021, 2:45 p.m. UTC | #10
On Tue, Mar 23, 2021 at 12:08:51PM +0100, Jesper Dangaard Brouer wrote:
> > > <SNIP>
> > > My results show that, because svc_alloc_arg() ends up calling
> > > __alloc_pages_bulk() twice in this case, it ends up being
> > > twice as expensive as the list case, on average, for the same
> > > workload.
> > >   
> > 
> > Ok, so in this case the caller knows that holes are always at the
> > start. If the API returns an index that is a valid index and populated,
> > it can check the next index and if it is valid then the whole array
> > must be populated.
> > 
> > <SNIP>
> 
> I do know that I suggested moving prep_new_page() out of the
> IRQ-disabled loop, but maybe was a bad idea, for several reasons.
> 
> All prep_new_page does is to write into struct page, unless some
> debugging stuff (like kasan) is enabled. This cache-line is hot as
> LRU-list update just wrote into this cache-line.  As the bulk size goes
> up, as Matthew pointed out, this cache-line might be pushed into
> L2-cache, and then need to be accessed again when prep_new_page() is
> called.
> 
> Another observation is that moving prep_new_page() into loop reduced
> function size with 253 bytes (which affect I-cache).
> 
>    ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside
>     add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253)
>     Function                                     old     new   delta
>     __alloc_pages_bulk                          1965    1712    -253
>     Total: Before=60799, After=60546, chg -0.42%
> 
> Maybe it is better to keep prep_new_page() inside the loop.  This also
> allows list vs array variant to share the call.  And it should simplify
> the array variant code.
> 

I agree. I did not like the level of complexity it incurred for arrays
or the fact it required that a list to be empty when alloc_pages_bulk()
is called. I thought the concern for calling prep_new_page() with IRQs
disabled was a little overblown but did not feel strongly enough to push
back on it hard given that we've had problems with IRQs being disabled
for long periods before. At worst, at some point in the future we'll have
to cap the number of pages that can be requested or enable/disable IRQs
every X pages.

New candidate

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r4

Interface is still the same so a rebase should be trivial. Diff between
v6r2 and v6r4 is as follows. I like the diffstat if nothing else :P


 mm/page_alloc.c | 54 +++++++++++++-----------------------------------------
 1 file changed, 13 insertions(+), 41 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 547a84f11310..be1e33a4df39 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4999,25 +4999,20 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	struct alloc_context ac;
 	gfp_t alloc_gfp;
 	unsigned int alloc_flags;
-	int nr_populated = 0, prep_index = 0;
-	bool hole = false;
+	int nr_populated = 0;
 
 	if (WARN_ON_ONCE(nr_pages <= 0))
 		return 0;
 
-	if (WARN_ON_ONCE(page_list && !list_empty(page_list)))
-		return 0;
-
-	/* Skip populated array elements. */
-	if (page_array) {
-		while (nr_populated < nr_pages && page_array[nr_populated])
-			nr_populated++;
-		if (nr_populated == nr_pages)
-			return nr_populated;
-		prep_index = nr_populated;
-	}
+	/*
+	 * Skip populated array elements to determine if any pages need
+	 * to be allocated before disabling IRQs.
+	 */
+	while (page_array && page_array[nr_populated] && nr_populated < nr_pages)
+		nr_populated++;
 
-	if (nr_pages == 1)
+	/* Use the single page allocator for one page. */
+	if (nr_pages - nr_populated == 1)
 		goto failed;
 
 	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
@@ -5056,22 +5051,17 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	if (!zone)
 		goto failed;
 
-retry_hole:
 	/* Attempt the batch allocation */
 	local_irq_save(flags);
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
 	pcp_list = &pcp->lists[ac.migratetype];
 
 	while (nr_populated < nr_pages) {
-		/*
-		 * Stop allocating if the next index has a populated
-		 * page or the page will be prepared a second time when
-		 * IRQs are enabled.
-		 */
+
+		/* Skip existing pages */
 		if (page_array && page_array[nr_populated]) {
-			hole = true;
 			nr_populated++;
-			break;
+			continue;
 		}
 
 		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
@@ -5092,6 +5082,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 		__count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
 		zone_statistics(ac.preferred_zoneref->zone, zone);
 
+		prep_new_page(page, 0, gfp, 0);
 		if (page_list)
 			list_add(&page->lru, page_list);
 		else
@@ -5101,25 +5092,6 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 
 	local_irq_restore(flags);
 
-	/* Prep pages with IRQs enabled. */
-	if (page_list) {
-		list_for_each_entry(page, page_list, lru)
-			prep_new_page(page, 0, gfp, 0);
-	} else {
-		while (prep_index < nr_populated)
-			prep_new_page(page_array[prep_index++], 0, gfp, 0);
-
-		/*
-		 * If the array is sparse, check whether the array is
-		 * now fully populated. Continue allocations if
-		 * necessary.
-		 */
-		while (nr_populated < nr_pages && page_array[nr_populated])
-			nr_populated++;
-		if (hole && nr_populated < nr_pages)
-			goto retry_hole;
-	}
-
 	return nr_populated;
 
 failed_irq:
Jesper Dangaard Brouer March 23, 2021, 3:08 p.m. UTC | #11
On Tue, 23 Mar 2021 10:44:21 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote:
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> >   
> 
> Jesper and Chuck, would you mind rebasing on top of the following branch
> please? 
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2
> 
> The interface is the same so the rebase should be trivial.
> 
> Jesper, I'm hoping you see no differences in performance but it's best
> to check.

I will rebase and check again.

The current performance tests that I'm running, I observe that the
compiler layout the code in unfortunate ways, which cause I-cache
performance issues.  I wonder if you could integrate below patch with
your patchset? (just squash it)
Mel Gorman March 23, 2021, 4:29 p.m. UTC | #12
On Tue, Mar 23, 2021 at 04:08:14PM +0100, Jesper Dangaard Brouer wrote:
> On Tue, 23 Mar 2021 10:44:21 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote:
> > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > following git tree
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> > >   
> > 
> > Jesper and Chuck, would you mind rebasing on top of the following branch
> > please? 
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2
> > 
> > The interface is the same so the rebase should be trivial.
> > 
> > Jesper, I'm hoping you see no differences in performance but it's best
> > to check.
> 
> I will rebase and check again.
> 
> The current performance tests that I'm running, I observe that the
> compiler layout the code in unfortunate ways, which cause I-cache
> performance issues.  I wonder if you could integrate below patch with
> your patchset? (just squash it)
> 

Yes but I'll keep it as a separate patch that is modified slightly.
Otherwise it might get "fixed" as likely/unlikely has been used
inappropriately in the past. If there is pushback, I'll squash them
together.

> From: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> Looking at perf-report and ASM-code for __alloc_pages_bulk() then the code
> activated is suboptimal. The compiler guess wrong and place unlikely code in
> the beginning. Due to the use of WARN_ON_ONCE() macro the UD2 asm
> instruction is added to the code, which confuse the I-cache prefetcher in
> the CPU
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  mm/page_alloc.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f60f51a97a7b..88a5c1ce5b87 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5003,10 +5003,10 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  	unsigned int alloc_flags;
>  	int nr_populated = 0, prep_index = 0;
>  
> -	if (WARN_ON_ONCE(nr_pages <= 0))
> +	if (unlikely(nr_pages <= 0))
>  		return 0;
>  

Ok, I can make this change. It was a defensive check for the new callers
in case insane values were being passed in. 

> -	if (WARN_ON_ONCE(page_list && !list_empty(page_list)))
> +	if (unlikely(page_list && !list_empty(page_list)))
>  		return 0;
>  
>  	/* Skip populated array elements. */

FWIW, this check is now gone. The list only had to be empty if
prep_new_page was deferred until IRQs were enabled to avoid accidentally
calling prep_new_page() on a page that was already on the list when
alloc_pages_bulk was called.

> @@ -5018,7 +5018,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  		prep_index = nr_populated;
>  	}
>  
> -	if (nr_pages == 1)
> +	if (unlikely(nr_pages == 1))
>  		goto failed;
>  
>  	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */

I'm dropping this because nr_pages == 1 is common for the sunrpc user.

> @@ -5054,7 +5054,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  	 * If there are no allowed local zones that meets the watermarks then
>  	 * try to allocate a single page and reclaim if necessary.
>  	 */
> -	if (!zone)
> +	if (unlikely(!zone))
>  		goto failed;
>  
>  	/* Attempt the batch allocation */

Ok.

> @@ -5075,7 +5075,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>  
>  		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
>  								pcp, pcp_list);
> -		if (!page) {
> +		if (unlikely(!page)) {
>  			/* Try and get at least one page */
>  			if (!nr_populated)
>  				goto failed_irq;

Hmmm, ok. It depends on memory pressure but I agree !page is unlikely.

Current version applied is

--8<--
mm/page_alloc: optimize code layout for __alloc_pages_bulk

From: Jesper Dangaard Brouer <brouer@redhat.com>

Looking at perf-report and ASM-code for __alloc_pages_bulk() it is clear
that the code activated is suboptimal. The compiler guesses wrong and
places unlikely code at the beginning. Due to the use of WARN_ON_ONCE()
macro the UD2 asm instruction is added to the code, which confuse the
I-cache prefetcher in the CPU.

[mgorman: Minor changes and rebasing]
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index be1e33a4df39..1ec18121268b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5001,7 +5001,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	unsigned int alloc_flags;
 	int nr_populated = 0;
 
-	if (WARN_ON_ONCE(nr_pages <= 0))
+	if (unlikely(nr_pages <= 0))
 		return 0;
 
 	/*
@@ -5048,7 +5048,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	 * If there are no allowed local zones that meets the watermarks then
 	 * try to allocate a single page and reclaim if necessary.
 	 */
-	if (!zone)
+	if (unlikely(!zone))
 		goto failed;
 
 	/* Attempt the batch allocation */
@@ -5066,7 +5066,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 
 		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
 								pcp, pcp_list);
-		if (!page) {
+		if (unlikely(!page)) {
 			/* Try and get at least one page */
 			if (!nr_populated)
 				goto failed_irq;
Jesper Dangaard Brouer March 23, 2021, 5:06 p.m. UTC | #13
On Tue, 23 Mar 2021 16:08:14 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 23 Mar 2021 10:44:21 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote:  
> > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > following git tree
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> > >     

I've pushed my benchmarks notes for this branch mm-bulk-rebase-v5r9:

 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#test-on-mel-git-tree-mm-bulk-rebase-v5r9

> > Jesper and Chuck, would you mind rebasing on top of the following branch
> > please? 
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2

I've rebase on mm-bulk-rebase-v6r4 tomorrow.
Chuck Lever March 23, 2021, 6:52 p.m. UTC | #14
> On Mar 23, 2021, at 10:45 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> On Tue, Mar 23, 2021 at 12:08:51PM +0100, Jesper Dangaard Brouer wrote:
>>>> <SNIP>
>>>> My results show that, because svc_alloc_arg() ends up calling
>>>> __alloc_pages_bulk() twice in this case, it ends up being
>>>> twice as expensive as the list case, on average, for the same
>>>> workload.
>>>> 
>>> 
>>> Ok, so in this case the caller knows that holes are always at the
>>> start. If the API returns an index that is a valid index and populated,
>>> it can check the next index and if it is valid then the whole array
>>> must be populated.
>>> 
>>> <SNIP>
>> 
>> I do know that I suggested moving prep_new_page() out of the
>> IRQ-disabled loop, but maybe was a bad idea, for several reasons.
>> 
>> All prep_new_page does is to write into struct page, unless some
>> debugging stuff (like kasan) is enabled. This cache-line is hot as
>> LRU-list update just wrote into this cache-line.  As the bulk size goes
>> up, as Matthew pointed out, this cache-line might be pushed into
>> L2-cache, and then need to be accessed again when prep_new_page() is
>> called.
>> 
>> Another observation is that moving prep_new_page() into loop reduced
>> function size with 253 bytes (which affect I-cache).
>> 
>>   ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside
>>    add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253)
>>    Function                                     old     new   delta
>>    __alloc_pages_bulk                          1965    1712    -253
>>    Total: Before=60799, After=60546, chg -0.42%
>> 
>> Maybe it is better to keep prep_new_page() inside the loop.  This also
>> allows list vs array variant to share the call.  And it should simplify
>> the array variant code.
>> 
> 
> I agree. I did not like the level of complexity it incurred for arrays
> or the fact it required that a list to be empty when alloc_pages_bulk()
> is called. I thought the concern for calling prep_new_page() with IRQs
> disabled was a little overblown but did not feel strongly enough to push
> back on it hard given that we've had problems with IRQs being disabled
> for long periods before. At worst, at some point in the future we'll have
> to cap the number of pages that can be requested or enable/disable IRQs
> every X pages.
> 
> New candidate
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r4

I have rebased the SUNRPC patches to v6r4. Testing has shown a
minor functional regression, which I'm chasing down. But
performance is in the same ballpark. FYI


> Interface is still the same so a rebase should be trivial. Diff between
> v6r2 and v6r4 is as follows. I like the diffstat if nothing else :P
> 
> 
> mm/page_alloc.c | 54 +++++++++++++-----------------------------------------
> 1 file changed, 13 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 547a84f11310..be1e33a4df39 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4999,25 +4999,20 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 	struct alloc_context ac;
> 	gfp_t alloc_gfp;
> 	unsigned int alloc_flags;
> -	int nr_populated = 0, prep_index = 0;
> -	bool hole = false;
> +	int nr_populated = 0;
> 
> 	if (WARN_ON_ONCE(nr_pages <= 0))
> 		return 0;
> 
> -	if (WARN_ON_ONCE(page_list && !list_empty(page_list)))
> -		return 0;
> -
> -	/* Skip populated array elements. */
> -	if (page_array) {
> -		while (nr_populated < nr_pages && page_array[nr_populated])
> -			nr_populated++;
> -		if (nr_populated == nr_pages)
> -			return nr_populated;
> -		prep_index = nr_populated;
> -	}
> +	/*
> +	 * Skip populated array elements to determine if any pages need
> +	 * to be allocated before disabling IRQs.
> +	 */
> +	while (page_array && page_array[nr_populated] && nr_populated < nr_pages)
> +		nr_populated++;
> 
> -	if (nr_pages == 1)
> +	/* Use the single page allocator for one page. */
> +	if (nr_pages - nr_populated == 1)
> 		goto failed;
> 
> 	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> @@ -5056,22 +5051,17 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 	if (!zone)
> 		goto failed;
> 
> -retry_hole:
> 	/* Attempt the batch allocation */
> 	local_irq_save(flags);
> 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
> 	pcp_list = &pcp->lists[ac.migratetype];
> 
> 	while (nr_populated < nr_pages) {
> -		/*
> -		 * Stop allocating if the next index has a populated
> -		 * page or the page will be prepared a second time when
> -		 * IRQs are enabled.
> -		 */
> +
> +		/* Skip existing pages */
> 		if (page_array && page_array[nr_populated]) {
> -			hole = true;
> 			nr_populated++;
> -			break;
> +			continue;
> 		}
> 
> 		page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> @@ -5092,6 +5082,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 		__count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
> 		zone_statistics(ac.preferred_zoneref->zone, zone);
> 
> +		prep_new_page(page, 0, gfp, 0);
> 		if (page_list)
> 			list_add(&page->lru, page_list);
> 		else
> @@ -5101,25 +5092,6 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> 
> 	local_irq_restore(flags);
> 
> -	/* Prep pages with IRQs enabled. */
> -	if (page_list) {
> -		list_for_each_entry(page, page_list, lru)
> -			prep_new_page(page, 0, gfp, 0);
> -	} else {
> -		while (prep_index < nr_populated)
> -			prep_new_page(page_array[prep_index++], 0, gfp, 0);
> -
> -		/*
> -		 * If the array is sparse, check whether the array is
> -		 * now fully populated. Continue allocations if
> -		 * necessary.
> -		 */
> -		while (nr_populated < nr_pages && page_array[nr_populated])
> -			nr_populated++;
> -		if (hole && nr_populated < nr_pages)
> -			goto retry_hole;
> -	}
> -
> 	return nr_populated;
> 
> failed_irq:
> 
> -- 
> Mel Gorman
> SUSE Labs

--
Chuck Lever