mbox series

[v2,0/3] mm/page_alloc: Remote per-cpu page list drain support

Message ID 20211103170512.2745765-1-nsaenzju@redhat.com (mailing list archive)
Headers show
Series mm/page_alloc: Remote per-cpu page list drain support | expand

Message

Nicolas Saenz Julienne Nov. 3, 2021, 5:05 p.m. UTC
This series introduces a new locking scheme around mm/page_alloc.c's per-cpu
page lists which will allow for remote CPUs to drain them. Currently, only a
local CPU is permitted to change its per-cpu lists, and it's expected to do so,
on-demand, whenever a process demands it (by means of queueing an drain task on
the local CPU). Most systems will handle this promptly, but it'll cause
problems for NOHZ_FULL CPUs that can't take any sort of interruption without
breaking their functional guarantees (latency, bandwidth, etc...).

This new locking scheme, based on per-cpu spinlocks, is the simpler and more
maintainable approach so far[1], although also has some drawbacks: it comes
with a small performance. Depending on the page allocation code path
micro-benchmark we can expect 0% to 0.6% degradation on x86_64, and 0% to 2% on
arm64[2].

Assuming there is nothing too horrible in the patches themselves I believe it
all comes down to whether we prefer to take the small performance hit vs the
maintenance burden of a more complex solution[1]. I don't have enough
experience with performance tuning, nor with maintenance to have an
authoritative opinion here, so I'll defer to whatever is hopefully discussed
here. Also, I'll be happy to run any extra tests that I might have missed.

Patch #1 could be taken regardless of the rest of the series as it removes dead
code.

The series is based on today's linux-next. 

Changes since v2:
 - Provide performance numbers
 - Unanimously use per-cpu spinlocks

[1] Other approaches can be found here:

  - Static branch conditional on nohz_full, no performance loss, the extra
    config option makes is painful to maintain (v1):
    https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/

  - RCU based approach, complex, yet a bit less taxing performance wise
    (RFC):
    https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/

[2] See individual patches for in-depth results

---

Nicolas Saenz Julienne (3):
  mm/page_alloc: Don't pass pfn to free_unref_page_commit()
  mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin
    locks
  mm/page_alloc: Remotely drain per-cpu lists

 include/linux/mmzone.h |   1 +
 mm/page_alloc.c        | 151 ++++++++++++++---------------------------
 2 files changed, 52 insertions(+), 100 deletions(-)

Comments

Vlastimil Babka Nov. 23, 2021, 2:58 p.m. UTC | #1
On 11/3/21 18:05, Nicolas Saenz Julienne wrote:
> This series introduces a new locking scheme around mm/page_alloc.c's per-cpu
> page lists which will allow for remote CPUs to drain them. Currently, only a
> local CPU is permitted to change its per-cpu lists, and it's expected to do so,
> on-demand, whenever a process demands it (by means of queueing an drain task on
> the local CPU). Most systems will handle this promptly, but it'll cause
> problems for NOHZ_FULL CPUs that can't take any sort of interruption without
> breaking their functional guarantees (latency, bandwidth, etc...).
> 
> This new locking scheme, based on per-cpu spinlocks, is the simpler and more
> maintainable approach so far[1], although also has some drawbacks: it comes
> with a small performance. Depending on the page allocation code path
> micro-benchmark we can expect 0% to 0.6% degradation on x86_64, and 0% to 2% on
> arm64[2].
> 
> Assuming there is nothing too horrible in the patches themselves I believe it
> all comes down to whether we prefer to take the small performance hit vs the
> maintenance burden of a more complex solution[1]. I don't have enough

I'd be for the small performance hit over more complex solution, if possible.

> experience with performance tuning, nor with maintenance to have an
> authoritative opinion here, so I'll defer to whatever is hopefully discussed
> here. Also, I'll be happy to run any extra tests that I might have missed.

I think Mel has done most page allocator optimizations recently so he would
be most authoritative to say what is or isn't acceptable.

> Patch #1 could be taken regardless of the rest of the series as it removes dead
> code.
> 
> The series is based on today's linux-next. 
> 
> Changes since v2:
>  - Provide performance numbers
>  - Unanimously use per-cpu spinlocks
> 
> [1] Other approaches can be found here:
> 
>   - Static branch conditional on nohz_full, no performance loss, the extra
>     config option makes is painful to maintain (v1):
>     https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/
> 
>   - RCU based approach, complex, yet a bit less taxing performance wise
>     (RFC):
>     https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/

Hm I wonder if there might still be another alternative possible. IIRC I did
propose at some point a local drain on the NOHZ cpu before returning to
userspace, and then avoiding that cpu in remote drains, but tglx didn't like
the idea of making entering the NOHZ full mode more expensive [1].

But what if we instead set pcp->high = 0 for these cpus so they would avoid
populating the pcplists in the first place? Then there wouldn't have to be a
drain at all. On the other hand page allocator operations would not benefit
from zone lock batching on those cpus. But perhaps that would be acceptable
tradeoff, as a nohz cpu is expected to run in userspace most of the time,
and page allocator operations are rare except maybe some initial page
faults? (I assume those kind of workloads pre-populate and/or mlock their
address space anyway).

[1] https://lore.kernel.org/all/878rznh93e.ffs@tglx/

> [2] See individual patches for in-depth results
> 
> ---
> 
> Nicolas Saenz Julienne (3):
>   mm/page_alloc: Don't pass pfn to free_unref_page_commit()
>   mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin
>     locks
>   mm/page_alloc: Remotely drain per-cpu lists
> 
>  include/linux/mmzone.h |   1 +
>  mm/page_alloc.c        | 151 ++++++++++++++---------------------------
>  2 files changed, 52 insertions(+), 100 deletions(-)
>
Nicolas Saenz Julienne Nov. 30, 2021, 6:09 p.m. UTC | #2
Hi Vlastimil, sorry for the late reply and thanks for your feedback. :)

On Tue, 2021-11-23 at 15:58 +0100, Vlastimil Babka wrote:
> > [1] Other approaches can be found here:
> > 
> >   - Static branch conditional on nohz_full, no performance loss, the extra
> >     config option makes is painful to maintain (v1):
> >     https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/
> > 
> >   - RCU based approach, complex, yet a bit less taxing performance wise
> >     (RFC):
> >     https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/
> 
> Hm I wonder if there might still be another alternative possible. IIRC I did
> propose at some point a local drain on the NOHZ cpu before returning to
> userspace, and then avoiding that cpu in remote drains, but tglx didn't like
> the idea of making entering the NOHZ full mode more expensive [1].
> 
> But what if we instead set pcp->high = 0 for these cpus so they would avoid
> populating the pcplists in the first place? Then there wouldn't have to be a
> drain at all. On the other hand page allocator operations would not benefit
> from zone lock batching on those cpus. But perhaps that would be acceptable
> tradeoff, as a nohz cpu is expected to run in userspace most of the time,
> and page allocator operations are rare except maybe some initial page
> faults? (I assume those kind of workloads pre-populate and/or mlock their
> address space anyway).

I've looked a bit into this and it seems straightforward. Our workloads
pre-populate everything, and a slight statup performance hit is not that tragic
(I'll measure it nonetheless). The per-cpu nohz_full state at some point will
be dynamic, but the feature seems simple to disable/enable. I'll have to teach
__drain_all_pages(zone, force_all_cpus=true) to bypass this special case
but that's all. I might have a go at this.

Thanks!
Marcelo Tosatti Dec. 1, 2021, 2:01 p.m. UTC | #3
On Tue, Nov 30, 2021 at 07:09:23PM +0100, Nicolas Saenz Julienne wrote:
> Hi Vlastimil, sorry for the late reply and thanks for your feedback. :)
> 
> On Tue, 2021-11-23 at 15:58 +0100, Vlastimil Babka wrote:
> > > [1] Other approaches can be found here:
> > > 
> > >   - Static branch conditional on nohz_full, no performance loss, the extra
> > >     config option makes is painful to maintain (v1):
> > >     https://lore.kernel.org/linux-mm/20210921161323.607817-5-nsaenzju@redhat.com/
> > > 
> > >   - RCU based approach, complex, yet a bit less taxing performance wise
> > >     (RFC):
> > >     https://lore.kernel.org/linux-mm/20211008161922.942459-4-nsaenzju@redhat.com/
> > 
> > Hm I wonder if there might still be another alternative possible. IIRC I did
> > propose at some point a local drain on the NOHZ cpu before returning to
> > userspace, and then avoiding that cpu in remote drains, but tglx didn't like
> > the idea of making entering the NOHZ full mode more expensive [1].
> > 
> > But what if we instead set pcp->high = 0 for these cpus so they would avoid
> > populating the pcplists in the first place? Then there wouldn't have to be a
> > drain at all. On the other hand page allocator operations would not benefit
> > from zone lock batching on those cpus. But perhaps that would be acceptable
> > tradeoff, as a nohz cpu is expected to run in userspace most of the time,
> > and page allocator operations are rare except maybe some initial page
> > faults? (I assume those kind of workloads pre-populate and/or mlock their
> > address space anyway).
> 
> I've looked a bit into this and it seems straightforward. Our workloads
> pre-populate everything, and a slight statup performance hit is not that tragic
> (I'll measure it nonetheless). The per-cpu nohz_full state at some point will
> be dynamic, but the feature seems simple to disable/enable. I'll have to teach
> __drain_all_pages(zone, force_all_cpus=true) to bypass this special case
> but that's all. I might have a go at this.
> 
> Thanks!
> 
> -- 
> Nicolás Sáenz

True, but a nohz cpu does not necessarily have to run in userspace most
of the time. For example, an application can enter nohz full mode, 
go back to userspace, idle, return from idle all without leaving
nohz_full mode.

So its not clear that nohz_full is an appropriate trigger for setting
pcp->high = 0. Perhaps a task isolation feature would be an appropriate
location.