mbox series

[v2,0/3] Create 'old' ptes for faultaround mappings on arm64 with hardware access flag

Message ID 20210108171517.5290-1-will@kernel.org (mailing list archive)
Headers show
Series Create 'old' ptes for faultaround mappings on arm64 with hardware access flag | expand

Message

Will Deacon Jan. 8, 2021, 5:15 p.m. UTC
Hi all,

This is version two of the series I originally posted here:

  https://lore.kernel.org/r/20201209163950.8494-1-will@kernel.org

The patches allow architectures to opt-in at runtime for faultaround
mappings to be created as 'old' instead of 'young'. Although there have
been previous attempts at this, they failed either because the decision
was deferred to userspace [1] or because it was done unconditionally and
shown to regress benchmarks for particular architectures [2].

The big difference in this version is that I have reworked it based on
Kirill's patch which he posted as a follow-up to the original. However,
I can't tell where we've landed on that -- Linus seemed to like it, but
Hugh was less enthusiastic. I think that my subsequent patches are an
awful lot cleaner after the rework, but I don't think I necessarily
depend on everything in there so if that patch is likely to be a
sticking point then I can try to extract the parts I need.

Feedback welcome.

Cheers,

Will

[1] https://www.spinics.net/lists/linux-mm/msg143831.html
[2] 315d09bf30c2 ("Revert "mm: make faultaround produce old ptes"")

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: <kernel-team@android.com>

--->8

Kirill A. Shutemov (1):
  mm: Cleanup faultaround and finish_fault() codepaths

Will Deacon (2):
  mm: Allow architectures to request 'old' entries when prefaulting
  arm64: mm: Implement arch_wants_old_prefaulted_pte()

 arch/arm64/include/asm/pgtable.h |  12 +-
 fs/xfs/xfs_file.c                |   6 +-
 include/linux/mm.h               |  17 ++-
 include/linux/pgtable.h          |  11 ++
 mm/filemap.c                     | 181 +++++++++++++++++++------
 mm/memory.c                      | 219 +++++++++++--------------------
 6 files changed, 251 insertions(+), 195 deletions(-)

Comments

Linus Torvalds Jan. 8, 2021, 7:34 p.m. UTC | #1
On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <will@kernel.org> wrote:
>
> The big difference in this version is that I have reworked it based on
> Kirill's patch which he posted as a follow-up to the original. However,
> I can't tell where we've landed on that -- Linus seemed to like it, but
> Hugh was less enthusiastic.

Yeah, I like it, but I have to admit that it had a disturbingly high
number of small details wrong for several versions. I hope you picked
up the final version of the code.

At the same time, I do think that the "disturbingly high number of
issues" was primarily exactly _because_ the old code was so
incomprehensible, and I think the end result is much cleaner, so I
still like it.

>I think that my subsequent patches are an
> awful lot cleaner after the rework

Yeah, I think that's a side effect of "now the code really makes a lot
more sense". Your subsequent patches 2-3 certainly are much simpler
now, although I'd be inclined to add an argument to "do_set_pte()"
that has the "write" and "pretault" bits in it, instead of having to
modify the 'vmf' structure.

I still dislike how we basically randomly modify the information in
that 'vmf' thing.

That said, now it's just a small detail - not really objectionable,
just a "this could be cleaner, I think".

I think it was Kirill who pointed out that we sadly cannot make 'vmf'
read-only anyway, because it does also contain those pre-allocation
details etc (vmf->pte etc) that are very much about what the current
"state" of the fault is. So while I would hope it could be more
read-only than it is, my wish that it could _actually_ be 'const' is
clearly just an irrelevant dream.

                   Linus
Linus Torvalds Jan. 8, 2021, 7:42 p.m. UTC | #2
On Fri, Jan 8, 2021 at 11:34 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yeah, I think that's a side effect of "now the code really makes a lot
> more sense". Your subsequent patches 2-3 certainly are much simpler
> now

On that note - they could be simpler still if this was just done
entirely unconditionally..

I'm taking your word for "it makes sense", but when you say

  On CPUs with hardware AF/DBM, initialising prefaulted PTEs as 'old'
  improves vmscan behaviour and does not appear to introduce any overhead.

in the description for patch 3, it makes me wonder how noticeable the
overhead is on the hardware that _does_ take a fault on old pte's..

IOW, it would be lovely to see numbers if you have any like that..

Both ways, actually. Because I also wonder how noticeable the vmscan
improvement is. You say there's no measurable overhead for platforms
with hardware dirty/accessed bits, but maybe there's not a lot of
measurable improvements from a more exact accessed bit either?

             Linus
Will Deacon Jan. 11, 2021, 1:30 p.m. UTC | #3
On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <will@kernel.org> wrote:
> >
> > The big difference in this version is that I have reworked it based on
> > Kirill's patch which he posted as a follow-up to the original. However,
> > I can't tell where we've landed on that -- Linus seemed to like it, but
> > Hugh was less enthusiastic.
> 
> Yeah, I like it, but I have to admit that it had a disturbingly high
> number of small details wrong for several versions. I hope you picked
> up the final version of the code.

I picked the version from here:

  https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box

and actually, I just noticed that willy spotted a typo in a comment, so
I'll fix that locally as well as adding the above to a 'Link:' tag for
reference.

> At the same time, I do think that the "disturbingly high number of
> issues" was primarily exactly _because_ the old code was so
> incomprehensible, and I think the end result is much cleaner, so I
> still like it.
> 
> >I think that my subsequent patches are an
> > awful lot cleaner after the rework
> 
> Yeah, I think that's a side effect of "now the code really makes a lot
> more sense". Your subsequent patches 2-3 certainly are much simpler
> now, although I'd be inclined to add an argument to "do_set_pte()"
> that has the "write" and "pretault" bits in it, instead of having to
> modify the 'vmf' structure.

I played with a few different ways of doing this, but I can't say I prefer
them over what I ended up posting. Having a bunch of 'bool' arguments makes
the callers hard to read and brings into question what exactly vmf->flags is
for. I also tried adding a separate 'address' parameter so that vmf->address
is always the real faulting address, but 'address' is the thing to use for
the pte (i.e. prefault is when 'address != vmf->address'). That wasn't too
bad, but it made the normal finish_fault() case look weird.

So I think I'll leave it as-is and see if anybody wants to change it later
on.

Will
Will Deacon Jan. 11, 2021, 2:01 p.m. UTC | #4
On Fri, Jan 08, 2021 at 11:42:39AM -0800, Linus Torvalds wrote:
> On Fri, Jan 8, 2021 at 11:34 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Yeah, I think that's a side effect of "now the code really makes a lot
> > more sense". Your subsequent patches 2-3 certainly are much simpler
> > now
> 
> On that note - they could be simpler still if this was just done
> entirely unconditionally..
> 
> I'm taking your word for "it makes sense", but when you say
> 
>   On CPUs with hardware AF/DBM, initialising prefaulted PTEs as 'old'
>   improves vmscan behaviour and does not appear to introduce any overhead.
> 
> in the description for patch 3, it makes me wonder how noticeable the
> overhead is on the hardware that _does_ take a fault on old pte's..
> 
> IOW, it would be lovely to see numbers if you have any like that..

[Vinayak -- please chime in if I miss anything here, as you've posted these
 numbers before]

The initial posting in 2016 had some numbers based on a 3.18 kernel, which
didn't have support for hardware AF/DBM:

  https://lore.kernel.org/lkml/fdc23a2a-b42a-f0af-d403-41ea4e755084@codeaurora.org

  (note that "Kirill's-fix" in the last column was a quick hack and didn't
   make the faulting pte young)

So yes, for the cases we care about in Android (where the vmscan behaviour
seems to be the important thing), then this patch makes sense for
non-hardware AF/DBM CPUs too. In either case, we see ~80% reduction in
direct reclaim time according to mmtests [1] and double-digit percentage
reductions in app launch latency (some of this is mentioned in the link
above). The actual fault cost isn't especially relevant.

*However...*

For machines with lots of memory, the increased fault cost when hardware
AF/DBM is not available may well be measurable, and I suspect it would
hurt unixbench (which was the reason behind reverting this on x86 [2],
although I must admit that the diagnosis wasn't particularly satisfactory
[3]). We could run those numbers on arm64 but, due to the wide diversity of
micro-architectures we have to deal with, I would like to keep our options
open to detecting this dynamically anyway, just in case somebody builds a
CPU which struggles in this area.

Cheers,

Will

[1] https://github.com/gormanm/mmtests
[2] https://lore.kernel.org/lkml/20160613125248.GA30109@black.fi.intel.com/
[3] https://lore.kernel.org/lkml/20160616151049.GM6836@dhcp22.suse.cz/
Kirill A. Shutemov Jan. 11, 2021, 2:24 p.m. UTC | #5
On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> I still dislike how we basically randomly modify the information in
> that 'vmf' thing.

I wounder if it would be acceptable to pass down to faultaround a copy
of vmf, so it mess with it without risking to corrupt the original one?

We would need to transfer back prealloc_pte, but the rest can be safely
discarded, I believe.

But it's somewhat wasteful of stack. sizeof(vmf) == 96 on x86-64.
Linus Torvalds Jan. 11, 2021, 7:25 p.m. UTC | #6
On Mon, Jan 11, 2021 at 6:24 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> I wonder if it would be acceptable to pass down to faultaround a copy
> of vmf, so it mess with it without risking to corrupt the original one?

I'd almost prefer to split vmf into two parts: the 'this is the fault
info' part and the 'this is the fault handling state' part.

So the first one would be filled in by the actual page faulter (or
GUP) - and then be 'const' during the lookup, while the second one
would be set up by handle_mm_fault() and would contain that "this is
the current state of my fault state machine" and contain things like
that ->pte thing.

And then if somebody actually needs to pass in "modified fault state"
(ie that whole "I'm doing fault-around, so I'll use multiple
addresses") they'd never modify the address in the fault info, they'd
just pass the address as an explicit argument (like most cases already
do - the "change addr or flags in vmf" is actually already _fairly_
rare).

               Linus
Hugh Dickins Jan. 11, 2021, 9:03 p.m. UTC | #7
On Mon, 11 Jan 2021, Will Deacon wrote:
> On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> > On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <will@kernel.org> wrote:
> > >
> > > The big difference in this version is that I have reworked it based on
> > > Kirill's patch which he posted as a follow-up to the original. However,
> > > I can't tell where we've landed on that -- Linus seemed to like it, but
> > > Hugh was less enthusiastic.
> > 
> > Yeah, I like it, but I have to admit that it had a disturbingly high
> > number of small details wrong for several versions. I hope you picked
> > up the final version of the code.
> 
> I picked the version from here:
> 
>   https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box
> 
> and actually, I just noticed that willy spotted a typo in a comment, so
> I'll fix that locally as well as adding the above to a 'Link:' tag for
> reference.
> 
> > At the same time, I do think that the "disturbingly high number of
> > issues" was primarily exactly _because_ the old code was so
> > incomprehensible, and I think the end result is much cleaner, so I
> > still like it.

Just to report that I gave this v2 set a spin on a few (x86_64 and i386)
machines, and found nothing objectionable this time around.

And the things that I'm unenthusiastic about are exactly those details
that you and Kirill and Linus find unsatisfactory, but awkward to
eliminate: expect no new insights from me!

Hugh
Will Deacon Jan. 12, 2021, 9:46 p.m. UTC | #8
On Mon, Jan 11, 2021 at 01:03:29PM -0800, Hugh Dickins wrote:
> On Mon, 11 Jan 2021, Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 11:34:08AM -0800, Linus Torvalds wrote:
> > > On Fri, Jan 8, 2021 at 9:15 AM Will Deacon <will@kernel.org> wrote:
> > > >
> > > > The big difference in this version is that I have reworked it based on
> > > > Kirill's patch which he posted as a follow-up to the original. However,
> > > > I can't tell where we've landed on that -- Linus seemed to like it, but
> > > > Hugh was less enthusiastic.
> > > 
> > > Yeah, I like it, but I have to admit that it had a disturbingly high
> > > number of small details wrong for several versions. I hope you picked
> > > up the final version of the code.
> > 
> > I picked the version from here:
> > 
> >   https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box
> > 
> > and actually, I just noticed that willy spotted a typo in a comment, so
> > I'll fix that locally as well as adding the above to a 'Link:' tag for
> > reference.
> > 
> > > At the same time, I do think that the "disturbingly high number of
> > > issues" was primarily exactly _because_ the old code was so
> > > incomprehensible, and I think the end result is much cleaner, so I
> > > still like it.
> 
> Just to report that I gave this v2 set a spin on a few (x86_64 and i386)
> machines, and found nothing objectionable this time around.

Thanks, Hugh.

> And the things that I'm unenthusiastic about are exactly those details
> that you and Kirill and Linus find unsatisfactory, but awkward to
> eliminate: expect no new insights from me!

Well, I'll keep you on CC for v3 -- just in case!

Will
Will Deacon Jan. 12, 2021, 9:47 p.m. UTC | #9
On Mon, Jan 11, 2021 at 11:25:37AM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2021 at 6:24 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > I wonder if it would be acceptable to pass down to faultaround a copy
> > of vmf, so it mess with it without risking to corrupt the original one?
> 
> I'd almost prefer to split vmf into two parts: the 'this is the fault
> info' part and the 'this is the fault handling state' part.
> 
> So the first one would be filled in by the actual page faulter (or
> GUP) - and then be 'const' during the lookup, while the second one
> would be set up by handle_mm_fault() and would contain that "this is
> the current state of my fault state machine" and contain things like
> that ->pte thing.
> 
> And then if somebody actually needs to pass in "modified fault state"
> (ie that whole "I'm doing fault-around, so I'll use multiple
> addresses") they'd never modify the address in the fault info, they'd
> just pass the address as an explicit argument (like most cases already
> do - the "change addr or flags in vmf" is actually already _fairly_
> rare).

Alright then, I'll take another crack at this for v3 and see how far I
get.

Will
Linus Torvalds Jan. 12, 2021, 9:55 p.m. UTC | #10
On Tue, Jan 12, 2021 at 1:48 PM Will Deacon <will@kernel.org> wrote:
>
> > And then if somebody actually needs to pass in "modified fault state"
> > (ie that whole "I'm doing fault-around, so I'll use multiple
> > addresses") they'd never modify the address in the fault info, they'd
> > just pass the address as an explicit argument (like most cases already
> > do - the "change addr or flags in vmf" is actually already _fairly_
> > rare).
>
> Alright then, I'll take another crack at this for v3 and see how far I
> get.

But please do keep it as a separate patch, and if it turns out to be
ugly for whatever reason, treat it as just a "maybe that would help"
suggestion rather than anything stronger..

           Linus