mbox series

[00/13] mm/munlock: rework of mlock+munlock page handling

Message ID 8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com (mailing list archive)
Headers show
Series mm/munlock: rework of mlock+munlock page handling | expand

Message

Hugh Dickins Feb. 6, 2022, 9:27 p.m. UTC
I wondered whether to post this munlocking rework in
https://lore.kernel.org/linux-mm/35c340a6-96f-28a0-2b7b-2f9fbddc01f@google.com/

There the discussion was OOM reaping, but the main reason for the rework
has been catastrophic contention on i_mmap_rwsem when exiting from
multiply mlocked files; and frustration with how contorted munlocking is.

Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
likely to conflict (I hope not fundamentally) with several concurrent
large patchsets.

tl;dr
 mm/mlock.c  |  634 +++++++++++++++-----------------------
 23 files changed, 510 insertions(+), 779 deletions(-)

In my own opinion, this is ready to go in: that whatever its defects,
it's a big practical improvement over what's presently there.  Others
may differ; and it may be too much to arrive at a quick opinion on.
My hope is that it will strike a chord with some who have laboured in
this area in the past: I'll be short of time to do much more with it,
so maybe someone else could take over if necessary.

At present there's no update to Documentation/vm/unevictable-lru.rst:
that always needs a different mindset, can follow later, please refer
to commit messages for now.

There are two half-related mm/thp patches at the end: enhancements
we've had for a long time, but more suited after the mlock changes.

01/13 mm/munlock: delete page_mlock() and all its works
02/13 mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE gupflags
03/13 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
04/13 mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
05/13 mm/munlock: replace clear_page_mlock() by final clearance
06/13 mm/munlock: maintain page->mlock_count while unevictable
07/13 mm/munlock: mlock_pte_range() when mlocking or munlocking
08/13 mm/migrate: __unmap_and_move() push good newpage to LRU
09/13 mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
10/13 mm/munlock: mlock_page() munlock_page() batch by pagevec
11/13 mm/munlock: page migration needs mlock pagevec drained
12/13 mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
13/13 mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP

 include/linux/mm.h        |    2 
 include/linux/mm_inline.h |   11 
 include/linux/mm_types.h  |   19 +
 include/linux/rmap.h      |   23 -
 kernel/events/uprobes.c   |    7 
 mm/gup.c                  |   43 --
 mm/huge_memory.c          |   55 ---
 mm/hugetlb.c              |    4 
 mm/internal.h             |   64 ++-
 mm/khugepaged.c           |   14 
 mm/ksm.c                  |   12 
 mm/madvise.c              |    5 
 mm/memcontrol.c           |    3 
 mm/memory.c               |   45 --
 mm/migrate.c              |   42 +-
 mm/mlock.c                |  634 +++++++++++++++-----------------------
 mm/mmap.c                 |   32 -
 mm/mmzone.c               |    7 
 mm/oom_kill.c             |    2 
 mm/rmap.c                 |  156 ++-------
 mm/swap.c                 |   89 ++---
 mm/userfaultfd.c          |   14 
 mm/vmscan.c               |    6 
 23 files changed, 510 insertions(+), 779 deletions(-)

Hugh

Comments

Matthew Wilcox Feb. 6, 2022, 9:38 p.m. UTC | #1
On Sun, Feb 06, 2022 at 01:27:41PM -0800, Hugh Dickins wrote:
> Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
> mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
> likely to conflict (I hope not fundamentally) with several concurrent
> large patchsets.

Most of this patchset hasn't arrived here yet, but I would be
_delighted_ to rebase the folio conversion on top of this.  What
convoluted code it is!  I am so glad you've cleaned this up; I was
dreading doing the rest of the mlock file.
Hugh Dickins Feb. 7, 2022, 6:20 p.m. UTC | #2
On Sun, 6 Feb 2022, Matthew Wilcox wrote:
> On Sun, Feb 06, 2022 at 01:27:41PM -0800, Hugh Dickins wrote:
> > Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
> > mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
> > likely to conflict (I hope not fundamentally) with several concurrent
> > large patchsets.
> 
> Most of this patchset hasn't arrived here yet, but I would be
> _delighted_ to rebase the folio conversion on top of this.  What
> convoluted code it is!  I am so glad you've cleaned this up; I was
> dreading doing the rest of the mlock file.

That's a very generous offer: thank you.  I'd been looking at it
the other way round, afraid that it would be getting in your way:
but now think you're right, that the cleanup there will help your
work.

I had found mm/mlock.c unexpectedly difficult to deal with when
doing shmem huge pages, and IIRC Kirill had the same experience.

Hugh
Michal Hocko Feb. 9, 2022, 3:35 p.m. UTC | #3
On Sun 06-02-22 13:27:41, Hugh Dickins wrote:
> I wondered whether to post this munlocking rework in
> https://lore.kernel.org/linux-mm/35c340a6-96f-28a0-2b7b-2f9fbddc01f@google.com/
> 
> There the discussion was OOM reaping, but the main reason for the rework
> has been catastrophic contention on i_mmap_rwsem when exiting from
> multiply mlocked files; and frustration with how contorted munlocking is.

yeah, I do not think that too many care about oom_reaper stumbling over
vast portions of the mlocked memory problem. So far I have seen only LTP
oom02 hitting this and that one does mlockall and intentionally hits
OOM. So far I haven't seen any actual workload hiting something similar.
But hey, if we get a better and easier to maintain mlock code then the
oom_reaper will surely add to a huge thanks.

> Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
> mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
> likely to conflict (I hope not fundamentally) with several concurrent
> large patchsets.
> 
> tl;dr
>  mm/mlock.c  |  634 +++++++++++++++-----------------------
>  23 files changed, 510 insertions(+), 779 deletions(-)

This is really impressive!

> In my own opinion, this is ready to go in: that whatever its defects,
> it's a big practical improvement over what's presently there.  Others
> may differ; and it may be too much to arrive at a quick opinion on.
> My hope is that it will strike a chord with some who have laboured in
> this area in the past: I'll be short of time to do much more with it,
> so maybe someone else could take over if necessary.
> 
> At present there's no update to Documentation/vm/unevictable-lru.rst:
> that always needs a different mindset, can follow later, please refer
> to commit messages for now.

Yes, that would help for sure.

So far I have only managed to read through the series and trying to put
all the pieces together (so far I have given up on the THP part) and my
undestanding is far from complete. But I have to say I like the general
approach and overall simplification.

The only thing that is not entirely clear to me at the moment is why you
have chosen to ignore already mapped LOCKONFAULT pages. They will
eventually get sorted out during the reclaim/migration but this can
backfire if too many pages have been pre-faulted before LOCKONFAULT
call. Maybe not an interesting case in the first place but I am still
wondering why you have chosen that way.

I will be off next couple of days and plan to revisit this afterwards
(should time allow). Anyway thanks a lot Hugh!
Hugh Dickins Feb. 9, 2022, 4:21 p.m. UTC | #4
On Wed, 9 Feb 2022, Michal Hocko wrote:
> 
> So far I have only managed to read through the series and trying to put
> all the pieces together (so far I have given up on the THP part) and my
> undestanding is far from complete. But I have to say I like the general
> approach and overall simplification.

Many thanks for looking, Michal, and for all the positivity!

> 
> The only thing that is not entirely clear to me at the moment is why you
> have chosen to ignore already mapped LOCKONFAULT pages. They will
> eventually get sorted out during the reclaim/migration but this can
> backfire if too many pages have been pre-faulted before LOCKONFAULT
> call. Maybe not an interesting case in the first place but I am still
> wondering why you have chosen that way.

I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
pages?  I'd consider that a bug.

It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
set too?  If I've got that wrong, yes, I'll need to revisit conditions.

> 
> I will be off next couple of days and plan to revisit this afterwards
> (should time allow). Anyway thanks a lot Hugh!
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Feb. 9, 2022, 9:01 p.m. UTC | #5
On Wed 09-02-22 08:21:17, Hugh Dickins wrote:
> On Wed, 9 Feb 2022, Michal Hocko wrote:
[...]
> > The only thing that is not entirely clear to me at the moment is why you
> > have chosen to ignore already mapped LOCKONFAULT pages. They will
> > eventually get sorted out during the reclaim/migration but this can
> > backfire if too many pages have been pre-faulted before LOCKONFAULT
> > call. Maybe not an interesting case in the first place but I am still
> > wondering why you have chosen that way.
> 
> I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
> pages?  I'd consider that a bug.

I've had the following path in mind
__mm_populate
  populate_vma_page_range
    if (vma->vm_flags & VM_LOCKONFAULT)
    	return nr_page

which means that __get_user_pages is not called at all. This also means
that follow_page_mask is skipped. The page table walk used to mark
already mapped pages as mlocked so unless I miss some other path it
would stay on its original LRU list and only get real mlock protection
when encountered by the reclaim or migration.
 
> It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
> set too?  If I've got that wrong, yes, I'll need to revisit conditions.

Yes, I did't really mean we would lose the mlock protection. We would
just do the lazy mlock also on pages which are already mapped. This is
certainly not a correctness issue - althoug stats might be off - but it
could polute existing LRUs with mlocked pages rather easily.

As I've said. Either I am still missing something or this might even not
be a big deal in real life. I was mostly curious about the choice to
exclude the page table walk for LOCKONFAULT.
Hugh Dickins Feb. 9, 2022, 10:59 p.m. UTC | #6
On Wed, 9 Feb 2022, Michal Hocko wrote:
> On Wed 09-02-22 08:21:17, Hugh Dickins wrote:
> > On Wed, 9 Feb 2022, Michal Hocko wrote:
> [...]
> > > The only thing that is not entirely clear to me at the moment is why you
> > > have chosen to ignore already mapped LOCKONFAULT pages. They will
> > > eventually get sorted out during the reclaim/migration but this can
> > > backfire if too many pages have been pre-faulted before LOCKONFAULT
> > > call. Maybe not an interesting case in the first place but I am still
> > > wondering why you have chosen that way.
> > 
> > I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
> > pages?  I'd consider that a bug.
> 
> I've had the following path in mind
> __mm_populate
>   populate_vma_page_range
>     if (vma->vm_flags & VM_LOCKONFAULT)
>     	return nr_page
> 
> which means that __get_user_pages is not called at all. This also means
> that follow_page_mask is skipped. The page table walk used to mark
> already mapped pages as mlocked so unless I miss some other path it
> would stay on its original LRU list and only get real mlock protection
> when encountered by the reclaim or migration.

That is so until 04/13: but then, whenever a page is faulted into a
VM_LOCKED area (except when skipping pte-of-compound) it goes through
mlock_page().  So it will then lock-on-fault.

Ah, but you're worrying about any previously-mapped pages when
VM_LOCKONFAULT is established: those ones get done by mlock_pte_range()
in 07/13, same as when VM_LOCKED is established - I checked again,
VM_LOCKONFAULT is not set without VM_LOCKED.

>  
> > It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
> > set too?  If I've got that wrong, yes, I'll need to revisit conditions.
> 
> Yes, I did't really mean we would lose the mlock protection. We would
> just do the lazy mlock also on pages which are already mapped. This is
> certainly not a correctness issue - althoug stats might be off - but it
> could polute existing LRUs with mlocked pages rather easily.

Even with the lazy mlock in reclaim, I'd still accuse myself of a bug
if I've got this wrong.

> 
> As I've said. Either I am still missing something or this might even not
> be a big deal in real life. I was mostly curious about the choice to
> exclude the page table walk for LOCKONFAULT.

It surprised me too: but looking at how FOLL_POPULATE was handled in
faultin_page(), it appeared to me to become redundant, once mlocking
was counted at fault time (and in mlock_pte_range() - maybe that's
the page table walk you're looking for, not excluded but moved).

It is a big deal for me: please don't let me fool you, please do
continue to worry at it until you're convinced or you convince me.

Thanks,
Hugh
Michal Hocko Feb. 10, 2022, 7:49 a.m. UTC | #7
On Wed 09-02-22 14:59:19, Hugh Dickins wrote:
[...]
> Ah, but you're worrying about any previously-mapped pages when
> VM_LOCKONFAULT is established: those ones get done by mlock_pte_range()
> in 07/13, same as when VM_LOCKED is established - I checked again,
> VM_LOCKONFAULT is not set without VM_LOCKED.

Right you are. mlock_fixup->mlock_vma_pages_range is the missing piece.
This is called for both do_mlock and mlockall paths. So all good and
thanks for the clarification!

I will get back to this sometimes next week.