mbox series

[RFC,0/3] support large folio for mlock

Message ID 20230707165221.4076590-1-fengwei.yin@intel.com (mailing list archive)
Headers show
Series support large folio for mlock | expand

Message

Yin Fengwei July 7, 2023, 4:52 p.m. UTC
Yu mentioned at [1] about the mlock() can't be applied to large folio.

I leant the related code and here is my understanding:
- For RLIMIT_MEMLOCK related, there is no problem. Becuase the
  RLIMIT_MEMLOCK statistics is not related underneath page. That means
  underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
  statistics collection which is always correct.

- For keeping the page in RAM, there is no problem either. At least,
  during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
  set in vm_flags, the folio will be kept whatever the folio is
  mlocked or not.

So the function of mlock for large folio works. But it's not optimized
because the page reclaim needs scan these large folio and may split
them.

This series identified the large folio for mlock to two types:
  - The large folio is in VM_LOCKED VMA range
  - The large folio cross VM_LOCKED VMA boundary

For the first type, we mlock large folio so page relcaim will skip it.
For the second type, we don't mlock large folio. It's allowed to be
picked by page reclaim and be split. So the pages not in VM_LOCKED VMA
range are allowed to be reclaimed/released.

patch1 introduce API to check whether large folio is in VMA range.
patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
large folio mlock/munlock.
patch3 make mlock/munlock syscall support large folio.

testing done:
  - kernel selftest. No extra failure introduced


[1] https://lore.kernel.org/linux-mm/CAOUHufbtNPkdktjt_5qM45GegVO-rCFOMkSh0HQminQ12zsV8Q@mail.gmail.com/

Yin Fengwei (3):
  mm: add function folio_in_range()
  mm: handle large folio when large folio in VM_LOCKED VMA range
  mm: mlock: update mlock_pte_range to handle large folio

 mm/internal.h |  37 ++++++++++++++++--
 mm/mlock.c    | 103 ++++++++++++++++++++++++++++++++++++++++++++++----
 mm/rmap.c     |   3 +-
 3 files changed, 131 insertions(+), 12 deletions(-)

Comments

Matthew Wilcox July 7, 2023, 5:26 p.m. UTC | #1
On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
> This series identified the large folio for mlock to two types:
>   - The large folio is in VM_LOCKED VMA range
>   - The large folio cross VM_LOCKED VMA boundary

This is somewhere that I think our fixation on MUST USE PMD ENTRIES
has led us astray.  Today when the arguments to mlock() cross a folio
boundary, we split the PMD entry but leave the folio intact.  That means
that we continue to manage the folio as a single entry on the LRU list.
But userspace may have no idea that we're doing this.  It may have made
several calls to mmap() 256kB at once, they've all been coalesced into
a single VMA and khugepaged has come along behind its back and created
a 2MB THP.  Now userspace calls mlock() and instead of treating that as
a hint that oops, maybe we shouldn't've done that, we do our utmost to
preserve the 2MB folio.

I think this whole approach needs rethinking.  IMO, anonymous folios
should not cross VMA boundaries.  Tell me why I'm wrong.
David Hildenbrand July 7, 2023, 6:54 p.m. UTC | #2
On 07.07.23 19:26, Matthew Wilcox wrote:
> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
>> This series identified the large folio for mlock to two types:
>>    - The large folio is in VM_LOCKED VMA range
>>    - The large folio cross VM_LOCKED VMA boundary
> 
> This is somewhere that I think our fixation on MUST USE PMD ENTRIES
> has led us astray.  Today when the arguments to mlock() cross a folio
> boundary, we split the PMD entry but leave the folio intact.  That means
> that we continue to manage the folio as a single entry on the LRU list.
> But userspace may have no idea that we're doing this.  It may have made
> several calls to mmap() 256kB at once, they've all been coalesced into
> a single VMA and khugepaged has come along behind its back and created
> a 2MB THP.  Now userspace calls mlock() and instead of treating that as
> a hint that oops, maybe we shouldn't've done that, we do our utmost to
> preserve the 2MB folio.
> 
> I think this whole approach needs rethinking.  IMO, anonymous folios
> should not cross VMA boundaries.  Tell me why I'm wrong.

I think we touched upon that a couple of times already, and the main 
issue is that while it sounds nice in theory, it's impossible in practice.

THP are supposed to be transparent, that is, we should not let arbitrary 
operations fail.

But nothing stops user space from

(a) mmap'ing a 2 MiB region
(b) GUP-pinning the whole range
(c) GUP-pinning the first half
(d) unpinning the whole range from (a)
(e) munmap'ing the second half


And that's just one out of many examples I can think of, not even 
considering temporary/speculative references that can prevent a split at 
random points in time -- especially when splitting a VMA.

Sure, any time we PTE-map a THP we might just say "let's put that on the 
deferred split queue" and cross fingers that we can eventually split it 
later. (I was recently thinking about that in the context of the 
mapcount ...)

It's all a big mess ...
Matthew Wilcox July 7, 2023, 7:06 p.m. UTC | #3
On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote:
> On 07.07.23 19:26, Matthew Wilcox wrote:
> > On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
> > > This series identified the large folio for mlock to two types:
> > >    - The large folio is in VM_LOCKED VMA range
> > >    - The large folio cross VM_LOCKED VMA boundary
> > 
> > This is somewhere that I think our fixation on MUST USE PMD ENTRIES
> > has led us astray.  Today when the arguments to mlock() cross a folio
> > boundary, we split the PMD entry but leave the folio intact.  That means
> > that we continue to manage the folio as a single entry on the LRU list.
> > But userspace may have no idea that we're doing this.  It may have made
> > several calls to mmap() 256kB at once, they've all been coalesced into
> > a single VMA and khugepaged has come along behind its back and created
> > a 2MB THP.  Now userspace calls mlock() and instead of treating that as
> > a hint that oops, maybe we shouldn't've done that, we do our utmost to
> > preserve the 2MB folio.
> > 
> > I think this whole approach needs rethinking.  IMO, anonymous folios
> > should not cross VMA boundaries.  Tell me why I'm wrong.
> 
> I think we touched upon that a couple of times already, and the main issue
> is that while it sounds nice in theory, it's impossible in practice.
> 
> THP are supposed to be transparent, that is, we should not let arbitrary
> operations fail.
> 
> But nothing stops user space from
> 
> (a) mmap'ing a 2 MiB region
> (b) GUP-pinning the whole range
> (c) GUP-pinning the first half
> (d) unpinning the whole range from (a)
> (e) munmap'ing the second half
> 
> 
> And that's just one out of many examples I can think of, not even
> considering temporary/speculative references that can prevent a split at
> random points in time -- especially when splitting a VMA.
> 
> Sure, any time we PTE-map a THP we might just say "let's put that on the
> deferred split queue" and cross fingers that we can eventually split it
> later. (I was recently thinking about that in the context of the mapcount
> ...)
> 
> It's all a big mess ...

Oh, I agree, there are always going to be circumstances where we realise
we've made a bad decision and can't (easily) undo it.  Unless we have a
per-page pincount, and I Would Rather Not Do That.  But we should _try_
to do that because it's the right model -- that's what I meant by "Tell
me why I'm wrong"; what scenarios do we have where a user temporarilly
mlocks (or mprotects or ...) a range of memory, but wants that memory
to be aged in the LRU exactly the same way as the adjacent memory that
wasn't mprotected?

GUP-pinning is different, and I don't think GUP-pinning should split
a folio.  That's a temporary use (not FOLL_LONGTERM), eg, we're doing
tcp zero-copy or it's the source/target of O_DIRECT.  That's not an
instruction that this memory is different from its neighbours.

Maybe we end up deciding to split folios on GUP-pin.  That would be
regrettable.
David Hildenbrand July 7, 2023, 7:15 p.m. UTC | #4
On 07.07.23 21:06, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote:
>> On 07.07.23 19:26, Matthew Wilcox wrote:
>>> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
>>>> This series identified the large folio for mlock to two types:
>>>>     - The large folio is in VM_LOCKED VMA range
>>>>     - The large folio cross VM_LOCKED VMA boundary
>>>
>>> This is somewhere that I think our fixation on MUST USE PMD ENTRIES
>>> has led us astray.  Today when the arguments to mlock() cross a folio
>>> boundary, we split the PMD entry but leave the folio intact.  That means
>>> that we continue to manage the folio as a single entry on the LRU list.
>>> But userspace may have no idea that we're doing this.  It may have made
>>> several calls to mmap() 256kB at once, they've all been coalesced into
>>> a single VMA and khugepaged has come along behind its back and created
>>> a 2MB THP.  Now userspace calls mlock() and instead of treating that as
>>> a hint that oops, maybe we shouldn't've done that, we do our utmost to
>>> preserve the 2MB folio.
>>>
>>> I think this whole approach needs rethinking.  IMO, anonymous folios
>>> should not cross VMA boundaries.  Tell me why I'm wrong.
>>
>> I think we touched upon that a couple of times already, and the main issue
>> is that while it sounds nice in theory, it's impossible in practice.
>>
>> THP are supposed to be transparent, that is, we should not let arbitrary
>> operations fail.
>>
>> But nothing stops user space from
>>
>> (a) mmap'ing a 2 MiB region
>> (b) GUP-pinning the whole range
>> (c) GUP-pinning the first half
>> (d) unpinning the whole range from (a)
>> (e) munmap'ing the second half
>>
>>
>> And that's just one out of many examples I can think of, not even
>> considering temporary/speculative references that can prevent a split at
>> random points in time -- especially when splitting a VMA.
>>
>> Sure, any time we PTE-map a THP we might just say "let's put that on the
>> deferred split queue" and cross fingers that we can eventually split it
>> later. (I was recently thinking about that in the context of the mapcount
>> ...)
>>
>> It's all a big mess ...
> 
> Oh, I agree, there are always going to be circumstances where we realise
> we've made a bad decision and can't (easily) undo it.  Unless we have a
> per-page pincount, and I Would Rather Not Do That. 

I agree ...

But we should _try_
> to do that because it's the right model -- that's what I meant by "Tell

Try to have per-page pincounts? :/ or do you mean, try to split on VMA 
split? I hope the latter (although I'm not sure about performance) :)

> me why I'm wrong"; what scenarios do we have where a user temporarilly
> mlocks (or mprotects or ...) a range of memory, but wants that memory
> to be aged in the LRU exactly the same way as the adjacent memory that
> wasn't mprotected?

Let me throw in a "fun one".

Parent process has a 2 MiB range populated by a THP. fork() a child 
process. Child process mprotects half the VMA.

Should we split the (COW-shared) THP? Or should we COW/unshare in the 
child process (ugh!) during the VMA split.

It all makes my brain hurt.

> 
> GUP-pinning is different, and I don't think GUP-pinning should split
> a folio.  That's a temporary use (not FOLL_LONGTERM), eg, we're doing
> tcp zero-copy or it's the source/target of O_DIRECT.  That's not an
> instruction that this memory is different from its neighbours.
> 
> Maybe we end up deciding to split folios on GUP-pin.  That would be
> regrettable.

That would probably never be accepted, because the ones that heavily 
rely on THP (databases, VMs), typically also end up using a lot of 
features that use (long-term) page pinning. Don't get me started on 
io_uring with fixed buffers.
Matthew Wilcox July 7, 2023, 7:26 p.m. UTC | #5
On Fri, Jul 07, 2023 at 09:15:02PM +0200, David Hildenbrand wrote:
> > > Sure, any time we PTE-map a THP we might just say "let's put that on the
> > > deferred split queue" and cross fingers that we can eventually split it
> > > later. (I was recently thinking about that in the context of the mapcount
> > > ...)
> > > 
> > > It's all a big mess ...
> > 
> > Oh, I agree, there are always going to be circumstances where we realise
> > we've made a bad decision and can't (easily) undo it.  Unless we have a
> > per-page pincount, and I Would Rather Not Do That.
> 
> I agree ...
> 
> But we should _try_
> > to do that because it's the right model -- that's what I meant by "Tell
> 
> Try to have per-page pincounts? :/ or do you mean, try to split on VMA
> split? I hope the latter (although I'm not sure about performance) :)

Sorry, try to split a folio on VMA split.

> > me why I'm wrong"; what scenarios do we have where a user temporarilly
> > mlocks (or mprotects or ...) a range of memory, but wants that memory
> > to be aged in the LRU exactly the same way as the adjacent memory that
> > wasn't mprotected?
> 
> Let me throw in a "fun one".
> 
> Parent process has a 2 MiB range populated by a THP. fork() a child process.
> Child process mprotects half the VMA.
> 
> Should we split the (COW-shared) THP? Or should we COW/unshare in the child
> process (ugh!) during the VMA split.
> 
> It all makes my brain hurt.

OK, so this goes back to what I wrote earlier about attempting to choose
what size of folio to allocate on COW:

https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/

: the parent had already established
: an appropriate size folio to use for this VMA before calling fork().
: Whether it is the parent or the child causing the COW, it should probably
: inherit that choice and we should default to the same size folio that
: was already found.

You've come up with a usefully different case here.  I think we should
COW the folio at the point of the mprotect().  That will allow the parent
to become the sole owner of the folio once again and ensure that when
the parent modifies the folio, it _doesn't_ have to COW.

(This is also a rare case, surely)

> > 
> > GUP-pinning is different, and I don't think GUP-pinning should split
> > a folio.  That's a temporary use (not FOLL_LONGTERM), eg, we're doing
> > tcp zero-copy or it's the source/target of O_DIRECT.  That's not an
> > instruction that this memory is different from its neighbours.
> > 
> > Maybe we end up deciding to split folios on GUP-pin.  That would be
> > regrettable.
> 
> That would probably never be accepted, because the ones that heavily rely on
> THP (databases, VMs), typically also end up using a lot of features that use
> (long-term) page pinning. Don't get me started on io_uring with fixed
> buffers.

I do think that something like a long-term pin should split a folio.
Otherwise we're condemning the rest of the folio to be pinned along
with it.  Short term pins shouldn't split.
Yin Fengwei July 8, 2023, 3:31 a.m. UTC | #6
On 7/8/2023 1:26 AM, Matthew Wilcox wrote:
> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
>> This series identified the large folio for mlock to two types:
>>   - The large folio is in VM_LOCKED VMA range
>>   - The large folio cross VM_LOCKED VMA boundary
> 
> This is somewhere that I think our fixation on MUST USE PMD ENTRIES
> has led us astray.  Today when the arguments to mlock() cross a folio
> boundary, we split the PMD entry but leave the folio intact.  That means
> that we continue to manage the folio as a single entry on the LRU list.
> But userspace may have no idea that we're doing this.  It may have made
> several calls to mmap() 256kB at once, they've all been coalesced into
> a single VMA and khugepaged has come along behind its back and created
> a 2MB THP.  Now userspace calls mlock() and instead of treating that as
> a hint that oops, maybe we shouldn't've done that, we do our utmost to
> preserve the 2MB folio.
> 
> I think this whole approach needs rethinking.  IMO, anonymous folios
> should not cross VMA boundaries.  Tell me why I'm wrong.

No. You are not wrong. :). That concept to keep anonymous folio not
cross VMA boundary is decent.


I tried to split the large folio when it cross VMA boundary for mlock().
As it's possible that the folio split fails, we always need to deal with
this case. I decided to postpone all large folio splitting to page
reclaim phase. The benefits we could get:
   - If the range is munlocked before page reclaim pick this folio,
     we don't need to split the folio.
   - for the system which don't have swap enabled, we don't need to
     split this kind folio.


Regards
Yin, Fengwei
Yin Fengwei July 8, 2023, 3:34 a.m. UTC | #7
On 7/8/2023 2:54 AM, David Hildenbrand wrote:
> On 07.07.23 19:26, Matthew Wilcox wrote:
>> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
>>> This series identified the large folio for mlock to two types:
>>>    - The large folio is in VM_LOCKED VMA range
>>>    - The large folio cross VM_LOCKED VMA boundary
>>
>> This is somewhere that I think our fixation on MUST USE PMD ENTRIES
>> has led us astray.  Today when the arguments to mlock() cross a folio
>> boundary, we split the PMD entry but leave the folio intact.  That means
>> that we continue to manage the folio as a single entry on the LRU list.
>> But userspace may have no idea that we're doing this.  It may have made
>> several calls to mmap() 256kB at once, they've all been coalesced into
>> a single VMA and khugepaged has come along behind its back and created
>> a 2MB THP.  Now userspace calls mlock() and instead of treating that as
>> a hint that oops, maybe we shouldn't've done that, we do our utmost to
>> preserve the 2MB folio.
>>
>> I think this whole approach needs rethinking.  IMO, anonymous folios
>> should not cross VMA boundaries.  Tell me why I'm wrong.
> 
> I think we touched upon that a couple of times already, and the main issue is that while it sounds nice in theory, it's impossible in practice.
> 
> THP are supposed to be transparent, that is, we should not let arbitrary operations fail.
> 
> But nothing stops user space from
> 
> (a) mmap'ing a 2 MiB region
> (b) GUP-pinning the whole range
> (c) GUP-pinning the first half
> (d) unpinning the whole range from (a)
> (e) munmap'ing the second half
> 
> 
> And that's just one out of many examples I can think of, not even considering temporary/speculative references that can prevent a split at random points in time -- especially when splitting a VMA.
> 
Yes. The case that folio can't be split successfully is the only reason
I tried to avoid split the folio in mlock() syscall. I'd like to postpone
the split to page reclaim phase.


Regards
Yin, Fengwei

> Sure, any time we PTE-map a THP we might just say "let's put that on the deferred split queue" and cross fingers that we can eventually split it later. (I was recently thinking about that in the context of the mapcount ...)
> 
> It's all a big mess ...
>
Yin Fengwei July 8, 2023, 3:52 a.m. UTC | #8
On 7/8/2023 3:06 AM, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote:
>> On 07.07.23 19:26, Matthew Wilcox wrote:
>>> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
>>>> This series identified the large folio for mlock to two types:
>>>>    - The large folio is in VM_LOCKED VMA range
>>>>    - The large folio cross VM_LOCKED VMA boundary
>>>
>>> This is somewhere that I think our fixation on MUST USE PMD ENTRIES
>>> has led us astray.  Today when the arguments to mlock() cross a folio
>>> boundary, we split the PMD entry but leave the folio intact.  That means
>>> that we continue to manage the folio as a single entry on the LRU list.
>>> But userspace may have no idea that we're doing this.  It may have made
>>> several calls to mmap() 256kB at once, they've all been coalesced into
>>> a single VMA and khugepaged has come along behind its back and created
>>> a 2MB THP.  Now userspace calls mlock() and instead of treating that as
>>> a hint that oops, maybe we shouldn't've done that, we do our utmost to
>>> preserve the 2MB folio.
>>>
>>> I think this whole approach needs rethinking.  IMO, anonymous folios
>>> should not cross VMA boundaries.  Tell me why I'm wrong.
>>
>> I think we touched upon that a couple of times already, and the main issue
>> is that while it sounds nice in theory, it's impossible in practice.
>>
>> THP are supposed to be transparent, that is, we should not let arbitrary
>> operations fail.
>>
>> But nothing stops user space from
>>
>> (a) mmap'ing a 2 MiB region
>> (b) GUP-pinning the whole range
>> (c) GUP-pinning the first half
>> (d) unpinning the whole range from (a)
>> (e) munmap'ing the second half
>>
>>
>> And that's just one out of many examples I can think of, not even
>> considering temporary/speculative references that can prevent a split at
>> random points in time -- especially when splitting a VMA.
>>
>> Sure, any time we PTE-map a THP we might just say "let's put that on the
>> deferred split queue" and cross fingers that we can eventually split it
>> later. (I was recently thinking about that in the context of the mapcount
>> ...)
>>
>> It's all a big mess ...
> 
> Oh, I agree, there are always going to be circumstances where we realise
> we've made a bad decision and can't (easily) undo it.  Unless we have a
> per-page pincount, and I Would Rather Not Do That.  But we should _try_
> to do that because it's the right model -- that's what I meant by "Tell
> me why I'm wrong"; what scenarios do we have where a user temporarilly
> mlocks (or mprotects or ...) a range of memory, but wants that memory
> to be aged in the LRU exactly the same way as the adjacent memory that
> wasn't mprotected?
for manpage of mlock():
       mlock(),  mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory
       from being paged to the swap area.

So my understanding is it's OK to let the memory mlocked to be aged with
the adjacent memory which is not mlocked. Just make sure they are not
paged out to swap.


One question for implementation detail:
  If the large folio cross VMA boundary can not be split, how do we
  deal with this case? Retry in syscall till it's split successfully?
  Or return error (and what ERRORS should we choose) to user space?


Regards
Yin, Fengwei

> 
> GUP-pinning is different, and I don't think GUP-pinning should split
> a folio.  That's a temporary use (not FOLL_LONGTERM), eg, we're doing
> tcp zero-copy or it's the source/target of O_DIRECT.  That's not an
> instruction that this memory is different from its neighbours.
> 
> Maybe we end up deciding to split folios on GUP-pin.  That would be
> regrettable.
Matthew Wilcox July 8, 2023, 4:02 a.m. UTC | #9
On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote:
> > Oh, I agree, there are always going to be circumstances where we realise
> > we've made a bad decision and can't (easily) undo it.  Unless we have a
> > per-page pincount, and I Would Rather Not Do That.  But we should _try_
> > to do that because it's the right model -- that's what I meant by "Tell
> > me why I'm wrong"; what scenarios do we have where a user temporarilly
> > mlocks (or mprotects or ...) a range of memory, but wants that memory
> > to be aged in the LRU exactly the same way as the adjacent memory that
> > wasn't mprotected?
> for manpage of mlock():
>        mlock(),  mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory
>        from being paged to the swap area.
> 
> So my understanding is it's OK to let the memory mlocked to be aged with
> the adjacent memory which is not mlocked. Just make sure they are not
> paged out to swap.

Right, it doesn't break anything; it's just a similar problem to
internal fragmentation.  The pages of the folio which aren't mlocked
will also be locked in RAM and never paged out.

> One question for implementation detail:
>   If the large folio cross VMA boundary can not be split, how do we
>   deal with this case? Retry in syscall till it's split successfully?
>   Or return error (and what ERRORS should we choose) to user space?

I would be tempted to allocate memory & copy to the new mlocked VMA.
The old folio will go on the deferred_list and be split later, or its
valid parts will be written to swap and then it can be freed.
Yu Zhao July 8, 2023, 4:35 a.m. UTC | #10
On Fri, Jul 7, 2023 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote:
> > > Oh, I agree, there are always going to be circumstances where we realise
> > > we've made a bad decision and can't (easily) undo it.  Unless we have a
> > > per-page pincount, and I Would Rather Not Do That.  But we should _try_
> > > to do that because it's the right model -- that's what I meant by "Tell
> > > me why I'm wrong"; what scenarios do we have where a user temporarilly
> > > mlocks (or mprotects or ...) a range of memory, but wants that memory
> > > to be aged in the LRU exactly the same way as the adjacent memory that
> > > wasn't mprotected?
> > for manpage of mlock():
> >        mlock(),  mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory
> >        from being paged to the swap area.
> >
> > So my understanding is it's OK to let the memory mlocked to be aged with
> > the adjacent memory which is not mlocked. Just make sure they are not
> > paged out to swap.
>
> Right, it doesn't break anything; it's just a similar problem to
> internal fragmentation.  The pages of the folio which aren't mlocked
> will also be locked in RAM and never paged out.

I don't think this is the case: since partially locking a
non-pmd-mappable large folio is a nop, it remains on one of the
evictable LRUs. The rmap walk by folio_referenced() should already be
able to find the VMA and the PTEs mapping the unlocked portion. So the
page reclaim should be able to correctly age the unlocked portion even
though the folio contains a locked portion too. And when it tries to
reclaim the entire folio, it first tries to split it into a list of
base folios in shrink_folio_list(), and if that succeeds, it walks the
rmap of each base folio on that list to unmap (not age). Unmapping
doesn't have TTU_IGNORE_MLOCK, so it should correctly call
mlock_vma_folio() on the locked base folios and bail out. And finally
those locked base folios are put back to the unevictable list.

> > One question for implementation detail:
> >   If the large folio cross VMA boundary can not be split, how do we
> >   deal with this case? Retry in syscall till it's split successfully?
> >   Or return error (and what ERRORS should we choose) to user space?
>
> I would be tempted to allocate memory & copy to the new mlocked VMA.
> The old folio will go on the deferred_list and be split later, or its
> valid parts will be written to swap and then it can be freed.
Yin Fengwei July 8, 2023, 4:36 a.m. UTC | #11
On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
> On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote:
>>> Oh, I agree, there are always going to be circumstances where we realise
>>> we've made a bad decision and can't (easily) undo it.  Unless we have a
>>> per-page pincount, and I Would Rather Not Do That.  But we should _try_
>>> to do that because it's the right model -- that's what I meant by "Tell
>>> me why I'm wrong"; what scenarios do we have where a user temporarilly
>>> mlocks (or mprotects or ...) a range of memory, but wants that memory
>>> to be aged in the LRU exactly the same way as the adjacent memory that
>>> wasn't mprotected?
>> for manpage of mlock():
>>        mlock(),  mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory
>>        from being paged to the swap area.
>>
>> So my understanding is it's OK to let the memory mlocked to be aged with
>> the adjacent memory which is not mlocked. Just make sure they are not
>> paged out to swap.
> 
> Right, it doesn't break anything; it's just a similar problem to
> internal fragmentation.  The pages of the folio which aren't mlocked
> will also be locked in RAM and never paged out.
This patchset doesn't mlock the large folio cross VMA boundary. So page
reclaim can pick the large folio and split it. Then the pages out of
VM_LOCKED VMA range will be paged out. Or did I miss something here?

> 
>> One question for implementation detail:
>>   If the large folio cross VMA boundary can not be split, how do we
>>   deal with this case? Retry in syscall till it's split successfully?
>>   Or return error (and what ERRORS should we choose) to user space?
> 
> I would be tempted to allocate memory & copy to the new mlocked VMA.
> The old folio will go on the deferred_list and be split later, or its
> valid parts will be written to swap and then it can be freed.

OK. This can be common operations for any case that splitting VMA triggers
large folio splitting but splitting fails.


Regards
Yin, Fengwei
Yin Fengwei July 8, 2023, 4:40 a.m. UTC | #12
On 7/8/2023 12:35 PM, Yu Zhao wrote:
> On Fri, Jul 7, 2023 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote:
>>>> Oh, I agree, there are always going to be circumstances where we realise
>>>> we've made a bad decision and can't (easily) undo it.  Unless we have a
>>>> per-page pincount, and I Would Rather Not Do That.  But we should _try_
>>>> to do that because it's the right model -- that's what I meant by "Tell
>>>> me why I'm wrong"; what scenarios do we have where a user temporarilly
>>>> mlocks (or mprotects or ...) a range of memory, but wants that memory
>>>> to be aged in the LRU exactly the same way as the adjacent memory that
>>>> wasn't mprotected?
>>> for manpage of mlock():
>>>        mlock(),  mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory
>>>        from being paged to the swap area.
>>>
>>> So my understanding is it's OK to let the memory mlocked to be aged with
>>> the adjacent memory which is not mlocked. Just make sure they are not
>>> paged out to swap.
>>
>> Right, it doesn't break anything; it's just a similar problem to
>> internal fragmentation.  The pages of the folio which aren't mlocked
>> will also be locked in RAM and never paged out.
> 
> I don't think this is the case: since partially locking a
> non-pmd-mappable large folio is a nop, it remains on one of the
> evictable LRUs. The rmap walk by folio_referenced() should already be
> able to find the VMA and the PTEs mapping the unlocked portion. So the
> page reclaim should be able to correctly age the unlocked portion even
> though the folio contains a locked portion too. And when it tries to
> reclaim the entire folio, it first tries to split it into a list of
> base folios in shrink_folio_list(), and if that succeeds, it walks the
> rmap of each base folio on that list to unmap (not age). Unmapping
> doesn't have TTU_IGNORE_MLOCK, so it should correctly call
> mlock_vma_folio() on the locked base folios and bail out. And finally
> those locked base folios are put back to the unevictable list.
Yes. This is exact my understanding to this too. It's also why this
patchset keep large folio cross VMA boundary munlocked.


Regards
Yin, Fengwei

> 
>>> One question for implementation detail:
>>>   If the large folio cross VMA boundary can not be split, how do we
>>>   deal with this case? Retry in syscall till it's split successfully?
>>>   Or return error (and what ERRORS should we choose) to user space?
>>
>> I would be tempted to allocate memory & copy to the new mlocked VMA.
>> The old folio will go on the deferred_list and be split later, or its
>> valid parts will be written to swap and then it can be freed.
Yu Zhao July 8, 2023, 4:45 a.m. UTC | #13
On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
> Yu mentioned at [1] about the mlock() can't be applied to large folio.
>
> I leant the related code and here is my understanding:
> - For RLIMIT_MEMLOCK related, there is no problem. Becuase the
>   RLIMIT_MEMLOCK statistics is not related underneath page. That means
>   underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
>   statistics collection which is always correct.
>
> - For keeping the page in RAM, there is no problem either. At least,
>   during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
>   set in vm_flags, the folio will be kept whatever the folio is
>   mlocked or not.
>
> So the function of mlock for large folio works. But it's not optimized
> because the page reclaim needs scan these large folio and may split
> them.
>
> This series identified the large folio for mlock to two types:
>   - The large folio is in VM_LOCKED VMA range
>   - The large folio cross VM_LOCKED VMA boundary
>
> For the first type, we mlock large folio so page relcaim will skip it.
> For the second type, we don't mlock large folio. It's allowed to be
> picked by page reclaim and be split. So the pages not in VM_LOCKED VMA
> range are allowed to be reclaimed/released.

This is a sound design, which is also what I have in mind. I see the
rationales are being spelled out in this thread, and hopefully
everyone can be convinced.

> patch1 introduce API to check whether large folio is in VMA range.
> patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
> large folio mlock/munlock.
> patch3 make mlock/munlock syscall support large folio.

Could you tidy up the last patch a little bit? E.g., Saying "mlock the
4K folio" is obviously not the best idea.

And if it's possible, make the loop just look like before, i.e.,

  if (!can_mlock_entire_folio())
    continue;
  if (vma->vm_flags & VM_LOCKED)
    mlock_folio_range();
  else
    munlock_folio_range();

Thanks.
Yin Fengwei July 8, 2023, 5:01 a.m. UTC | #14
On 7/8/2023 12:45 PM, Yu Zhao wrote:
> On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>> Yu mentioned at [1] about the mlock() can't be applied to large folio.
>>
>> I leant the related code and here is my understanding:
>> - For RLIMIT_MEMLOCK related, there is no problem. Becuase the
>>   RLIMIT_MEMLOCK statistics is not related underneath page. That means
>>   underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
>>   statistics collection which is always correct.
>>
>> - For keeping the page in RAM, there is no problem either. At least,
>>   during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
>>   set in vm_flags, the folio will be kept whatever the folio is
>>   mlocked or not.
>>
>> So the function of mlock for large folio works. But it's not optimized
>> because the page reclaim needs scan these large folio and may split
>> them.
>>
>> This series identified the large folio for mlock to two types:
>>   - The large folio is in VM_LOCKED VMA range
>>   - The large folio cross VM_LOCKED VMA boundary
>>
>> For the first type, we mlock large folio so page relcaim will skip it.
>> For the second type, we don't mlock large folio. It's allowed to be
>> picked by page reclaim and be split. So the pages not in VM_LOCKED VMA
>> range are allowed to be reclaimed/released.
> 
> This is a sound design, which is also what I have in mind. I see the
> rationales are being spelled out in this thread, and hopefully
> everyone can be convinced.
> 
>> patch1 introduce API to check whether large folio is in VMA range.
>> patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
>> large folio mlock/munlock.
>> patch3 make mlock/munlock syscall support large folio.
> 
> Could you tidy up the last patch a little bit? E.g., Saying "mlock the
> 4K folio" is obviously not the best idea.
> 
> And if it's possible, make the loop just look like before, i.e.,
> 
>   if (!can_mlock_entire_folio())
>     continue;
>   if (vma->vm_flags & VM_LOCKED)
>     mlock_folio_range();
>   else
>     munlock_folio_range();
This can make large folio mlocked() even user space call munlock()
to the range. Considering following case:
  1. mlock() 64K range and underneath 64K large folio is mlocked().
  2. mprotect the first 32K range to different prot and triggers
     VMA split.
  3. munlock() 64K range. As 64K large folio doesn't in these two
     new VMAs range, it will not be munlocked() and only can be
     reclaimed after it's unmapped from two VMAs instead of after
     the range is munlocked().


Regards
Yin, Fengwei

> 
> Thanks.
Yu Zhao July 8, 2023, 5:06 a.m. UTC | #15
On Fri, Jul 7, 2023 at 11:01 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 7/8/2023 12:45 PM, Yu Zhao wrote:
> > On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >> Yu mentioned at [1] about the mlock() can't be applied to large folio.
> >>
> >> I leant the related code and here is my understanding:
> >> - For RLIMIT_MEMLOCK related, there is no problem. Becuase the
> >>   RLIMIT_MEMLOCK statistics is not related underneath page. That means
> >>   underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
> >>   statistics collection which is always correct.
> >>
> >> - For keeping the page in RAM, there is no problem either. At least,
> >>   during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
> >>   set in vm_flags, the folio will be kept whatever the folio is
> >>   mlocked or not.
> >>
> >> So the function of mlock for large folio works. But it's not optimized
> >> because the page reclaim needs scan these large folio and may split
> >> them.
> >>
> >> This series identified the large folio for mlock to two types:
> >>   - The large folio is in VM_LOCKED VMA range
> >>   - The large folio cross VM_LOCKED VMA boundary
> >>
> >> For the first type, we mlock large folio so page relcaim will skip it.
> >> For the second type, we don't mlock large folio. It's allowed to be
> >> picked by page reclaim and be split. So the pages not in VM_LOCKED VMA
> >> range are allowed to be reclaimed/released.
> >
> > This is a sound design, which is also what I have in mind. I see the
> > rationales are being spelled out in this thread, and hopefully
> > everyone can be convinced.
> >
> >> patch1 introduce API to check whether large folio is in VMA range.
> >> patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
> >> large folio mlock/munlock.
> >> patch3 make mlock/munlock syscall support large folio.
> >
> > Could you tidy up the last patch a little bit? E.g., Saying "mlock the
> > 4K folio" is obviously not the best idea.
> >
> > And if it's possible, make the loop just look like before, i.e.,
> >
> >   if (!can_mlock_entire_folio())
> >     continue;
> >   if (vma->vm_flags & VM_LOCKED)
> >     mlock_folio_range();
> >   else
> >     munlock_folio_range();
> This can make large folio mlocked() even user space call munlock()
> to the range. Considering following case:
>   1. mlock() 64K range and underneath 64K large folio is mlocked().
>   2. mprotect the first 32K range to different prot and triggers
>      VMA split.
>   3. munlock() 64K range. As 64K large folio doesn't in these two
>      new VMAs range, it will not be munlocked() and only can be
>      reclaimed after it's unmapped from two VMAs instead of after
>      the range is munlocked().

I understand. I'm asking to factor the code, not to change the logic.
Yin Fengwei July 8, 2023, 5:35 a.m. UTC | #16
On 7/8/2023 1:06 PM, Yu Zhao wrote:
> On Fri, Jul 7, 2023 at 11:01 PM Yin, Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 7/8/2023 12:45 PM, Yu Zhao wrote:
>>> On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>
>>>> Yu mentioned at [1] about the mlock() can't be applied to large folio.
>>>>
>>>> I leant the related code and here is my understanding:
>>>> - For RLIMIT_MEMLOCK related, there is no problem. Becuase the
>>>>   RLIMIT_MEMLOCK statistics is not related underneath page. That means
>>>>   underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK
>>>>   statistics collection which is always correct.
>>>>
>>>> - For keeping the page in RAM, there is no problem either. At least,
>>>>   during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit
>>>>   set in vm_flags, the folio will be kept whatever the folio is
>>>>   mlocked or not.
>>>>
>>>> So the function of mlock for large folio works. But it's not optimized
>>>> because the page reclaim needs scan these large folio and may split
>>>> them.
>>>>
>>>> This series identified the large folio for mlock to two types:
>>>>   - The large folio is in VM_LOCKED VMA range
>>>>   - The large folio cross VM_LOCKED VMA boundary
>>>>
>>>> For the first type, we mlock large folio so page relcaim will skip it.
>>>> For the second type, we don't mlock large folio. It's allowed to be
>>>> picked by page reclaim and be split. So the pages not in VM_LOCKED VMA
>>>> range are allowed to be reclaimed/released.
>>>
>>> This is a sound design, which is also what I have in mind. I see the
>>> rationales are being spelled out in this thread, and hopefully
>>> everyone can be convinced.
>>>
>>>> patch1 introduce API to check whether large folio is in VMA range.
>>>> patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support
>>>> large folio mlock/munlock.
>>>> patch3 make mlock/munlock syscall support large folio.
>>>
>>> Could you tidy up the last patch a little bit? E.g., Saying "mlock the
>>> 4K folio" is obviously not the best idea.
>>>
>>> And if it's possible, make the loop just look like before, i.e.,
>>>
>>>   if (!can_mlock_entire_folio())
>>>     continue;
>>>   if (vma->vm_flags & VM_LOCKED)
>>>     mlock_folio_range();
>>>   else
>>>     munlock_folio_range();
>> This can make large folio mlocked() even user space call munlock()
>> to the range. Considering following case:
>>   1. mlock() 64K range and underneath 64K large folio is mlocked().
>>   2. mprotect the first 32K range to different prot and triggers
>>      VMA split.
>>   3. munlock() 64K range. As 64K large folio doesn't in these two
>>      new VMAs range, it will not be munlocked() and only can be
>>      reclaimed after it's unmapped from two VMAs instead of after
>>      the range is munlocked().
> 
> I understand. I'm asking to factor the code, not to change the logic.
Oh. Sorry. I miss-understood the code piece you showed. Will address
this in coming version. Thanks.


Regards
Yin, Fengwei
Yin Fengwei July 9, 2023, 1:25 p.m. UTC | #17
On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
> I would be tempted to allocate memory & copy to the new mlocked VMA.
> The old folio will go on the deferred_list and be split later, or its
> valid parts will be written to swap and then it can be freed.
If the large folio splitting failure is because of GUP pages, can we
do copy here?

Let's say, if the GUP page is target of DMA operation and DMA operation
is ongoing. We allocated a new page and copy GUP page content to the
new page, the data in the new page can be corrupted.


Regards
Yin, Fengwei
David Hildenbrand July 10, 2023, 9:32 a.m. UTC | #18
On 09.07.23 15:25, Yin, Fengwei wrote:
> 
> 
> On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
>> I would be tempted to allocate memory & copy to the new mlocked VMA.
>> The old folio will go on the deferred_list and be split later, or its
>> valid parts will be written to swap and then it can be freed.
> If the large folio splitting failure is because of GUP pages, can we
> do copy here?
> 
> Let's say, if the GUP page is target of DMA operation and DMA operation
> is ongoing. We allocated a new page and copy GUP page content to the
> new page, the data in the new page can be corrupted.

No, we may only replace anon pages that are flagged as maybe shared 
(!PageAnonExclusive). We must not replace pages that are exclusive 
(PageAnonExclusive) unless we first try marking them maybe shared. 
Clearing will fail if the page maybe pinned.

page_try_share_anon_rmap() implements the clearing logic, taking care of 
synchronizing against concurrent GUP-fast.

There are some additional nasty details regarding O_DIRECT. But once it 
completely switched from using FOLL_GET to properly using FOLL_PIN (a 
lot of that conversion already happened IIRC), we're fine in that regard.
Yin Fengwei July 10, 2023, 9:43 a.m. UTC | #19
Hi David,

On 7/10/2023 5:32 PM, David Hildenbrand wrote:
> On 09.07.23 15:25, Yin, Fengwei wrote:
>>
>>
>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
>>> I would be tempted to allocate memory & copy to the new mlocked VMA.
>>> The old folio will go on the deferred_list and be split later, or its
>>> valid parts will be written to swap and then it can be freed.
>> If the large folio splitting failure is because of GUP pages, can we
>> do copy here?
>>
>> Let's say, if the GUP page is target of DMA operation and DMA operation
>> is ongoing. We allocated a new page and copy GUP page content to the
>> new page, the data in the new page can be corrupted.
> 
> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned.
Thanks a lot for clarification.

So my understanding is that if large folio splitting fails, it's not always
true that we can allocate new folios, copy original large folio content to
new folios, remove original large folio from VMA and map the new folios to
VMA (like it's only true if original large folio is marked as maybe shared).


Regards
Yin, Fengwei

> 
> page_try_share_anon_rmap() implements the clearing logic, taking care of synchronizing against concurrent GUP-fast.
> 
> There are some additional nasty details regarding O_DIRECT. But once it completely switched from using FOLL_GET to properly using FOLL_PIN (a lot of that conversion already happened IIRC), we're fine in that regard.
>
David Hildenbrand July 10, 2023, 9:57 a.m. UTC | #20
On 10.07.23 11:43, Yin, Fengwei wrote:
> Hi David,
> 
> On 7/10/2023 5:32 PM, David Hildenbrand wrote:
>> On 09.07.23 15:25, Yin, Fengwei wrote:
>>>
>>>
>>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
>>>> I would be tempted to allocate memory & copy to the new mlocked VMA.
>>>> The old folio will go on the deferred_list and be split later, or its
>>>> valid parts will be written to swap and then it can be freed.
>>> If the large folio splitting failure is because of GUP pages, can we
>>> do copy here?
>>>
>>> Let's say, if the GUP page is target of DMA operation and DMA operation
>>> is ongoing. We allocated a new page and copy GUP page content to the
>>> new page, the data in the new page can be corrupted.
>>
>> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned.
> Thanks a lot for clarification.
> 
> So my understanding is that if large folio splitting fails, it's not always
> true that we can allocate new folios, copy original large folio content to
> new folios, remove original large folio from VMA and map the new folios to
> VMA (like it's only true if original large folio is marked as maybe shared).
> 

While it might work in many cases, there are some corner cases where it 
won't work.

So to summarize

(1) THP are transparent and should not result in arbitrary syscall
     failures.
(2) Splitting a THP might fail at random points in time either due to
     GUP pins or due to speculative page references (including
     speculative GUP pins).
(3) Replacing an exclusive anon page that maybe pinned will result in
     memory corruptions.

So we can try to split any THP that crosses VMA borders on VMA 
modifications (split due to munmap, mremap, madvise, mprotect, mlock, 
...), it's not guaranteed to work due to (1). And we can try to replace 
pages such pages, but it's not guaranteed to be allowed due to (3).

And as it's all transparent, we cannot fail (1).

For the other cases that Willy and I discussed (split on VMA 
modifications after fork()), we can at least always replace the anon page.

<details>

What always works, is putting the THP on the deferred split queue to see 
if we can split it later. The deferred split queue is a bit suboptimal 
right now, because it requires the (sub)page mapcounts to detect whether 
the folio is partially mapped vs. fully mapped. If we want to get rid of 
that, we have to come up with something reasonable.

I was wondering if we could have a an optimized deferred split queue, 
that only conditionally splits: do an rmap walk and detect if (a) each 
page of the folio is still mapped (b) the folio does not cross a VMA. If 
both are met, one could skip the deferred split. But that needs a bit of 
thought -- but we're already doing an rmap walk when splitting, so 
scanning which parts are actually mapped does not sound too weird.

</details>
Yin Fengwei July 10, 2023, 10:19 a.m. UTC | #21
On 7/10/2023 5:57 PM, David Hildenbrand wrote:
> On 10.07.23 11:43, Yin, Fengwei wrote:
>> Hi David,
>>
>> On 7/10/2023 5:32 PM, David Hildenbrand wrote:
>>> On 09.07.23 15:25, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote:
>>>>> I would be tempted to allocate memory & copy to the new mlocked VMA.
>>>>> The old folio will go on the deferred_list and be split later, or its
>>>>> valid parts will be written to swap and then it can be freed.
>>>> If the large folio splitting failure is because of GUP pages, can we
>>>> do copy here?
>>>>
>>>> Let's say, if the GUP page is target of DMA operation and DMA operation
>>>> is ongoing. We allocated a new page and copy GUP page content to the
>>>> new page, the data in the new page can be corrupted.
>>>
>>> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned.
>> Thanks a lot for clarification.
>>
>> So my understanding is that if large folio splitting fails, it's not always
>> true that we can allocate new folios, copy original large folio content to
>> new folios, remove original large folio from VMA and map the new folios to
>> VMA (like it's only true if original large folio is marked as maybe shared).
>>
> 
> While it might work in many cases, there are some corner cases where it won't work.
> 
> So to summarize
> 
> (1) THP are transparent and should not result in arbitrary syscall
>     failures.
> (2) Splitting a THP might fail at random points in time either due to
>     GUP pins or due to speculative page references (including
>     speculative GUP pins).
> (3) Replacing an exclusive anon page that maybe pinned will result in
>     memory corruptions.
> 
> So we can try to split any THP that crosses VMA borders on VMA modifications (split due to munmap, mremap, madvise, mprotect, mlock, ...), it's not guaranteed to work due to (1). And we can try to replace pages such pages, but it's not guaranteed to be allowed due to (3).
> 
> And as it's all transparent, we cannot fail (1).
Very clear to me now.

> 
> For the other cases that Willy and I discussed (split on VMA modifications after fork()), we can at least always replace the anon page.
> 
> <details>
> 
> What always works, is putting the THP on the deferred split queue to see if we can split it later. The deferred split queue is a bit suboptimal right now, because it requires the (sub)page mapcounts to detect whether the folio is partially mapped vs. fully mapped. If we want to get rid of that, we have to come up with something reasonable.
> 
> I was wondering if we could have a an optimized deferred split queue, that only conditionally splits: do an rmap walk and detect if (a) each page of the folio is still mapped (b) the folio does not cross a VMA. If both are met, one could skip the deferred split. But that needs a bit of thought -- but we're already doing an rmap walk when splitting, so scanning which parts are actually mapped does not sound too weird.
> 
> </details>
> 
Thanks a lot for extra information which help me to know more background.
Really appreciate it.


Regards
Yin, Fengwei
Ryan Roberts July 10, 2023, 10:36 a.m. UTC | #22
On 07/07/2023 20:26, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 09:15:02PM +0200, David Hildenbrand wrote:
>>>> Sure, any time we PTE-map a THP we might just say "let's put that on the
>>>> deferred split queue" and cross fingers that we can eventually split it
>>>> later. (I was recently thinking about that in the context of the mapcount
>>>> ...)
>>>>
>>>> It's all a big mess ...
>>>
>>> Oh, I agree, there are always going to be circumstances where we realise
>>> we've made a bad decision and can't (easily) undo it.  Unless we have a
>>> per-page pincount, and I Would Rather Not Do That.
>>
>> I agree ...
>>
>> But we should _try_
>>> to do that because it's the right model -- that's what I meant by "Tell
>>
>> Try to have per-page pincounts? :/ or do you mean, try to split on VMA
>> split? I hope the latter (although I'm not sure about performance) :)
> 
> Sorry, try to split a folio on VMA split.
> 
>>> me why I'm wrong"; what scenarios do we have where a user temporarilly
>>> mlocks (or mprotects or ...) a range of memory, but wants that memory
>>> to be aged in the LRU exactly the same way as the adjacent memory that
>>> wasn't mprotected?
>>
>> Let me throw in a "fun one".
>>
>> Parent process has a 2 MiB range populated by a THP. fork() a child process.
>> Child process mprotects half the VMA.
>>
>> Should we split the (COW-shared) THP? Or should we COW/unshare in the child
>> process (ugh!) during the VMA split.
>>
>> It all makes my brain hurt.
> 
> OK, so this goes back to what I wrote earlier about attempting to choose
> what size of folio to allocate on COW:
> 
> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/
> 
> : the parent had already established
> : an appropriate size folio to use for this VMA before calling fork().
> : Whether it is the parent or the child causing the COW, it should probably
> : inherit that choice and we should default to the same size folio that
> : was already found.

FWIW, I had patches in my original RFC that aimed to follow this policy for
large anon folios [1] & [2], and intend to follow up with a modified version of
these patches once we have an initial submission.

[1] https://lore.kernel.org/linux-mm/20230414130303.2345383-11-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20230414130303.2345383-15-ryan.roberts@arm.com/