Message ID | 20230707165221.4076590-1-fengwei.yin@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | support large folio for mlock | expand |
On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: > This series identified the large folio for mlock to two types: > - The large folio is in VM_LOCKED VMA range > - The large folio cross VM_LOCKED VMA boundary This is somewhere that I think our fixation on MUST USE PMD ENTRIES has led us astray. Today when the arguments to mlock() cross a folio boundary, we split the PMD entry but leave the folio intact. That means that we continue to manage the folio as a single entry on the LRU list. But userspace may have no idea that we're doing this. It may have made several calls to mmap() 256kB at once, they've all been coalesced into a single VMA and khugepaged has come along behind its back and created a 2MB THP. Now userspace calls mlock() and instead of treating that as a hint that oops, maybe we shouldn't've done that, we do our utmost to preserve the 2MB folio. I think this whole approach needs rethinking. IMO, anonymous folios should not cross VMA boundaries. Tell me why I'm wrong.
On 07.07.23 19:26, Matthew Wilcox wrote: > On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: >> This series identified the large folio for mlock to two types: >> - The large folio is in VM_LOCKED VMA range >> - The large folio cross VM_LOCKED VMA boundary > > This is somewhere that I think our fixation on MUST USE PMD ENTRIES > has led us astray. Today when the arguments to mlock() cross a folio > boundary, we split the PMD entry but leave the folio intact. That means > that we continue to manage the folio as a single entry on the LRU list. > But userspace may have no idea that we're doing this. It may have made > several calls to mmap() 256kB at once, they've all been coalesced into > a single VMA and khugepaged has come along behind its back and created > a 2MB THP. Now userspace calls mlock() and instead of treating that as > a hint that oops, maybe we shouldn't've done that, we do our utmost to > preserve the 2MB folio. > > I think this whole approach needs rethinking. IMO, anonymous folios > should not cross VMA boundaries. Tell me why I'm wrong. I think we touched upon that a couple of times already, and the main issue is that while it sounds nice in theory, it's impossible in practice. THP are supposed to be transparent, that is, we should not let arbitrary operations fail. But nothing stops user space from (a) mmap'ing a 2 MiB region (b) GUP-pinning the whole range (c) GUP-pinning the first half (d) unpinning the whole range from (a) (e) munmap'ing the second half And that's just one out of many examples I can think of, not even considering temporary/speculative references that can prevent a split at random points in time -- especially when splitting a VMA. Sure, any time we PTE-map a THP we might just say "let's put that on the deferred split queue" and cross fingers that we can eventually split it later. (I was recently thinking about that in the context of the mapcount ...) It's all a big mess ...
On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote: > On 07.07.23 19:26, Matthew Wilcox wrote: > > On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: > > > This series identified the large folio for mlock to two types: > > > - The large folio is in VM_LOCKED VMA range > > > - The large folio cross VM_LOCKED VMA boundary > > > > This is somewhere that I think our fixation on MUST USE PMD ENTRIES > > has led us astray. Today when the arguments to mlock() cross a folio > > boundary, we split the PMD entry but leave the folio intact. That means > > that we continue to manage the folio as a single entry on the LRU list. > > But userspace may have no idea that we're doing this. It may have made > > several calls to mmap() 256kB at once, they've all been coalesced into > > a single VMA and khugepaged has come along behind its back and created > > a 2MB THP. Now userspace calls mlock() and instead of treating that as > > a hint that oops, maybe we shouldn't've done that, we do our utmost to > > preserve the 2MB folio. > > > > I think this whole approach needs rethinking. IMO, anonymous folios > > should not cross VMA boundaries. Tell me why I'm wrong. > > I think we touched upon that a couple of times already, and the main issue > is that while it sounds nice in theory, it's impossible in practice. > > THP are supposed to be transparent, that is, we should not let arbitrary > operations fail. > > But nothing stops user space from > > (a) mmap'ing a 2 MiB region > (b) GUP-pinning the whole range > (c) GUP-pinning the first half > (d) unpinning the whole range from (a) > (e) munmap'ing the second half > > > And that's just one out of many examples I can think of, not even > considering temporary/speculative references that can prevent a split at > random points in time -- especially when splitting a VMA. > > Sure, any time we PTE-map a THP we might just say "let's put that on the > deferred split queue" and cross fingers that we can eventually split it > later. (I was recently thinking about that in the context of the mapcount > ...) > > It's all a big mess ... Oh, I agree, there are always going to be circumstances where we realise we've made a bad decision and can't (easily) undo it. Unless we have a per-page pincount, and I Would Rather Not Do That. But we should _try_ to do that because it's the right model -- that's what I meant by "Tell me why I'm wrong"; what scenarios do we have where a user temporarilly mlocks (or mprotects or ...) a range of memory, but wants that memory to be aged in the LRU exactly the same way as the adjacent memory that wasn't mprotected? GUP-pinning is different, and I don't think GUP-pinning should split a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing tcp zero-copy or it's the source/target of O_DIRECT. That's not an instruction that this memory is different from its neighbours. Maybe we end up deciding to split folios on GUP-pin. That would be regrettable.
On 07.07.23 21:06, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote: >> On 07.07.23 19:26, Matthew Wilcox wrote: >>> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: >>>> This series identified the large folio for mlock to two types: >>>> - The large folio is in VM_LOCKED VMA range >>>> - The large folio cross VM_LOCKED VMA boundary >>> >>> This is somewhere that I think our fixation on MUST USE PMD ENTRIES >>> has led us astray. Today when the arguments to mlock() cross a folio >>> boundary, we split the PMD entry but leave the folio intact. That means >>> that we continue to manage the folio as a single entry on the LRU list. >>> But userspace may have no idea that we're doing this. It may have made >>> several calls to mmap() 256kB at once, they've all been coalesced into >>> a single VMA and khugepaged has come along behind its back and created >>> a 2MB THP. Now userspace calls mlock() and instead of treating that as >>> a hint that oops, maybe we shouldn't've done that, we do our utmost to >>> preserve the 2MB folio. >>> >>> I think this whole approach needs rethinking. IMO, anonymous folios >>> should not cross VMA boundaries. Tell me why I'm wrong. >> >> I think we touched upon that a couple of times already, and the main issue >> is that while it sounds nice in theory, it's impossible in practice. >> >> THP are supposed to be transparent, that is, we should not let arbitrary >> operations fail. >> >> But nothing stops user space from >> >> (a) mmap'ing a 2 MiB region >> (b) GUP-pinning the whole range >> (c) GUP-pinning the first half >> (d) unpinning the whole range from (a) >> (e) munmap'ing the second half >> >> >> And that's just one out of many examples I can think of, not even >> considering temporary/speculative references that can prevent a split at >> random points in time -- especially when splitting a VMA. >> >> Sure, any time we PTE-map a THP we might just say "let's put that on the >> deferred split queue" and cross fingers that we can eventually split it >> later. (I was recently thinking about that in the context of the mapcount >> ...) >> >> It's all a big mess ... > > Oh, I agree, there are always going to be circumstances where we realise > we've made a bad decision and can't (easily) undo it. Unless we have a > per-page pincount, and I Would Rather Not Do That. I agree ... But we should _try_ > to do that because it's the right model -- that's what I meant by "Tell Try to have per-page pincounts? :/ or do you mean, try to split on VMA split? I hope the latter (although I'm not sure about performance) :) > me why I'm wrong"; what scenarios do we have where a user temporarilly > mlocks (or mprotects or ...) a range of memory, but wants that memory > to be aged in the LRU exactly the same way as the adjacent memory that > wasn't mprotected? Let me throw in a "fun one". Parent process has a 2 MiB range populated by a THP. fork() a child process. Child process mprotects half the VMA. Should we split the (COW-shared) THP? Or should we COW/unshare in the child process (ugh!) during the VMA split. It all makes my brain hurt. > > GUP-pinning is different, and I don't think GUP-pinning should split > a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing > tcp zero-copy or it's the source/target of O_DIRECT. That's not an > instruction that this memory is different from its neighbours. > > Maybe we end up deciding to split folios on GUP-pin. That would be > regrettable. That would probably never be accepted, because the ones that heavily rely on THP (databases, VMs), typically also end up using a lot of features that use (long-term) page pinning. Don't get me started on io_uring with fixed buffers.
On Fri, Jul 07, 2023 at 09:15:02PM +0200, David Hildenbrand wrote: > > > Sure, any time we PTE-map a THP we might just say "let's put that on the > > > deferred split queue" and cross fingers that we can eventually split it > > > later. (I was recently thinking about that in the context of the mapcount > > > ...) > > > > > > It's all a big mess ... > > > > Oh, I agree, there are always going to be circumstances where we realise > > we've made a bad decision and can't (easily) undo it. Unless we have a > > per-page pincount, and I Would Rather Not Do That. > > I agree ... > > But we should _try_ > > to do that because it's the right model -- that's what I meant by "Tell > > Try to have per-page pincounts? :/ or do you mean, try to split on VMA > split? I hope the latter (although I'm not sure about performance) :) Sorry, try to split a folio on VMA split. > > me why I'm wrong"; what scenarios do we have where a user temporarilly > > mlocks (or mprotects or ...) a range of memory, but wants that memory > > to be aged in the LRU exactly the same way as the adjacent memory that > > wasn't mprotected? > > Let me throw in a "fun one". > > Parent process has a 2 MiB range populated by a THP. fork() a child process. > Child process mprotects half the VMA. > > Should we split the (COW-shared) THP? Or should we COW/unshare in the child > process (ugh!) during the VMA split. > > It all makes my brain hurt. OK, so this goes back to what I wrote earlier about attempting to choose what size of folio to allocate on COW: https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ : the parent had already established : an appropriate size folio to use for this VMA before calling fork(). : Whether it is the parent or the child causing the COW, it should probably : inherit that choice and we should default to the same size folio that : was already found. You've come up with a usefully different case here. I think we should COW the folio at the point of the mprotect(). That will allow the parent to become the sole owner of the folio once again and ensure that when the parent modifies the folio, it _doesn't_ have to COW. (This is also a rare case, surely) > > > > GUP-pinning is different, and I don't think GUP-pinning should split > > a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing > > tcp zero-copy or it's the source/target of O_DIRECT. That's not an > > instruction that this memory is different from its neighbours. > > > > Maybe we end up deciding to split folios on GUP-pin. That would be > > regrettable. > > That would probably never be accepted, because the ones that heavily rely on > THP (databases, VMs), typically also end up using a lot of features that use > (long-term) page pinning. Don't get me started on io_uring with fixed > buffers. I do think that something like a long-term pin should split a folio. Otherwise we're condemning the rest of the folio to be pinned along with it. Short term pins shouldn't split.
On 7/8/2023 1:26 AM, Matthew Wilcox wrote: > On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: >> This series identified the large folio for mlock to two types: >> - The large folio is in VM_LOCKED VMA range >> - The large folio cross VM_LOCKED VMA boundary > > This is somewhere that I think our fixation on MUST USE PMD ENTRIES > has led us astray. Today when the arguments to mlock() cross a folio > boundary, we split the PMD entry but leave the folio intact. That means > that we continue to manage the folio as a single entry on the LRU list. > But userspace may have no idea that we're doing this. It may have made > several calls to mmap() 256kB at once, they've all been coalesced into > a single VMA and khugepaged has come along behind its back and created > a 2MB THP. Now userspace calls mlock() and instead of treating that as > a hint that oops, maybe we shouldn't've done that, we do our utmost to > preserve the 2MB folio. > > I think this whole approach needs rethinking. IMO, anonymous folios > should not cross VMA boundaries. Tell me why I'm wrong. No. You are not wrong. :). That concept to keep anonymous folio not cross VMA boundary is decent. I tried to split the large folio when it cross VMA boundary for mlock(). As it's possible that the folio split fails, we always need to deal with this case. I decided to postpone all large folio splitting to page reclaim phase. The benefits we could get: - If the range is munlocked before page reclaim pick this folio, we don't need to split the folio. - for the system which don't have swap enabled, we don't need to split this kind folio. Regards Yin, Fengwei
On 7/8/2023 2:54 AM, David Hildenbrand wrote: > On 07.07.23 19:26, Matthew Wilcox wrote: >> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: >>> This series identified the large folio for mlock to two types: >>> - The large folio is in VM_LOCKED VMA range >>> - The large folio cross VM_LOCKED VMA boundary >> >> This is somewhere that I think our fixation on MUST USE PMD ENTRIES >> has led us astray. Today when the arguments to mlock() cross a folio >> boundary, we split the PMD entry but leave the folio intact. That means >> that we continue to manage the folio as a single entry on the LRU list. >> But userspace may have no idea that we're doing this. It may have made >> several calls to mmap() 256kB at once, they've all been coalesced into >> a single VMA and khugepaged has come along behind its back and created >> a 2MB THP. Now userspace calls mlock() and instead of treating that as >> a hint that oops, maybe we shouldn't've done that, we do our utmost to >> preserve the 2MB folio. >> >> I think this whole approach needs rethinking. IMO, anonymous folios >> should not cross VMA boundaries. Tell me why I'm wrong. > > I think we touched upon that a couple of times already, and the main issue is that while it sounds nice in theory, it's impossible in practice. > > THP are supposed to be transparent, that is, we should not let arbitrary operations fail. > > But nothing stops user space from > > (a) mmap'ing a 2 MiB region > (b) GUP-pinning the whole range > (c) GUP-pinning the first half > (d) unpinning the whole range from (a) > (e) munmap'ing the second half > > > And that's just one out of many examples I can think of, not even considering temporary/speculative references that can prevent a split at random points in time -- especially when splitting a VMA. > Yes. The case that folio can't be split successfully is the only reason I tried to avoid split the folio in mlock() syscall. I'd like to postpone the split to page reclaim phase. Regards Yin, Fengwei > Sure, any time we PTE-map a THP we might just say "let's put that on the deferred split queue" and cross fingers that we can eventually split it later. (I was recently thinking about that in the context of the mapcount ...) > > It's all a big mess ... >
On 7/8/2023 3:06 AM, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote: >> On 07.07.23 19:26, Matthew Wilcox wrote: >>> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: >>>> This series identified the large folio for mlock to two types: >>>> - The large folio is in VM_LOCKED VMA range >>>> - The large folio cross VM_LOCKED VMA boundary >>> >>> This is somewhere that I think our fixation on MUST USE PMD ENTRIES >>> has led us astray. Today when the arguments to mlock() cross a folio >>> boundary, we split the PMD entry but leave the folio intact. That means >>> that we continue to manage the folio as a single entry on the LRU list. >>> But userspace may have no idea that we're doing this. It may have made >>> several calls to mmap() 256kB at once, they've all been coalesced into >>> a single VMA and khugepaged has come along behind its back and created >>> a 2MB THP. Now userspace calls mlock() and instead of treating that as >>> a hint that oops, maybe we shouldn't've done that, we do our utmost to >>> preserve the 2MB folio. >>> >>> I think this whole approach needs rethinking. IMO, anonymous folios >>> should not cross VMA boundaries. Tell me why I'm wrong. >> >> I think we touched upon that a couple of times already, and the main issue >> is that while it sounds nice in theory, it's impossible in practice. >> >> THP are supposed to be transparent, that is, we should not let arbitrary >> operations fail. >> >> But nothing stops user space from >> >> (a) mmap'ing a 2 MiB region >> (b) GUP-pinning the whole range >> (c) GUP-pinning the first half >> (d) unpinning the whole range from (a) >> (e) munmap'ing the second half >> >> >> And that's just one out of many examples I can think of, not even >> considering temporary/speculative references that can prevent a split at >> random points in time -- especially when splitting a VMA. >> >> Sure, any time we PTE-map a THP we might just say "let's put that on the >> deferred split queue" and cross fingers that we can eventually split it >> later. (I was recently thinking about that in the context of the mapcount >> ...) >> >> It's all a big mess ... > > Oh, I agree, there are always going to be circumstances where we realise > we've made a bad decision and can't (easily) undo it. Unless we have a > per-page pincount, and I Would Rather Not Do That. But we should _try_ > to do that because it's the right model -- that's what I meant by "Tell > me why I'm wrong"; what scenarios do we have where a user temporarilly > mlocks (or mprotects or ...) a range of memory, but wants that memory > to be aged in the LRU exactly the same way as the adjacent memory that > wasn't mprotected? for manpage of mlock(): mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. So my understanding is it's OK to let the memory mlocked to be aged with the adjacent memory which is not mlocked. Just make sure they are not paged out to swap. One question for implementation detail: If the large folio cross VMA boundary can not be split, how do we deal with this case? Retry in syscall till it's split successfully? Or return error (and what ERRORS should we choose) to user space? Regards Yin, Fengwei > > GUP-pinning is different, and I don't think GUP-pinning should split > a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing > tcp zero-copy or it's the source/target of O_DIRECT. That's not an > instruction that this memory is different from its neighbours. > > Maybe we end up deciding to split folios on GUP-pin. That would be > regrettable.
On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote: > > Oh, I agree, there are always going to be circumstances where we realise > > we've made a bad decision and can't (easily) undo it. Unless we have a > > per-page pincount, and I Would Rather Not Do That. But we should _try_ > > to do that because it's the right model -- that's what I meant by "Tell > > me why I'm wrong"; what scenarios do we have where a user temporarilly > > mlocks (or mprotects or ...) a range of memory, but wants that memory > > to be aged in the LRU exactly the same way as the adjacent memory that > > wasn't mprotected? > for manpage of mlock(): > mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory > from being paged to the swap area. > > So my understanding is it's OK to let the memory mlocked to be aged with > the adjacent memory which is not mlocked. Just make sure they are not > paged out to swap. Right, it doesn't break anything; it's just a similar problem to internal fragmentation. The pages of the folio which aren't mlocked will also be locked in RAM and never paged out. > One question for implementation detail: > If the large folio cross VMA boundary can not be split, how do we > deal with this case? Retry in syscall till it's split successfully? > Or return error (and what ERRORS should we choose) to user space? I would be tempted to allocate memory & copy to the new mlocked VMA. The old folio will go on the deferred_list and be split later, or its valid parts will be written to swap and then it can be freed.
On Fri, Jul 7, 2023 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote: > > > Oh, I agree, there are always going to be circumstances where we realise > > > we've made a bad decision and can't (easily) undo it. Unless we have a > > > per-page pincount, and I Would Rather Not Do That. But we should _try_ > > > to do that because it's the right model -- that's what I meant by "Tell > > > me why I'm wrong"; what scenarios do we have where a user temporarilly > > > mlocks (or mprotects or ...) a range of memory, but wants that memory > > > to be aged in the LRU exactly the same way as the adjacent memory that > > > wasn't mprotected? > > for manpage of mlock(): > > mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory > > from being paged to the swap area. > > > > So my understanding is it's OK to let the memory mlocked to be aged with > > the adjacent memory which is not mlocked. Just make sure they are not > > paged out to swap. > > Right, it doesn't break anything; it's just a similar problem to > internal fragmentation. The pages of the folio which aren't mlocked > will also be locked in RAM and never paged out. I don't think this is the case: since partially locking a non-pmd-mappable large folio is a nop, it remains on one of the evictable LRUs. The rmap walk by folio_referenced() should already be able to find the VMA and the PTEs mapping the unlocked portion. So the page reclaim should be able to correctly age the unlocked portion even though the folio contains a locked portion too. And when it tries to reclaim the entire folio, it first tries to split it into a list of base folios in shrink_folio_list(), and if that succeeds, it walks the rmap of each base folio on that list to unmap (not age). Unmapping doesn't have TTU_IGNORE_MLOCK, so it should correctly call mlock_vma_folio() on the locked base folios and bail out. And finally those locked base folios are put back to the unevictable list. > > One question for implementation detail: > > If the large folio cross VMA boundary can not be split, how do we > > deal with this case? Retry in syscall till it's split successfully? > > Or return error (and what ERRORS should we choose) to user space? > > I would be tempted to allocate memory & copy to the new mlocked VMA. > The old folio will go on the deferred_list and be split later, or its > valid parts will be written to swap and then it can be freed.
On 7/8/2023 12:02 PM, Matthew Wilcox wrote: > On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote: >>> Oh, I agree, there are always going to be circumstances where we realise >>> we've made a bad decision and can't (easily) undo it. Unless we have a >>> per-page pincount, and I Would Rather Not Do That. But we should _try_ >>> to do that because it's the right model -- that's what I meant by "Tell >>> me why I'm wrong"; what scenarios do we have where a user temporarilly >>> mlocks (or mprotects or ...) a range of memory, but wants that memory >>> to be aged in the LRU exactly the same way as the adjacent memory that >>> wasn't mprotected? >> for manpage of mlock(): >> mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory >> from being paged to the swap area. >> >> So my understanding is it's OK to let the memory mlocked to be aged with >> the adjacent memory which is not mlocked. Just make sure they are not >> paged out to swap. > > Right, it doesn't break anything; it's just a similar problem to > internal fragmentation. The pages of the folio which aren't mlocked > will also be locked in RAM and never paged out. This patchset doesn't mlock the large folio cross VMA boundary. So page reclaim can pick the large folio and split it. Then the pages out of VM_LOCKED VMA range will be paged out. Or did I miss something here? > >> One question for implementation detail: >> If the large folio cross VMA boundary can not be split, how do we >> deal with this case? Retry in syscall till it's split successfully? >> Or return error (and what ERRORS should we choose) to user space? > > I would be tempted to allocate memory & copy to the new mlocked VMA. > The old folio will go on the deferred_list and be split later, or its > valid parts will be written to swap and then it can be freed. OK. This can be common operations for any case that splitting VMA triggers large folio splitting but splitting fails. Regards Yin, Fengwei
On 7/8/2023 12:35 PM, Yu Zhao wrote: > On Fri, Jul 7, 2023 at 10:02 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Sat, Jul 08, 2023 at 11:52:23AM +0800, Yin, Fengwei wrote: >>>> Oh, I agree, there are always going to be circumstances where we realise >>>> we've made a bad decision and can't (easily) undo it. Unless we have a >>>> per-page pincount, and I Would Rather Not Do That. But we should _try_ >>>> to do that because it's the right model -- that's what I meant by "Tell >>>> me why I'm wrong"; what scenarios do we have where a user temporarilly >>>> mlocks (or mprotects or ...) a range of memory, but wants that memory >>>> to be aged in the LRU exactly the same way as the adjacent memory that >>>> wasn't mprotected? >>> for manpage of mlock(): >>> mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory >>> from being paged to the swap area. >>> >>> So my understanding is it's OK to let the memory mlocked to be aged with >>> the adjacent memory which is not mlocked. Just make sure they are not >>> paged out to swap. >> >> Right, it doesn't break anything; it's just a similar problem to >> internal fragmentation. The pages of the folio which aren't mlocked >> will also be locked in RAM and never paged out. > > I don't think this is the case: since partially locking a > non-pmd-mappable large folio is a nop, it remains on one of the > evictable LRUs. The rmap walk by folio_referenced() should already be > able to find the VMA and the PTEs mapping the unlocked portion. So the > page reclaim should be able to correctly age the unlocked portion even > though the folio contains a locked portion too. And when it tries to > reclaim the entire folio, it first tries to split it into a list of > base folios in shrink_folio_list(), and if that succeeds, it walks the > rmap of each base folio on that list to unmap (not age). Unmapping > doesn't have TTU_IGNORE_MLOCK, so it should correctly call > mlock_vma_folio() on the locked base folios and bail out. And finally > those locked base folios are put back to the unevictable list. Yes. This is exact my understanding to this too. It's also why this patchset keep large folio cross VMA boundary munlocked. Regards Yin, Fengwei > >>> One question for implementation detail: >>> If the large folio cross VMA boundary can not be split, how do we >>> deal with this case? Retry in syscall till it's split successfully? >>> Or return error (and what ERRORS should we choose) to user space? >> >> I would be tempted to allocate memory & copy to the new mlocked VMA. >> The old folio will go on the deferred_list and be split later, or its >> valid parts will be written to swap and then it can be freed.
On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > > Yu mentioned at [1] about the mlock() can't be applied to large folio. > > I leant the related code and here is my understanding: > - For RLIMIT_MEMLOCK related, there is no problem. Becuase the > RLIMIT_MEMLOCK statistics is not related underneath page. That means > underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK > statistics collection which is always correct. > > - For keeping the page in RAM, there is no problem either. At least, > during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit > set in vm_flags, the folio will be kept whatever the folio is > mlocked or not. > > So the function of mlock for large folio works. But it's not optimized > because the page reclaim needs scan these large folio and may split > them. > > This series identified the large folio for mlock to two types: > - The large folio is in VM_LOCKED VMA range > - The large folio cross VM_LOCKED VMA boundary > > For the first type, we mlock large folio so page relcaim will skip it. > For the second type, we don't mlock large folio. It's allowed to be > picked by page reclaim and be split. So the pages not in VM_LOCKED VMA > range are allowed to be reclaimed/released. This is a sound design, which is also what I have in mind. I see the rationales are being spelled out in this thread, and hopefully everyone can be convinced. > patch1 introduce API to check whether large folio is in VMA range. > patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support > large folio mlock/munlock. > patch3 make mlock/munlock syscall support large folio. Could you tidy up the last patch a little bit? E.g., Saying "mlock the 4K folio" is obviously not the best idea. And if it's possible, make the loop just look like before, i.e., if (!can_mlock_entire_folio()) continue; if (vma->vm_flags & VM_LOCKED) mlock_folio_range(); else munlock_folio_range(); Thanks.
On 7/8/2023 12:45 PM, Yu Zhao wrote: > On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> Yu mentioned at [1] about the mlock() can't be applied to large folio. >> >> I leant the related code and here is my understanding: >> - For RLIMIT_MEMLOCK related, there is no problem. Becuase the >> RLIMIT_MEMLOCK statistics is not related underneath page. That means >> underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK >> statistics collection which is always correct. >> >> - For keeping the page in RAM, there is no problem either. At least, >> during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit >> set in vm_flags, the folio will be kept whatever the folio is >> mlocked or not. >> >> So the function of mlock for large folio works. But it's not optimized >> because the page reclaim needs scan these large folio and may split >> them. >> >> This series identified the large folio for mlock to two types: >> - The large folio is in VM_LOCKED VMA range >> - The large folio cross VM_LOCKED VMA boundary >> >> For the first type, we mlock large folio so page relcaim will skip it. >> For the second type, we don't mlock large folio. It's allowed to be >> picked by page reclaim and be split. So the pages not in VM_LOCKED VMA >> range are allowed to be reclaimed/released. > > This is a sound design, which is also what I have in mind. I see the > rationales are being spelled out in this thread, and hopefully > everyone can be convinced. > >> patch1 introduce API to check whether large folio is in VMA range. >> patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support >> large folio mlock/munlock. >> patch3 make mlock/munlock syscall support large folio. > > Could you tidy up the last patch a little bit? E.g., Saying "mlock the > 4K folio" is obviously not the best idea. > > And if it's possible, make the loop just look like before, i.e., > > if (!can_mlock_entire_folio()) > continue; > if (vma->vm_flags & VM_LOCKED) > mlock_folio_range(); > else > munlock_folio_range(); This can make large folio mlocked() even user space call munlock() to the range. Considering following case: 1. mlock() 64K range and underneath 64K large folio is mlocked(). 2. mprotect the first 32K range to different prot and triggers VMA split. 3. munlock() 64K range. As 64K large folio doesn't in these two new VMAs range, it will not be munlocked() and only can be reclaimed after it's unmapped from two VMAs instead of after the range is munlocked(). Regards Yin, Fengwei > > Thanks.
On Fri, Jul 7, 2023 at 11:01 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/8/2023 12:45 PM, Yu Zhao wrote: > > On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> Yu mentioned at [1] about the mlock() can't be applied to large folio. > >> > >> I leant the related code and here is my understanding: > >> - For RLIMIT_MEMLOCK related, there is no problem. Becuase the > >> RLIMIT_MEMLOCK statistics is not related underneath page. That means > >> underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK > >> statistics collection which is always correct. > >> > >> - For keeping the page in RAM, there is no problem either. At least, > >> during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit > >> set in vm_flags, the folio will be kept whatever the folio is > >> mlocked or not. > >> > >> So the function of mlock for large folio works. But it's not optimized > >> because the page reclaim needs scan these large folio and may split > >> them. > >> > >> This series identified the large folio for mlock to two types: > >> - The large folio is in VM_LOCKED VMA range > >> - The large folio cross VM_LOCKED VMA boundary > >> > >> For the first type, we mlock large folio so page relcaim will skip it. > >> For the second type, we don't mlock large folio. It's allowed to be > >> picked by page reclaim and be split. So the pages not in VM_LOCKED VMA > >> range are allowed to be reclaimed/released. > > > > This is a sound design, which is also what I have in mind. I see the > > rationales are being spelled out in this thread, and hopefully > > everyone can be convinced. > > > >> patch1 introduce API to check whether large folio is in VMA range. > >> patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support > >> large folio mlock/munlock. > >> patch3 make mlock/munlock syscall support large folio. > > > > Could you tidy up the last patch a little bit? E.g., Saying "mlock the > > 4K folio" is obviously not the best idea. > > > > And if it's possible, make the loop just look like before, i.e., > > > > if (!can_mlock_entire_folio()) > > continue; > > if (vma->vm_flags & VM_LOCKED) > > mlock_folio_range(); > > else > > munlock_folio_range(); > This can make large folio mlocked() even user space call munlock() > to the range. Considering following case: > 1. mlock() 64K range and underneath 64K large folio is mlocked(). > 2. mprotect the first 32K range to different prot and triggers > VMA split. > 3. munlock() 64K range. As 64K large folio doesn't in these two > new VMAs range, it will not be munlocked() and only can be > reclaimed after it's unmapped from two VMAs instead of after > the range is munlocked(). I understand. I'm asking to factor the code, not to change the logic.
On 7/8/2023 1:06 PM, Yu Zhao wrote: > On Fri, Jul 7, 2023 at 11:01 PM Yin, Fengwei <fengwei.yin@intel.com> wrote: >> >> >> >> On 7/8/2023 12:45 PM, Yu Zhao wrote: >>> On Fri, Jul 7, 2023 at 10:52 AM Yin Fengwei <fengwei.yin@intel.com> wrote: >>>> >>>> Yu mentioned at [1] about the mlock() can't be applied to large folio. >>>> >>>> I leant the related code and here is my understanding: >>>> - For RLIMIT_MEMLOCK related, there is no problem. Becuase the >>>> RLIMIT_MEMLOCK statistics is not related underneath page. That means >>>> underneath page mlock or munlock doesn't impact the RLIMIT_MEMLOCK >>>> statistics collection which is always correct. >>>> >>>> - For keeping the page in RAM, there is no problem either. At least, >>>> during try_to_unmap_one(), once detect the VMA has VM_LOCKED bit >>>> set in vm_flags, the folio will be kept whatever the folio is >>>> mlocked or not. >>>> >>>> So the function of mlock for large folio works. But it's not optimized >>>> because the page reclaim needs scan these large folio and may split >>>> them. >>>> >>>> This series identified the large folio for mlock to two types: >>>> - The large folio is in VM_LOCKED VMA range >>>> - The large folio cross VM_LOCKED VMA boundary >>>> >>>> For the first type, we mlock large folio so page relcaim will skip it. >>>> For the second type, we don't mlock large folio. It's allowed to be >>>> picked by page reclaim and be split. So the pages not in VM_LOCKED VMA >>>> range are allowed to be reclaimed/released. >>> >>> This is a sound design, which is also what I have in mind. I see the >>> rationales are being spelled out in this thread, and hopefully >>> everyone can be convinced. >>> >>>> patch1 introduce API to check whether large folio is in VMA range. >>>> patch2 make page reclaim/mlock_vma_folio/munlock_vma_folio support >>>> large folio mlock/munlock. >>>> patch3 make mlock/munlock syscall support large folio. >>> >>> Could you tidy up the last patch a little bit? E.g., Saying "mlock the >>> 4K folio" is obviously not the best idea. >>> >>> And if it's possible, make the loop just look like before, i.e., >>> >>> if (!can_mlock_entire_folio()) >>> continue; >>> if (vma->vm_flags & VM_LOCKED) >>> mlock_folio_range(); >>> else >>> munlock_folio_range(); >> This can make large folio mlocked() even user space call munlock() >> to the range. Considering following case: >> 1. mlock() 64K range and underneath 64K large folio is mlocked(). >> 2. mprotect the first 32K range to different prot and triggers >> VMA split. >> 3. munlock() 64K range. As 64K large folio doesn't in these two >> new VMAs range, it will not be munlocked() and only can be >> reclaimed after it's unmapped from two VMAs instead of after >> the range is munlocked(). > > I understand. I'm asking to factor the code, not to change the logic. Oh. Sorry. I miss-understood the code piece you showed. Will address this in coming version. Thanks. Regards Yin, Fengwei
On 7/8/2023 12:02 PM, Matthew Wilcox wrote: > I would be tempted to allocate memory & copy to the new mlocked VMA. > The old folio will go on the deferred_list and be split later, or its > valid parts will be written to swap and then it can be freed. If the large folio splitting failure is because of GUP pages, can we do copy here? Let's say, if the GUP page is target of DMA operation and DMA operation is ongoing. We allocated a new page and copy GUP page content to the new page, the data in the new page can be corrupted. Regards Yin, Fengwei
On 09.07.23 15:25, Yin, Fengwei wrote: > > > On 7/8/2023 12:02 PM, Matthew Wilcox wrote: >> I would be tempted to allocate memory & copy to the new mlocked VMA. >> The old folio will go on the deferred_list and be split later, or its >> valid parts will be written to swap and then it can be freed. > If the large folio splitting failure is because of GUP pages, can we > do copy here? > > Let's say, if the GUP page is target of DMA operation and DMA operation > is ongoing. We allocated a new page and copy GUP page content to the > new page, the data in the new page can be corrupted. No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned. page_try_share_anon_rmap() implements the clearing logic, taking care of synchronizing against concurrent GUP-fast. There are some additional nasty details regarding O_DIRECT. But once it completely switched from using FOLL_GET to properly using FOLL_PIN (a lot of that conversion already happened IIRC), we're fine in that regard.
Hi David, On 7/10/2023 5:32 PM, David Hildenbrand wrote: > On 09.07.23 15:25, Yin, Fengwei wrote: >> >> >> On 7/8/2023 12:02 PM, Matthew Wilcox wrote: >>> I would be tempted to allocate memory & copy to the new mlocked VMA. >>> The old folio will go on the deferred_list and be split later, or its >>> valid parts will be written to swap and then it can be freed. >> If the large folio splitting failure is because of GUP pages, can we >> do copy here? >> >> Let's say, if the GUP page is target of DMA operation and DMA operation >> is ongoing. We allocated a new page and copy GUP page content to the >> new page, the data in the new page can be corrupted. > > No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned. Thanks a lot for clarification. So my understanding is that if large folio splitting fails, it's not always true that we can allocate new folios, copy original large folio content to new folios, remove original large folio from VMA and map the new folios to VMA (like it's only true if original large folio is marked as maybe shared). Regards Yin, Fengwei > > page_try_share_anon_rmap() implements the clearing logic, taking care of synchronizing against concurrent GUP-fast. > > There are some additional nasty details regarding O_DIRECT. But once it completely switched from using FOLL_GET to properly using FOLL_PIN (a lot of that conversion already happened IIRC), we're fine in that regard. >
On 10.07.23 11:43, Yin, Fengwei wrote: > Hi David, > > On 7/10/2023 5:32 PM, David Hildenbrand wrote: >> On 09.07.23 15:25, Yin, Fengwei wrote: >>> >>> >>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote: >>>> I would be tempted to allocate memory & copy to the new mlocked VMA. >>>> The old folio will go on the deferred_list and be split later, or its >>>> valid parts will be written to swap and then it can be freed. >>> If the large folio splitting failure is because of GUP pages, can we >>> do copy here? >>> >>> Let's say, if the GUP page is target of DMA operation and DMA operation >>> is ongoing. We allocated a new page and copy GUP page content to the >>> new page, the data in the new page can be corrupted. >> >> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned. > Thanks a lot for clarification. > > So my understanding is that if large folio splitting fails, it's not always > true that we can allocate new folios, copy original large folio content to > new folios, remove original large folio from VMA and map the new folios to > VMA (like it's only true if original large folio is marked as maybe shared). > While it might work in many cases, there are some corner cases where it won't work. So to summarize (1) THP are transparent and should not result in arbitrary syscall failures. (2) Splitting a THP might fail at random points in time either due to GUP pins or due to speculative page references (including speculative GUP pins). (3) Replacing an exclusive anon page that maybe pinned will result in memory corruptions. So we can try to split any THP that crosses VMA borders on VMA modifications (split due to munmap, mremap, madvise, mprotect, mlock, ...), it's not guaranteed to work due to (1). And we can try to replace pages such pages, but it's not guaranteed to be allowed due to (3). And as it's all transparent, we cannot fail (1). For the other cases that Willy and I discussed (split on VMA modifications after fork()), we can at least always replace the anon page. <details> What always works, is putting the THP on the deferred split queue to see if we can split it later. The deferred split queue is a bit suboptimal right now, because it requires the (sub)page mapcounts to detect whether the folio is partially mapped vs. fully mapped. If we want to get rid of that, we have to come up with something reasonable. I was wondering if we could have a an optimized deferred split queue, that only conditionally splits: do an rmap walk and detect if (a) each page of the folio is still mapped (b) the folio does not cross a VMA. If both are met, one could skip the deferred split. But that needs a bit of thought -- but we're already doing an rmap walk when splitting, so scanning which parts are actually mapped does not sound too weird. </details>
On 7/10/2023 5:57 PM, David Hildenbrand wrote: > On 10.07.23 11:43, Yin, Fengwei wrote: >> Hi David, >> >> On 7/10/2023 5:32 PM, David Hildenbrand wrote: >>> On 09.07.23 15:25, Yin, Fengwei wrote: >>>> >>>> >>>> On 7/8/2023 12:02 PM, Matthew Wilcox wrote: >>>>> I would be tempted to allocate memory & copy to the new mlocked VMA. >>>>> The old folio will go on the deferred_list and be split later, or its >>>>> valid parts will be written to swap and then it can be freed. >>>> If the large folio splitting failure is because of GUP pages, can we >>>> do copy here? >>>> >>>> Let's say, if the GUP page is target of DMA operation and DMA operation >>>> is ongoing. We allocated a new page and copy GUP page content to the >>>> new page, the data in the new page can be corrupted. >>> >>> No, we may only replace anon pages that are flagged as maybe shared (!PageAnonExclusive). We must not replace pages that are exclusive (PageAnonExclusive) unless we first try marking them maybe shared. Clearing will fail if the page maybe pinned. >> Thanks a lot for clarification. >> >> So my understanding is that if large folio splitting fails, it's not always >> true that we can allocate new folios, copy original large folio content to >> new folios, remove original large folio from VMA and map the new folios to >> VMA (like it's only true if original large folio is marked as maybe shared). >> > > While it might work in many cases, there are some corner cases where it won't work. > > So to summarize > > (1) THP are transparent and should not result in arbitrary syscall > failures. > (2) Splitting a THP might fail at random points in time either due to > GUP pins or due to speculative page references (including > speculative GUP pins). > (3) Replacing an exclusive anon page that maybe pinned will result in > memory corruptions. > > So we can try to split any THP that crosses VMA borders on VMA modifications (split due to munmap, mremap, madvise, mprotect, mlock, ...), it's not guaranteed to work due to (1). And we can try to replace pages such pages, but it's not guaranteed to be allowed due to (3). > > And as it's all transparent, we cannot fail (1). Very clear to me now. > > For the other cases that Willy and I discussed (split on VMA modifications after fork()), we can at least always replace the anon page. > > <details> > > What always works, is putting the THP on the deferred split queue to see if we can split it later. The deferred split queue is a bit suboptimal right now, because it requires the (sub)page mapcounts to detect whether the folio is partially mapped vs. fully mapped. If we want to get rid of that, we have to come up with something reasonable. > > I was wondering if we could have a an optimized deferred split queue, that only conditionally splits: do an rmap walk and detect if (a) each page of the folio is still mapped (b) the folio does not cross a VMA. If both are met, one could skip the deferred split. But that needs a bit of thought -- but we're already doing an rmap walk when splitting, so scanning which parts are actually mapped does not sound too weird. > > </details> > Thanks a lot for extra information which help me to know more background. Really appreciate it. Regards Yin, Fengwei
On 07/07/2023 20:26, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 09:15:02PM +0200, David Hildenbrand wrote: >>>> Sure, any time we PTE-map a THP we might just say "let's put that on the >>>> deferred split queue" and cross fingers that we can eventually split it >>>> later. (I was recently thinking about that in the context of the mapcount >>>> ...) >>>> >>>> It's all a big mess ... >>> >>> Oh, I agree, there are always going to be circumstances where we realise >>> we've made a bad decision and can't (easily) undo it. Unless we have a >>> per-page pincount, and I Would Rather Not Do That. >> >> I agree ... >> >> But we should _try_ >>> to do that because it's the right model -- that's what I meant by "Tell >> >> Try to have per-page pincounts? :/ or do you mean, try to split on VMA >> split? I hope the latter (although I'm not sure about performance) :) > > Sorry, try to split a folio on VMA split. > >>> me why I'm wrong"; what scenarios do we have where a user temporarilly >>> mlocks (or mprotects or ...) a range of memory, but wants that memory >>> to be aged in the LRU exactly the same way as the adjacent memory that >>> wasn't mprotected? >> >> Let me throw in a "fun one". >> >> Parent process has a 2 MiB range populated by a THP. fork() a child process. >> Child process mprotects half the VMA. >> >> Should we split the (COW-shared) THP? Or should we COW/unshare in the child >> process (ugh!) during the VMA split. >> >> It all makes my brain hurt. > > OK, so this goes back to what I wrote earlier about attempting to choose > what size of folio to allocate on COW: > > https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ > > : the parent had already established > : an appropriate size folio to use for this VMA before calling fork(). > : Whether it is the parent or the child causing the COW, it should probably > : inherit that choice and we should default to the same size folio that > : was already found. FWIW, I had patches in my original RFC that aimed to follow this policy for large anon folios [1] & [2], and intend to follow up with a modified version of these patches once we have an initial submission. [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-11-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230414130303.2345383-15-ryan.roberts@arm.com/