Message ID | 8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/munlock: rework of mlock+munlock page handling | expand |
On Sun, Feb 06, 2022 at 01:27:41PM -0800, Hugh Dickins wrote: > Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to > mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but > likely to conflict (I hope not fundamentally) with several concurrent > large patchsets. Most of this patchset hasn't arrived here yet, but I would be _delighted_ to rebase the folio conversion on top of this. What convoluted code it is! I am so glad you've cleaned this up; I was dreading doing the rest of the mlock file.
On Sun, 6 Feb 2022, Matthew Wilcox wrote: > On Sun, Feb 06, 2022 at 01:27:41PM -0800, Hugh Dickins wrote: > > Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to > > mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but > > likely to conflict (I hope not fundamentally) with several concurrent > > large patchsets. > > Most of this patchset hasn't arrived here yet, but I would be > _delighted_ to rebase the folio conversion on top of this. What > convoluted code it is! I am so glad you've cleaned this up; I was > dreading doing the rest of the mlock file. That's a very generous offer: thank you. I'd been looking at it the other way round, afraid that it would be getting in your way: but now think you're right, that the cleanup there will help your work. I had found mm/mlock.c unexpectedly difficult to deal with when doing shmem huge pages, and IIRC Kirill had the same experience. Hugh
On Sun 06-02-22 13:27:41, Hugh Dickins wrote: > I wondered whether to post this munlocking rework in > https://lore.kernel.org/linux-mm/35c340a6-96f-28a0-2b7b-2f9fbddc01f@google.com/ > > There the discussion was OOM reaping, but the main reason for the rework > has been catastrophic contention on i_mmap_rwsem when exiting from > multiply mlocked files; and frustration with how contorted munlocking is. yeah, I do not think that too many care about oom_reaper stumbling over vast portions of the mlocked memory problem. So far I have seen only LTP oom02 hitting this and that one does mlockall and intentionally hits OOM. So far I haven't seen any actual workload hiting something similar. But hey, if we get a better and easier to maintain mlock code then the oom_reaper will surely add to a huge thanks. > Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to > mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but > likely to conflict (I hope not fundamentally) with several concurrent > large patchsets. > > tl;dr > mm/mlock.c | 634 +++++++++++++++----------------------- > 23 files changed, 510 insertions(+), 779 deletions(-) This is really impressive! > In my own opinion, this is ready to go in: that whatever its defects, > it's a big practical improvement over what's presently there. Others > may differ; and it may be too much to arrive at a quick opinion on. > My hope is that it will strike a chord with some who have laboured in > this area in the past: I'll be short of time to do much more with it, > so maybe someone else could take over if necessary. > > At present there's no update to Documentation/vm/unevictable-lru.rst: > that always needs a different mindset, can follow later, please refer > to commit messages for now. Yes, that would help for sure. So far I have only managed to read through the series and trying to put all the pieces together (so far I have given up on the THP part) and my undestanding is far from complete. But I have to say I like the general approach and overall simplification. The only thing that is not entirely clear to me at the moment is why you have chosen to ignore already mapped LOCKONFAULT pages. They will eventually get sorted out during the reclaim/migration but this can backfire if too many pages have been pre-faulted before LOCKONFAULT call. Maybe not an interesting case in the first place but I am still wondering why you have chosen that way. I will be off next couple of days and plan to revisit this afterwards (should time allow). Anyway thanks a lot Hugh!
On Wed, 9 Feb 2022, Michal Hocko wrote: > > So far I have only managed to read through the series and trying to put > all the pieces together (so far I have given up on the THP part) and my > undestanding is far from complete. But I have to say I like the general > approach and overall simplification. Many thanks for looking, Michal, and for all the positivity! > > The only thing that is not entirely clear to me at the moment is why you > have chosen to ignore already mapped LOCKONFAULT pages. They will > eventually get sorted out during the reclaim/migration but this can > backfire if too many pages have been pre-faulted before LOCKONFAULT > call. Maybe not an interesting case in the first place but I am still > wondering why you have chosen that way. I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT pages? I'd consider that a bug. It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED set too? If I've got that wrong, yes, I'll need to revisit conditions. > > I will be off next couple of days and plan to revisit this afterwards > (should time allow). Anyway thanks a lot Hugh! > -- > Michal Hocko > SUSE Labs
On Wed 09-02-22 08:21:17, Hugh Dickins wrote: > On Wed, 9 Feb 2022, Michal Hocko wrote: [...] > > The only thing that is not entirely clear to me at the moment is why you > > have chosen to ignore already mapped LOCKONFAULT pages. They will > > eventually get sorted out during the reclaim/migration but this can > > backfire if too many pages have been pre-faulted before LOCKONFAULT > > call. Maybe not an interesting case in the first place but I am still > > wondering why you have chosen that way. > > I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT > pages? I'd consider that a bug. I've had the following path in mind __mm_populate populate_vma_page_range if (vma->vm_flags & VM_LOCKONFAULT) return nr_page which means that __get_user_pages is not called at all. This also means that follow_page_mask is skipped. The page table walk used to mark already mapped pages as mlocked so unless I miss some other path it would stay on its original LRU list and only get real mlock protection when encountered by the reclaim or migration. > It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED > set too? If I've got that wrong, yes, I'll need to revisit conditions. Yes, I did't really mean we would lose the mlock protection. We would just do the lazy mlock also on pages which are already mapped. This is certainly not a correctness issue - althoug stats might be off - but it could polute existing LRUs with mlocked pages rather easily. As I've said. Either I am still missing something or this might even not be a big deal in real life. I was mostly curious about the choice to exclude the page table walk for LOCKONFAULT.
On Wed, 9 Feb 2022, Michal Hocko wrote: > On Wed 09-02-22 08:21:17, Hugh Dickins wrote: > > On Wed, 9 Feb 2022, Michal Hocko wrote: > [...] > > > The only thing that is not entirely clear to me at the moment is why you > > > have chosen to ignore already mapped LOCKONFAULT pages. They will > > > eventually get sorted out during the reclaim/migration but this can > > > backfire if too many pages have been pre-faulted before LOCKONFAULT > > > call. Maybe not an interesting case in the first place but I am still > > > wondering why you have chosen that way. > > > > I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT > > pages? I'd consider that a bug. > > I've had the following path in mind > __mm_populate > populate_vma_page_range > if (vma->vm_flags & VM_LOCKONFAULT) > return nr_page > > which means that __get_user_pages is not called at all. This also means > that follow_page_mask is skipped. The page table walk used to mark > already mapped pages as mlocked so unless I miss some other path it > would stay on its original LRU list and only get real mlock protection > when encountered by the reclaim or migration. That is so until 04/13: but then, whenever a page is faulted into a VM_LOCKED area (except when skipping pte-of-compound) it goes through mlock_page(). So it will then lock-on-fault. Ah, but you're worrying about any previously-mapped pages when VM_LOCKONFAULT is established: those ones get done by mlock_pte_range() in 07/13, same as when VM_LOCKED is established - I checked again, VM_LOCKONFAULT is not set without VM_LOCKED. > > > It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED > > set too? If I've got that wrong, yes, I'll need to revisit conditions. > > Yes, I did't really mean we would lose the mlock protection. We would > just do the lazy mlock also on pages which are already mapped. This is > certainly not a correctness issue - althoug stats might be off - but it > could polute existing LRUs with mlocked pages rather easily. Even with the lazy mlock in reclaim, I'd still accuse myself of a bug if I've got this wrong. > > As I've said. Either I am still missing something or this might even not > be a big deal in real life. I was mostly curious about the choice to > exclude the page table walk for LOCKONFAULT. It surprised me too: but looking at how FOLL_POPULATE was handled in faultin_page(), it appeared to me to become redundant, once mlocking was counted at fault time (and in mlock_pte_range() - maybe that's the page table walk you're looking for, not excluded but moved). It is a big deal for me: please don't let me fool you, please do continue to worry at it until you're convinced or you convince me. Thanks, Hugh
On Wed 09-02-22 14:59:19, Hugh Dickins wrote: [...] > Ah, but you're worrying about any previously-mapped pages when > VM_LOCKONFAULT is established: those ones get done by mlock_pte_range() > in 07/13, same as when VM_LOCKED is established - I checked again, > VM_LOCKONFAULT is not set without VM_LOCKED. Right you are. mlock_fixup->mlock_vma_pages_range is the missing piece. This is called for both do_mlock and mlockall paths. So all good and thanks for the clarification! I will get back to this sometimes next week.