Message ID | 20240424122630.495788-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() | expand |
On Wed, Apr 24, 2024 at 5:26 AM David Hildenbrand <david@redhat.com> wrote: > > We want to limit the use of page_mapcount() to places where absolutely > required, to prepare for kernel configs where we won't keep track of > per-page mapcounts in large folios. > > khugepaged is one of the remaining "more challenging" page_mapcount() > users, but we might be able to move away from page_mapcount() without > resulting in a significant behavior change that would warrant > special-casing based on kernel configs. > > In 2020, we first added support to khugepaged for collapsing COW-shared > pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared > across fork"), followed by support for collapsing PTE-mapped THP in commit > 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages") > and limiting the memory waste via the "page_count() > 1" check in commit > 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable"). > > As a default, khugepaged will allow up to half of the PTEs to map shared > pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged > setting. > > khugepaged does currently not care about swapcache page references, and > does not check under folio lock: so in some corner cases the "shared vs. > exclusive" detection might be a bit off, making us detect "exclusive" when > it's actually "shared". > > Most of our anonymous folios in the system are usually exclusive. We > frequently see sharing of anonymous folios for a short period of time, > after which our short-lived suprocesses either quit or exec(). > > There are some famous examples, though, where child processes exist for a > long time, and where memory is COW-shared with a lot of processes > (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for > reducing the memory footprint. We don't want to suddenly change the > behavior to result in a significant increase in memory waste. > > Interestingly, khugepaged will only collapse an anonymous THP if at least > one PTE is writable. After fork(), that means that something (usually a > page fault) populated at least a single exclusive anonymous THP in that PMD > range. > > So ... what happens when we switch to "is this folio mapped shared" > instead of "is this page mapped shared" by using > folio_likely_mapped_shared()? > > For "not-COW-shared" folios, small folios and for THPs (large > folios) that are completely mapped into at least one process, > switching to folio_likely_mapped_shared() will not result in a change. > > We'll only see a change for COW-shared PTE-mapped THPs that are > partially mapped into all involved processes. > > There are two cases to consider: > > (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP > > If the folio is detected as exclusive, and it actually is exclusive, > there is no change: page_mapcount() == 1. This is the common case > without fork() or with short-lived child processes. > > folio_likely_mapped_shared() might currently still detect a folio as > exclusive although it is shared (false negatives): if the first page is > not mapped multiple times and if the average per-page mapcount is smaller > than 1, implying that (1) the folio is partially mapped and (2) if we are > responsible for many mapcounts by mapping many pages others can't > ("mostly exclusive") (3) if we are not responsible for many mapcounts by > mapping little pages ("mostly shared") it won't make a big impact on the > end result. > > So while we might now detect a page as "exclusive" although it isn't, > it's not expected to make a big difference in common cases. > > (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP > > folio_likely_mapped_shared() will never detect a large anonymous folio > as shared although it is exclusive: there are no false positives. > > If we detect a THP as shared, at least one page of the THP is mapped by > another process. It could well be that some pages are actually exclusive. > For example, our child processes could have unmapped/COW'ed some pages > such that they would now be exclusive to out process, which we now > would treat as still-shared. IIUC, case A may under-count shared PTEs, however on the opposite side case B may over-count shared PTEs, right? So the impact may depend on what value is used by max_shared_ptes tunable. It may have a more noticeable impact on a very conservative setting (i.e. max_shared_ptes == 1) if it is under-counted or on a very aggressive setting (i.e. max_shared_ptes == 510) if it is over-counted. So I agree it should not matter much for common cases. AFAIK, the usecase for aggressive setting should be very rare, but conservative setting may be more usual, so improving the under-count for conservative setting may be worth it. > > Examples: > (1) Parent maps all pages of a THP, child maps some pages. We detect > all pages in the parent as shared although some are actually > exclusive. > (2) Parent maps all but some page of a THP, child maps the remainder. > We detect all pages of the THP that the parent maps as shared > although they are all exclusive. > > In (1) we wouldn't collapse a THP right now already: no PTE > is writable, because a write fault would have resulted in COW of a > single page and the parent would no longer map all pages of that THP. > > For (2) we would have collapsed a THP in the parent so far, now we > wouldn't as long as the child process is still alive: unless the child > process unmaps the remaining THP pages or we decide to split that THP. > > Possibly, the child COW'ed many pages, meaning that it's likely that > we can populate a THP for our child first, and then for our parent. > > For (2), we are making really bad use of the THP in the first > place (not even mapped completely in at least one process). If the > THP would be completely partially mapped, it would be on the deferred > split queue where we would split it lazily later. > > For short-running child processes, we don't particularly care. For > long-running processes, the expectation is that such scenarios are > rather rare: further, a THP might be best placed if most data in the > PMD range is actually written, implying that we'll have to COW more > pages first before khugepaged would collapse it. > > To summarize, in the common case, this change is not expected to matter > much. The more common application of khugepaged operates on > exclusive pages, either before fork() or after a child quit. > > Can we improve (A)? Yes, if we implement more precise tracking of "mapped > shared" vs. "mapped exclusively", we could get rid of the false > negatives completely. > > Can we improve (B)? We could count how many pages of a large folio we map > inside the current page table and detect that we are responsible for most > of the folio mapcount and conclude "as good as exclusive", which might help > in some cases. ... but likely, some other mechanism should detect that > the THP is not a good use in the scenario (not even mapped completely in > a single process) and try splitting that folio lazily etc. > > We'll move the folio_test_anon() check before our "shared" check, so we > might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order > of checks now matches the one in __collapse_huge_page_isolate(). Extend > documentation. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Yang Shi <yang.shi@linux.alibaba.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > > How much time can one spend writing a patch description? Unbelievable. But > it was likely time well spend to have a clear picture of the impact. > > This really needs the folio_likely_mapped_shared() optimization [1] that > resides in mm-unstable, I think, to reduce "false negatives". > > The khugepage MM selftests keep working as expected, including: > > Run test: collapse_max_ptes_shared (khugepaged:anon) > Allocate huge page... OK > Share huge page over fork()... OK > Trigger CoW on page 255 of 512... OK > Maybe collapse with max_ptes_shared exceeded.... OK > Trigger CoW on page 256 of 512... OK > Collapse with max_ptes_shared PTEs shared.... OK > Check if parent still has huge page... OK > > Where we check that collapsing in the parent behaves as expected after > COWing a lot of pages in the parent: a sane scenario that is essentially > unchanged and which does not depend on any action in the child process > (compared to the cases discussed in (B) above). > > [1] https://lkml.kernel.org/r/20240409192301.907377-6-david@redhat.com > > --- > Documentation/admin-guide/mm/transhuge.rst | 3 ++- > mm/khugepaged.c | 22 +++++++++++++++------- > 2 files changed, 17 insertions(+), 8 deletions(-) > > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst > index f82300b9193fe..076443cc10a6c 100644 > --- a/Documentation/admin-guide/mm/transhuge.rst > +++ b/Documentation/admin-guide/mm/transhuge.rst > @@ -278,7 +278,8 @@ collapsed, resulting fewer pages being collapsed into > THPs, and lower memory access performance. > > ``max_ptes_shared`` specifies how many pages can be shared across multiple > -processes. Exceeding the number would block the collapse:: > +processes. khugepaged might treat pages of THPs as shared if any page of > +that THP is shared. Exceeding the number would block the collapse:: > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2f73d2aa9ae84..cf518fc440982 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -583,7 +583,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > folio = page_folio(page); > VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); > > - if (page_mapcount(page) > 1) { > + /* See hpage_collapse_scan_pmd(). */ > + if (folio_likely_mapped_shared(folio)) { > ++shared; > if (cc->is_khugepaged && > shared > khugepaged_max_ptes_shared) { > @@ -1317,8 +1318,20 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > result = SCAN_PAGE_NULL; > goto out_unmap; > } > + folio = page_folio(page); > > - if (page_mapcount(page) > 1) { > + if (!folio_test_anon(folio)) { > + result = SCAN_PAGE_ANON; > + goto out_unmap; > + } > + > + /* > + * We treat a single page as shared if any part of the THP > + * is shared. "False negatives" from > + * folio_likely_mapped_shared() are not expected to matter > + * much in practice. > + */ > + if (folio_likely_mapped_shared(folio)) { > ++shared; > if (cc->is_khugepaged && > shared > khugepaged_max_ptes_shared) { > @@ -1328,7 +1341,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > } > } > > - folio = page_folio(page); > /* > * Record which node the original page is from and save this > * information to cc->node_load[]. > @@ -1349,10 +1361,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > result = SCAN_PAGE_LOCK; > goto out_unmap; > } > - if (!folio_test_anon(folio)) { > - result = SCAN_PAGE_ANON; > - goto out_unmap; > - } > > /* > * Check if the page has any GUP (or other external) pins. > -- > 2.44.0 > >
On 24.04.24 18:28, Yang Shi wrote: > On Wed, Apr 24, 2024 at 5:26 AM David Hildenbrand <david@redhat.com> wrote: >> >> We want to limit the use of page_mapcount() to places where absolutely >> required, to prepare for kernel configs where we won't keep track of >> per-page mapcounts in large folios. >> >> khugepaged is one of the remaining "more challenging" page_mapcount() >> users, but we might be able to move away from page_mapcount() without >> resulting in a significant behavior change that would warrant >> special-casing based on kernel configs. >> >> In 2020, we first added support to khugepaged for collapsing COW-shared >> pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared >> across fork"), followed by support for collapsing PTE-mapped THP in commit >> 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages") >> and limiting the memory waste via the "page_count() > 1" check in commit >> 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable"). >> >> As a default, khugepaged will allow up to half of the PTEs to map shared >> pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged >> setting. >> >> khugepaged does currently not care about swapcache page references, and >> does not check under folio lock: so in some corner cases the "shared vs. >> exclusive" detection might be a bit off, making us detect "exclusive" when >> it's actually "shared". >> >> Most of our anonymous folios in the system are usually exclusive. We >> frequently see sharing of anonymous folios for a short period of time, >> after which our short-lived suprocesses either quit or exec(). >> >> There are some famous examples, though, where child processes exist for a >> long time, and where memory is COW-shared with a lot of processes >> (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for >> reducing the memory footprint. We don't want to suddenly change the >> behavior to result in a significant increase in memory waste. >> >> Interestingly, khugepaged will only collapse an anonymous THP if at least >> one PTE is writable. After fork(), that means that something (usually a >> page fault) populated at least a single exclusive anonymous THP in that PMD >> range. >> >> So ... what happens when we switch to "is this folio mapped shared" >> instead of "is this page mapped shared" by using >> folio_likely_mapped_shared()? >> >> For "not-COW-shared" folios, small folios and for THPs (large >> folios) that are completely mapped into at least one process, >> switching to folio_likely_mapped_shared() will not result in a change. >> >> We'll only see a change for COW-shared PTE-mapped THPs that are >> partially mapped into all involved processes. >> >> There are two cases to consider: >> >> (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP >> >> If the folio is detected as exclusive, and it actually is exclusive, >> there is no change: page_mapcount() == 1. This is the common case >> without fork() or with short-lived child processes. >> >> folio_likely_mapped_shared() might currently still detect a folio as >> exclusive although it is shared (false negatives): if the first page is >> not mapped multiple times and if the average per-page mapcount is smaller >> than 1, implying that (1) the folio is partially mapped and (2) if we are >> responsible for many mapcounts by mapping many pages others can't >> ("mostly exclusive") (3) if we are not responsible for many mapcounts by >> mapping little pages ("mostly shared") it won't make a big impact on the >> end result. >> >> So while we might now detect a page as "exclusive" although it isn't, >> it's not expected to make a big difference in common cases. >> >> (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP >> >> folio_likely_mapped_shared() will never detect a large anonymous folio >> as shared although it is exclusive: there are no false positives. >> >> If we detect a THP as shared, at least one page of the THP is mapped by >> another process. It could well be that some pages are actually exclusive. >> For example, our child processes could have unmapped/COW'ed some pages >> such that they would now be exclusive to out process, which we now >> would treat as still-shared. > > IIUC, case A may under-count shared PTEs, however on the opposite side > case B may over-count shared PTEs, right? So the impact may depend on > what value is used by max_shared_ptes tunable. It may have a more > noticeable impact on a very conservative setting (i.e. max_shared_ptes > == 1) if it is under-counted or on a very aggressive setting (i.e. > max_shared_ptes == 510) if it is over-counted. Thanks for reading all of that! Right, and mostly affecting corner cases. I'm not concerned about (B) really. I was more concerned about (A) before I optimized folio_likely_mapped_shared() using the large mapcount. > > So I agree it should not matter much for common cases. AFAIK, the > usecase for aggressive setting should be very rare, but conservative > setting may be more usual, so improving the under-count for > conservative setting may be worth it. Yes, sorting out A completely is what I 'm working on, but it will likely not be available on all kernel configs, at least initially. And unless there is a good reason, I want to avoid having config-dependent stuff all over the kernel -- and just move page_mapcount() to task_mmu.c where it can no longer be (ab)used. Thanks!
On 4/24/24 5:26 AM, David Hildenbrand wrote: Hi David, Overall, I think this looks good, just a few questions, and of course some silly documentation nits. > We want to limit the use of page_mapcount() to places where absolutely > required, to prepare for kernel configs where we won't keep track of > per-page mapcounts in large folios. Just curious, can you elaborate on the motivation? I probably missed the discussions that explained why page_mapcount() in large folios is not desirable. Are we getting rid of a field in struct page/folio? Some other reason? ... > To summarize, in the common case, this change is not expected to matter > much. The more common application of khugepaged operates on Based on the diffs (and some quick hacks for testing that I ran), I agree. ... > > This really needs the folio_likely_mapped_shared() optimization [1] that > resides in mm-unstable, I think, to reduce "false negatives". > > The khugepage MM selftests keep working as expected, including: > > Run test: collapse_max_ptes_shared (khugepaged:anon) > Allocate huge page... OK > Share huge page over fork()... OK > Trigger CoW on page 255 of 512... OK > Maybe collapse with max_ptes_shared exceeded.... OK > Trigger CoW on page 256 of 512... OK > Collapse with max_ptes_shared PTEs shared.... OK > Check if parent still has huge page... OK Well, a word of caution! These tests do not (yet) cover either of the interesting new cases that folio_likely_mapped_shared() presents: KSM or hugetlbfs interactions. In other words, false positives. > > Where we check that collapsing in the parent behaves as expected after > COWing a lot of pages in the parent: a sane scenario that is essentially > unchanged and which does not depend on any action in the child process > (compared to the cases discussed in (B) above). > > [1] https://lkml.kernel.org/r/20240409192301.907377-6-david@redhat.com > > --- > Documentation/admin-guide/mm/transhuge.rst | 3 ++- > mm/khugepaged.c | 22 +++++++++++++++------- > 2 files changed, 17 insertions(+), 8 deletions(-) > > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst > index f82300b9193fe..076443cc10a6c 100644 > --- a/Documentation/admin-guide/mm/transhuge.rst > +++ b/Documentation/admin-guide/mm/transhuge.rst > @@ -278,7 +278,8 @@ collapsed, resulting fewer pages being collapsed into > THPs, and lower memory access performance. > > ``max_ptes_shared`` specifies how many pages can be shared across multiple > -processes. Exceeding the number would block the collapse:: > +processes. khugepaged might treat pages of THPs as shared if any page of > +that THP is shared. Exceeding the number would block the collapse:: > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 2f73d2aa9ae84..cf518fc440982 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -583,7 +583,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > folio = page_folio(page); > VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); > > - if (page_mapcount(page) > 1) { > + /* See hpage_collapse_scan_pmd(). */ Why? Because it has an identical code snippet? I thought about asking if we should factor that out, just to keep the policy the same. Thoughts? > + if (folio_likely_mapped_shared(folio)) { > ++shared; > if (cc->is_khugepaged && > shared > khugepaged_max_ptes_shared) { > @@ -1317,8 +1318,20 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > result = SCAN_PAGE_NULL; > goto out_unmap; > } > + folio = page_folio(page); > > - if (page_mapcount(page) > 1) { > + if (!folio_test_anon(folio)) { > + result = SCAN_PAGE_ANON; > + goto out_unmap; > + } > + > + /* > + * We treat a single page as shared if any part of the THP > + * is shared. "False negatives" from > + * folio_likely_mapped_shared() are not expected to matter > + * much in practice. Maybe delete that second sentence? It is not really pulling its weight here. :) thanks,
On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote: > > We want to limit the use of page_mapcount() to places where absolutely > > required, to prepare for kernel configs where we won't keep track of > > per-page mapcounts in large folios. > > > Just curious, can you elaborate on the motivation? I probably missed > the discussions that explained why page_mapcount() in large folios > is not desirable. Are we getting rid of a field in struct page/folio? > Some other reason? Two reasons. One is that, regardless of anything else, folio_mapcount() is expensive on large folios as it has to walk every page in the folio summing the mapcounts. The more important reason is that when we move to separately allocated folios, we don't want to allocate an array of mapcounts in order to maintain a per-page mapcount. So we're looking for a more compact scheme to avoid maintaining a per-page mapcount. > > The khugepage MM selftests keep working as expected, including: > > > > Run test: collapse_max_ptes_shared (khugepaged:anon) > > Allocate huge page... OK > > Share huge page over fork()... OK > > Trigger CoW on page 255 of 512... OK > > Maybe collapse with max_ptes_shared exceeded.... OK > > Trigger CoW on page 256 of 512... OK > > Collapse with max_ptes_shared PTEs shared.... OK > > Check if parent still has huge page... OK > > Well, a word of caution! These tests do not (yet) cover either of > the interesting new cases that folio_likely_mapped_shared() presents: > KSM or hugetlbfs interactions. In other words, false positives. Hmm ... KSM never uses large folios and hugetlbfs is disjoint from khugepaged?
On 4/24/24 9:17 PM, Matthew Wilcox wrote: > On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote: >>> We want to limit the use of page_mapcount() to places where absolutely >>> required, to prepare for kernel configs where we won't keep track of >>> per-page mapcounts in large folios. >> >> >> Just curious, can you elaborate on the motivation? I probably missed >> the discussions that explained why page_mapcount() in large folios >> is not desirable. Are we getting rid of a field in struct page/folio? >> Some other reason? > > Two reasons. One is that, regardless of anything else, folio_mapcount() > is expensive on large folios as it has to walk every page in the folio > summing the mapcounts. The more important reason is that when we move > to separately allocated folios, we don't want to allocate an array of > mapcounts in order to maintain a per-page mapcount. > > So we're looking for a more compact scheme to avoid maintaining a > per-page mapcount. > I see. Thanks for explaining the story. >>> The khugepage MM selftests keep working as expected, including: >>> >>> Run test: collapse_max_ptes_shared (khugepaged:anon) >>> Allocate huge page... OK >>> Share huge page over fork()... OK >>> Trigger CoW on page 255 of 512... OK >>> Maybe collapse with max_ptes_shared exceeded.... OK >>> Trigger CoW on page 256 of 512... OK >>> Collapse with max_ptes_shared PTEs shared.... OK >>> Check if parent still has huge page... OK >> >> Well, a word of caution! These tests do not (yet) cover either of >> the interesting new cases that folio_likely_mapped_shared() presents: >> KSM or hugetlbfs interactions. In other words, false positives. > > Hmm ... KSM never uses large folios and hugetlbfs is disjoint from > khugepaged? > Oh good. I thought we might have had a testing hole, but no. thanks,
On 4/25/24 1:06 AM, David Hildenbrand wrote: > On 25.04.24 07:40, John Hubbard wrote: >> On 4/24/24 9:17 PM, Matthew Wilcox wrote: >>> On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote: ... > We'll talk more about all that at LSF/MM in the mapcount session. A spoiler: Looking forward to it. And as an aside, this year it feels like the mm code is changing relatively fast. So many large and small improvements have happened or are in progress. > > page_mapcount() in the context of large folios: > * Is a misunderstood function (e.g., page_mapcount() vs page_count() > checks, mapped = !page_mapcount() checks). > * Is a misleading function (e.g., page_mapped() == folio_mapped() but > page_mapcount() != folio_mapcount()) > > We could just rename it to "folio_precise_page_mapcount()", but then, once we tackle the subpage mapcount optimizations (initially using a separate kernel config toggle), we'll have to teach each caller about an alternative that gets the job done, and it's not that easy to prevent further reuse around the kernel. > > If you look at linux-next, we're down to 5 page_mapcount() calls in fs/proc/, so I'll relocate it to fs/proc/internal.h to prevent any further use - once the s390x change lands in the next merge window. > > Regarding the subpage mapcount optimizations, I can further add: > * (un)map performance improvements for PTE-mapped THP > * Preparation for folio_order() > PMD_ORDER, where the current scheme > won't scale and needs further adjustments/complexity to even keep it > working > * Preparation for hugetlb-like vmemmap optimizations until we have > memdescs / dynamically allocated folios > * (Paving the way for partially mapping hugetlb folios that faced > similar issues? Not sure if that ever gets real, though) > > Is this patch ahead of its time? LSF/MM is just around the corner, and I'm planning on posting the other relevant patches in the next months. I think so, yes. There is a lot of context required to understand the motivation, and more required in order to figure out if it is safe, and if it still provides "good" behavior. I still think it's mostly harmless, though, so being ahead of its time is not necessarily an indictment. :) > >> >>>>> The khugepage MM selftests keep working as expected, including: >>>>> >>>>> Run test: collapse_max_ptes_shared (khugepaged:anon) >>>>> Allocate huge page... OK >>>>> Share huge page over fork()... OK >>>>> Trigger CoW on page 255 of 512... OK >>>>> Maybe collapse with max_ptes_shared exceeded.... OK >>>>> Trigger CoW on page 256 of 512... OK >>>>> Collapse with max_ptes_shared PTEs shared.... OK >>>>> Check if parent still has huge page... OK >>>> >>>> Well, a word of caution! These tests do not (yet) cover either of >>>> the interesting new cases that folio_likely_mapped_shared() presents: >>>> KSM or hugetlbfs interactions. In other words, false positives. >>> >>> Hmm ... KSM never uses large folios and hugetlbfs is disjoint from >>> khugepaged? > > Right, folio_likely_mapped_shared() behaves exactly like page_mapcount() would for (small) KSM folios, so no change for them. > > Thankfully, hugetlb is out of the picture this time. > >>> >> >> Oh good. I thought we might have had a testing hole, but no. > > Thanks for having a look! > > I'm only a bit concerned about folio_likely_mapped_shared() "false negatives" (detecting exclusive although shared), until we have a more precise folio_likely_mapped_shared() variant to not unexpectedly waste memory. > > Imagine someone would be setting "khugepaged_max_ptes_shared=0", and then we have an area where (I think this is the extreme case): > > * We map 256 subpages of a 2M folio that are shared 256 times with a > child process. > * We don't map the first subpage. > * One PTE maps another page and is pte_write(). > * 255 PTEs are pte_none(). > > folio_likely_mapped_shared() would return "false". > > But then my thinking is: > * We are already wasting 256 subpages that are free in the 2M folio. > Sure, we might be able to relaim it when deferred splitting. > * Why would someone set khugepaged_max_ptes_shared=0 but leave > khugepaged_max_ptes_none set that high that we would allow 255 > pte_none? > * If the child is a short-living subprocess, we don't really care > * Any futher writes to unbacked/R/O PTEs in that PMD area would COW and > consume memory. > > So I had to use more and more "ifs" to construct a scenario where we might end up wasting 1M of memory, at which point I decided "this is really a corner case" and likely not worth the worry. > > If we run into real issues, though, it will be easy to just inline page_mapcount() here to resolve it; but the less special-casing the better. > OK. I'll need to think through some more of these cases. And meanwhile, I was poking around from the other direction: just injection test it by pasting in "true" or "false", in place of calling folio_likely_mapped_shared(). And see what changes. The "true" test lets me fail several khugepaged selftests, while the "false" test just increases the counter in /proc/vmstat. That's more of a black box way of poking at it, just to have another facet of testing. Because it is good to ensure that we really do have test coverage if we're changing the code. Anyway, just ideas. thanks,
On 26.04.24 03:23, John Hubbard wrote: > On 4/25/24 1:06 AM, David Hildenbrand wrote: >> On 25.04.24 07:40, John Hubbard wrote: >>> On 4/24/24 9:17 PM, Matthew Wilcox wrote: >>>> On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote: > ... >> We'll talk more about all that at LSF/MM in the mapcount session. A spoiler: > > Looking forward to it. And as an aside, this year it feels like the mm > code is changing relatively fast. So many large and small improvements > have happened or are in progress. Yes, it's happening on a very fast pace (and it's hard for me to get reasonable work done while still keeping reviewing that much ...). I'll note, that replacing a page-based interface by a folio-based interface should not be shocking news in 2024, and that the issues with mapcounts for large folios have been a recurring topic at LSF/MM and on the mailing list. > > >> >> page_mapcount() in the context of large folios: >> * Is a misunderstood function (e.g., page_mapcount() vs page_count() >> checks, mapped = !page_mapcount() checks). >> * Is a misleading function (e.g., page_mapped() == folio_mapped() but >> page_mapcount() != folio_mapcount()) >> >> We could just rename it to "folio_precise_page_mapcount()", but then, once we tackle the subpage mapcount optimizations (initially using a separate kernel config toggle), we'll have to teach each caller about an alternative that gets the job done, and it's not that easy to prevent further reuse around the kernel. >> >> If you look at linux-next, we're down to 5 page_mapcount() calls in fs/proc/, so I'll relocate it to fs/proc/internal.h to prevent any further use - once the s390x change lands in the next merge window. >> >> Regarding the subpage mapcount optimizations, I can further add: >> * (un)map performance improvements for PTE-mapped THP >> * Preparation for folio_order() > PMD_ORDER, where the current scheme >> won't scale and needs further adjustments/complexity to even keep it >> working >> * Preparation for hugetlb-like vmemmap optimizations until we have >> memdescs / dynamically allocated folios >> * (Paving the way for partially mapping hugetlb folios that faced >> similar issues? Not sure if that ever gets real, though) >> >> Is this patch ahead of its time? LSF/MM is just around the corner, and I'm planning on posting the other relevant patches in the next months. > > I think so, yes. There is a lot of context required to understand the > motivation, and more required in order to figure out if it is safe, > and if it still provides "good" behavior. I think the motivation for removing page_mapcount() should be very clear at this point: a single remaining user in mm/ should be warranted, and the faster it is gone the better. [case in point: I even have another potential user [1] in my mailbox that should be using a folio interface, well, or PG_anon_exclusive :) ] [1] https://lore.kernel.org/all/Zirw0uINbP6GxFiK@bender.morinfr.org/T/#u Regarding removing subpage mapcounts I agree: I added too many details that made it look harder to understand :P > > I still think it's mostly harmless, though, so being ahead of its time > is not necessarily an indictment. :) I didn't express my thought clearly: LSF/MM is just around the corner and the discussion we are having here is the perfect preparation for that session! :) I don't particularly care if we merge this patch now or after the next merge window along with the remaining page_mapcount() removal. Discussing the impact of this change is the important piece. :) [...] >> Thanks for having a look! >> >> I'm only a bit concerned about folio_likely_mapped_shared() "false negatives" (detecting exclusive although shared), until we have a more precise folio_likely_mapped_shared() variant to not unexpectedly waste memory. >> >> Imagine someone would be setting "khugepaged_max_ptes_shared=0", and then we have an area where (I think this is the extreme case): >> >> * We map 256 subpages of a 2M folio that are shared 256 times with a >> child process. >> * We don't map the first subpage. >> * One PTE maps another page and is pte_write(). >> * 255 PTEs are pte_none(). >> >> folio_likely_mapped_shared() would return "false". >> >> But then my thinking is: >> * We are already wasting 256 subpages that are free in the 2M folio. >> Sure, we might be able to relaim it when deferred splitting. >> * Why would someone set khugepaged_max_ptes_shared=0 but leave >> khugepaged_max_ptes_none set that high that we would allow 255 >> pte_none? >> * If the child is a short-living subprocess, we don't really care >> * Any futher writes to unbacked/R/O PTEs in that PMD area would COW and >> consume memory. >> >> So I had to use more and more "ifs" to construct a scenario where we might end up wasting 1M of memory, at which point I decided "this is really a corner case" and likely not worth the worry. >> >> If we run into real issues, though, it will be easy to just inline page_mapcount() here to resolve it; but the less special-casing the better. >> > > OK. I'll need to think through some more of these cases. And meanwhile, I > was poking around from the other direction: just injection test it by > pasting in "true" or "false", in place of calling folio_likely_mapped_shared(). > And see what changes. Highly appreciated! > > The "true" test lets me fail several khugepaged selftests, while the "false" > test just increases the counter in /proc/vmstat. > > That's more of a black box way of poking at it, just to have another facet > of testing. Because it is good to ensure that we really do have test > coverage if we're changing the code. Anyway, just ideas. Yes, all makes sense. I'm very interested if there are valid concerns that the "false negatives" are unacceptable: it would be another case for why we really want to make folio_likely_mapped_shared() precise. For me it's clear that we want to make it precise, but so far I am not convinced that it is absolutely required in the khugepaged context. Thanks!
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index f82300b9193fe..076443cc10a6c 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -278,7 +278,8 @@ collapsed, resulting fewer pages being collapsed into THPs, and lower memory access performance. ``max_ptes_shared`` specifies how many pages can be shared across multiple -processes. Exceeding the number would block the collapse:: +processes. khugepaged might treat pages of THPs as shared if any page of +that THP is shared. Exceeding the number would block the collapse:: /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 2f73d2aa9ae84..cf518fc440982 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -583,7 +583,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, folio = page_folio(page); VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); - if (page_mapcount(page) > 1) { + /* See hpage_collapse_scan_pmd(). */ + if (folio_likely_mapped_shared(folio)) { ++shared; if (cc->is_khugepaged && shared > khugepaged_max_ptes_shared) { @@ -1317,8 +1318,20 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, result = SCAN_PAGE_NULL; goto out_unmap; } + folio = page_folio(page); - if (page_mapcount(page) > 1) { + if (!folio_test_anon(folio)) { + result = SCAN_PAGE_ANON; + goto out_unmap; + } + + /* + * We treat a single page as shared if any part of the THP + * is shared. "False negatives" from + * folio_likely_mapped_shared() are not expected to matter + * much in practice. + */ + if (folio_likely_mapped_shared(folio)) { ++shared; if (cc->is_khugepaged && shared > khugepaged_max_ptes_shared) { @@ -1328,7 +1341,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, } } - folio = page_folio(page); /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1349,10 +1361,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, result = SCAN_PAGE_LOCK; goto out_unmap; } - if (!folio_test_anon(folio)) { - result = SCAN_PAGE_ANON; - goto out_unmap; - } /* * Check if the page has any GUP (or other external) pins.
We want to limit the use of page_mapcount() to places where absolutely required, to prepare for kernel configs where we won't keep track of per-page mapcounts in large folios. khugepaged is one of the remaining "more challenging" page_mapcount() users, but we might be able to move away from page_mapcount() without resulting in a significant behavior change that would warrant special-casing based on kernel configs. In 2020, we first added support to khugepaged for collapsing COW-shared pages via commit 9445689f3b61 ("khugepaged: allow to collapse a page shared across fork"), followed by support for collapsing PTE-mapped THP in commit 5503fbf2b0b8 ("khugepaged: allow to collapse PTE-mapped compound pages") and limiting the memory waste via the "page_count() > 1" check in commit 71a2c112a0f6 ("khugepaged: introduce 'max_ptes_shared' tunable"). As a default, khugepaged will allow up to half of the PTEs to map shared pages: where page_mapcount() > 1. MADV_COLLAPSE ignores the khugepaged setting. khugepaged does currently not care about swapcache page references, and does not check under folio lock: so in some corner cases the "shared vs. exclusive" detection might be a bit off, making us detect "exclusive" when it's actually "shared". Most of our anonymous folios in the system are usually exclusive. We frequently see sharing of anonymous folios for a short period of time, after which our short-lived suprocesses either quit or exec(). There are some famous examples, though, where child processes exist for a long time, and where memory is COW-shared with a lot of processes (webservers, webbrowsers, sshd, ...) and COW-sharing is crucial for reducing the memory footprint. We don't want to suddenly change the behavior to result in a significant increase in memory waste. Interestingly, khugepaged will only collapse an anonymous THP if at least one PTE is writable. After fork(), that means that something (usually a page fault) populated at least a single exclusive anonymous THP in that PMD range. So ... what happens when we switch to "is this folio mapped shared" instead of "is this page mapped shared" by using folio_likely_mapped_shared()? For "not-COW-shared" folios, small folios and for THPs (large folios) that are completely mapped into at least one process, switching to folio_likely_mapped_shared() will not result in a change. We'll only see a change for COW-shared PTE-mapped THPs that are partially mapped into all involved processes. There are two cases to consider: (A) folio_likely_mapped_shared() returns "false" for a PTE-mapped THP If the folio is detected as exclusive, and it actually is exclusive, there is no change: page_mapcount() == 1. This is the common case without fork() or with short-lived child processes. folio_likely_mapped_shared() might currently still detect a folio as exclusive although it is shared (false negatives): if the first page is not mapped multiple times and if the average per-page mapcount is smaller than 1, implying that (1) the folio is partially mapped and (2) if we are responsible for many mapcounts by mapping many pages others can't ("mostly exclusive") (3) if we are not responsible for many mapcounts by mapping little pages ("mostly shared") it won't make a big impact on the end result. So while we might now detect a page as "exclusive" although it isn't, it's not expected to make a big difference in common cases. (B) folio_likely_mapped_shared() returns "true" for a PTE-mapped THP folio_likely_mapped_shared() will never detect a large anonymous folio as shared although it is exclusive: there are no false positives. If we detect a THP as shared, at least one page of the THP is mapped by another process. It could well be that some pages are actually exclusive. For example, our child processes could have unmapped/COW'ed some pages such that they would now be exclusive to out process, which we now would treat as still-shared. Examples: (1) Parent maps all pages of a THP, child maps some pages. We detect all pages in the parent as shared although some are actually exclusive. (2) Parent maps all but some page of a THP, child maps the remainder. We detect all pages of the THP that the parent maps as shared although they are all exclusive. In (1) we wouldn't collapse a THP right now already: no PTE is writable, because a write fault would have resulted in COW of a single page and the parent would no longer map all pages of that THP. For (2) we would have collapsed a THP in the parent so far, now we wouldn't as long as the child process is still alive: unless the child process unmaps the remaining THP pages or we decide to split that THP. Possibly, the child COW'ed many pages, meaning that it's likely that we can populate a THP for our child first, and then for our parent. For (2), we are making really bad use of the THP in the first place (not even mapped completely in at least one process). If the THP would be completely partially mapped, it would be on the deferred split queue where we would split it lazily later. For short-running child processes, we don't particularly care. For long-running processes, the expectation is that such scenarios are rather rare: further, a THP might be best placed if most data in the PMD range is actually written, implying that we'll have to COW more pages first before khugepaged would collapse it. To summarize, in the common case, this change is not expected to matter much. The more common application of khugepaged operates on exclusive pages, either before fork() or after a child quit. Can we improve (A)? Yes, if we implement more precise tracking of "mapped shared" vs. "mapped exclusively", we could get rid of the false negatives completely. Can we improve (B)? We could count how many pages of a large folio we map inside the current page table and detect that we are responsible for most of the folio mapcount and conclude "as good as exclusive", which might help in some cases. ... but likely, some other mechanism should detect that the THP is not a good use in the scenario (not even mapped completely in a single process) and try splitting that folio lazily etc. We'll move the folio_test_anon() check before our "shared" check, so we might get more expressive results for SCAN_EXCEED_SHARED_PTE: this order of checks now matches the one in __collapse_huge_page_isolate(). Extend documentation. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- How much time can one spend writing a patch description? Unbelievable. But it was likely time well spend to have a clear picture of the impact. This really needs the folio_likely_mapped_shared() optimization [1] that resides in mm-unstable, I think, to reduce "false negatives". The khugepage MM selftests keep working as expected, including: Run test: collapse_max_ptes_shared (khugepaged:anon) Allocate huge page... OK Share huge page over fork()... OK Trigger CoW on page 255 of 512... OK Maybe collapse with max_ptes_shared exceeded.... OK Trigger CoW on page 256 of 512... OK Collapse with max_ptes_shared PTEs shared.... OK Check if parent still has huge page... OK Where we check that collapsing in the parent behaves as expected after COWing a lot of pages in the parent: a sane scenario that is essentially unchanged and which does not depend on any action in the child process (compared to the cases discussed in (B) above). [1] https://lkml.kernel.org/r/20240409192301.907377-6-david@redhat.com --- Documentation/admin-guide/mm/transhuge.rst | 3 ++- mm/khugepaged.c | 22 +++++++++++++++------- 2 files changed, 17 insertions(+), 8 deletions(-)