Message ID | 20240411153232.169560-1-zi.yan@sent.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/rmap: do not add fully unmapped large folio to deferred split list | expand |
On 11.04.24 17:32, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > In __folio_remove_rmap(), a large folio is added to deferred split list > if any page in a folio loses its final mapping. It is possible that > the folio is unmapped fully, but it is unnecessary to add the folio > to deferred split list at all. Fix it by checking folio mapcount before > adding a folio to deferred split list. > > Signed-off-by: Zi Yan <ziy@nvidia.com> > --- > mm/rmap.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 2608c40dffad..d599a772e282 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > enum rmap_level level) > { > atomic_t *mapped = &folio->_nr_pages_mapped; > - int last, nr = 0, nr_pmdmapped = 0; > + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > enum node_stat_item idx; > > __folio_rmap_sanity_checks(folio, page, nr_pages, level); > @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > break; > } > > - atomic_sub(nr_pages, &folio->_large_mapcount); > + mapcount = atomic_sub_return(nr_pages, > + &folio->_large_mapcount) + 1; That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. > do { > last = atomic_add_negative(-1, &page->_mapcount); > if (last) { > @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > * is still mapped. > */ > if (folio_test_large(folio) && folio_test_anon(folio)) > - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > + if ((level == RMAP_LEVEL_PTE && > + mapcount != 0) || > + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) > deferred_split_folio(folio); > } But I do wonder if we really care? Usually the folio will simply get freed afterwards, where we simply remove it from the list. If it's pinned, we won't be able to free or reclaim, but it's rather a corner case ... Is it really worth the added code? Not convinced.
On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: > > On 11.04.24 17:32, Zi Yan wrote: > > From: Zi Yan <ziy@nvidia.com> > > > > In __folio_remove_rmap(), a large folio is added to deferred split list > > if any page in a folio loses its final mapping. It is possible that > > the folio is unmapped fully, but it is unnecessary to add the folio > > to deferred split list at all. Fix it by checking folio mapcount before > > adding a folio to deferred split list. > > > > Signed-off-by: Zi Yan <ziy@nvidia.com> > > --- > > mm/rmap.c | 9 ++++++--- > > 1 file changed, 6 insertions(+), 3 deletions(-) > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 2608c40dffad..d599a772e282 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > > enum rmap_level level) > > { > > atomic_t *mapped = &folio->_nr_pages_mapped; > > - int last, nr = 0, nr_pmdmapped = 0; > > + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > > enum node_stat_item idx; > > > > __folio_rmap_sanity_checks(folio, page, nr_pages, level); > > @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > > break; > > } > > > > - atomic_sub(nr_pages, &folio->_large_mapcount); > > + mapcount = atomic_sub_return(nr_pages, > > + &folio->_large_mapcount) + 1; > > That becomes a new memory barrier on some archs. Rather just re-read it > below. Re-reading should be fine here. > > > do { > > last = atomic_add_negative(-1, &page->_mapcount); > > if (last) { > > @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > > * is still mapped. > > */ > > if (folio_test_large(folio) && folio_test_anon(folio)) > > - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > > + if ((level == RMAP_LEVEL_PTE && > > + mapcount != 0) || > > + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) > > deferred_split_folio(folio); > > } > > But I do wonder if we really care? Usually the folio will simply get > freed afterwards, where we simply remove it from the list. > > If it's pinned, we won't be able to free or reclaim, but it's rather a > corner case ... > > Is it really worth the added code? Not convinced. It is actually not only an optimization, but also fixed the broken thp_deferred_split_page counter in /proc/vmstat. The counter actually counted the partially unmapped huge pages (so they are on deferred split queue), but it counts the fully unmapped mTHP as well now. For example, when a 64K THP is fully unmapped, the thp_deferred_split_page is not supposed to get inc'ed, but it does now. The counter is also useful for performance analysis, for example, whether a workload did a lot of partial unmap or not. So fixing the counter seems worthy. Zi Yan should have mentioned this in the commit log. > > -- > Cheers, > > David / dhildenb >
On 11.04.24 21:01, Yang Shi wrote: > On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 11.04.24 17:32, Zi Yan wrote: >>> From: Zi Yan <ziy@nvidia.com> >>> >>> In __folio_remove_rmap(), a large folio is added to deferred split list >>> if any page in a folio loses its final mapping. It is possible that >>> the folio is unmapped fully, but it is unnecessary to add the folio >>> to deferred split list at all. Fix it by checking folio mapcount before >>> adding a folio to deferred split list. >>> >>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>> --- >>> mm/rmap.c | 9 ++++++--- >>> 1 file changed, 6 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 2608c40dffad..d599a772e282 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>> enum rmap_level level) >>> { >>> atomic_t *mapped = &folio->_nr_pages_mapped; >>> - int last, nr = 0, nr_pmdmapped = 0; >>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>> enum node_stat_item idx; >>> >>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>> break; >>> } >>> >>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>> + mapcount = atomic_sub_return(nr_pages, >>> + &folio->_large_mapcount) + 1; >> >> That becomes a new memory barrier on some archs. Rather just re-read it >> below. Re-reading should be fine here. >> >>> do { >>> last = atomic_add_negative(-1, &page->_mapcount); >>> if (last) { >>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>> * is still mapped. >>> */ >>> if (folio_test_large(folio) && folio_test_anon(folio)) >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >>> + if ((level == RMAP_LEVEL_PTE && >>> + mapcount != 0) || >>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) >>> deferred_split_folio(folio); >>> } >> >> But I do wonder if we really care? Usually the folio will simply get >> freed afterwards, where we simply remove it from the list. >> >> If it's pinned, we won't be able to free or reclaim, but it's rather a >> corner case ... >> >> Is it really worth the added code? Not convinced. > > It is actually not only an optimization, but also fixed the broken > thp_deferred_split_page counter in /proc/vmstat. > > The counter actually counted the partially unmapped huge pages (so > they are on deferred split queue), but it counts the fully unmapped > mTHP as well now. For example, when a 64K THP is fully unmapped, the > thp_deferred_split_page is not supposed to get inc'ed, but it does > now. > > The counter is also useful for performance analysis, for example, > whether a workload did a lot of partial unmap or not. So fixing the > counter seems worthy. Zi Yan should have mentioned this in the commit > log. Yes, all that is information that is missing from the patch description. If it's a fix, there should be a "Fixes:". Likely we want to have a folio_large_mapcount() check in the code below. (I yet have to digest the condition where this happens -- can we have an example where we'd use to do the wrong thing and now would do the right thing as well?)
On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote: > > On 11.04.24 21:01, Yang Shi wrote: > > On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 11.04.24 17:32, Zi Yan wrote: > >>> From: Zi Yan <ziy@nvidia.com> > >>> > >>> In __folio_remove_rmap(), a large folio is added to deferred split list > >>> if any page in a folio loses its final mapping. It is possible that > >>> the folio is unmapped fully, but it is unnecessary to add the folio > >>> to deferred split list at all. Fix it by checking folio mapcount before > >>> adding a folio to deferred split list. > >>> > >>> Signed-off-by: Zi Yan <ziy@nvidia.com> > >>> --- > >>> mm/rmap.c | 9 ++++++--- > >>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/rmap.c b/mm/rmap.c > >>> index 2608c40dffad..d599a772e282 100644 > >>> --- a/mm/rmap.c > >>> +++ b/mm/rmap.c > >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>> enum rmap_level level) > >>> { > >>> atomic_t *mapped = &folio->_nr_pages_mapped; > >>> - int last, nr = 0, nr_pmdmapped = 0; > >>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > >>> enum node_stat_item idx; > >>> > >>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>> break; > >>> } > >>> > >>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>> + mapcount = atomic_sub_return(nr_pages, > >>> + &folio->_large_mapcount) + 1; > >> > >> That becomes a new memory barrier on some archs. Rather just re-read it > >> below. Re-reading should be fine here. > >> > >>> do { > >>> last = atomic_add_negative(-1, &page->_mapcount); > >>> if (last) { > >>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>> * is still mapped. > >>> */ > >>> if (folio_test_large(folio) && folio_test_anon(folio)) > >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > >>> + if ((level == RMAP_LEVEL_PTE && > >>> + mapcount != 0) || > >>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) > >>> deferred_split_folio(folio); > >>> } > >> > >> But I do wonder if we really care? Usually the folio will simply get > >> freed afterwards, where we simply remove it from the list. > >> > >> If it's pinned, we won't be able to free or reclaim, but it's rather a > >> corner case ... > >> > >> Is it really worth the added code? Not convinced. > > > > It is actually not only an optimization, but also fixed the broken > > thp_deferred_split_page counter in /proc/vmstat. > > > > The counter actually counted the partially unmapped huge pages (so > > they are on deferred split queue), but it counts the fully unmapped > > mTHP as well now. For example, when a 64K THP is fully unmapped, the > > thp_deferred_split_page is not supposed to get inc'ed, but it does > > now. > > > > The counter is also useful for performance analysis, for example, > > whether a workload did a lot of partial unmap or not. So fixing the > > counter seems worthy. Zi Yan should have mentioned this in the commit > > log. > > Yes, all that is information that is missing from the patch description. > If it's a fix, there should be a "Fixes:". > > Likely we want to have a folio_large_mapcount() check in the code below. > (I yet have to digest the condition where this happens -- can we have an > example where we'd use to do the wrong thing and now would do the right > thing as well?) For example, map 1G memory with 64K mTHP, then unmap the whole 1G or some full 64K areas, you will see thp_deferred_split_page increased, but it shouldn't. It looks __folio_remove_rmap() incorrectly detected whether the mTHP is fully unmapped or partially unmapped by comparing the number of still-mapped subpages to ENTIRELY_MAPPED, which should just work for PMD-mappable THP. However I just realized this problem was kind of workaround'ed by commit: commit 98046944a1597f3a02b792dbe9665e9943b77f28 Author: Baolin Wang <baolin.wang@linux.alibaba.com> Date: Fri Mar 29 14:59:33 2024 +0800 mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics Now the mTHP can also be split or added into the deferred list, so add folio_test_pmd_mappable() validation for PMD mapped THP, to avoid confusion with PMD mapped THP related statistics. Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> This commit made thp_deferred_split_page didn't count mTHP anymore, it also made thp_split_page didn't count mTHP anymore. However Zi Yan's patch does make the code more robust and we don't need to worry about the miscounting issue anymore if we will add deferred_split_page and split_page counters for mTHP in the future. > -- > Cheers, > > David / dhildenb >
On 11 Apr 2024, at 17:59, Yang Shi wrote: > On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 11.04.24 21:01, Yang Shi wrote: >>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 11.04.24 17:32, Zi Yan wrote: >>>>> From: Zi Yan <ziy@nvidia.com> >>>>> >>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>> if any page in a folio loses its final mapping. It is possible that >>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>> adding a folio to deferred split list. >>>>> >>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>> --- >>>>> mm/rmap.c | 9 ++++++--- >>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 2608c40dffad..d599a772e282 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>> enum rmap_level level) >>>>> { >>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>> enum node_stat_item idx; >>>>> >>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>> break; >>>>> } >>>>> >>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>> + mapcount = atomic_sub_return(nr_pages, >>>>> + &folio->_large_mapcount) + 1; >>>> >>>> That becomes a new memory barrier on some archs. Rather just re-read it >>>> below. Re-reading should be fine here. >>>> >>>>> do { >>>>> last = atomic_add_negative(-1, &page->_mapcount); >>>>> if (last) { >>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>> * is still mapped. >>>>> */ >>>>> if (folio_test_large(folio) && folio_test_anon(folio)) >>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >>>>> + if ((level == RMAP_LEVEL_PTE && >>>>> + mapcount != 0) || >>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) >>>>> deferred_split_folio(folio); >>>>> } >>>> >>>> But I do wonder if we really care? Usually the folio will simply get >>>> freed afterwards, where we simply remove it from the list. >>>> >>>> If it's pinned, we won't be able to free or reclaim, but it's rather a >>>> corner case ... >>>> >>>> Is it really worth the added code? Not convinced. >>> >>> It is actually not only an optimization, but also fixed the broken >>> thp_deferred_split_page counter in /proc/vmstat. >>> >>> The counter actually counted the partially unmapped huge pages (so >>> they are on deferred split queue), but it counts the fully unmapped >>> mTHP as well now. For example, when a 64K THP is fully unmapped, the >>> thp_deferred_split_page is not supposed to get inc'ed, but it does >>> now. >>> >>> The counter is also useful for performance analysis, for example, >>> whether a workload did a lot of partial unmap or not. So fixing the >>> counter seems worthy. Zi Yan should have mentioned this in the commit >>> log. >> >> Yes, all that is information that is missing from the patch description. >> If it's a fix, there should be a "Fixes:". >> >> Likely we want to have a folio_large_mapcount() check in the code below. >> (I yet have to digest the condition where this happens -- can we have an >> example where we'd use to do the wrong thing and now would do the right >> thing as well?) > > For example, map 1G memory with 64K mTHP, then unmap the whole 1G or > some full 64K areas, you will see thp_deferred_split_page increased, > but it shouldn't. > > It looks __folio_remove_rmap() incorrectly detected whether the mTHP > is fully unmapped or partially unmapped by comparing the number of > still-mapped subpages to ENTIRELY_MAPPED, which should just work for > PMD-mappable THP. > > However I just realized this problem was kind of workaround'ed by commit: > > commit 98046944a1597f3a02b792dbe9665e9943b77f28 > Author: Baolin Wang <baolin.wang@linux.alibaba.com> > Date: Fri Mar 29 14:59:33 2024 +0800 > > mm: huge_memory: add the missing folio_test_pmd_mappable() for THP > split statistics > > Now the mTHP can also be split or added into the deferred list, so add > folio_test_pmd_mappable() validation for PMD mapped THP, to avoid > confusion with PMD mapped THP related statistics. > > Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > Acked-by: David Hildenbrand <david@redhat.com> > Cc: Muchun Song <muchun.song@linux.dev> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > > This commit made thp_deferred_split_page didn't count mTHP anymore, it > also made thp_split_page didn't count mTHP anymore. > > However Zi Yan's patch does make the code more robust and we don't > need to worry about the miscounting issue anymore if we will add > deferred_split_page and split_page counters for mTHP in the future. Actually, the patch above does not fix everything. A fully unmapped PTE-mapped order-9 THP is also added to deferred split list and counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512 (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() the order-9 folio is folio_test_pmd_mappable(). I will add this information in the next version. -- Best Regards, Yan, Zi
On 12 Apr 2024, at 10:21, Zi Yan wrote: > On 11 Apr 2024, at 17:59, Yang Shi wrote: > >> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 11.04.24 21:01, Yang Shi wrote: >>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 11.04.24 17:32, Zi Yan wrote: >>>>>> From: Zi Yan <ziy@nvidia.com> >>>>>> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>>> if any page in a folio loses its final mapping. It is possible that >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>>> adding a folio to deferred split list. >>>>>> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>> --- >>>>>> mm/rmap.c | 9 ++++++--- >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 2608c40dffad..d599a772e282 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>> enum rmap_level level) >>>>>> { >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>>> enum node_stat_item idx; >>>>>> >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>> break; >>>>>> } >>>>>> >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>>> + mapcount = atomic_sub_return(nr_pages, >>>>>> + &folio->_large_mapcount) + 1; >>>>> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it >>>>> below. Re-reading should be fine here. >>>>> >>>>>> do { >>>>>> last = atomic_add_negative(-1, &page->_mapcount); >>>>>> if (last) { >>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>> * is still mapped. >>>>>> */ >>>>>> if (folio_test_large(folio) && folio_test_anon(folio)) >>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >>>>>> + if ((level == RMAP_LEVEL_PTE && >>>>>> + mapcount != 0) || >>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) >>>>>> deferred_split_folio(folio); >>>>>> } >>>>> >>>>> But I do wonder if we really care? Usually the folio will simply get >>>>> freed afterwards, where we simply remove it from the list. >>>>> >>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a >>>>> corner case ... >>>>> >>>>> Is it really worth the added code? Not convinced. >>>> >>>> It is actually not only an optimization, but also fixed the broken >>>> thp_deferred_split_page counter in /proc/vmstat. >>>> >>>> The counter actually counted the partially unmapped huge pages (so >>>> they are on deferred split queue), but it counts the fully unmapped >>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the >>>> thp_deferred_split_page is not supposed to get inc'ed, but it does >>>> now. >>>> >>>> The counter is also useful for performance analysis, for example, >>>> whether a workload did a lot of partial unmap or not. So fixing the >>>> counter seems worthy. Zi Yan should have mentioned this in the commit >>>> log. >>> >>> Yes, all that is information that is missing from the patch description. >>> If it's a fix, there should be a "Fixes:". >>> >>> Likely we want to have a folio_large_mapcount() check in the code below. >>> (I yet have to digest the condition where this happens -- can we have an >>> example where we'd use to do the wrong thing and now would do the right >>> thing as well?) >> >> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or >> some full 64K areas, you will see thp_deferred_split_page increased, >> but it shouldn't. >> >> It looks __folio_remove_rmap() incorrectly detected whether the mTHP >> is fully unmapped or partially unmapped by comparing the number of >> still-mapped subpages to ENTIRELY_MAPPED, which should just work for >> PMD-mappable THP. >> >> However I just realized this problem was kind of workaround'ed by commit: >> >> commit 98046944a1597f3a02b792dbe9665e9943b77f28 >> Author: Baolin Wang <baolin.wang@linux.alibaba.com> >> Date: Fri Mar 29 14:59:33 2024 +0800 >> >> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP >> split statistics >> >> Now the mTHP can also be split or added into the deferred list, so add >> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid >> confusion with PMD mapped THP related statistics. >> >> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> Acked-by: David Hildenbrand <david@redhat.com> >> Cc: Muchun Song <muchun.song@linux.dev> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >> >> This commit made thp_deferred_split_page didn't count mTHP anymore, it >> also made thp_split_page didn't count mTHP anymore. >> >> However Zi Yan's patch does make the code more robust and we don't >> need to worry about the miscounting issue anymore if we will add >> deferred_split_page and split_page counters for mTHP in the future. > > Actually, the patch above does not fix everything. A fully unmapped > PTE-mapped order-9 THP is also added to deferred split list and > counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512 > (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() > the order-9 folio is folio_test_pmd_mappable(). > > I will add this information in the next version. It might Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"), but before this commit fully unmapping a PTE-mapped order-9 THP still increased THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE unmapping adds the THP into the deferred split list. This means commit b06dc281aa99 did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is due to implementation. I will add this to the commit log as well without Fixes tag. -- Best Regards, Yan, Zi
On 11 Apr 2024, at 11:46, David Hildenbrand wrote: > On 11.04.24 17:32, Zi Yan wrote: >> From: Zi Yan <ziy@nvidia.com> >> >> In __folio_remove_rmap(), a large folio is added to deferred split list >> if any page in a folio loses its final mapping. It is possible that >> the folio is unmapped fully, but it is unnecessary to add the folio >> to deferred split list at all. Fix it by checking folio mapcount before >> adding a folio to deferred split list. >> >> Signed-off-by: Zi Yan <ziy@nvidia.com> >> --- >> mm/rmap.c | 9 ++++++--- >> 1 file changed, 6 insertions(+), 3 deletions(-) >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 2608c40dffad..d599a772e282 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >> enum rmap_level level) >> { >> atomic_t *mapped = &folio->_nr_pages_mapped; >> - int last, nr = 0, nr_pmdmapped = 0; >> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >> enum node_stat_item idx; >> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >> break; >> } >> - atomic_sub(nr_pages, &folio->_large_mapcount); >> + mapcount = atomic_sub_return(nr_pages, >> + &folio->_large_mapcount) + 1; > > That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) below, but to save an atomic op, I chose to read mapcount here. -- Best Regards, Yan, Zi
On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <ziy@nvidia.com> wrote: > > On 12 Apr 2024, at 10:21, Zi Yan wrote: > > > On 11 Apr 2024, at 17:59, Yang Shi wrote: > > > >> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> On 11.04.24 21:01, Yang Shi wrote: > >>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: > >>>>> > >>>>> On 11.04.24 17:32, Zi Yan wrote: > >>>>>> From: Zi Yan <ziy@nvidia.com> > >>>>>> > >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list > >>>>>> if any page in a folio loses its final mapping. It is possible that > >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio > >>>>>> to deferred split list at all. Fix it by checking folio mapcount before > >>>>>> adding a folio to deferred split list. > >>>>>> > >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> > >>>>>> --- > >>>>>> mm/rmap.c | 9 ++++++--- > >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>>>> index 2608c40dffad..d599a772e282 100644 > >>>>>> --- a/mm/rmap.c > >>>>>> +++ b/mm/rmap.c > >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>> enum rmap_level level) > >>>>>> { > >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; > >>>>>> - int last, nr = 0, nr_pmdmapped = 0; > >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > >>>>>> enum node_stat_item idx; > >>>>>> > >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>> break; > >>>>>> } > >>>>>> > >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>>>>> + mapcount = atomic_sub_return(nr_pages, > >>>>>> + &folio->_large_mapcount) + 1; > >>>>> > >>>>> That becomes a new memory barrier on some archs. Rather just re-read it > >>>>> below. Re-reading should be fine here. > >>>>> > >>>>>> do { > >>>>>> last = atomic_add_negative(-1, &page->_mapcount); > >>>>>> if (last) { > >>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>> * is still mapped. > >>>>>> */ > >>>>>> if (folio_test_large(folio) && folio_test_anon(folio)) > >>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > >>>>>> + if ((level == RMAP_LEVEL_PTE && > >>>>>> + mapcount != 0) || > >>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) > >>>>>> deferred_split_folio(folio); > >>>>>> } > >>>>> > >>>>> But I do wonder if we really care? Usually the folio will simply get > >>>>> freed afterwards, where we simply remove it from the list. > >>>>> > >>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a > >>>>> corner case ... > >>>>> > >>>>> Is it really worth the added code? Not convinced. > >>>> > >>>> It is actually not only an optimization, but also fixed the broken > >>>> thp_deferred_split_page counter in /proc/vmstat. > >>>> > >>>> The counter actually counted the partially unmapped huge pages (so > >>>> they are on deferred split queue), but it counts the fully unmapped > >>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the > >>>> thp_deferred_split_page is not supposed to get inc'ed, but it does > >>>> now. > >>>> > >>>> The counter is also useful for performance analysis, for example, > >>>> whether a workload did a lot of partial unmap or not. So fixing the > >>>> counter seems worthy. Zi Yan should have mentioned this in the commit > >>>> log. > >>> > >>> Yes, all that is information that is missing from the patch description. > >>> If it's a fix, there should be a "Fixes:". > >>> > >>> Likely we want to have a folio_large_mapcount() check in the code below. > >>> (I yet have to digest the condition where this happens -- can we have an > >>> example where we'd use to do the wrong thing and now would do the right > >>> thing as well?) > >> > >> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or > >> some full 64K areas, you will see thp_deferred_split_page increased, > >> but it shouldn't. > >> > >> It looks __folio_remove_rmap() incorrectly detected whether the mTHP > >> is fully unmapped or partially unmapped by comparing the number of > >> still-mapped subpages to ENTIRELY_MAPPED, which should just work for > >> PMD-mappable THP. > >> > >> However I just realized this problem was kind of workaround'ed by commit: > >> > >> commit 98046944a1597f3a02b792dbe9665e9943b77f28 > >> Author: Baolin Wang <baolin.wang@linux.alibaba.com> > >> Date: Fri Mar 29 14:59:33 2024 +0800 > >> > >> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP > >> split statistics > >> > >> Now the mTHP can also be split or added into the deferred list, so add > >> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid > >> confusion with PMD mapped THP related statistics. > >> > >> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com > >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > >> Acked-by: David Hildenbrand <david@redhat.com> > >> Cc: Muchun Song <muchun.song@linux.dev> > >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > >> > >> This commit made thp_deferred_split_page didn't count mTHP anymore, it > >> also made thp_split_page didn't count mTHP anymore. > >> > >> However Zi Yan's patch does make the code more robust and we don't > >> need to worry about the miscounting issue anymore if we will add > >> deferred_split_page and split_page counters for mTHP in the future. > > > > Actually, the patch above does not fix everything. A fully unmapped > > PTE-mapped order-9 THP is also added to deferred split list and > > counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512 > > (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() > > the order-9 folio is folio_test_pmd_mappable(). > > > > I will add this information in the next version. > > It might > Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"), > but before this commit fully unmapping a PTE-mapped order-9 THP still increased > THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE > unmapping adds the THP into the deferred split list. This means commit b06dc281aa99 > did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is > due to implementation. I will add this to the commit log as well without Fixes > tag. Thanks for digging deeper. The problem may be not that obvious before mTHP because PMD-mappable THP is converted to PTE-mapped due to partial unmap in most cases. But mTHP is always PTE-mapped in the first place. The other reason is batched rmap remove was not supported before David's optimization. Now we do have reasonable motivation to make it precise and it is also easier to do so than before. > > > -- > Best Regards, > Yan, Zi
On 12.04.24 16:31, Zi Yan wrote: > On 12 Apr 2024, at 10:21, Zi Yan wrote: > >> On 11 Apr 2024, at 17:59, Yang Shi wrote: >> >>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 11.04.24 21:01, Yang Shi wrote: >>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: >>>>>> >>>>>> On 11.04.24 17:32, Zi Yan wrote: >>>>>>> From: Zi Yan <ziy@nvidia.com> >>>>>>> >>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>>>> if any page in a folio loses its final mapping. It is possible that >>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>>>> adding a folio to deferred split list. >>>>>>> >>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>>> --- >>>>>>> mm/rmap.c | 9 ++++++--- >>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>> index 2608c40dffad..d599a772e282 100644 >>>>>>> --- a/mm/rmap.c >>>>>>> +++ b/mm/rmap.c >>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>>> enum rmap_level level) >>>>>>> { >>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>>>> enum node_stat_item idx; >>>>>>> >>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>>> break; >>>>>>> } >>>>>>> >>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>>>> + mapcount = atomic_sub_return(nr_pages, >>>>>>> + &folio->_large_mapcount) + 1; >>>>>> >>>>>> That becomes a new memory barrier on some archs. Rather just re-read it >>>>>> below. Re-reading should be fine here. >>>>>> >>>>>>> do { >>>>>>> last = atomic_add_negative(-1, &page->_mapcount); >>>>>>> if (last) { >>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>>> * is still mapped. >>>>>>> */ >>>>>>> if (folio_test_large(folio) && folio_test_anon(folio)) >>>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >>>>>>> + if ((level == RMAP_LEVEL_PTE && >>>>>>> + mapcount != 0) || >>>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) >>>>>>> deferred_split_folio(folio); >>>>>>> } >>>>>> >>>>>> But I do wonder if we really care? Usually the folio will simply get >>>>>> freed afterwards, where we simply remove it from the list. >>>>>> >>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a >>>>>> corner case ... >>>>>> >>>>>> Is it really worth the added code? Not convinced. >>>>> >>>>> It is actually not only an optimization, but also fixed the broken >>>>> thp_deferred_split_page counter in /proc/vmstat. >>>>> >>>>> The counter actually counted the partially unmapped huge pages (so >>>>> they are on deferred split queue), but it counts the fully unmapped >>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the >>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does >>>>> now. >>>>> >>>>> The counter is also useful for performance analysis, for example, >>>>> whether a workload did a lot of partial unmap or not. So fixing the >>>>> counter seems worthy. Zi Yan should have mentioned this in the commit >>>>> log. >>>> >>>> Yes, all that is information that is missing from the patch description. >>>> If it's a fix, there should be a "Fixes:". >>>> >>>> Likely we want to have a folio_large_mapcount() check in the code below. >>>> (I yet have to digest the condition where this happens -- can we have an >>>> example where we'd use to do the wrong thing and now would do the right >>>> thing as well?) >>> >>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or >>> some full 64K areas, you will see thp_deferred_split_page increased, >>> but it shouldn't. >>> >>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP >>> is fully unmapped or partially unmapped by comparing the number of >>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for >>> PMD-mappable THP. >>> >>> However I just realized this problem was kind of workaround'ed by commit: >>> >>> commit 98046944a1597f3a02b792dbe9665e9943b77f28 >>> Author: Baolin Wang <baolin.wang@linux.alibaba.com> >>> Date: Fri Mar 29 14:59:33 2024 +0800 >>> >>> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP >>> split statistics >>> >>> Now the mTHP can also be split or added into the deferred list, so add >>> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid >>> confusion with PMD mapped THP related statistics. >>> >>> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com >>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>> Acked-by: David Hildenbrand <david@redhat.com> >>> Cc: Muchun Song <muchun.song@linux.dev> >>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >>> >>> This commit made thp_deferred_split_page didn't count mTHP anymore, it >>> also made thp_split_page didn't count mTHP anymore. >>> >>> However Zi Yan's patch does make the code more robust and we don't >>> need to worry about the miscounting issue anymore if we will add >>> deferred_split_page and split_page counters for mTHP in the future. >> >> Actually, the patch above does not fix everything. A fully unmapped >> PTE-mapped order-9 THP is also added to deferred split list and >> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512 >> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() >> the order-9 folio is folio_test_pmd_mappable(). >> >> I will add this information in the next version. > > It might > Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"), > but before this commit fully unmapping a PTE-mapped order-9 THP still increased > THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE > unmapping adds the THP into the deferred split list. This means commit b06dc281aa99 > did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is > due to implementation. I will add this to the commit log as well without Fixes > tag. Right, so it's always been a problem for PTE-mapped PMD-sized THP and only with batching we can now do "better". But not fix it completely. I'll reply separately to your other mail.
On 12.04.24 16:35, Zi Yan wrote: > On 11 Apr 2024, at 11:46, David Hildenbrand wrote: > >> On 11.04.24 17:32, Zi Yan wrote: >>> From: Zi Yan <ziy@nvidia.com> >>> >>> In __folio_remove_rmap(), a large folio is added to deferred split list >>> if any page in a folio loses its final mapping. It is possible that >>> the folio is unmapped fully, but it is unnecessary to add the folio >>> to deferred split list at all. Fix it by checking folio mapcount before >>> adding a folio to deferred split list. >>> >>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>> --- >>> mm/rmap.c | 9 ++++++--- >>> 1 file changed, 6 insertions(+), 3 deletions(-) >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 2608c40dffad..d599a772e282 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>> enum rmap_level level) >>> { >>> atomic_t *mapped = &folio->_nr_pages_mapped; >>> - int last, nr = 0, nr_pmdmapped = 0; >>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>> enum node_stat_item idx; >>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>> break; >>> } >>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>> + mapcount = atomic_sub_return(nr_pages, >>> + &folio->_large_mapcount) + 1; >> >> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. > > Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) > below, but to save an atomic op, I chose to read mapcount here. Some points: (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic RMW that return a value -- and how they interact with memory barriers. Further, how relaxed variants are only optimized on some architectures. atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory access that should not be refetched. Usually cheaper than most other stuff that involves atomics. (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) to figure out if the folio is now completely unmapped. (3) There is one fundamental issue: if we are not batch-unmapping the whole thing, we will still add the folios to the deferred split queue. Migration would still do that, or if there are multiple VMAs covering a folio. (4) We should really avoid making common operations slower only to make some unreliable stats less unreliable. We should likely do something like the following, which might even be a bit faster in some cases because we avoid a function call in case we unmap individual PTEs by checking _deferred_list ahead of time diff --git a/mm/rmap.c b/mm/rmap.c index 2608c40dffad..356598b3dc3c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_large(folio) && folio_test_anon(folio)) - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) - deferred_split_folio(folio); + if (folio_test_large(folio) && folio_test_anon(folio) && + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && + atomic_read(mapped) && + data_race(list_empty(&folio->_deferred_list))) + deferred_split_folio(folio); } I also thought about handling the scenario where we unmap the whole think in smaller chunks. We could detect "!atomic_read(mapped)" and detect that it is on the deferred split list, and simply remove it from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. But it would be racy with concurrent remapping of the folio (might happen with anon folios in corner cases I guess). What we can do is the following, though: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index dc30139590e6..f05cba1807f2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) ds_queue = get_deferred_split_queue(folio); spin_lock_irqsave(&ds_queue->split_queue_lock, flags); if (!list_empty(&folio->_deferred_list)) { + if (folio_test_pmd_mappable(folio)) + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); ds_queue->split_queue_len--; list_del_init(&folio->_deferred_list); } Adding the right event of course. Then it's easy to filter out these "temporarily added to the list, but never split before the folio was freed" cases.
On 12.04.24 20:29, Yang Shi wrote: > On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <ziy@nvidia.com> wrote: >> >> On 12 Apr 2024, at 10:21, Zi Yan wrote: >> >>> On 11 Apr 2024, at 17:59, Yang Shi wrote: >>> >>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 11.04.24 21:01, Yang Shi wrote: >>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: >>>>>>> >>>>>>> On 11.04.24 17:32, Zi Yan wrote: >>>>>>>> From: Zi Yan <ziy@nvidia.com> >>>>>>>> >>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>>>>> if any page in a folio loses its final mapping. It is possible that >>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>>>>> adding a folio to deferred split list. >>>>>>>> >>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>>>> --- >>>>>>>> mm/rmap.c | 9 ++++++--- >>>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>>>>> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>>>> index 2608c40dffad..d599a772e282 100644 >>>>>>>> --- a/mm/rmap.c >>>>>>>> +++ b/mm/rmap.c >>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>>>> enum rmap_level level) >>>>>>>> { >>>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>>>>> enum node_stat_item idx; >>>>>>>> >>>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>>>> break; >>>>>>>> } >>>>>>>> >>>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>>>>> + mapcount = atomic_sub_return(nr_pages, >>>>>>>> + &folio->_large_mapcount) + 1; >>>>>>> >>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it >>>>>>> below. Re-reading should be fine here. >>>>>>> >>>>>>>> do { >>>>>>>> last = atomic_add_negative(-1, &page->_mapcount); >>>>>>>> if (last) { >>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>>>> * is still mapped. >>>>>>>> */ >>>>>>>> if (folio_test_large(folio) && folio_test_anon(folio)) >>>>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >>>>>>>> + if ((level == RMAP_LEVEL_PTE && >>>>>>>> + mapcount != 0) || >>>>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) >>>>>>>> deferred_split_folio(folio); >>>>>>>> } >>>>>>> >>>>>>> But I do wonder if we really care? Usually the folio will simply get >>>>>>> freed afterwards, where we simply remove it from the list. >>>>>>> >>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a >>>>>>> corner case ... >>>>>>> >>>>>>> Is it really worth the added code? Not convinced. >>>>>> >>>>>> It is actually not only an optimization, but also fixed the broken >>>>>> thp_deferred_split_page counter in /proc/vmstat. >>>>>> >>>>>> The counter actually counted the partially unmapped huge pages (so >>>>>> they are on deferred split queue), but it counts the fully unmapped >>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the >>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does >>>>>> now. >>>>>> >>>>>> The counter is also useful for performance analysis, for example, >>>>>> whether a workload did a lot of partial unmap or not. So fixing the >>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit >>>>>> log. >>>>> >>>>> Yes, all that is information that is missing from the patch description. >>>>> If it's a fix, there should be a "Fixes:". >>>>> >>>>> Likely we want to have a folio_large_mapcount() check in the code below. >>>>> (I yet have to digest the condition where this happens -- can we have an >>>>> example where we'd use to do the wrong thing and now would do the right >>>>> thing as well?) >>>> >>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or >>>> some full 64K areas, you will see thp_deferred_split_page increased, >>>> but it shouldn't. >>>> >>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP >>>> is fully unmapped or partially unmapped by comparing the number of >>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for >>>> PMD-mappable THP. >>>> >>>> However I just realized this problem was kind of workaround'ed by commit: >>>> >>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28 >>>> Author: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> Date: Fri Mar 29 14:59:33 2024 +0800 >>>> >>>> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP >>>> split statistics >>>> >>>> Now the mTHP can also be split or added into the deferred list, so add >>>> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid >>>> confusion with PMD mapped THP related statistics. >>>> >>>> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> Acked-by: David Hildenbrand <david@redhat.com> >>>> Cc: Muchun Song <muchun.song@linux.dev> >>>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> >>>> >>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it >>>> also made thp_split_page didn't count mTHP anymore. >>>> >>>> However Zi Yan's patch does make the code more robust and we don't >>>> need to worry about the miscounting issue anymore if we will add >>>> deferred_split_page and split_page counters for mTHP in the future. >>> >>> Actually, the patch above does not fix everything. A fully unmapped >>> PTE-mapped order-9 THP is also added to deferred split list and >>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512 >>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() >>> the order-9 folio is folio_test_pmd_mappable(). >>> >>> I will add this information in the next version. >> >> It might >> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"), >> but before this commit fully unmapping a PTE-mapped order-9 THP still increased >> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE >> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99 >> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is >> due to implementation. I will add this to the commit log as well without Fixes >> tag. > > Thanks for digging deeper. The problem may be not that obvious before > mTHP because PMD-mappable THP is converted to PTE-mapped due to > partial unmap in most cases. But mTHP is always PTE-mapped in the > first place. The other reason is batched rmap remove was not supported > before David's optimization. Yes. > > Now we do have reasonable motivation to make it precise and it is also > easier to do so than before. If by "precise" you mean "less unreliable in some cases", yes. See my other mail.
On Fri, Apr 12, 2024 at 12:36 PM David Hildenbrand <david@redhat.com> wrote: > > On 12.04.24 20:29, Yang Shi wrote: > > On Fri, Apr 12, 2024 at 7:31 AM Zi Yan <ziy@nvidia.com> wrote: > >> > >> On 12 Apr 2024, at 10:21, Zi Yan wrote: > >> > >>> On 11 Apr 2024, at 17:59, Yang Shi wrote: > >>> > >>>> On Thu, Apr 11, 2024 at 2:15 PM David Hildenbrand <david@redhat.com> wrote: > >>>>> > >>>>> On 11.04.24 21:01, Yang Shi wrote: > >>>>>> On Thu, Apr 11, 2024 at 8:46 AM David Hildenbrand <david@redhat.com> wrote: > >>>>>>> > >>>>>>> On 11.04.24 17:32, Zi Yan wrote: > >>>>>>>> From: Zi Yan <ziy@nvidia.com> > >>>>>>>> > >>>>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list > >>>>>>>> if any page in a folio loses its final mapping. It is possible that > >>>>>>>> the folio is unmapped fully, but it is unnecessary to add the folio > >>>>>>>> to deferred split list at all. Fix it by checking folio mapcount before > >>>>>>>> adding a folio to deferred split list. > >>>>>>>> > >>>>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> > >>>>>>>> --- > >>>>>>>> mm/rmap.c | 9 ++++++--- > >>>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>>>>>> index 2608c40dffad..d599a772e282 100644 > >>>>>>>> --- a/mm/rmap.c > >>>>>>>> +++ b/mm/rmap.c > >>>>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>>>> enum rmap_level level) > >>>>>>>> { > >>>>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; > >>>>>>>> - int last, nr = 0, nr_pmdmapped = 0; > >>>>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > >>>>>>>> enum node_stat_item idx; > >>>>>>>> > >>>>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>>>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>>>> break; > >>>>>>>> } > >>>>>>>> > >>>>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>>>>>>> + mapcount = atomic_sub_return(nr_pages, > >>>>>>>> + &folio->_large_mapcount) + 1; > >>>>>>> > >>>>>>> That becomes a new memory barrier on some archs. Rather just re-read it > >>>>>>> below. Re-reading should be fine here. > >>>>>>> > >>>>>>>> do { > >>>>>>>> last = atomic_add_negative(-1, &page->_mapcount); > >>>>>>>> if (last) { > >>>>>>>> @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>>>> * is still mapped. > >>>>>>>> */ > >>>>>>>> if (folio_test_large(folio) && folio_test_anon(folio)) > >>>>>>>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > >>>>>>>> + if ((level == RMAP_LEVEL_PTE && > >>>>>>>> + mapcount != 0) || > >>>>>>>> + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) > >>>>>>>> deferred_split_folio(folio); > >>>>>>>> } > >>>>>>> > >>>>>>> But I do wonder if we really care? Usually the folio will simply get > >>>>>>> freed afterwards, where we simply remove it from the list. > >>>>>>> > >>>>>>> If it's pinned, we won't be able to free or reclaim, but it's rather a > >>>>>>> corner case ... > >>>>>>> > >>>>>>> Is it really worth the added code? Not convinced. > >>>>>> > >>>>>> It is actually not only an optimization, but also fixed the broken > >>>>>> thp_deferred_split_page counter in /proc/vmstat. > >>>>>> > >>>>>> The counter actually counted the partially unmapped huge pages (so > >>>>>> they are on deferred split queue), but it counts the fully unmapped > >>>>>> mTHP as well now. For example, when a 64K THP is fully unmapped, the > >>>>>> thp_deferred_split_page is not supposed to get inc'ed, but it does > >>>>>> now. > >>>>>> > >>>>>> The counter is also useful for performance analysis, for example, > >>>>>> whether a workload did a lot of partial unmap or not. So fixing the > >>>>>> counter seems worthy. Zi Yan should have mentioned this in the commit > >>>>>> log. > >>>>> > >>>>> Yes, all that is information that is missing from the patch description. > >>>>> If it's a fix, there should be a "Fixes:". > >>>>> > >>>>> Likely we want to have a folio_large_mapcount() check in the code below. > >>>>> (I yet have to digest the condition where this happens -- can we have an > >>>>> example where we'd use to do the wrong thing and now would do the right > >>>>> thing as well?) > >>>> > >>>> For example, map 1G memory with 64K mTHP, then unmap the whole 1G or > >>>> some full 64K areas, you will see thp_deferred_split_page increased, > >>>> but it shouldn't. > >>>> > >>>> It looks __folio_remove_rmap() incorrectly detected whether the mTHP > >>>> is fully unmapped or partially unmapped by comparing the number of > >>>> still-mapped subpages to ENTIRELY_MAPPED, which should just work for > >>>> PMD-mappable THP. > >>>> > >>>> However I just realized this problem was kind of workaround'ed by commit: > >>>> > >>>> commit 98046944a1597f3a02b792dbe9665e9943b77f28 > >>>> Author: Baolin Wang <baolin.wang@linux.alibaba.com> > >>>> Date: Fri Mar 29 14:59:33 2024 +0800 > >>>> > >>>> mm: huge_memory: add the missing folio_test_pmd_mappable() for THP > >>>> split statistics > >>>> > >>>> Now the mTHP can also be split or added into the deferred list, so add > >>>> folio_test_pmd_mappable() validation for PMD mapped THP, to avoid > >>>> confusion with PMD mapped THP related statistics. > >>>> > >>>> Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com > >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > >>>> Acked-by: David Hildenbrand <david@redhat.com> > >>>> Cc: Muchun Song <muchun.song@linux.dev> > >>>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > >>>> > >>>> This commit made thp_deferred_split_page didn't count mTHP anymore, it > >>>> also made thp_split_page didn't count mTHP anymore. > >>>> > >>>> However Zi Yan's patch does make the code more robust and we don't > >>>> need to worry about the miscounting issue anymore if we will add > >>>> deferred_split_page and split_page counters for mTHP in the future. > >>> > >>> Actually, the patch above does not fix everything. A fully unmapped > >>> PTE-mapped order-9 THP is also added to deferred split list and > >>> counted as THP_DEFERRED_SPLIT_PAGE without my patch, since nr is 512 > >>> (non zero), level is RMAP_LEVEL_PTE, and inside deferred_split_folio() > >>> the order-9 folio is folio_test_pmd_mappable(). > >>> > >>> I will add this information in the next version. > >> > >> It might > >> Fixes: b06dc281aa99 ("mm/rmap: introduce folio_remove_rmap_[pte|ptes|pmd]()"), > >> but before this commit fully unmapping a PTE-mapped order-9 THP still increased > >> THP_DEFERRED_SPLIT_PAGE, because PTEs are unmapped individually and first PTE > >> unmapping adds the THP into the deferred split list. This means commit b06dc281aa99 > >> did not change anything and before that THP_DEFERRED_SPLIT_PAGE increase is > >> due to implementation. I will add this to the commit log as well without Fixes > >> tag. > > > > Thanks for digging deeper. The problem may be not that obvious before > > mTHP because PMD-mappable THP is converted to PTE-mapped due to > > partial unmap in most cases. But mTHP is always PTE-mapped in the > > first place. The other reason is batched rmap remove was not supported > > before David's optimization. > > Yes. > > > > > Now we do have reasonable motivation to make it precise and it is also > > easier to do so than before. > > If by "precise" you mean "less unreliable in some cases", yes. See my > other mail. Yes, definitely. Make the unreliability somehow acceptable. > > -- > Cheers, > > David / dhildenb >
On Fri, Apr 12, 2024 at 12:33 PM David Hildenbrand <david@redhat.com> wrote: > > On 12.04.24 16:35, Zi Yan wrote: > > On 11 Apr 2024, at 11:46, David Hildenbrand wrote: > > > >> On 11.04.24 17:32, Zi Yan wrote: > >>> From: Zi Yan <ziy@nvidia.com> > >>> > >>> In __folio_remove_rmap(), a large folio is added to deferred split list > >>> if any page in a folio loses its final mapping. It is possible that > >>> the folio is unmapped fully, but it is unnecessary to add the folio > >>> to deferred split list at all. Fix it by checking folio mapcount before > >>> adding a folio to deferred split list. > >>> > >>> Signed-off-by: Zi Yan <ziy@nvidia.com> > >>> --- > >>> mm/rmap.c | 9 ++++++--- > >>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/mm/rmap.c b/mm/rmap.c > >>> index 2608c40dffad..d599a772e282 100644 > >>> --- a/mm/rmap.c > >>> +++ b/mm/rmap.c > >>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>> enum rmap_level level) > >>> { > >>> atomic_t *mapped = &folio->_nr_pages_mapped; > >>> - int last, nr = 0, nr_pmdmapped = 0; > >>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > >>> enum node_stat_item idx; > >>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>> break; > >>> } > >>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>> + mapcount = atomic_sub_return(nr_pages, > >>> + &folio->_large_mapcount) + 1; > >> > >> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. > > > > Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) > > below, but to save an atomic op, I chose to read mapcount here. > > Some points: > > (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic > RMW that return a value -- and how they interact with memory barriers. > Further, how relaxed variants are only optimized on some architectures. > > atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory > access that should not be refetched. Usually cheaper than most other stuff > that involves atomics. > > (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) > to figure out if the folio is now completely unmapped. > > (3) There is one fundamental issue: if we are not batch-unmapping the whole > thing, we will still add the folios to the deferred split queue. Migration > would still do that, or if there are multiple VMAs covering a folio. Maybe we can let rmap remove code not touch deferred split queue in migration path at all because we know all the PTEs will be converted to migration entries. And all the migration entries will be converted back to PTEs regardless of whether try_to_migrate succeeded or not. > > (4) We should really avoid making common operations slower only to make > some unreliable stats less unreliable. > > > We should likely do something like the following, which might even be a bit > faster in some cases because we avoid a function call in case we unmap > individual PTEs by checking _deferred_list ahead of time > > diff --git a/mm/rmap.c b/mm/rmap.c > index 2608c40dffad..356598b3dc3c 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > * page of the folio is unmapped and at least one page > * is still mapped. > */ > - if (folio_test_large(folio) && folio_test_anon(folio)) > - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > - deferred_split_folio(folio); > + if (folio_test_large(folio) && folio_test_anon(folio) && > + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && > + atomic_read(mapped) && > + data_race(list_empty(&folio->_deferred_list))) > + deferred_split_folio(folio); > } > > > I also thought about handling the scenario where we unmap the whole > think in smaller chunks. We could detect "!atomic_read(mapped)" and > detect that it is on the deferred split list, and simply remove it > from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. > > But it would be racy with concurrent remapping of the folio (might happen with > anon folios in corner cases I guess). > > What we can do is the following, though: > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index dc30139590e6..f05cba1807f2 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) > ds_queue = get_deferred_split_queue(folio); > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > if (!list_empty(&folio->_deferred_list)) { > + if (folio_test_pmd_mappable(folio)) > + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); > ds_queue->split_queue_len--; > list_del_init(&folio->_deferred_list); > } > > Adding the right event of course. > > > Then it's easy to filter out these "temporarily added to the list, but never split > before the folio was freed" cases. > > > -- > Cheers, > > David / dhildenb >
On 12 Apr 2024, at 15:32, David Hildenbrand wrote: > On 12.04.24 16:35, Zi Yan wrote: >> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: >> >>> On 11.04.24 17:32, Zi Yan wrote: >>>> From: Zi Yan <ziy@nvidia.com> >>>> >>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>> if any page in a folio loses its final mapping. It is possible that >>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>> to deferred split list at all. Fix it by checking folio mapcount before >>>> adding a folio to deferred split list. >>>> >>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>> --- >>>> mm/rmap.c | 9 ++++++--- >>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 2608c40dffad..d599a772e282 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>> enum rmap_level level) >>>> { >>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>> - int last, nr = 0, nr_pmdmapped = 0; >>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>> enum node_stat_item idx; >>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>> break; >>>> } >>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>> + mapcount = atomic_sub_return(nr_pages, >>>> + &folio->_large_mapcount) + 1; >>> >>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. >> >> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) >> below, but to save an atomic op, I chose to read mapcount here. > > Some points: > > (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic > RMW that return a value -- and how they interact with memory barriers. > Further, how relaxed variants are only optimized on some architectures. > > atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory > access that should not be refetched. Usually cheaper than most other stuff > that involves atomics. I should have checked the actual implementation instead of being fooled by the name. Will read about it. Thanks. > > (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) > to figure out if the folio is now completely unmapped. > > (3) There is one fundamental issue: if we are not batch-unmapping the whole > thing, we will still add the folios to the deferred split queue. Migration > would still do that, or if there are multiple VMAs covering a folio. > > (4) We should really avoid making common operations slower only to make > some unreliable stats less unreliable. > > > We should likely do something like the following, which might even be a bit > faster in some cases because we avoid a function call in case we unmap > individual PTEs by checking _deferred_list ahead of time > > diff --git a/mm/rmap.c b/mm/rmap.c > index 2608c40dffad..356598b3dc3c 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > * page of the folio is unmapped and at least one page > * is still mapped. > */ > - if (folio_test_large(folio) && folio_test_anon(folio)) > - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > - deferred_split_folio(folio); > + if (folio_test_large(folio) && folio_test_anon(folio) && > + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && > + atomic_read(mapped) && > + data_race(list_empty(&folio->_deferred_list))) data_race() might not be needed, as Ryan pointed out[1] > + deferred_split_folio(folio); > } > > I also thought about handling the scenario where we unmap the whole > think in smaller chunks. We could detect "!atomic_read(mapped)" and > detect that it is on the deferred split list, and simply remove it > from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. > > But it would be racy with concurrent remapping of the folio (might happen with > anon folios in corner cases I guess). > > What we can do is the following, though: > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index dc30139590e6..f05cba1807f2 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) > ds_queue = get_deferred_split_queue(folio); > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > if (!list_empty(&folio->_deferred_list)) { > + if (folio_test_pmd_mappable(folio)) > + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); > ds_queue->split_queue_len--; > list_del_init(&folio->_deferred_list); > } > > Adding the right event of course. > > > Then it's easy to filter out these "temporarily added to the list, but never split > before the folio was freed" cases. So instead of making THP_DEFERRED_SPLIT_PAGE precise, use THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work. I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred splits, why not just move the counter to deferred_split_scan(), where the actual split happens. Or the counter has a different meaning? [1] https://lore.kernel.org/linux-mm/e3e14098-eade-483e-a459-e43200b87941@arm.com/ -- Best Regards, Yan, Zi
On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote: > > On 12 Apr 2024, at 15:32, David Hildenbrand wrote: > > > On 12.04.24 16:35, Zi Yan wrote: > >> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: > >> > >>> On 11.04.24 17:32, Zi Yan wrote: > >>>> From: Zi Yan <ziy@nvidia.com> > >>>> > >>>> In __folio_remove_rmap(), a large folio is added to deferred split list > >>>> if any page in a folio loses its final mapping. It is possible that > >>>> the folio is unmapped fully, but it is unnecessary to add the folio > >>>> to deferred split list at all. Fix it by checking folio mapcount before > >>>> adding a folio to deferred split list. > >>>> > >>>> Signed-off-by: Zi Yan <ziy@nvidia.com> > >>>> --- > >>>> mm/rmap.c | 9 ++++++--- > >>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>> index 2608c40dffad..d599a772e282 100644 > >>>> --- a/mm/rmap.c > >>>> +++ b/mm/rmap.c > >>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>> enum rmap_level level) > >>>> { > >>>> atomic_t *mapped = &folio->_nr_pages_mapped; > >>>> - int last, nr = 0, nr_pmdmapped = 0; > >>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > >>>> enum node_stat_item idx; > >>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>> break; > >>>> } > >>>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>>> + mapcount = atomic_sub_return(nr_pages, > >>>> + &folio->_large_mapcount) + 1; > >>> > >>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. > >> > >> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) > >> below, but to save an atomic op, I chose to read mapcount here. > > > > Some points: > > > > (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic > > RMW that return a value -- and how they interact with memory barriers. > > Further, how relaxed variants are only optimized on some architectures. > > > > atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory > > access that should not be refetched. Usually cheaper than most other stuff > > that involves atomics. > > I should have checked the actual implementation instead of being fooled > by the name. Will read about it. Thanks. > > > > > (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) > > to figure out if the folio is now completely unmapped. > > > > (3) There is one fundamental issue: if we are not batch-unmapping the whole > > thing, we will still add the folios to the deferred split queue. Migration > > would still do that, or if there are multiple VMAs covering a folio. > > > > (4) We should really avoid making common operations slower only to make > > some unreliable stats less unreliable. > > > > > > We should likely do something like the following, which might even be a bit > > faster in some cases because we avoid a function call in case we unmap > > individual PTEs by checking _deferred_list ahead of time > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 2608c40dffad..356598b3dc3c 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > > * page of the folio is unmapped and at least one page > > * is still mapped. > > */ > > - if (folio_test_large(folio) && folio_test_anon(folio)) > > - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > > - deferred_split_folio(folio); > > + if (folio_test_large(folio) && folio_test_anon(folio) && > > + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && > > + atomic_read(mapped) && > > + data_race(list_empty(&folio->_deferred_list))) > > data_race() might not be needed, as Ryan pointed out[1] > > > + deferred_split_folio(folio); > > } > > > > I also thought about handling the scenario where we unmap the whole > > think in smaller chunks. We could detect "!atomic_read(mapped)" and > > detect that it is on the deferred split list, and simply remove it > > from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. > > > > But it would be racy with concurrent remapping of the folio (might happen with > > anon folios in corner cases I guess). > > > > What we can do is the following, though: > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index dc30139590e6..f05cba1807f2 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) > > ds_queue = get_deferred_split_queue(folio); > > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > > if (!list_empty(&folio->_deferred_list)) { > > + if (folio_test_pmd_mappable(folio)) > > + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); > > ds_queue->split_queue_len--; > > list_del_init(&folio->_deferred_list); > > } > > > > Adding the right event of course. > > > > > > Then it's easy to filter out these "temporarily added to the list, but never split > > before the folio was freed" cases. > > So instead of making THP_DEFERRED_SPLIT_PAGE precise, use > THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work. It is definitely possible that the THP on the deferred split queue are freed instead of split. For example, 1M is unmapped for a 2M THP, then later the remaining 1M is unmapped, or the process exits before memory pressure happens. So how come we can tell it is "temporarily added to list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE actually just counts how many pages are still on deferred split queue. It may be useful. However the counter is typically used to estimate how many THP are partially unmapped during a period of time. So we just need to know the initial value and the value when we read it again. > > I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred > splits, why not just move the counter to deferred_split_scan(), where the actual > split happens. Or the counter has a different meaning? The deferred_split_scan() / deferred_split_count() just can return the number of pages on a specific queue (a specific node with a specific memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss something? Or you mean traverse all memcgs and all nodes? It sounds too overkilling. > > > > [1] https://lore.kernel.org/linux-mm/e3e14098-eade-483e-a459-e43200b87941@arm.com/ > > -- > Best Regards, > Yan, Zi
On 12 Apr 2024, at 18:29, Yang Shi wrote: > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote: >> >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote: >> >>> On 12.04.24 16:35, Zi Yan wrote: >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: >>>> >>>>> On 11.04.24 17:32, Zi Yan wrote: >>>>>> From: Zi Yan <ziy@nvidia.com> >>>>>> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>>> if any page in a folio loses its final mapping. It is possible that >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>>> adding a folio to deferred split list. >>>>>> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>> --- >>>>>> mm/rmap.c | 9 ++++++--- >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 2608c40dffad..d599a772e282 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>> enum rmap_level level) >>>>>> { >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>>> enum node_stat_item idx; >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>> break; >>>>>> } >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>>> + mapcount = atomic_sub_return(nr_pages, >>>>>> + &folio->_large_mapcount) + 1; >>>>> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. >>>> >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) >>>> below, but to save an atomic op, I chose to read mapcount here. >>> >>> Some points: >>> >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic >>> RMW that return a value -- and how they interact with memory barriers. >>> Further, how relaxed variants are only optimized on some architectures. >>> >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory >>> access that should not be refetched. Usually cheaper than most other stuff >>> that involves atomics. >> >> I should have checked the actual implementation instead of being fooled >> by the name. Will read about it. Thanks. >> >>> >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) >>> to figure out if the folio is now completely unmapped. >>> >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole >>> thing, we will still add the folios to the deferred split queue. Migration >>> would still do that, or if there are multiple VMAs covering a folio. >>> >>> (4) We should really avoid making common operations slower only to make >>> some unreliable stats less unreliable. >>> >>> >>> We should likely do something like the following, which might even be a bit >>> faster in some cases because we avoid a function call in case we unmap >>> individual PTEs by checking _deferred_list ahead of time >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 2608c40dffad..356598b3dc3c 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>> * page of the folio is unmapped and at least one page >>> * is still mapped. >>> */ >>> - if (folio_test_large(folio) && folio_test_anon(folio)) >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >>> - deferred_split_folio(folio); >>> + if (folio_test_large(folio) && folio_test_anon(folio) && >>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && >>> + atomic_read(mapped) && >>> + data_race(list_empty(&folio->_deferred_list))) >> >> data_race() might not be needed, as Ryan pointed out[1] >> >>> + deferred_split_folio(folio); >>> } >>> >>> I also thought about handling the scenario where we unmap the whole >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and >>> detect that it is on the deferred split list, and simply remove it >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. >>> >>> But it would be racy with concurrent remapping of the folio (might happen with >>> anon folios in corner cases I guess). >>> >>> What we can do is the following, though: >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index dc30139590e6..f05cba1807f2 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) >>> ds_queue = get_deferred_split_queue(folio); >>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>> if (!list_empty(&folio->_deferred_list)) { >>> + if (folio_test_pmd_mappable(folio)) >>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); >>> ds_queue->split_queue_len--; >>> list_del_init(&folio->_deferred_list); >>> } >>> >>> Adding the right event of course. >>> >>> >>> Then it's easy to filter out these "temporarily added to the list, but never split >>> before the folio was freed" cases. >> >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work. > > It is definitely possible that the THP on the deferred split queue are > freed instead of split. For example, 1M is unmapped for a 2M THP, then > later the remaining 1M is unmapped, or the process exits before memory > pressure happens. So how come we can tell it is "temporarily added to > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE > actually just counts how many pages are still on deferred split queue. > It may be useful. However the counter is typically used to estimate > how many THP are partially unmapped during a period of time. So we > just need to know the initial value and the value when we read it > again. > >> >> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred >> splits, why not just move the counter to deferred_split_scan(), where the actual >> split happens. Or the counter has a different meaning? > > The deferred_split_scan() / deferred_split_count() just can return the > number of pages on a specific queue (a specific node with a specific > memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss > something? Or you mean traverse all memcgs and all nodes? It sounds > too overkilling. I mean instead of increasing THP_DEFERRED_SPLIT_PAGE when a folio is added to the split list, increase it when a folio is split in deferred_split_scan(), regardless which list the folio is on. -- Best Regards, Yan, Zi
On Fri, Apr 12, 2024 at 3:59 PM Zi Yan <ziy@nvidia.com> wrote: > > On 12 Apr 2024, at 18:29, Yang Shi wrote: > > > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote: > >> > >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote: > >> > >>> On 12.04.24 16:35, Zi Yan wrote: > >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: > >>>> > >>>>> On 11.04.24 17:32, Zi Yan wrote: > >>>>>> From: Zi Yan <ziy@nvidia.com> > >>>>>> > >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list > >>>>>> if any page in a folio loses its final mapping. It is possible that > >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio > >>>>>> to deferred split list at all. Fix it by checking folio mapcount before > >>>>>> adding a folio to deferred split list. > >>>>>> > >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> > >>>>>> --- > >>>>>> mm/rmap.c | 9 ++++++--- > >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>>>> index 2608c40dffad..d599a772e282 100644 > >>>>>> --- a/mm/rmap.c > >>>>>> +++ b/mm/rmap.c > >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>> enum rmap_level level) > >>>>>> { > >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; > >>>>>> - int last, nr = 0, nr_pmdmapped = 0; > >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > >>>>>> enum node_stat_item idx; > >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>> break; > >>>>>> } > >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>>>>> + mapcount = atomic_sub_return(nr_pages, > >>>>>> + &folio->_large_mapcount) + 1; > >>>>> > >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. > >>>> > >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) > >>>> below, but to save an atomic op, I chose to read mapcount here. > >>> > >>> Some points: > >>> > >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic > >>> RMW that return a value -- and how they interact with memory barriers. > >>> Further, how relaxed variants are only optimized on some architectures. > >>> > >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory > >>> access that should not be refetched. Usually cheaper than most other stuff > >>> that involves atomics. > >> > >> I should have checked the actual implementation instead of being fooled > >> by the name. Will read about it. Thanks. > >> > >>> > >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) > >>> to figure out if the folio is now completely unmapped. > >>> > >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole > >>> thing, we will still add the folios to the deferred split queue. Migration > >>> would still do that, or if there are multiple VMAs covering a folio. > >>> > >>> (4) We should really avoid making common operations slower only to make > >>> some unreliable stats less unreliable. > >>> > >>> > >>> We should likely do something like the following, which might even be a bit > >>> faster in some cases because we avoid a function call in case we unmap > >>> individual PTEs by checking _deferred_list ahead of time > >>> > >>> diff --git a/mm/rmap.c b/mm/rmap.c > >>> index 2608c40dffad..356598b3dc3c 100644 > >>> --- a/mm/rmap.c > >>> +++ b/mm/rmap.c > >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>> * page of the folio is unmapped and at least one page > >>> * is still mapped. > >>> */ > >>> - if (folio_test_large(folio) && folio_test_anon(folio)) > >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > >>> - deferred_split_folio(folio); > >>> + if (folio_test_large(folio) && folio_test_anon(folio) && > >>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && > >>> + atomic_read(mapped) && > >>> + data_race(list_empty(&folio->_deferred_list))) > >> > >> data_race() might not be needed, as Ryan pointed out[1] > >> > >>> + deferred_split_folio(folio); > >>> } > >>> > >>> I also thought about handling the scenario where we unmap the whole > >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and > >>> detect that it is on the deferred split list, and simply remove it > >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. > >>> > >>> But it would be racy with concurrent remapping of the folio (might happen with > >>> anon folios in corner cases I guess). > >>> > >>> What we can do is the following, though: > >>> > >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >>> index dc30139590e6..f05cba1807f2 100644 > >>> --- a/mm/huge_memory.c > >>> +++ b/mm/huge_memory.c > >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) > >>> ds_queue = get_deferred_split_queue(folio); > >>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > >>> if (!list_empty(&folio->_deferred_list)) { > >>> + if (folio_test_pmd_mappable(folio)) > >>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); > >>> ds_queue->split_queue_len--; > >>> list_del_init(&folio->_deferred_list); > >>> } > >>> > >>> Adding the right event of course. > >>> > >>> > >>> Then it's easy to filter out these "temporarily added to the list, but never split > >>> before the folio was freed" cases. > >> > >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use > >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work. > > > > It is definitely possible that the THP on the deferred split queue are > > freed instead of split. For example, 1M is unmapped for a 2M THP, then > > later the remaining 1M is unmapped, or the process exits before memory > > pressure happens. So how come we can tell it is "temporarily added to > > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE > > actually just counts how many pages are still on deferred split queue. > > It may be useful. However the counter is typically used to estimate > > how many THP are partially unmapped during a period of time. So we > > just need to know the initial value and the value when we read it > > again. > > > >> > >> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred > >> splits, why not just move the counter to deferred_split_scan(), where the actual > >> split happens. Or the counter has a different meaning? > > > > The deferred_split_scan() / deferred_split_count() just can return the > > number of pages on a specific queue (a specific node with a specific > > memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss > > something? Or you mean traverse all memcgs and all nodes? It sounds > > too overkilling. > > I mean instead of increasing THP_DEFERRED_SPLIT_PAGE when a folio is added > to the split list, increase it when a folio is split in deferred_split_scan(), > regardless which list the folio is on. It will have overlap with thp_split_page. And what if memory pressure doesn't happen? The counter will be 0 even though thousands THP have been partially unmapped. > > -- > Best Regards, > Yan, Zi
On 12.04.24 23:06, Zi Yan wrote: > On 12 Apr 2024, at 15:32, David Hildenbrand wrote: > >> On 12.04.24 16:35, Zi Yan wrote: >>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: >>> >>>> On 11.04.24 17:32, Zi Yan wrote: >>>>> From: Zi Yan <ziy@nvidia.com> >>>>> >>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>> if any page in a folio loses its final mapping. It is possible that >>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>> adding a folio to deferred split list. >>>>> >>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>> --- >>>>> mm/rmap.c | 9 ++++++--- >>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 2608c40dffad..d599a772e282 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>> enum rmap_level level) >>>>> { >>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>> enum node_stat_item idx; >>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>> break; >>>>> } >>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>> + mapcount = atomic_sub_return(nr_pages, >>>>> + &folio->_large_mapcount) + 1; >>>> >>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. >>> >>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) >>> below, but to save an atomic op, I chose to read mapcount here. >> >> Some points: >> >> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic >> RMW that return a value -- and how they interact with memory barriers. >> Further, how relaxed variants are only optimized on some architectures. >> >> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory >> access that should not be refetched. Usually cheaper than most other stuff >> that involves atomics. > > I should have checked the actual implementation instead of being fooled > by the name. Will read about it. Thanks. > >> >> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) >> to figure out if the folio is now completely unmapped. >> >> (3) There is one fundamental issue: if we are not batch-unmapping the whole >> thing, we will still add the folios to the deferred split queue. Migration >> would still do that, or if there are multiple VMAs covering a folio. >> >> (4) We should really avoid making common operations slower only to make >> some unreliable stats less unreliable. >> >> >> We should likely do something like the following, which might even be a bit >> faster in some cases because we avoid a function call in case we unmap >> individual PTEs by checking _deferred_list ahead of time >> >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 2608c40dffad..356598b3dc3c 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >> * page of the folio is unmapped and at least one page >> * is still mapped. >> */ >> - if (folio_test_large(folio) && folio_test_anon(folio)) >> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >> - deferred_split_folio(folio); >> + if (folio_test_large(folio) && folio_test_anon(folio) && >> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && >> + atomic_read(mapped) && >> + data_race(list_empty(&folio->_deferred_list))) > > data_race() might not be needed, as Ryan pointed out[1] Right, I keep getting confused by that. Likely we should add data_race() only if we get actual reports.
On 13.04.24 00:29, Yang Shi wrote: > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote: >> >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote: >> >>> On 12.04.24 16:35, Zi Yan wrote: >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: >>>> >>>>> On 11.04.24 17:32, Zi Yan wrote: >>>>>> From: Zi Yan <ziy@nvidia.com> >>>>>> >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>>> if any page in a folio loses its final mapping. It is possible that >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>>> adding a folio to deferred split list. >>>>>> >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>>> --- >>>>>> mm/rmap.c | 9 ++++++--- >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 2608c40dffad..d599a772e282 100644 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>> enum rmap_level level) >>>>>> { >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>>> enum node_stat_item idx; >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>>> break; >>>>>> } >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>>> + mapcount = atomic_sub_return(nr_pages, >>>>>> + &folio->_large_mapcount) + 1; >>>>> >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. >>>> >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) >>>> below, but to save an atomic op, I chose to read mapcount here. >>> >>> Some points: >>> >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic >>> RMW that return a value -- and how they interact with memory barriers. >>> Further, how relaxed variants are only optimized on some architectures. >>> >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory >>> access that should not be refetched. Usually cheaper than most other stuff >>> that involves atomics. >> >> I should have checked the actual implementation instead of being fooled >> by the name. Will read about it. Thanks. >> >>> >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) >>> to figure out if the folio is now completely unmapped. >>> >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole >>> thing, we will still add the folios to the deferred split queue. Migration >>> would still do that, or if there are multiple VMAs covering a folio. >>> >>> (4) We should really avoid making common operations slower only to make >>> some unreliable stats less unreliable. >>> >>> >>> We should likely do something like the following, which might even be a bit >>> faster in some cases because we avoid a function call in case we unmap >>> individual PTEs by checking _deferred_list ahead of time >>> >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 2608c40dffad..356598b3dc3c 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>> * page of the folio is unmapped and at least one page >>> * is still mapped. >>> */ >>> - if (folio_test_large(folio) && folio_test_anon(folio)) >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) >>> - deferred_split_folio(folio); >>> + if (folio_test_large(folio) && folio_test_anon(folio) && >>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && >>> + atomic_read(mapped) && >>> + data_race(list_empty(&folio->_deferred_list))) >> >> data_race() might not be needed, as Ryan pointed out[1] >> >>> + deferred_split_folio(folio); >>> } >>> >>> I also thought about handling the scenario where we unmap the whole >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and >>> detect that it is on the deferred split list, and simply remove it >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. >>> >>> But it would be racy with concurrent remapping of the folio (might happen with >>> anon folios in corner cases I guess). >>> >>> What we can do is the following, though: >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index dc30139590e6..f05cba1807f2 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) >>> ds_queue = get_deferred_split_queue(folio); >>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >>> if (!list_empty(&folio->_deferred_list)) { >>> + if (folio_test_pmd_mappable(folio)) >>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); >>> ds_queue->split_queue_len--; >>> list_del_init(&folio->_deferred_list); >>> } >>> >>> Adding the right event of course. >>> >>> >>> Then it's easy to filter out these "temporarily added to the list, but never split >>> before the folio was freed" cases. >> >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work. > > It is definitely possible that the THP on the deferred split queue are > freed instead of split. For example, 1M is unmapped for a 2M THP, then > later the remaining 1M is unmapped, or the process exits before memory > pressure happens. So how come we can tell it is "temporarily added to > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE > actually just counts how many pages are still on deferred split queue. Not quite I think. I don't think we have a counter that counts how many large folios on the deferred list were split. I think we only have THP_SPLIT_PAGE. We could have * THP_DEFERRED_SPLIT_PAGE * THP_UNDO_DEFERRED_SPLIT_PAGE * THP_PERFORM_DEFERRED_SPLIT_PAGE Maybe that would catch more cases (not sure if all, though). Then, you could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE. That could give one a clearer picture how deferred split interacts with actual splitting (possibly under memory pressure), the whole reason why deferred splitting was added after all. > It may be useful. However the counter is typically used to estimate > how many THP are partially unmapped during a period of time. I'd say it's a bit of an abuse of that counter; well, or interpreting something into the counter that that counter never reliably represented. I can easily write a program that keeps sending your counter to infinity simply by triggering that behavior in a loop, so it's all a bit shaky. Something like Ryans script makes more sense, where you get a clearer picture of what's mapped where and how. Because that information can be much more valuable than just knowing if it's mapped fully or partially (again, relevant for handling with memory waste).
On 12.04.24 22:35, Yang Shi wrote: > On Fri, Apr 12, 2024 at 12:33 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 12.04.24 16:35, Zi Yan wrote: >>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: >>> >>>> On 11.04.24 17:32, Zi Yan wrote: >>>>> From: Zi Yan <ziy@nvidia.com> >>>>> >>>>> In __folio_remove_rmap(), a large folio is added to deferred split list >>>>> if any page in a folio loses its final mapping. It is possible that >>>>> the folio is unmapped fully, but it is unnecessary to add the folio >>>>> to deferred split list at all. Fix it by checking folio mapcount before >>>>> adding a folio to deferred split list. >>>>> >>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> >>>>> --- >>>>> mm/rmap.c | 9 ++++++--- >>>>> 1 file changed, 6 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>> index 2608c40dffad..d599a772e282 100644 >>>>> --- a/mm/rmap.c >>>>> +++ b/mm/rmap.c >>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>> enum rmap_level level) >>>>> { >>>>> atomic_t *mapped = &folio->_nr_pages_mapped; >>>>> - int last, nr = 0, nr_pmdmapped = 0; >>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; >>>>> enum node_stat_item idx; >>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); >>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, >>>>> break; >>>>> } >>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); >>>>> + mapcount = atomic_sub_return(nr_pages, >>>>> + &folio->_large_mapcount) + 1; >>>> >>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. >>> >>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) >>> below, but to save an atomic op, I chose to read mapcount here. >> >> Some points: >> >> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic >> RMW that return a value -- and how they interact with memory barriers. >> Further, how relaxed variants are only optimized on some architectures. >> >> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory >> access that should not be refetched. Usually cheaper than most other stuff >> that involves atomics. >> >> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) >> to figure out if the folio is now completely unmapped. >> >> (3) There is one fundamental issue: if we are not batch-unmapping the whole >> thing, we will still add the folios to the deferred split queue. Migration >> would still do that, or if there are multiple VMAs covering a folio. > > Maybe we can let rmap remove code not touch deferred split queue in > migration path at all because we know all the PTEs will be converted > to migration entries. And all the migration entries will be converted > back to PTEs regardless of whether try_to_migrate succeeded or not. It's would be just another bandaid I think :/ Maybe a worthwile one, not sure.
On Mon, Apr 15, 2024 at 8:40 AM David Hildenbrand <david@redhat.com> wrote: > > On 13.04.24 00:29, Yang Shi wrote: > > On Fri, Apr 12, 2024 at 2:06 PM Zi Yan <ziy@nvidia.com> wrote: > >> > >> On 12 Apr 2024, at 15:32, David Hildenbrand wrote: > >> > >>> On 12.04.24 16:35, Zi Yan wrote: > >>>> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: > >>>> > >>>>> On 11.04.24 17:32, Zi Yan wrote: > >>>>>> From: Zi Yan <ziy@nvidia.com> > >>>>>> > >>>>>> In __folio_remove_rmap(), a large folio is added to deferred split list > >>>>>> if any page in a folio loses its final mapping. It is possible that > >>>>>> the folio is unmapped fully, but it is unnecessary to add the folio > >>>>>> to deferred split list at all. Fix it by checking folio mapcount before > >>>>>> adding a folio to deferred split list. > >>>>>> > >>>>>> Signed-off-by: Zi Yan <ziy@nvidia.com> > >>>>>> --- > >>>>>> mm/rmap.c | 9 ++++++--- > >>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>>>> > >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>>>> index 2608c40dffad..d599a772e282 100644 > >>>>>> --- a/mm/rmap.c > >>>>>> +++ b/mm/rmap.c > >>>>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>> enum rmap_level level) > >>>>>> { > >>>>>> atomic_t *mapped = &folio->_nr_pages_mapped; > >>>>>> - int last, nr = 0, nr_pmdmapped = 0; > >>>>>> + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; > >>>>>> enum node_stat_item idx; > >>>>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>>>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>>>>> break; > >>>>>> } > >>>>>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>>>>> + mapcount = atomic_sub_return(nr_pages, > >>>>>> + &folio->_large_mapcount) + 1; > >>>>> > >>>>> That becomes a new memory barrier on some archs. Rather just re-read it below. Re-reading should be fine here. > >>>> > >>>> Would atomic_sub_return_relaxed() work? Originally I was using atomic_read(mapped) > >>>> below, but to save an atomic op, I chose to read mapcount here. > >>> > >>> Some points: > >>> > >>> (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic > >>> RMW that return a value -- and how they interact with memory barriers. > >>> Further, how relaxed variants are only optimized on some architectures. > >>> > >>> atomic_read() is usually READ_ONCE(), which is just an "ordinary" memory > >>> access that should not be refetched. Usually cheaper than most other stuff > >>> that involves atomics. > >> > >> I should have checked the actual implementation instead of being fooled > >> by the name. Will read about it. Thanks. > >> > >>> > >>> (2) We can either use folio_large_mapcount() == 0 or !atomic_read(mapped) > >>> to figure out if the folio is now completely unmapped. > >>> > >>> (3) There is one fundamental issue: if we are not batch-unmapping the whole > >>> thing, we will still add the folios to the deferred split queue. Migration > >>> would still do that, or if there are multiple VMAs covering a folio. > >>> > >>> (4) We should really avoid making common operations slower only to make > >>> some unreliable stats less unreliable. > >>> > >>> > >>> We should likely do something like the following, which might even be a bit > >>> faster in some cases because we avoid a function call in case we unmap > >>> individual PTEs by checking _deferred_list ahead of time > >>> > >>> diff --git a/mm/rmap.c b/mm/rmap.c > >>> index 2608c40dffad..356598b3dc3c 100644 > >>> --- a/mm/rmap.c > >>> +++ b/mm/rmap.c > >>> @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, > >>> * page of the folio is unmapped and at least one page > >>> * is still mapped. > >>> */ > >>> - if (folio_test_large(folio) && folio_test_anon(folio)) > >>> - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) > >>> - deferred_split_folio(folio); > >>> + if (folio_test_large(folio) && folio_test_anon(folio) && > >>> + (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) && > >>> + atomic_read(mapped) && > >>> + data_race(list_empty(&folio->_deferred_list))) > >> > >> data_race() might not be needed, as Ryan pointed out[1] > >> > >>> + deferred_split_folio(folio); > >>> } > >>> > >>> I also thought about handling the scenario where we unmap the whole > >>> think in smaller chunks. We could detect "!atomic_read(mapped)" and > >>> detect that it is on the deferred split list, and simply remove it > >>> from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. > >>> > >>> But it would be racy with concurrent remapping of the folio (might happen with > >>> anon folios in corner cases I guess). > >>> > >>> What we can do is the following, though: > >>> > >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c > >>> index dc30139590e6..f05cba1807f2 100644 > >>> --- a/mm/huge_memory.c > >>> +++ b/mm/huge_memory.c > >>> @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *folio) > >>> ds_queue = get_deferred_split_queue(folio); > >>> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > >>> if (!list_empty(&folio->_deferred_list)) { > >>> + if (folio_test_pmd_mappable(folio)) > >>> + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); > >>> ds_queue->split_queue_len--; > >>> list_del_init(&folio->_deferred_list); > >>> } > >>> > >>> Adding the right event of course. > >>> > >>> > >>> Then it's easy to filter out these "temporarily added to the list, but never split > >>> before the folio was freed" cases. > >> > >> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use > >> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That should work. > > > > It is definitely possible that the THP on the deferred split queue are > > freed instead of split. For example, 1M is unmapped for a 2M THP, then > > later the remaining 1M is unmapped, or the process exits before memory > > pressure happens. So how come we can tell it is "temporarily added to > > list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE > > actually just counts how many pages are still on deferred split queue. > > Not quite I think. I don't think we have a counter that counts how many > large folios on the deferred list were split. I think we only have > THP_SPLIT_PAGE. Yes, we just count how many THP were split regardless of why they got split. They may be split from a deferred split queue due to memory pressure, migration, etc. > > We could have > * THP_DEFERRED_SPLIT_PAGE > * THP_UNDO_DEFERRED_SPLIT_PAGE > * THP_PERFORM_DEFERRED_SPLIT_PAGE > > Maybe that would catch more cases (not sure if all, though). Then, you > could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE - > THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE. > > That could give one a clearer picture how deferred split interacts with > actual splitting (possibly under memory pressure), the whole reason why > deferred splitting was added after all. I'm not quite sure whether there is a solid usecase or not. If we have, we could consider this. But a simpler counter may be more preferred. > > > It may be useful. However the counter is typically used to estimate > > how many THP are partially unmapped during a period of time. > > I'd say it's a bit of an abuse of that counter; well, or interpreting > something into the counter that that counter never reliably represented. It was way more reliable than now. > > I can easily write a program that keeps sending your counter to infinity > simply by triggering that behavior in a loop, so it's all a bit shaky. I don't doubt that. But let's get back to reality. The counter used to stay reasonable and reliable with most real life workloads before mTHP. There may be over-counting, for example, when unmapping a PTE-mapped THP which was not on a deferred split queue before. But such a case is not common for real life workloads because the huge PMD has to be split by partial unmap for most cases. And the partial unmap will add the THP to deferred split queue. But now a common workload, for example, just process exit, may probably send the counter to infinity. > > Something like Ryans script makes more sense, where you get a clearer > picture of what's mapped where and how. Because that information can be > much more valuable than just knowing if it's mapped fully or partially > (again, relevant for handling with memory waste). Ryan's script is very helpful. But the counter has been existing and used for years, and it is a quick indicator and much easier to monitor in a large-scale fleet. If we think the reliability of the counter is not worth fixing, why don't we just remove it. No counter is better than a broken counter. > > -- > Cheers, > > David / dhildenb >
>> >> We could have >> * THP_DEFERRED_SPLIT_PAGE >> * THP_UNDO_DEFERRED_SPLIT_PAGE >> * THP_PERFORM_DEFERRED_SPLIT_PAGE >> >> Maybe that would catch more cases (not sure if all, though). Then, you >> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE - >> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE. >> >> That could give one a clearer picture how deferred split interacts with >> actual splitting (possibly under memory pressure), the whole reason why >> deferred splitting was added after all. > > I'm not quite sure whether there is a solid usecase or not. If we > have, we could consider this. But a simpler counter may be more > preferred. Yes. > >> >>> It may be useful. However the counter is typically used to estimate >>> how many THP are partially unmapped during a period of time. >> >> I'd say it's a bit of an abuse of that counter; well, or interpreting >> something into the counter that that counter never reliably represented. > > It was way more reliable than now. Correct me if I am wrong: now that we only adjust the counter for PMD-sized THP, it is as (un)reliable as it always was. Or was there another unintended change by some of my cleanups or previous patches? > >> >> I can easily write a program that keeps sending your counter to infinity >> simply by triggering that behavior in a loop, so it's all a bit shaky. > > I don't doubt that. But let's get back to reality. The counter used to > stay reasonable and reliable with most real life workloads before > mTHP. There may be over-counting, for example, when unmapping a > PTE-mapped THP which was not on a deferred split queue before. But > such a case is not common for real life workloads because the huge PMD > has to be split by partial unmap for most cases. And the partial unmap > will add the THP to deferred split queue. > > But now a common workload, for example, just process exit, may > probably send the counter to infinity. Agreed, that's stupid. > >> >> Something like Ryans script makes more sense, where you get a clearer >> picture of what's mapped where and how. Because that information can be >> much more valuable than just knowing if it's mapped fully or partially >> (again, relevant for handling with memory waste). > > Ryan's script is very helpful. But the counter has been existing and > used for years, and it is a quick indicator and much easier to monitor > in a large-scale fleet. > > If we think the reliability of the counter is not worth fixing, why > don't we just remove it. No counter is better than a broken counter. Again, is only counting the PMD-sized THPs "fixing" the old use cases? Then it should just stick around. And we can even optimize it for some more cases as proposed in this patch. But there is no easy way to "get it completely right" I'm afraid.
On Mon, Apr 15, 2024 at 12:19 PM David Hildenbrand <david@redhat.com> wrote: > > >> > >> We could have > >> * THP_DEFERRED_SPLIT_PAGE > >> * THP_UNDO_DEFERRED_SPLIT_PAGE > >> * THP_PERFORM_DEFERRED_SPLIT_PAGE > >> > >> Maybe that would catch more cases (not sure if all, though). Then, you > >> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE - > >> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE. > >> > >> That could give one a clearer picture how deferred split interacts with > >> actual splitting (possibly under memory pressure), the whole reason why > >> deferred splitting was added after all. > > > > I'm not quite sure whether there is a solid usecase or not. If we > > have, we could consider this. But a simpler counter may be more > > preferred. > > Yes. > > > > >> > >>> It may be useful. However the counter is typically used to estimate > >>> how many THP are partially unmapped during a period of time. > >> > >> I'd say it's a bit of an abuse of that counter; well, or interpreting > >> something into the counter that that counter never reliably represented. > > > > It was way more reliable than now. > > Correct me if I am wrong: now that we only adjust the counter for > PMD-sized THP, it is as (un)reliable as it always was. Yes. The problem introduced by mTHP was somehow workaround'ed by that commit. > > Or was there another unintended change by some of my cleanups or > previous patches? No, at least I didn't see for now. > > > > >> > >> I can easily write a program that keeps sending your counter to infinity > >> simply by triggering that behavior in a loop, so it's all a bit shaky. > > > > I don't doubt that. But let's get back to reality. The counter used to > > stay reasonable and reliable with most real life workloads before > > mTHP. There may be over-counting, for example, when unmapping a > > PTE-mapped THP which was not on a deferred split queue before. But > > such a case is not common for real life workloads because the huge PMD > > has to be split by partial unmap for most cases. And the partial unmap > > will add the THP to deferred split queue. > > > > But now a common workload, for example, just process exit, may > > probably send the counter to infinity. > > Agreed, that's stupid. > > > > >> > >> Something like Ryans script makes more sense, where you get a clearer > >> picture of what's mapped where and how. Because that information can be > >> much more valuable than just knowing if it's mapped fully or partially > >> (again, relevant for handling with memory waste). > > > > Ryan's script is very helpful. But the counter has been existing and > > used for years, and it is a quick indicator and much easier to monitor > > in a large-scale fleet. > > > > If we think the reliability of the counter is not worth fixing, why > > don't we just remove it. No counter is better than a broken counter. > > Again, is only counting the PMD-sized THPs "fixing" the old use cases? Yes > Then it should just stick around. And we can even optimize it for some > more cases as proposed in this patch. But there is no easy way to "get > it completely right" I'm afraid. I don't mean we should revert that "fixing", my point is we should not rely on it and we should make rmap remove code behave more reliable regardless of whether we just count PMD-sized THP or not. > > -- > Cheers, > > David / dhildenb >
diff --git a/mm/rmap.c b/mm/rmap.c index 2608c40dffad..d599a772e282 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, enum rmap_level level) { atomic_t *mapped = &folio->_nr_pages_mapped; - int last, nr = 0, nr_pmdmapped = 0; + int last, nr = 0, nr_pmdmapped = 0, mapcount = 0; enum node_stat_item idx; __folio_rmap_sanity_checks(folio, page, nr_pages, level); @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, break; } - atomic_sub(nr_pages, &folio->_large_mapcount); + mapcount = atomic_sub_return(nr_pages, + &folio->_large_mapcount) + 1; do { last = atomic_add_negative(-1, &page->_mapcount); if (last) { @@ -1554,7 +1555,9 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, * is still mapped. */ if (folio_test_large(folio) && folio_test_anon(folio)) - if (level == RMAP_LEVEL_PTE || nr < nr_pmdmapped) + if ((level == RMAP_LEVEL_PTE && + mapcount != 0) || + (level == RMAP_LEVEL_PMD && nr < nr_pmdmapped)) deferred_split_folio(folio); }