diff mbox series

[hotfix,6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes

Message ID 20241024151228.101841-2-vbabka@suse.cz (mailing list archive)
State New
Headers show
Series [hotfix,6.12] mm, mmap: limit THP aligment of anonymous mappings to PMD-aligned sizes | expand

Commit Message

Vlastimil Babka Oct. 24, 2024, 3:12 p.m. UTC
Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") a mmap() of anonymous memory without a specific address
hint and of at least PMD_SIZE will be aligned to PMD so that it can
benefit from a THP backing page.

However this change has been shown to regress some workloads
significantly. [1] reports regressions in various spec benchmarks, with
up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
benchmark seems to create many mappings of 4632kB, which would have
merged to a large THP-backed area before commit efa7df3e3bb5 and now
they are fragmented to multiple areas each aligned to PMD boundary with
gaps between. The regression then seems to be caused mainly due to the
benchmark's memory access pattern suffering from TLB or cache aliasing
due to the aligned boundaries of the individual areas.

Another known regression bisected to commit efa7df3e3bb5 is darktable
[2] [3] and early testing suggests this patch fixes the regression there
as well.

To fix the regression but still try to benefit from THP-friendly
anonymous mapping alignment, add a condition that the size of the
mapping must be a multiple of PMD size instead of at least PMD size. In
case of many odd-sized mapping like the cactusBSSN creates, those will
stop being aligned and with gaps between, and instead naturally merge
again.

Reported-by: Michael Matz <matz@suse.de>
Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
Cc: <stable@vger.kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mmap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Lorenzo Stoakes Oct. 24, 2024, 3:47 p.m. UTC | #1
On Thu, Oct 24, 2024 at 05:12:29PM +0200, Vlastimil Babka wrote:
> Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") a mmap() of anonymous memory without a specific address
> hint and of at least PMD_SIZE will be aligned to PMD so that it can
> benefit from a THP backing page.
>
> However this change has been shown to regress some workloads
> significantly. [1] reports regressions in various spec benchmarks, with
> up to 600% slowdown of the cactusBSSN benchmark on some platforms. The

Ugh god.

> benchmark seems to create many mappings of 4632kB, which would have
> merged to a large THP-backed area before commit efa7df3e3bb5 and now
> they are fragmented to multiple areas each aligned to PMD boundary with
> gaps between. The regression then seems to be caused mainly due to the
> benchmark's memory access pattern suffering from TLB or cache aliasing
> due to the aligned boundaries of the individual areas.

Any more details on precisely why?

>
> Another known regression bisected to commit efa7df3e3bb5 is darktable
> [2] [3] and early testing suggests this patch fixes the regression there
> as well.

Good!

>
> To fix the regression but still try to benefit from THP-friendly
> anonymous mapping alignment, add a condition that the size of the
> mapping must be a multiple of PMD size instead of at least PMD size. In
> case of many odd-sized mapping like the cactusBSSN creates, those will
> stop being aligned and with gaps between, and instead naturally merge
> again.
>

Seems like the original logic just padded the length by PMD size and checks
for overflow, assuming that [pgoff << PAGE_SHIFT, pgoff << PAGE_SHIFT +
len) contains at least one PMD-sized block.

Which I guess results in potentially getting mis-sized empty spaces that
now can't be PMD-merged at the bits that 'overhang' the PMD-sized/aligned
bit?

Which is yeah, not great and would explain this (correct me if my
understanding is wrong).

> Reported-by: Michael Matz <matz@suse.de>
> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
> Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
> Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
> Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
> Cc: <stable@vger.kernel.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/mmap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9c0fb43064b5..a5297cfb1dfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>
>  	if (get_area) {
>  		addr = get_area(file, addr, len, pgoff, flags);
> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> +		   && IS_ALIGNED(len, PMD_SIZE)) {

So doing this feels right but...

Hm this seems like it belongs in __thp_get_unmapped_area() which does a bunch of
checks up front returning 0 if they fail, which then results in it peforming the
normal get unmapped area logic.

That also has a bunch of (offset) alignment checks as well overflow checks
so it would seem the natural place to also check length?

>  		/* Ensures that larger anonymous mappings are THP aligned. */
>  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>  						     pgoff, flags, vm_flags);
> --
> 2.47.0
>
Lorenzo Stoakes Oct. 24, 2024, 4 p.m. UTC | #2
On Thu, Oct 24, 2024 at 04:47:54PM +0100, Lorenzo Stoakes wrote:
[snip]

> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 9c0fb43064b5..a5297cfb1dfc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
> >
> >  	if (get_area) {
> >  		addr = get_area(file, addr, len, pgoff, flags);
> > -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> > +		   && IS_ALIGNED(len, PMD_SIZE)) {
>
> So doing this feels right but...
>
> Hm this seems like it belongs in __thp_get_unmapped_area() which does a bunch of
> checks up front returning 0 if they fail, which then results in it peforming the
> normal get unmapped area logic.
>
> That also has a bunch of (offset) alignment checks as well overflow checks
> so it would seem the natural place to also check length?
>

OK having said that, I see this function is referenced from a bunch of fs
stuff we probably don't want to potentially break by enforcing this
requirement there (at least in this fix).

So disregard that and since this looks otherwise good to me, feel free to add:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


> >  		/* Ensures that larger anonymous mappings are THP aligned. */
> >  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
> >  						     pgoff, flags, vm_flags);
> > --
> > 2.47.0
> >

Thanks!
Vlastimil Babka Oct. 24, 2024, 4:04 p.m. UTC | #3
On 10/24/24 17:47, Lorenzo Stoakes wrote:
> On Thu, Oct 24, 2024 at 05:12:29PM +0200, Vlastimil Babka wrote:
>> Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
>> boundaries") a mmap() of anonymous memory without a specific address
>> hint and of at least PMD_SIZE will be aligned to PMD so that it can
>> benefit from a THP backing page.
>>
>> However this change has been shown to regress some workloads
>> significantly. [1] reports regressions in various spec benchmarks, with
>> up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
> 
> Ugh god.
> 
>> benchmark seems to create many mappings of 4632kB, which would have
>> merged to a large THP-backed area before commit efa7df3e3bb5 and now
>> they are fragmented to multiple areas each aligned to PMD boundary with
>> gaps between. The regression then seems to be caused mainly due to the
>> benchmark's memory access pattern suffering from TLB or cache aliasing
>> due to the aligned boundaries of the individual areas.
> 
> Any more details on precisely why?

The experiments performed in [1] didn't seem conclusive enough for me to say
that with enough confidence :) Generally speaking if there are multiple
addresses with the same virtual or physical offset accesssed rapidly, they
can alias in the TLB or processor caches due to limited associativity and
cause thrashing. Aligning the mappings to same 2MB boundary can cause such
aliasing.

>>
>> Another known regression bisected to commit efa7df3e3bb5 is darktable
>> [2] [3] and early testing suggests this patch fixes the regression there
>> as well.
> 
> Good!
> 
>>
>> To fix the regression but still try to benefit from THP-friendly
>> anonymous mapping alignment, add a condition that the size of the
>> mapping must be a multiple of PMD size instead of at least PMD size. In
>> case of many odd-sized mapping like the cactusBSSN creates, those will
>> stop being aligned and with gaps between, and instead naturally merge
>> again.
>>
> 
> Seems like the original logic just padded the length by PMD size and checks
> for overflow, assuming that [pgoff << PAGE_SHIFT, pgoff << PAGE_SHIFT +
> len) contains at least one PMD-sized block.
> 
> Which I guess results in potentially getting mis-sized empty spaces that
> now can't be PMD-merged at the bits that 'overhang' the PMD-sized/aligned
> bit?
> 
> Which is yeah, not great and would explain this (correct me if my
> understanding is wrong).
> 
>> Reported-by: Michael Matz <matz@suse.de>
>> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
>> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
>> Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
>> Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
>> Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
>> Cc: <stable@vger.kernel.org>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: Yang Shi <yang@os.amperecomputing.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/mmap.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 9c0fb43064b5..a5297cfb1dfc 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>>
>>  	if (get_area) {
>>  		addr = get_area(file, addr, len, pgoff, flags);
>> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
>> +		   && IS_ALIGNED(len, PMD_SIZE)) {
> 
> So doing this feels right but...
> 
> Hm this seems like it belongs in __thp_get_unmapped_area() which does a bunch of
> checks up front returning 0 if they fail, which then results in it peforming the
> normal get unmapped area logic.
> 
> That also has a bunch of (offset) alignment checks as well overflow checks
> so it would seem the natural place to also check length?

Petr suggested the same, but changing  __thp_get_unmapped_area() affects FS
THP's and the proposed check seemed wrong to me:

https://lore.kernel.org/all/9d7c73f6-1e1a-458b-93c6-3b44959022e0@suse.cz/

While it could be fixed, I'm still not sure if we want to restrict FS THPs
the same as anonymous THPs. AFAIU even small mappings of a range from a file
should be aligned properly to make it possible for a large range from the
same file (that includes the smaller range) mapped elsewhere to be THP
backed? I mean we can investigate it further, but for the regression fix to
backported to stable kernels it seemed more safe to address only the case
that was changed by commit efa7df3e3bb5 specifically, i.e. anonymous mappings.

>>  		/* Ensures that larger anonymous mappings are THP aligned. */
>>  		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>>  						     pgoff, flags, vm_flags);
>> --
>> 2.47.0
>>
Lorenzo Stoakes Oct. 24, 2024, 4:17 p.m. UTC | #4
On Thu, Oct 24, 2024 at 06:04:41PM +0200, Vlastimil Babka wrote:

> Petr suggested the same, but changing  __thp_get_unmapped_area() affects FS
> THP's and the proposed check seemed wrong to me:
>
> https://lore.kernel.org/all/9d7c73f6-1e1a-458b-93c6-3b44959022e0@suse.cz/
>
> While it could be fixed, I'm still not sure if we want to restrict FS THPs
> the same as anonymous THPs. AFAIU even small mappings of a range from a file
> should be aligned properly to make it possible for a large range from the
> same file (that includes the smaller range) mapped elsewhere to be THP
> backed? I mean we can investigate it further, but for the regression fix to
> backported to stable kernels it seemed more safe to address only the case
> that was changed by commit efa7df3e3bb5 specifically, i.e. anonymous mappings.
>

Ack, yeah totally agreed - sorry I missed the fs usage before, see my 2nd
reply. I had wrongly assumed this was only used in 1 place, where it would
be sensible to move the check, however with fs using it of course it's not.

Gave an R-b tag on other reply so this patch LGTM! :)

Cheers for finding this utterly critical fix!
Yang Shi Oct. 24, 2024, 6:32 p.m. UTC | #5
On 10/24/24 8:12 AM, Vlastimil Babka wrote:
> Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") a mmap() of anonymous memory without a specific address
> hint and of at least PMD_SIZE will be aligned to PMD so that it can
> benefit from a THP backing page.
>
> However this change has been shown to regress some workloads
> significantly. [1] reports regressions in various spec benchmarks, with
> up to 600% slowdown of the cactusBSSN benchmark on some platforms. The
> benchmark seems to create many mappings of 4632kB, which would have
> merged to a large THP-backed area before commit efa7df3e3bb5 and now
> they are fragmented to multiple areas each aligned to PMD boundary with
> gaps between. The regression then seems to be caused mainly due to the
> benchmark's memory access pattern suffering from TLB or cache aliasing
> due to the aligned boundaries of the individual areas.
>
> Another known regression bisected to commit efa7df3e3bb5 is darktable
> [2] [3] and early testing suggests this patch fixes the regression there
> as well.
>
> To fix the regression but still try to benefit from THP-friendly
> anonymous mapping alignment, add a condition that the size of the
> mapping must be a multiple of PMD size instead of at least PMD size. In
> case of many odd-sized mapping like the cactusBSSN creates, those will
> stop being aligned and with gaps between, and instead naturally merge
> again.

Thanks for debugging this. The fix makes sense to me. Reviewed-by: Yang 
Shi <yang@os.amperecomputing.com>

>
> Reported-by: Michael Matz <matz@suse.de>
> Debugged-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
> Closes: https://bugzilla.suse.com/show_bug.cgi?id=1229012 [1]
> Reported-by: Matthias Bodenbinder <matthias@bodenbinder.de>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219366 [2]
> Closes: https://lore.kernel.org/all/2050f0d4-57b0-481d-bab8-05e8d48fed0c@leemhuis.info/ [3]
> Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
> Cc: <stable@vger.kernel.org>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   mm/mmap.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9c0fb43064b5..a5297cfb1dfc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -900,7 +900,8 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>   
>   	if (get_area) {
>   		addr = get_area(file, addr, len, pgoff, flags);
> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
> +		   && IS_ALIGNED(len, PMD_SIZE)) {
>   		/* Ensures that larger anonymous mappings are THP aligned. */
>   		addr = thp_get_unmapped_area_vmflags(file, addr, len,
>   						     pgoff, flags, vm_flags);
Michael Matz Oct. 28, 2024, 1:45 p.m. UTC | #6
Hello,

On Thu, 24 Oct 2024, Lorenzo Stoakes wrote:

> > benchmark seems to create many mappings of 4632kB, which would have
> > merged to a large THP-backed area before commit efa7df3e3bb5 and now
> > they are fragmented to multiple areas each aligned to PMD boundary with
> > gaps between. The regression then seems to be caused mainly due to the
> > benchmark's memory access pattern suffering from TLB or cache aliasing
> > due to the aligned boundaries of the individual areas.
> 
> Any more details on precisely why?

Anything we found out and theorized about is in the suse bugreport.  I 
think the best theory is TLB aliasing when the mixing^Whash function in 
the given hardware uses too few bits, and most of them in the low 21-12 
bits of an address.  Of course that then still depends on the particular 
access pattern.  cactuBSSN has about 20 memory streams in the hot loops, 
and the accesses are fairly regular from step to step (plus/minus certain 
strides in 3D arrays).  When their start addresses all differ only in the 
upper bits, you will hit TLB aliasing from time to time, and when the 
dimensions/strides are just right it occurs often, the N-way associativity 
doesn't save you anymore and you will hit it very very hard.

It was interesting to see how broad the range of CPUs and vendors was that 
exhibited the problem (in various degrees of severity, from 50% to 600% 
slowdown), and how more recent CPUs don't show the symptom anymore.  I 
guess the micro-arch guys eventually convinced P&R management that hashing 
another bit or two is worthwhile the silicon :-)


Ciao,
Michael.
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index 9c0fb43064b5..a5297cfb1dfc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -900,7 +900,8 @@  __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 	if (get_area) {
 		addr = get_area(file, addr, len, pgoff, flags);
-	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+		   && IS_ALIGNED(len, PMD_SIZE)) {
 		/* Ensures that larger anonymous mappings are THP aligned. */
 		addr = thp_get_unmapped_area_vmflags(file, addr, len,
 						     pgoff, flags, vm_flags);