diff mbox series

mm: Do not boost watermarks to avoid fragmentation for the DISCONTIG memory model

Message ID 20190419094335.GJ18914@techsingularity.net (mailing list archive)
State Awaiting Upstream
Headers show
Series mm: Do not boost watermarks to avoid fragmentation for the DISCONTIG memory model | expand

Commit Message

Mel Gorman April 19, 2019, 9:43 a.m. UTC
Mikulas Patocka reported that 1c30844d2dfe ("mm: reclaim small amounts
of memory when an external fragmentation event occurs") "broke" memory
management on parisc. The machine is not NUMA but the DISCONTIG model
creates three pgdats even though it's a UMA machine for the following
ranges

        0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
        1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
        2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB

From his own report

	With the patch 1c30844d2, the kernel will incorrectly reclaim the
	first zone when it fills up, ignoring the fact that there are two
	completely free zones. Basiscally, it limits cache size to 1GiB.

	For example, if I run:
	# dd if=/dev/sda of=/dev/null bs=1M count=2048

	- with the proper kernel, there should be "Buffers - 2GiB"
	when this command finishes. With the patch 1c30844d2, buffers
	will consume just 1GiB or slightly more, because the kernel was
	incorrectly reclaiming them.

The page allocator and reclaim makes assumptions that pgdats really
represent NUMA nodes and zones represent ranges and makes decisions
on that basis. Watermark boosting for small pgdats leads to unexpected
results even though this would have behaved reasonably on SPARSEMEM.

DISCONTIG is essentially deprecated and even parisc plans to move to
SPARSEMEM so there is no need to be fancy, this patch simply disables
watermark boosting by default on DISCONTIGMEM.

Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/sysctl/vm.txt | 16 ++++++++--------
 mm/page_alloc.c             | 13 +++++++++++++
 2 files changed, 21 insertions(+), 8 deletions(-)

Comments

Matthew Wilcox (Oracle) April 19, 2019, 2:05 p.m. UTC | #1
On Fri, Apr 19, 2019 at 10:43:35AM +0100, Mel Gorman wrote:
> DISCONTIG is essentially deprecated and even parisc plans to move to
> SPARSEMEM so there is no need to be fancy, this patch simply disables
> watermark boosting by default on DISCONTIGMEM.

I don't think parisc is the only arch which uses DISCONTIGMEM for !NUMA
scenarios.  Grepping the arch/ directories shows:

alpha (does support NUMA, but also non-NUMA DISCONTIGMEM)
arc (for supporting more than 1GB of memory)
ia64 (looks complicated ...)
m68k (for multiple chunks of memory)
mips (does support NUMA but also non-NUMA)
parisc (both NUMA and non-NUMA)

I'm not sure that these architecture maintainers even know that DISCONTIGMEM
is deprecated.  Adding linux-arch to the cc.
Mel Gorman April 19, 2019, 2:28 p.m. UTC | #2
On Fri, Apr 19, 2019 at 07:05:21AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 19, 2019 at 10:43:35AM +0100, Mel Gorman wrote:
> > DISCONTIG is essentially deprecated and even parisc plans to move to
> > SPARSEMEM so there is no need to be fancy, this patch simply disables
> > watermark boosting by default on DISCONTIGMEM.
> 
> I don't think parisc is the only arch which uses DISCONTIGMEM for !NUMA
> scenarios.  Grepping the arch/ directories shows:
> 
> alpha (does support NUMA, but also non-NUMA DISCONTIGMEM)
> arc (for supporting more than 1GB of memory)
> ia64 (looks complicated ...)
> m68k (for multiple chunks of memory)
> mips (does support NUMA but also non-NUMA)
> parisc (both NUMA and non-NUMA)
> 
> I'm not sure that these architecture maintainers even know that DISCONTIGMEM
> is deprecated.  Adding linux-arch to the cc.

Poor wording then -- yes, DISCONTIGMEM is still used but look where it's
used. I find it impossible to believe that any new arch would support
DISCONTIGMEM or that DISCONTIGMEM would be selected when SPARSEMEM is
available.`It's even more insane when you consider that SPARSEMEM can be
extended to support VMEMMAP so that it has similar overhead to FLATMEM
when mapping pfns to struct pages and vice-versa.
Helge Deller April 19, 2019, 8:08 p.m. UTC | #3
On 19.04.19 16:28, Mel Gorman wrote:
> On Fri, Apr 19, 2019 at 07:05:21AM -0700, Matthew Wilcox wrote:
>> On Fri, Apr 19, 2019 at 10:43:35AM +0100, Mel Gorman wrote:
>>> DISCONTIG is essentially deprecated and even parisc plans to move to
>>> SPARSEMEM so there is no need to be fancy, this patch simply disables
>>> watermark boosting by default on DISCONTIGMEM.
>>
>> I don't think parisc is the only arch which uses DISCONTIGMEM for !NUMA
>> scenarios.  Grepping the arch/ directories shows:
>>
>> alpha (does support NUMA, but also non-NUMA DISCONTIGMEM)
>> arc (for supporting more than 1GB of memory)
>> ia64 (looks complicated ...)
>> m68k (for multiple chunks of memory)
>> mips (does support NUMA but also non-NUMA)
>> parisc (both NUMA and non-NUMA)
>>
>> I'm not sure that these architecture maintainers even know that DISCONTIGMEM
>> is deprecated.  Adding linux-arch to the cc.
>
> Poor wording then -- yes, DISCONTIGMEM is still used but look where it's
> used. I find it impossible to believe that any new arch would support
> DISCONTIGMEM or that DISCONTIGMEM would be selected when SPARSEMEM is
> available.`It's even more insane when you consider that SPARSEMEM can be
> extended to support VMEMMAP so that it has similar overhead to FLATMEM
> when mapping pfns to struct pages and vice-versa.

FYI, on parisc we will switch from DISCONTIGMEM to SPARSEMEM with kernel 5.2.
The patch was quite simple and it's currently in the for-next tree:
https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/commit/?h=for-next&id=281b718721a5e78288271d632731cea9697749f7

Helge
Mike Rapoport April 21, 2019, 6:38 a.m. UTC | #4
On Fri, Apr 19, 2019 at 07:05:21AM -0700, Matthew Wilcox wrote:
> On Fri, Apr 19, 2019 at 10:43:35AM +0100, Mel Gorman wrote:
> > DISCONTIG is essentially deprecated and even parisc plans to move to
> > SPARSEMEM so there is no need to be fancy, this patch simply disables
> > watermark boosting by default on DISCONTIGMEM.
> 
> I don't think parisc is the only arch which uses DISCONTIGMEM for !NUMA
> scenarios.  Grepping the arch/ directories shows:
> 
> alpha (does support NUMA, but also non-NUMA DISCONTIGMEM)
> arc (for supporting more than 1GB of memory)
> ia64 (looks complicated ...)
> m68k (for multiple chunks of memory)
> mips (does support NUMA but also non-NUMA)
> parisc (both NUMA and non-NUMA)

i386 NUMA as well
 
> I'm not sure that these architecture maintainers even know that DISCONTIGMEM
> is deprecated.  Adding linux-arch to the cc.
>
Matthew Wilcox (Oracle) April 21, 2019, 1:26 p.m. UTC | #5
On Sun, Apr 21, 2019 at 09:38:59AM +0300, Mike Rapoport wrote:
> On Fri, Apr 19, 2019 at 07:05:21AM -0700, Matthew Wilcox wrote:
> > On Fri, Apr 19, 2019 at 10:43:35AM +0100, Mel Gorman wrote:
> > > DISCONTIG is essentially deprecated and even parisc plans to move to
> > > SPARSEMEM so there is no need to be fancy, this patch simply disables
> > > watermark boosting by default on DISCONTIGMEM.
> > 
> > I don't think parisc is the only arch which uses DISCONTIGMEM for !NUMA
> > scenarios.  Grepping the arch/ directories shows:
> > 
> > alpha (does support NUMA, but also non-NUMA DISCONTIGMEM)
> > arc (for supporting more than 1GB of memory)
> > ia64 (looks complicated ...)
> > m68k (for multiple chunks of memory)
> > mips (does support NUMA but also non-NUMA)
> > parisc (both NUMA and non-NUMA)
> 
> i386 NUMA as well

I clearly over-trimmed.  The original assumption that Mel had was that
DISCONTIGMEM => NUMA, and that's not true on the above six architectures.
It is true on i386 ;-)
Mel Gorman April 21, 2019, 9:16 p.m. UTC | #6
On Sun, Apr 21, 2019 at 06:26:07AM -0700, Matthew Wilcox wrote:
> On Sun, Apr 21, 2019 at 09:38:59AM +0300, Mike Rapoport wrote:
> > On Fri, Apr 19, 2019 at 07:05:21AM -0700, Matthew Wilcox wrote:
> > > On Fri, Apr 19, 2019 at 10:43:35AM +0100, Mel Gorman wrote:
> > > > DISCONTIG is essentially deprecated and even parisc plans to move to
> > > > SPARSEMEM so there is no need to be fancy, this patch simply disables
> > > > watermark boosting by default on DISCONTIGMEM.
> > > 
> > > I don't think parisc is the only arch which uses DISCONTIGMEM for !NUMA
> > > scenarios.  Grepping the arch/ directories shows:
> > > 
> > > alpha (does support NUMA, but also non-NUMA DISCONTIGMEM)
> > > arc (for supporting more than 1GB of memory)
> > > ia64 (looks complicated ...)
> > > m68k (for multiple chunks of memory)
> > > mips (does support NUMA but also non-NUMA)
> > > parisc (both NUMA and non-NUMA)
> > 
> > i386 NUMA as well
> 
> I clearly over-trimmed.  The original assumption that Mel had was that
> DISCONTIGMEM => NUMA, and that's not true on the above six architectures.
> It is true on i386 ;-)

32-bit NUMA systems should be non-existent in practice. The last NUMA
system I'm aware of that was both NUMA and 32-bit only died somewhere
between 2004 and 2007. If someone is running a 64-bit capable system in
32-bit mode with NUMA, they really are just punishing themselves for fun.
Christoph Lameter (Ampere) April 22, 2019, 5:29 p.m. UTC | #7
On Fri, 19 Apr 2019, Matthew Wilcox wrote:

> ia64 (looks complicated ...)

Well as far as I can tell it was not even used 12 or so years ago on
Itanium when I worked on that stuff.
Vlastimil Babka April 23, 2019, 6:33 a.m. UTC | #8
On 4/19/19 11:43 AM, Mel Gorman wrote:
> Mikulas Patocka reported that 1c30844d2dfe ("mm: reclaim small amounts
> of memory when an external fragmentation event occurs") "broke" memory
> management on parisc. The machine is not NUMA but the DISCONTIG model
> creates three pgdats even though it's a UMA machine for the following
> ranges
> 
>         0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
>         1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
>         2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
> 
> From his own report
> 
> 	With the patch 1c30844d2, the kernel will incorrectly reclaim the
> 	first zone when it fills up, ignoring the fact that there are two
> 	completely free zones. Basiscally, it limits cache size to 1GiB.
> 
> 	For example, if I run:
> 	# dd if=/dev/sda of=/dev/null bs=1M count=2048
> 
> 	- with the proper kernel, there should be "Buffers - 2GiB"
> 	when this command finishes. With the patch 1c30844d2, buffers
> 	will consume just 1GiB or slightly more, because the kernel was
> 	incorrectly reclaiming them.
> 
> The page allocator and reclaim makes assumptions that pgdats really
> represent NUMA nodes and zones represent ranges and makes decisions
> on that basis. Watermark boosting for small pgdats leads to unexpected
> results even though this would have behaved reasonably on SPARSEMEM.
> 
> DISCONTIG is essentially deprecated and even parisc plans to move to
> SPARSEMEM so there is no need to be fancy, this patch simply disables
> watermark boosting by default on DISCONTIGMEM.
> 
> Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Tested-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  Documentation/sysctl/vm.txt | 16 ++++++++--------
>  mm/page_alloc.c             | 13 +++++++++++++
>  2 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 6af24cdb25cc..3f13d8599337 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -866,14 +866,14 @@ The intent is that compaction has less work to do in the future and to
>  increase the success rate of future high-order allocations such as SLUB
>  allocations, THP and hugetlbfs pages.
>  
> -To make it sensible with respect to the watermark_scale_factor parameter,
> -the unit is in fractions of 10,000. The default value of 15,000 means
> -that up to 150% of the high watermark will be reclaimed in the event of
> -a pageblock being mixed due to fragmentation. The level of reclaim is
> -determined by the number of fragmentation events that occurred in the
> -recent past. If this value is smaller than a pageblock then a pageblocks
> -worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
> -of 0 will disable the feature.
> +To make it sensible with respect to the watermark_scale_factor
> +parameter, the unit is in fractions of 10,000. The default value of
> +15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
> +watermark will be reclaimed in the event of a pageblock being mixed due
> +to fragmentation. The level of reclaim is determined by the number of
> +fragmentation events that occurred in the recent past. If this value is
> +smaller than a pageblock then a pageblocks worth of pages will be reclaimed
> +(e.g.  2MB on 64-bit x86). A boost factor of 0 will disable the feature.
>  
>  =============================================================
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cfaba3889fa2..86c3806f1070 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -266,7 +266,20 @@ compound_page_dtor * const compound_page_dtors[] = {
>  
>  int min_free_kbytes = 1024;
>  int user_min_free_kbytes = -1;
> +#ifdef CONFIG_DISCONTIGMEM
> +/*
> + * DiscontigMem defines memory ranges as separate pg_data_t even if the ranges
> + * are not on separate NUMA nodes. Functionally this works but with
> + * watermark_boost_factor, it can reclaim prematurely as the ranges can be
> + * quite small. By default, do not boost watermarks on discontigmem as in
> + * many cases very high-order allocations like THP are likely to be
> + * unsupported and the premature reclaim offsets the advantage of long-term
> + * fragmentation avoidance.
> + */
> +int watermark_boost_factor __read_mostly;
> +#else
>  int watermark_boost_factor __read_mostly = 15000;
> +#endif
>  int watermark_scale_factor = 10;
>  
>  static unsigned long nr_kernel_pages __initdata;
>
Christoph Hellwig April 23, 2019, 7:13 a.m. UTC | #9
On Sun, Apr 21, 2019 at 10:16:04PM +0100, Mel Gorman wrote:
> 32-bit NUMA systems should be non-existent in practice. The last NUMA
> system I'm aware of that was both NUMA and 32-bit only died somewhere
> between 2004 and 2007. If someone is running a 64-bit capable system in
> 32-bit mode with NUMA, they really are just punishing themselves for fun.

Can we mark it as BROKEN to see if someone shouts and then remove it
a year or two down the road?  Or just kill it off now..
Meelis Roos April 23, 2019, 4:49 p.m. UTC | #10
>> ia64 (looks complicated ...)
> 
> Well as far as I can tell it was not even used 12 or so years ago on
> Itanium when I worked on that stuff.

My notes tell that on UP ia64 (RX2620), !NUMA was broken with both
SPARSEMEM and DISCONTIGMEM. NUMA+SPARSEMEM or !NUMA worked. Even
NUMA+DISCONTIGMEM worked, that was my config on 2-CPU RX2660.
Mike Rapoport April 24, 2019, 11:33 a.m. UTC | #11
On Tue, Apr 23, 2019 at 12:13:54AM -0700, Christoph Hellwig wrote:
> On Sun, Apr 21, 2019 at 10:16:04PM +0100, Mel Gorman wrote:
> > 32-bit NUMA systems should be non-existent in practice. The last NUMA
> > system I'm aware of that was both NUMA and 32-bit only died somewhere
> > between 2004 and 2007. If someone is running a 64-bit capable system in
> > 32-bit mode with NUMA, they really are just punishing themselves for fun.
> 
> Can we mark it as BROKEN to see if someone shouts and then remove it
> a year or two down the road?  Or just kill it off now..

How about making SPARSEMEM default for x86-32?

From ac2dc27414e26f799ea063fd1d01e19d70056f43 Mon Sep 17 00:00:00 2001
From: Mike Rapoport <rppt@linux.ibm.com>
Date: Wed, 24 Apr 2019 14:32:12 +0300
Subject: [PATCH] x86/Kconfig: make SPARSEMEM default for X86_32

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/Kconfig | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 62fc3fd..77b17af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1609,10 +1609,6 @@ config ARCH_DISCONTIGMEM_ENABLE
 	def_bool y
 	depends on NUMA && X86_32
 
-config ARCH_DISCONTIGMEM_DEFAULT
-	def_bool y
-	depends on NUMA && X86_32
-
 config ARCH_SPARSEMEM_ENABLE
 	def_bool y
 	depends on X86_64 || NUMA || X86_32 || X86_32_NON_STANDARD
@@ -1621,7 +1617,7 @@ config ARCH_SPARSEMEM_ENABLE
 
 config ARCH_SPARSEMEM_DEFAULT
 	def_bool y
-	depends on X86_64
+	depends on X86_64 || (NUMA && X86_32)
 
 config ARCH_SELECT_MEMORY_MODEL
 	def_bool y
Mel Gorman April 24, 2019, 12:21 p.m. UTC | #12
On Wed, Apr 24, 2019 at 02:33:53PM +0300, Mike Rapoport wrote:
> On Tue, Apr 23, 2019 at 12:13:54AM -0700, Christoph Hellwig wrote:
> > On Sun, Apr 21, 2019 at 10:16:04PM +0100, Mel Gorman wrote:
> > > 32-bit NUMA systems should be non-existent in practice. The last NUMA
> > > system I'm aware of that was both NUMA and 32-bit only died somewhere
> > > between 2004 and 2007. If someone is running a 64-bit capable system in
> > > 32-bit mode with NUMA, they really are just punishing themselves for fun.
> > 
> > Can we mark it as BROKEN to see if someone shouts and then remove it
> > a year or two down the road?  Or just kill it off now..
> 
> How about making SPARSEMEM default for x86-32?
> 

While an improvement, I tend to agree with Christoph that marking it
BROKEN as a patch on top of this makes sense and wait to see who, if
anyone, screams. If it's quiet for long enough then we can remove it
entirely.
Christoph Hellwig April 28, 2019, 8:11 a.m. UTC | #13
On Wed, Apr 24, 2019 at 02:33:53PM +0300, Mike Rapoport wrote:
> On Tue, Apr 23, 2019 at 12:13:54AM -0700, Christoph Hellwig wrote:
> > On Sun, Apr 21, 2019 at 10:16:04PM +0100, Mel Gorman wrote:
> > > 32-bit NUMA systems should be non-existent in practice. The last NUMA
> > > system I'm aware of that was both NUMA and 32-bit only died somewhere
> > > between 2004 and 2007. If someone is running a 64-bit capable system in
> > > 32-bit mode with NUMA, they really are just punishing themselves for fun.
> > 
> > Can we mark it as BROKEN to see if someone shouts and then remove it
> > a year or two down the road?  Or just kill it off now..
> 
> How about making SPARSEMEM default for x86-32?

Sounds good.

Another question:  I always found the option to even select the memory
models like a bad tradeoff.  Can we really expect a user to make a sane
choice?  I'd rather stick to a relativelty optimal choice based on arch
and maybe a few other parameters (NUMA or not for example) and stick to
it, reducing the testing matrix.
Christoph Hellwig April 28, 2019, 8:13 a.m. UTC | #14
On Tue, Apr 23, 2019 at 07:49:57PM +0300, Meelis Roos wrote:
> > > ia64 (looks complicated ...)
> > 
> > Well as far as I can tell it was not even used 12 or so years ago on
> > Itanium when I worked on that stuff.
> 
> My notes tell that on UP ia64 (RX2620), !NUMA was broken with both
> SPARSEMEM and DISCONTIGMEM. NUMA+SPARSEMEM or !NUMA worked. Even
> NUMA+DISCONTIGMEM worked, that was my config on 2-CPU RX2660.

ia64 has a such a huge number of memory model choices.  Maybe we
need to cut it down to a small set that actually work.

That includes fund bits like the 'VIRTUAL_MEM_MAP' option where the
comment claims:

# VIRTUAL_MEM_MAP and FLAT_NODE_MEM_MAP are functionally equivalent.
# VIRTUAL_MEM_MAP has been retained for historical reasons.

but it still is selected as the default if sparsemem is not enabled..
Tony Luck April 29, 2019, 4:58 p.m. UTC | #15
> ia64 has a such a huge number of memory model choices.  Maybe we
> need to cut it down to a small set that actually work.

SGI systems had extremely discontiguous memory (they used some high
order physical address bits in the tens/hundreds of terabyte range for the
node number ... so there would be a few GBytes of actual memory then
a huge gap before the next node had a few more Gbytes).

I don't know of anyone still booting upstream on an SN2, so if we start doing
serious hack and slash the chances are high that SN2 will be broken (if it isn't
already).

-Tony
Christoph Hellwig April 29, 2019, 8:09 p.m. UTC | #16
On Mon, Apr 29, 2019 at 04:58:09PM +0000, Luck, Tony wrote:
> > ia64 has a such a huge number of memory model choices.  Maybe we
> > need to cut it down to a small set that actually work.
> 
> SGI systems had extremely discontiguous memory (they used some high
> order physical address bits in the tens/hundreds of terabyte range for the
> node number ... so there would be a few GBytes of actual memory then
> a huge gap before the next node had a few more Gbytes).
> 
> I don't know of anyone still booting upstream on an SN2, so if we start doing
> serious hack and slash the chances are high that SN2 will be broken (if it isn't
> already).

When I wrote this, I thought of

!NUMA:  flat mem
NUMA:	sparsemem
SN2:	discontig

based on Meelis report.  But now that you mention it, I bet SN2 has
already died slow death from bitrot.  It is so different in places, and
it doesn't seem like anyone care - if people want room sized SGI
machines the Origin is much more sexy (hello Thomas!) :)

So maybe it it time to mark SN2 broken and see if anyone screams?

Without SN2 the whole machvec mess could basically go away - the
only real difference between the remaining machvecs is which iommu
if any we set up.
Christoph Lameter (Ampere) April 30, 2019, 1:40 p.m. UTC | #17
On Mon, 29 Apr 2019, Christoph Hellwig wrote:

> So maybe it it time to mark SN2 broken and see if anyone screams?
>
> Without SN2 the whole machvec mess could basically go away - the
> only real difference between the remaining machvecs is which iommu
> if any we set up.

SPARSEMEM with VMEMMAP was developed to address these
issues and allow one mapping scheme across the different platforms.

You do not need DISCONTIGMEM support for SN2. And as far as I know (from a
decade ago ok....) the distributions were using VMEMMAP instead.
Mike Rapoport May 1, 2019, 8:46 p.m. UTC | #18
On Sun, Apr 28, 2019 at 01:11:07AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019 at 02:33:53PM +0300, Mike Rapoport wrote:
> > On Tue, Apr 23, 2019 at 12:13:54AM -0700, Christoph Hellwig wrote:
> > > On Sun, Apr 21, 2019 at 10:16:04PM +0100, Mel Gorman wrote:
> > > > 32-bit NUMA systems should be non-existent in practice. The last NUMA
> > > > system I'm aware of that was both NUMA and 32-bit only died somewhere
> > > > between 2004 and 2007. If someone is running a 64-bit capable system in
> > > > 32-bit mode with NUMA, they really are just punishing themselves for fun.
> > > 
> > > Can we mark it as BROKEN to see if someone shouts and then remove it
> > > a year or two down the road?  Or just kill it off now..
> > 
> > How about making SPARSEMEM default for x86-32?
> 
> Sounds good.
> 
> Another question:  I always found the option to even select the memory
> models like a bad tradeoff.  Can we really expect a user to make a sane
> choice?  I'd rather stick to a relativelty optimal choice based on arch
> and maybe a few other parameters (NUMA or not for example) and stick to
> it, reducing the testing matrix.

I've sent patches that remove ARCH_SELECT_MEMORY_MODEL from arm, s390 and
sparc where it anyway has no effect [1].

That leaves arm64, ia64, parisc, powerpc, sh and i386.

I'd say that for i386 selecting between FLAT and SPARSE based on NUMA
sounds reasonable.

I'm not familiar enough with others to say if such enforcement makes any
sense.

Probably powerpc and sh can enable the preferred memory model in
platform/board part of their Kconfig, just like arm.

[1] https://lore.kernel.org/lkml/1556740577-4140-1-git-send-email-rppt@linux.ibm.com
Mike Rapoport May 5, 2019, 8:53 a.m. UTC | #19
Hi,

On Fri, Apr 19, 2019 at 10:08:31PM +0200, Helge Deller wrote:
> On 19.04.19 16:28, Mel Gorman wrote:
> > On Fri, Apr 19, 2019 at 07:05:21AM -0700, Matthew Wilcox wrote:
> >> On Fri, Apr 19, 2019 at 10:43:35AM +0100, Mel Gorman wrote:
> >>> DISCONTIG is essentially deprecated and even parisc plans to move to
> >>> SPARSEMEM so there is no need to be fancy, this patch simply disables
> >>> watermark boosting by default on DISCONTIGMEM.
> >>
> >> I don't think parisc is the only arch which uses DISCONTIGMEM for !NUMA
> >> scenarios.  Grepping the arch/ directories shows:
> >>
> >> alpha (does support NUMA, but also non-NUMA DISCONTIGMEM)
> >> arc (for supporting more than 1GB of memory)
> >> ia64 (looks complicated ...)
> >> m68k (for multiple chunks of memory)
> >> mips (does support NUMA but also non-NUMA)
> >> parisc (both NUMA and non-NUMA)
> >>
> >> I'm not sure that these architecture maintainers even know that DISCONTIGMEM
> >> is deprecated.  Adding linux-arch to the cc.
> >
> > Poor wording then -- yes, DISCONTIGMEM is still used but look where it's
> > used. I find it impossible to believe that any new arch would support
> > DISCONTIGMEM or that DISCONTIGMEM would be selected when SPARSEMEM is
> > available.`It's even more insane when you consider that SPARSEMEM can be
> > extended to support VMEMMAP so that it has similar overhead to FLATMEM
> > when mapping pfns to struct pages and vice-versa.
> 
> FYI, on parisc we will switch from DISCONTIGMEM to SPARSEMEM with kernel 5.2.
> The patch was quite simple and it's currently in the for-next tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/commit/?h=for-next&id=281b718721a5e78288271d632731cea9697749f7

A while ago I've sent a patch that removes ARCH_DISCARD_MEMBLOCK option [1]
so the hunk below is not needed:

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index c8038165b81f..26c215570adf 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -36,6 +36,7 @@ config PARISC
 	select GENERIC_STRNCPY_FROM_USER
 	select SYSCTL_ARCH_UNALIGN_ALLOW
 	select SYSCTL_EXCEPTION_TRACE
+	select ARCH_DISCARD_MEMBLOCK
 	select HAVE_MOD_ARCH_SPECIFIC
 	select VIRT_TO_BUS
 	select MODULES_USE_ELF_RELA


[1] https://lore.kernel.org/lkml/1556102150-32517-1-git-send-email-rppt@linux.ibm.com/
 
> Helge
>
diff mbox series

Patch

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 6af24cdb25cc..3f13d8599337 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -866,14 +866,14 @@  The intent is that compaction has less work to do in the future and to
 increase the success rate of future high-order allocations such as SLUB
 allocations, THP and hugetlbfs pages.
 
-To make it sensible with respect to the watermark_scale_factor parameter,
-the unit is in fractions of 10,000. The default value of 15,000 means
-that up to 150% of the high watermark will be reclaimed in the event of
-a pageblock being mixed due to fragmentation. The level of reclaim is
-determined by the number of fragmentation events that occurred in the
-recent past. If this value is smaller than a pageblock then a pageblocks
-worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
-of 0 will disable the feature.
+To make it sensible with respect to the watermark_scale_factor
+parameter, the unit is in fractions of 10,000. The default value of
+15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
+watermark will be reclaimed in the event of a pageblock being mixed due
+to fragmentation. The level of reclaim is determined by the number of
+fragmentation events that occurred in the recent past. If this value is
+smaller than a pageblock then a pageblocks worth of pages will be reclaimed
+(e.g.  2MB on 64-bit x86). A boost factor of 0 will disable the feature.
 
 =============================================================
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfaba3889fa2..86c3806f1070 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -266,7 +266,20 @@  compound_page_dtor * const compound_page_dtors[] = {
 
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
+#ifdef CONFIG_DISCONTIGMEM
+/*
+ * DiscontigMem defines memory ranges as separate pg_data_t even if the ranges
+ * are not on separate NUMA nodes. Functionally this works but with
+ * watermark_boost_factor, it can reclaim prematurely as the ranges can be
+ * quite small. By default, do not boost watermarks on discontigmem as in
+ * many cases very high-order allocations like THP are likely to be
+ * unsupported and the premature reclaim offsets the advantage of long-term
+ * fragmentation avoidance.
+ */
+int watermark_boost_factor __read_mostly;
+#else
 int watermark_boost_factor __read_mostly = 15000;
+#endif
 int watermark_scale_factor = 10;
 
 static unsigned long nr_kernel_pages __initdata;