[RFC] mm: mmap: Change DEFAULT_MAX_MAP_COUNT to INT_MAX

Message ID	20240830095636.572947-1-pspacek@isc.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Petr Spacek <pspacek@isc.org> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Petr Spacek <pspacek@isc.org> Subject: [PATCH RFC] mm: mmap: Change DEFAULT_MAX_MAP_COUNT to INT_MAX Date: Fri, 30 Aug 2024 11:56:36 +0200 Message-ID: <20240830095636.572947-1-pspacek@isc.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm: mmap: Change DEFAULT_MAX_MAP_COUNT to INT_MAX \| expand [RFC] mm: mmap: Change DEFAULT_MAX_MAP_COUNT to INT_MAX

Petr Špaček Aug. 30, 2024, 9:56 a.m. UTC

From: Petr Spacek <pspacek@isc.org>

Raise default sysctl vm.max_map_count to INT_MAX, which effectively
disables the limit for all sane purposes. The sysctl is kept around in
case there is some use-case for this limit.

The old default value of vm.max_map_count=65530 provided compatibility
with ELF format predating year 2000 and with binutils predating 2010. At
the same time the old default caused issues with applications deployed
in 2024.

State since 2012: Linux 3.2.0 correctly generates coredump from a
process with 100 000 mmapped files. GDB 7.4.1, binutils 2.22 work with
this coredump fine and can actually read data from the mmaped addresses.

Signed-off-by: Petr Spacek <pspacek@isc.org>
---

Downstream distributions started to override the default a while ago.
Individual distributions are summarized at the end of this message:
https://lists.archlinux.org/archives/list/arch-dev-public@lists.archlinux.org/thread/5GU7ZUFI25T2IRXIQ62YYERQKIPE3U6E/

Please note it's not only games in emulator which hit this default
limit. Larger instances of server applications are also suffering from
this. Couple examples here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2057792/comments/24

SAP documentation behind paywall also mentions this limit:
https://service.sap.com/sap/support/notes/2002167

And finally, it is also an issue for BIND DNS server compiled against
jemalloc, which is what brought me here.

System V gABI draft dated 2000-07-17 already extended the ELF numbering:
https://www.sco.com/developers/gabi/2000-07-17/ch4.sheader.html

binutils support is in commit ecd12bc14d85421fcf992cda5af1d534cc8736e0
dated 2010-01-19. IIUC this goes a bit beyond what is described in the
gABI document and extends ELF's e_phnum.

Linux coredumper support is in commit
8d9032bbe4671dc481261ccd4e161cd96e54b118 dated 2010-03-06.

As mentioned above, this all works for the last 12 years and the
conservative limit seems to do more harm than good.

 include/linux/mm.h | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)


base-commit: d5d547aa7b51467b15d9caa86b116f8c2507c72a

Lorenzo Stoakes Aug. 30, 2024, 11:41 a.m. UTC | #1

On Fri, Aug 30, 2024 at 11:56:36AM GMT, Petr Spacek wrote:
> From: Petr Spacek <pspacek@isc.org>
>
> Raise default sysctl vm.max_map_count to INT_MAX, which effectively
> disables the limit for all sane purposes. The sysctl is kept around in
> case there is some use-case for this limit.
>
> The old default value of vm.max_map_count=65530 provided compatibility
> with ELF format predating year 2000 and with binutils predating 2010. At
> the same time the old default caused issues with applications deployed
> in 2024.
>
> State since 2012: Linux 3.2.0 correctly generates coredump from a
> process with 100 000 mmapped files. GDB 7.4.1, binutils 2.22 work with
> this coredump fine and can actually read data from the mmaped addresses.
>
> Signed-off-by: Petr Spacek <pspacek@isc.org>

NACK.

> ---
>
> Downstream distributions started to override the default a while ago.
> Individual distributions are summarized at the end of this message:
> https://lists.archlinux.org/archives/list/arch-dev-public@lists.archlinux.org/thread/5GU7ZUFI25T2IRXIQ62YYERQKIPE3U6E/

Did they change them to 2.14 billion?

>
> Please note it's not only games in emulator which hit this default
> limit. Larger instances of server applications are also suffering from
> this. Couple examples here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2057792/comments/24
>
> SAP documentation behind paywall also mentions this limit:
> https://service.sap.com/sap/support/notes/2002167
>
> And finally, it is also an issue for BIND DNS server compiled against
> jemalloc, which is what brought me here.
>
> System V gABI draft dated 2000-07-17 already extended the ELF numbering:
> https://www.sco.com/developers/gabi/2000-07-17/ch4.sheader.html
>
> binutils support is in commit ecd12bc14d85421fcf992cda5af1d534cc8736e0
> dated 2010-01-19. IIUC this goes a bit beyond what is described in the
> gABI document and extends ELF's e_phnum.
>
> Linux coredumper support is in commit
> 8d9032bbe4671dc481261ccd4e161cd96e54b118 dated 2010-03-06.
>
> As mentioned above, this all works for the last 12 years and the
> conservative limit seems to do more harm than good.
>
>  include/linux/mm.h | 21 +++++++++------------
>  1 file changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6549d0979..3e1ed3b80 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -178,22 +178,19 @@ static inline void __mm_zero_struct_page(struct page *page)
>
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
> - * per mm struct. Users can overwrite this number by sysctl but there is a
> - * problem.
> + * per mm struct. Users can overwrite this number by sysctl. Historically
> + * this limit was a compatibility measure for ELF format predating year 2000.
>   *
>   * When a program's coredump is generated as ELF format, a section is created
> - * per a vma. In ELF, the number of sections is represented in unsigned short.
> - * This means the number of sections should be smaller than 65535 at coredump.
> - * Because the kernel adds some informative sections to a image of program at
> - * generating coredump, we need some margin. The number of extra sections is
> - * 1-3 now and depends on arch. We use "5" as safe margin, here.
> + * per a vma. In ELF before year 2000, the number of sections was represented
> + * as unsigned short e_shnum. This means the number of sections should be
> + * smaller than 65535 at coredump.
>   *
> - * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
> - * not a hard limit any more. Although some userspace tools can be surprised by
> - * that.
> + * ELF extended numbering was added into System V gABI spec around 2000.
> + * It allows more than 65535 sections, so 16-bit bound is not a hard limit any
> + * more.
>   */
> -#define MAPCOUNT_ELF_CORE_MARGIN	(5)
> -#define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
> +#define DEFAULT_MAX_MAP_COUNT	INT_MAX

NACK, you can't abitrarily change an established limit like this.

Also VMAs have a non-zero size. On my system, 184 bytes. So your change allows
for ~395 GiB to be assigned to VMAs. Does that seem reasonable?

It _might_ be sensible to increase the minimum, not to INT_MAX.

Also note that you _can_ change this limit, it's a tunable. It's not egregious
to you know, change a tunable.

Also please cc- the MEMORY MAPPING reviewers for changes like this. It wasn't
obvious because include/linux/mm.h isn't included in the MAINTAINERS block but
that's me, Liam and Vlastimil, cc'd now.

>
>  extern int sysctl_max_map_count;
>
>
> base-commit: d5d547aa7b51467b15d9caa86b116f8c2507c72a
> --
> 2.46.0
>
>

Lorenzo Stoakes Aug. 30, 2024, 12:01 p.m. UTC | #2

On Fri, Aug 30, 2024 at 12:41:37PM GMT, Lorenzo Stoakes wrote:
> On Fri, Aug 30, 2024 at 11:56:36AM GMT, Petr Spacek wrote:
> > From: Petr Spacek <pspacek@isc.org>
> >
> > Raise default sysctl vm.max_map_count to INT_MAX, which effectively
> > disables the limit for all sane purposes. The sysctl is kept around in
> > case there is some use-case for this limit.
> >
> > The old default value of vm.max_map_count=65530 provided compatibility
> > with ELF format predating year 2000 and with binutils predating 2010. At
> > the same time the old default caused issues with applications deployed
> > in 2024.
> >
> > State since 2012: Linux 3.2.0 correctly generates coredump from a
> > process with 100 000 mmapped files. GDB 7.4.1, binutils 2.22 work with
> > this coredump fine and can actually read data from the mmaped addresses.
> >
> > Signed-off-by: Petr Spacek <pspacek@isc.org>
>
> NACK.

Sorry this may have come off as more hostile than intended... we are
welcoming of patches, promise :)

It is only because we want to be _super_ careful about things like this
that can have potentially problematic impact if you have a buggy program
that allocates too many VMAs.

It is a NACK, but it's a NACK because of the limit being so high.

With steam I believe it is a product of how it performs allocations, and
unfortunately this causes it to allocate quite a bit more than you would
expect.

With jemalloc() that seems strange, perhaps buggy behaviour?

It may be reasonable to adjust the default limit higher, and I'm not
opposed to that, but it might be tricky to find a level that is sensible
across all arches including ones with significantly smaller memory
availability.

This is what makes choosing this value tricky, thanks for you analysis as
to the original choice which does seem less applicable now, but choosing
something sensible here might be tricky.

Also there may be _somebody_ out there relying on this limit being quite so
low.

[snip]

Petr Špaček Aug. 30, 2024, 2:28 p.m. UTC | #3

On 30. 08. 24 14:01, Lorenzo Stoakes wrote:
> On Fri, Aug 30, 2024 at 12:41:37PM GMT, Lorenzo Stoakes wrote:
>> On Fri, Aug 30, 2024 at 11:56:36AM GMT, Petr Spacek wrote:
>>> From: Petr Spacek <pspacek@isc.org>
>>>
>>> Raise default sysctl vm.max_map_count to INT_MAX, which effectively
>>> disables the limit for all sane purposes. The sysctl is kept around in
>>> case there is some use-case for this limit.

[snip]

>> NACK.
> 
> Sorry this may have come off as more hostile than intended... we are
> welcoming of patches, promise :)

[snip]

Understood. The RFC in the subject was honest - and we are having the 
discussion now, so all's good!

I also apologize for not Ccing the right people. This is my first patch 
here and I'm still trying to grasp the process.

> It is only because we want to be _super_ careful about things like this
> that can have potentially problematic impact if you have a buggy program
> that allocates too many VMAs.

Now I understand your concern. From the docs and code comments I've seen 
it was not clear that the limit serves _another_ purpose than mere 
compatibility shim for old ELF tools.

> It is a NACK, but it's a NACK because of the limit being so high.
> 
> With steam I believe it is a product of how it performs allocations, and
> unfortunately this causes it to allocate quite a bit more than you would
> expect.

FTR select non-game applications:

ElasticSearch and OpenSearch insist on at least 262144.
DNS server BIND 9.18.28 linked to jemalloc 5.2.1 was observed with usage 
around 700000.
OpenJDK GC sometimes weeps about values < 737280.
SAP docs I was able to access use 1000000.
MariaDB is being tested by their QA with 1048576.
Fedora, Ubuntu, NixOS, and Arch distros went with value 1048576.

Is it worth sending a patch with the default raised to 1048576?

> With jemalloc() that seems strange, perhaps buggy behaviour?

Good question. In case of BIND DNS server, jemalloc handles mmap() and 
we keep statistics about bytes requested from malloc().

When we hit max_map_count limit the
(sum of not-yet-freed malloc(size)) / (vm.max_map_count)
gives average size of mmaped block ~ 100 k.

Is 100 k way too low / does it indicate a bug? It does not seem terrible 
to me - the application is handling ~ 100-1500 B packets at rate 
somewhere between 10-200 k packets per second so it's expected it does 
lots of small short lived allocations.

A complicating factor is that the process itself does not see the 
current counter value (unless BPF is involved) so it's hard to monitor 
this until the limit is hit.

> It may be reasonable to adjust the default limit higher, and I'm not
> opposed to that, but it might be tricky to find a level that is sensible
> across all arches including ones with significantly smaller memory
> availability.

Hmm... Thinking aloud:

Are VMA sizes included in cgroup v2 memory accounting? Maybe the safety 
limit can be handled there?

If sizing based on available memory is a concern then a fixed value is 
probably already wrong? I mean, current boxes range from dozen MB to 512 
GB of RAM.

For a box with 16 MB of RAM we get ~ 16M/(sizeof ~ 184) = 91 180 VMAs to 
fill RAM, and the current limit is 65 530 _per process_.

Threat model which allows attacker to attacker mmap() but not fork() 
seems theoretical to me. I.e. an insane (or rogue) application can eat up to
(max # of processes) * (max_map_count) * (sizeof VMA)
bytes of memory, not just
max_map_count * (sizeof VMA)
we were talking about before.

Apologies for having more questions than answers. I'm trying to 
understand what purpose the limit serves and if we can improve user 
experience.

Thank you for patience and have a great weekend!

Pedro Falcato Aug. 30, 2024, 3:04 p.m. UTC | #4

On Fri, Aug 30, 2024 at 04:28:33PM GMT, Petr Špaček wrote:
> Now I understand your concern. From the docs and code comments I've seen it
> was not clear that the limit serves _another_ purpose than mere
> compatibility shim for old ELF tools.
> 
> > It is a NACK, but it's a NACK because of the limit being so high.
> > 
> > With steam I believe it is a product of how it performs allocations, and
> > unfortunately this causes it to allocate quite a bit more than you would
> > expect.
> 
> FTR select non-game applications:
> 
> ElasticSearch and OpenSearch insist on at least 262144.
> DNS server BIND 9.18.28 linked to jemalloc 5.2.1 was observed with usage
> around 700000.
> OpenJDK GC sometimes weeps about values < 737280.
> SAP docs I was able to access use 1000000.
> MariaDB is being tested by their QA with 1048576.
> Fedora, Ubuntu, NixOS, and Arch distros went with value 1048576.
> 
> Is it worth sending a patch with the default raised to 1048576?
> 
> 
> > With jemalloc() that seems strange, perhaps buggy behaviour?
> 
> Good question. In case of BIND DNS server, jemalloc handles mmap() and we
> keep statistics about bytes requested from malloc().
> 
> When we hit max_map_count limit the
> (sum of not-yet-freed malloc(size)) / (vm.max_map_count)
> gives average size of mmaped block ~ 100 k.
> 
> Is 100 k way too low / does it indicate a bug? It does not seem terrible to
> me - the application is handling ~ 100-1500 B packets at rate somewhere
> between 10-200 k packets per second so it's expected it does lots of small
> short lived allocations.
> 
> A complicating factor is that the process itself does not see the current
> counter value (unless BPF is involved) so it's hard to monitor this until
> the limit is hit.

Can you get us a dump of the /proc/<pid>/maps? It'd be interesting to see how
exactly you're hitting this.

David Hildenbrand Aug. 30, 2024, 3:24 p.m. UTC | #5

On 30.08.24 13:41, Lorenzo Stoakes wrote:
> On Fri, Aug 30, 2024 at 11:56:36AM GMT, Petr Spacek wrote:
>> From: Petr Spacek <pspacek@isc.org>
>>
>> Raise default sysctl vm.max_map_count to INT_MAX, which effectively
>> disables the limit for all sane purposes. The sysctl is kept around in
>> case there is some use-case for this limit.
>>
>> The old default value of vm.max_map_count=65530 provided compatibility
>> with ELF format predating year 2000 and with binutils predating 2010. At
>> the same time the old default caused issues with applications deployed
>> in 2024.
>>
>> State since 2012: Linux 3.2.0 correctly generates coredump from a
>> process with 100 000 mmapped files. GDB 7.4.1, binutils 2.22 work with
>> this coredump fine and can actually read data from the mmaped addresses.
>>
>> Signed-off-by: Petr Spacek <pspacek@isc.org>
> 
> NACK.

Ageed, I could have sworn I NACKed a similar patch just months ago.

If you use that many memory mappings, you re doing something very, very 
wrong.

Liam R. Howlett Aug. 30, 2024, 4:48 p.m. UTC | #6

* David Hildenbrand <david@redhat.com> [240830 11:24]:
> On 30.08.24 13:41, Lorenzo Stoakes wrote:
> > On Fri, Aug 30, 2024 at 11:56:36AM GMT, Petr Spacek wrote:
> > > From: Petr Spacek <pspacek@isc.org>
> > > 
> > > Raise default sysctl vm.max_map_count to INT_MAX, which effectively
> > > disables the limit for all sane purposes. The sysctl is kept around in
> > > case there is some use-case for this limit.
> > > 
> > > The old default value of vm.max_map_count=65530 provided compatibility
> > > with ELF format predating year 2000 and with binutils predating 2010. At
> > > the same time the old default caused issues with applications deployed
> > > in 2024.
> > > 
> > > State since 2012: Linux 3.2.0 correctly generates coredump from a
> > > process with 100 000 mmapped files. GDB 7.4.1, binutils 2.22 work with
> > > this coredump fine and can actually read data from the mmaped addresses.
> > > 
> > > Signed-off-by: Petr Spacek <pspacek@isc.org>
> > 
> > NACK.
> 
> Ageed, I could have sworn I NACKed a similar patch just months ago.

You did [1], the mm list doesn't seem to have all those emails.

The initial patch isn't there but I believe the planned change was to
increase the limit to 1048576.

> 
> If you use that many memory mappings, you re doing something very, very
> wrong.

It also caught jemalloc ever-increasing the vma count in 2017 [2], 2018
[3], and something odd in 2023 [4].

It seems like there is a lot of configuring to get this to behave as one
would expect, and so you either need to tune your system or the
application, or both here?

Although there are useful reasons to increase the vma limit, most people
are fine with the limit now and it catches bad actors.  Those not fine
with the limit have a way to increase it - and pretty much all of those
people are using their own allocators, it seems.


[1]. https://lore.kernel.org/all/1a91e772-4150-4d28-9c67-cb6d0478af79@redhat.com/
[2]. https://github.com/jemalloc/jemalloc/issues/1011
[3]. https://github.com/jemalloc/jemalloc/issues/1328
[4]. https://github.com/jemalloc/jemalloc/issues/2426

Petr Špaček Aug. 30, 2024, 5 p.m. UTC | #7

On 30. 08. 24 17:04, Pedro Falcato wrote:
> On Fri, Aug 30, 2024 at 04:28:33PM GMT, Petr Špaček wrote:
>> Now I understand your concern. From the docs and code comments I've seen it
>> was not clear that the limit serves _another_ purpose than mere
>> compatibility shim for old ELF tools.
>>
>>> It is a NACK, but it's a NACK because of the limit being so high.
>>>
>>> With steam I believe it is a product of how it performs allocations, and
>>> unfortunately this causes it to allocate quite a bit more than you would
>>> expect.
>>
>> FTR select non-game applications:
>>
>> ElasticSearch and OpenSearch insist on at least 262144.
>> DNS server BIND 9.18.28 linked to jemalloc 5.2.1 was observed with usage
>> around 700000.
>> OpenJDK GC sometimes weeps about values < 737280.
>> SAP docs I was able to access use 1000000.
>> MariaDB is being tested by their QA with 1048576.
>> Fedora, Ubuntu, NixOS, and Arch distros went with value 1048576.
>>
>> Is it worth sending a patch with the default raised to 1048576?
>>
>>
>>> With jemalloc() that seems strange, perhaps buggy behaviour?
>>
>> Good question. In case of BIND DNS server, jemalloc handles mmap() and we
>> keep statistics about bytes requested from malloc().
>>
>> When we hit max_map_count limit the
>> (sum of not-yet-freed malloc(size)) / (vm.max_map_count)
>> gives average size of mmaped block ~ 100 k.
>>
>> Is 100 k way too low / does it indicate a bug? It does not seem terrible to
>> me - the application is handling ~ 100-1500 B packets at rate somewhere
>> between 10-200 k packets per second so it's expected it does lots of small
>> short lived allocations.
>>
>> A complicating factor is that the process itself does not see the current
>> counter value (unless BPF is involved) so it's hard to monitor this until
>> the limit is hit.
> 
> Can you get us a dump of the /proc/<pid>/maps? It'd be interesting to see how
> exactly you're hitting this.

I have immediately available only a coredump from hitting the default 
limit. GDB apparently does not show these regions in "info proc 
mappings", but I was able to extract section addresses from the coredump:
https://users.isc.org/~pspacek/sf1717/elf-sections.csv

Distribution of section sizes and their count in format "size,count" is 
here:
https://users.isc.org/~pspacek/sf1717/sizes.csv

If you want to see some cumulative stats they are as OpenDocument here:
https://users.isc.org/~pspacek/sf1717/sizes.ods

 From a quick glance it is obvious that single-page blocks eat most of 
the quota.

I don't know if it is a bug or just memory fragmentation caused by a 
long-running server application.

I can try to get data from production system to you next week if needed.

Petr Špaček Sept. 2, 2024, 10:37 a.m. UTC | #8

On 30. 08. 24 19:00, Petr Špaček wrote:
> On 30. 08. 24 17:04, Pedro Falcato wrote:
>> On Fri, Aug 30, 2024 at 04:28:33PM GMT, Petr Špaček wrote:
>>
>> Can you get us a dump of the /proc/<pid>/maps? It'd be interesting to 
>> see how
>> exactly you're hitting this.

https://users.isc.org/~pspacek/sf1717/bind-9.18.28-jemalloc-maps.xz

RSS was about 8.9 GB when the snapshot was taken.

I'm curious about your conclusions from this data. Thank you for your time!

Pedro Falcato Sept. 2, 2024, 11:05 a.m. UTC | #9

On Mon, Sep 02, 2024 at 12:37:48PM GMT, Petr Špaček wrote:
> On 30. 08. 24 19:00, Petr Špaček wrote:
> > On 30. 08. 24 17:04, Pedro Falcato wrote:
> > > On Fri, Aug 30, 2024 at 04:28:33PM GMT, Petr Špaček wrote:
> > > 
> > > Can you get us a dump of the /proc/<pid>/maps? It'd be interesting
> > > to see how
> > > exactly you're hitting this.
> 
> https://users.isc.org/~pspacek/sf1717/bind-9.18.28-jemalloc-maps.xz
> 
> RSS was about 8.9 GB when the snapshot was taken.
> 
> I'm curious about your conclusions from this data. Thank you for your time!

I'm not a jemalloc expert (maybe they could chime in) but a quick look suggests
jemalloc is poking _a lot_ of holes into your memory map (with munmap).
There were theories regarding jemalloc guard pages, but these don't even seem
to be it. E.g:

7fa95d392000-7fa95d4ab000 rw-p 00000000 00:00 0
7fa95d4ac000-7fa95d4b7000 rw-p 00000000 00:00 0
7fa95d4b8000-7fa95d4dd000 rw-p 00000000 00:00 0
7fa95d4de000-7fa95d4f2000 rw-p 00000000 00:00 0
7fa95d4f3000-7fa95d4f9000 rw-p 00000000 00:00 0
7fa95d4fa000-7fa95d512000 rw-p 00000000 00:00 0
7fa95d513000-7fa95d53d000 rw-p 00000000 00:00 0
7fa95d53e000-7fa95d555000 rw-p 00000000 00:00 0
7fa95d556000-7fa95d5ab000 rw-p 00000000 00:00 0
7fa95d5ac000-7fa95d5b4000 rw-p 00000000 00:00 0

Where we have about a one page gap between every vma. Either jemalloc is a big fan
of munmap on free(), or this is some novel guard page technique I've never seen before :)
MADV_DONTNEED should work just fine on systems with overcommit on.

[RFC] mm: mmap: Change DEFAULT_MAX_MAP_COUNT to INT_MAX

Commit Message

Comments

Patch