diff mbox series

[1/3] selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB

Message ID 20250107-virtual_address_range-tests-v1-1-3834a2fb47fe@linutronix.de (mailing list archive)
State New
Headers show
Series selftests/mm: virtual_address_range: Two bugfixes and a cleanup | expand

Commit Message

Thomas Weißschuh Jan. 7, 2025, 3:14 p.m. UTC
If not enough physical memory is available the kernel may fail mmap();
see __vm_enough_memory() and vm_commit_limit().
In that case the logic in validate_complete_va_space() does not make
sense and will even incorrectly fail.
Instead skip the test if no mmap() succeeded.

Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>

---
The logic in __vm_enough_memory() seems weird.
It describes itself as "Check that a process has enough memory to
allocate a new virtual mapping", however it never checks the current
memory usage of the process.
So it only disallows large mappings. But many small mappings taking the
same amount of memory are allowed; and then even automatically merged
into one big mapping.
---
 tools/testing/selftests/mm/virtual_address_range.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Dev Jain Jan. 8, 2025, 6:16 a.m. UTC | #1
On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
> If not enough physical memory is available the kernel may fail mmap();
> see __vm_enough_memory() and vm_commit_limit().
> In that case the logic in validate_complete_va_space() does not make
> sense and will even incorrectly fail.
> Instead skip the test if no mmap() succeeded.
>
> Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
>
> ---
> The logic in __vm_enough_memory() seems weird.
> It describes itself as "Check that a process has enough memory to
> allocate a new virtual mapping", however it never checks the current
> memory usage of the process.
> So it only disallows large mappings. But many small mappings taking the
> same amount of memory are allowed; and then even automatically merged
> into one big mapping.
> ---
>   tools/testing/selftests/mm/virtual_address_range.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
> index 2a2b69e91950a37999f606847c9c8328d79890c2..d7bf8094d8bcd4bc96e2db4dc3fcb41968def859 100644
> --- a/tools/testing/selftests/mm/virtual_address_range.c
> +++ b/tools/testing/selftests/mm/virtual_address_range.c
> @@ -178,6 +178,12 @@ int main(int argc, char *argv[])
>   		validate_addr(ptr[i], 0);
>   	}
>   	lchunks = i;
> +
> +	if (!lchunks) {
> +		ksft_test_result_skip("Not enough memory for a single chunk\n");
> +		ksft_finished();
> +	}
> +
>   	hptr = (char **) calloc(NR_CHUNKS_HIGH, sizeof(char *));
>   	if (hptr == NULL) {
>   		ksft_test_result_skip("Memory constraint not fulfilled\n");
>

I do not  know about __vm_enough_memory(), but I am going by your description:
You say that the kernel may fail mmap() when enough physical memory is not
there, but it may happen that we have already done 100 mmap()'s, and then
the kernel fails mmap(), so if (!lchunks) won't be able to handle this case.
Basically, lchunks == 0 is not a complete indicator of kernel failing mmap().

The basic assumption of the test is that any process should be able to exhaust
its virtual address space, and running the test under memory pressure and the
kernel violating this behaviour defeats the point of the test I think?
Thomas Weißschuh Jan. 8, 2025, 8:05 a.m. UTC | #2
On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
> 
> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
> > If not enough physical memory is available the kernel may fail mmap();
> > see __vm_enough_memory() and vm_commit_limit().
> > In that case the logic in validate_complete_va_space() does not make
> > sense and will even incorrectly fail.
> > Instead skip the test if no mmap() succeeded.
> > 
> > Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> > 
> > ---
> > The logic in __vm_enough_memory() seems weird.
> > It describes itself as "Check that a process has enough memory to
> > allocate a new virtual mapping", however it never checks the current
> > memory usage of the process.
> > So it only disallows large mappings. But many small mappings taking the
> > same amount of memory are allowed; and then even automatically merged
> > into one big mapping.
> > ---
> >   tools/testing/selftests/mm/virtual_address_range.c | 6 ++++++
> >   1 file changed, 6 insertions(+)
> > 
> > diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
> > index 2a2b69e91950a37999f606847c9c8328d79890c2..d7bf8094d8bcd4bc96e2db4dc3fcb41968def859 100644
> > --- a/tools/testing/selftests/mm/virtual_address_range.c
> > +++ b/tools/testing/selftests/mm/virtual_address_range.c
> > @@ -178,6 +178,12 @@ int main(int argc, char *argv[])
> >   		validate_addr(ptr[i], 0);
> >   	}
> >   	lchunks = i;
> > +
> > +	if (!lchunks) {
> > +		ksft_test_result_skip("Not enough memory for a single chunk\n");
> > +		ksft_finished();
> > +	}
> > +
> >   	hptr = (char **) calloc(NR_CHUNKS_HIGH, sizeof(char *));
> >   	if (hptr == NULL) {
> >   		ksft_test_result_skip("Memory constraint not fulfilled\n");
> > 
> 
> I do not  know about __vm_enough_memory(), but I am going by your description:
> You say that the kernel may fail mmap() when enough physical memory is not
> there, but it may happen that we have already done 100 mmap()'s, and then
> the kernel fails mmap(), so if (!lchunks) won't be able to handle this case.
> Basically, lchunks == 0 is not a complete indicator of kernel failing mmap().

__vm_enough_memory() only checks the size of each single mmap() on its
own. It does not actually check the current memory or address space
usage of the process.
This seems a bit weird, as indicated in my after-the-fold explanation.

> The basic assumption of the test is that any process should be able to exhaust
> its virtual address space, and running the test under memory pressure and the
> kernel violating this behaviour defeats the point of the test I think?

The assumption is correct, as soon as one mapping succeeds the others
will also succeed, until the actual address space is exhausted.

Looking at it again, __vm_enough_memory() is only called for writable
mappings, so it would be possible to use only readable mappings in the
test. The test will still fail with OOM, as the many PTEs need more than
1GiB of physical memory anyways, but at least that produces a usable
error message.
However I'm not sure if this would violate other test assumptions.
David Hildenbrand Jan. 8, 2025, 1:36 p.m. UTC | #3
On 08.01.25 09:05, Thomas Weißschuh wrote:
> On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
>>
>> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>>> If not enough physical memory is available the kernel may fail mmap();
>>> see __vm_enough_memory() and vm_commit_limit().
>>> In that case the logic in validate_complete_va_space() does not make
>>> sense and will even incorrectly fail.
>>> Instead skip the test if no mmap() succeeded.
>>>
>>> Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
>>> Cc: stable@vger.kernel.org

CC stable on tests is ... odd.

>>> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
>>>
>>> ---
>>> The logic in __vm_enough_memory() seems weird.
>>> It describes itself as "Check that a process has enough memory to
>>> allocate a new virtual mapping", however it never checks the current
>>> memory usage of the process.
>>> So it only disallows large mappings. But many small mappings taking the
>>> same amount of memory are allowed; and then even automatically merged
>>> into one big mapping.
>>> ---
>>>    tools/testing/selftests/mm/virtual_address_range.c | 6 ++++++
>>>    1 file changed, 6 insertions(+)
>>>
>>> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>>> index 2a2b69e91950a37999f606847c9c8328d79890c2..d7bf8094d8bcd4bc96e2db4dc3fcb41968def859 100644
>>> --- a/tools/testing/selftests/mm/virtual_address_range.c
>>> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>>> @@ -178,6 +178,12 @@ int main(int argc, char *argv[])
>>>    		validate_addr(ptr[i], 0);
>>>    	}
>>>    	lchunks = i;
>>> +
>>> +	if (!lchunks) {
>>> +		ksft_test_result_skip("Not enough memory for a single chunk\n");
>>> +		ksft_finished();
>>> +	}
>>> +
>>>    	hptr = (char **) calloc(NR_CHUNKS_HIGH, sizeof(char *));
>>>    	if (hptr == NULL) {
>>>    		ksft_test_result_skip("Memory constraint not fulfilled\n");
>>>
>>
>> I do not  know about __vm_enough_memory(), but I am going by your description:
>> You say that the kernel may fail mmap() when enough physical memory is not
>> there, but it may happen that we have already done 100 mmap()'s, and then
>> the kernel fails mmap(), so if (!lchunks) won't be able to handle this case.
>> Basically, lchunks == 0 is not a complete indicator of kernel failing mmap().
> 
> __vm_enough_memory() only checks the size of each single mmap() on its
> own. It does not actually check the current memory or address space
> usage of the process.
> This seems a bit weird, as indicated in my after-the-fold explanation.
> 
>> The basic assumption of the test is that any process should be able to exhaust
>> its virtual address space, and running the test under memory pressure and the
>> kernel violating this behaviour defeats the point of the test I think?
> 
> The assumption is correct, as soon as one mapping succeeds the others
> will also succeed, until the actual address space is exhausted.
> 
> Looking at it again, __vm_enough_memory() is only called for writable
> mappings, so it would be possible to use only readable mappings in the
> test. The test will still fail with OOM, as the many PTEs need more than
> 1GiB of physical memory anyways, but at least that produces a usable
> error message.
> However I'm not sure if this would violate other test assumptions.
> 

Note that with MAP_NORESRVE, most setups we care about will allow 
mapping as much as you want, but on access OOM will fire.

So one could require that /proc/sys/vm/overcommit_memory is setup 
properly and use MAP_NORESRVE.

Reading from anonymous memory will populate the shared zeropage. To 
mitigate OOM from "too many page tables", one could simply unmap the 
pieces as they are verified (or MAP_FIXED over them, to free page tables).
Thomas Weißschuh Jan. 8, 2025, 4:13 p.m. UTC | #4
On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
> On 08.01.25 09:05, Thomas Weißschuh wrote:
> > On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
> > > 
> > > On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
> > > > If not enough physical memory is available the kernel may fail mmap();
> > > > see __vm_enough_memory() and vm_commit_limit().
> > > > In that case the logic in validate_complete_va_space() does not make
> > > > sense and will even incorrectly fail.
> > > > Instead skip the test if no mmap() succeeded.
> > > > 
> > > > Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
> > > > Cc: stable@vger.kernel.org
> 
> CC stable on tests is ... odd.

I thought it was fairly common, but it isn't.
Will drop it.

> > > > Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> > > > 
> > > > ---
> > > > The logic in __vm_enough_memory() seems weird.
> > > > It describes itself as "Check that a process has enough memory to
> > > > allocate a new virtual mapping", however it never checks the current
> > > > memory usage of the process.
> > > > So it only disallows large mappings. But many small mappings taking the
> > > > same amount of memory are allowed; and then even automatically merged
> > > > into one big mapping.
> > > > ---
> > > >    tools/testing/selftests/mm/virtual_address_range.c | 6 ++++++
> > > >    1 file changed, 6 insertions(+)
> > > > 
> > > > diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
> > > > index 2a2b69e91950a37999f606847c9c8328d79890c2..d7bf8094d8bcd4bc96e2db4dc3fcb41968def859 100644
> > > > --- a/tools/testing/selftests/mm/virtual_address_range.c
> > > > +++ b/tools/testing/selftests/mm/virtual_address_range.c
> > > > @@ -178,6 +178,12 @@ int main(int argc, char *argv[])
> > > >    		validate_addr(ptr[i], 0);
> > > >    	}
> > > >    	lchunks = i;
> > > > +
> > > > +	if (!lchunks) {
> > > > +		ksft_test_result_skip("Not enough memory for a single chunk\n");
> > > > +		ksft_finished();
> > > > +	}
> > > > +
> > > >    	hptr = (char **) calloc(NR_CHUNKS_HIGH, sizeof(char *));
> > > >    	if (hptr == NULL) {
> > > >    		ksft_test_result_skip("Memory constraint not fulfilled\n");
> > > > 
> > > 
> > > I do not  know about __vm_enough_memory(), but I am going by your description:
> > > You say that the kernel may fail mmap() when enough physical memory is not
> > > there, but it may happen that we have already done 100 mmap()'s, and then
> > > the kernel fails mmap(), so if (!lchunks) won't be able to handle this case.
> > > Basically, lchunks == 0 is not a complete indicator of kernel failing mmap().
> > 
> > __vm_enough_memory() only checks the size of each single mmap() on its
> > own. It does not actually check the current memory or address space
> > usage of the process.
> > This seems a bit weird, as indicated in my after-the-fold explanation.
> > 
> > > The basic assumption of the test is that any process should be able to exhaust
> > > its virtual address space, and running the test under memory pressure and the
> > > kernel violating this behaviour defeats the point of the test I think?
> > 
> > The assumption is correct, as soon as one mapping succeeds the others
> > will also succeed, until the actual address space is exhausted.
> > 
> > Looking at it again, __vm_enough_memory() is only called for writable
> > mappings, so it would be possible to use only readable mappings in the
> > test. The test will still fail with OOM, as the many PTEs need more than
> > 1GiB of physical memory anyways, but at least that produces a usable
> > error message.
> > However I'm not sure if this would violate other test assumptions.
> > 
> 
> Note that with MAP_NORESRVE, most setups we care about will allow mapping as
> much as you want, but on access OOM will fire.

Thanks for the hint.

> So one could require that /proc/sys/vm/overcommit_memory is setup properly
> and use MAP_NORESRVE.

Isn't the check for lchunks == 0 essentially exactly this?

> Reading from anonymous memory will populate the shared zeropage. To mitigate
> OOM from "too many page tables", one could simply unmap the pieces as they
> are verified (or MAP_FIXED over them, to free page tables).

The code has to figure out if a verified region was created by mmap(),
otherwise an munmap() could crash the process.
As the entries from /proc/self/maps may have been merged and (I assume)
the ordering of mappings is not guaranteed, some bespoke logic to establish
the link will be needed.

Is it fine to rely on CONFIG_ANON_VMA_NAME?
That would make it much easier to implement.

Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even
in very low physical memory conditions.

Thomas
David Hildenbrand Jan. 8, 2025, 4:46 p.m. UTC | #5
On 08.01.25 17:13, Thomas Weißschuh wrote:
> On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
>> On 08.01.25 09:05, Thomas Weißschuh wrote:
>>> On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
>>>>
>>>> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>>>>> If not enough physical memory is available the kernel may fail mmap();
>>>>> see __vm_enough_memory() and vm_commit_limit().
>>>>> In that case the logic in validate_complete_va_space() does not make
>>>>> sense and will even incorrectly fail.
>>>>> Instead skip the test if no mmap() succeeded.
>>>>>
>>>>> Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
>>>>> Cc: stable@vger.kernel.org
>>
>> CC stable on tests is ... odd.
> 
> I thought it was fairly common, but it isn't.
> Will drop it.

As it's not really a "kernel BUG", it's rather uncommon.

>>
>> Note that with MAP_NORESRVE, most setups we care about will allow mapping as
>> much as you want, but on access OOM will fire.
> 
> Thanks for the hint.
> 
>> So one could require that /proc/sys/vm/overcommit_memory is setup properly
>> and use MAP_NORESRVE.
> 
> Isn't the check for lchunks == 0 essentially exactly this?

I assume paired with MAP_NORESERVE?

Maybe, but it could be better to have something that says "if 
overcommit_memory is not setup properly I will SKIP this test", but 
otherwise I expect this to work and will FAIL if it doesn't".

Or would you expect to run into lchunks == 0 even if overcommit_memory 
is setup properly and MAP_NORESERVE is used? (very very low memory that 
we cannot even create all the VMAs?)

> 
>> Reading from anonymous memory will populate the shared zeropage. To mitigate
>> OOM from "too many page tables", one could simply unmap the pieces as they
>> are verified (or MAP_FIXED over them, to free page tables).
> 
> The code has to figure out if a verified region was created by mmap(),
> otherwise an munmap() could crash the process.
> As the entries from /proc/self/maps may have been merged and (I assume)

Yes, and partial unmap (in chunk granularity?) would split them again.

> the ordering of mappings is not guaranteed, some bespoke logic to establish
> the link will be needed.


My thinking was that you simply process one /proc/self/maps entry in 
some chunks. After processing a chunk, you munmap() it.

So you would process + munmap in chunks.

> 
> Is it fine to rely on CONFIG_ANON_VMA_NAME?
> That would make it much easier to implement.

Can you elaborate how you would do it?

> 
> Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even
> in very low physical memory conditions.

Cool.
Dev Jain Jan. 9, 2025, 5:40 a.m. UTC | #6
On 08/01/25 9:43 pm, Thomas Weißschuh wrote:
> On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
>> On 08.01.25 09:05, Thomas Weißschuh wrote:
>>> On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
>>>> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>>>>> If not enough physical memory is available the kernel may fail mmap();
>>>>> see __vm_enough_memory() and vm_commit_limit().
>>>>> In that case the logic in validate_complete_va_space() does not make
>>>>> sense and will even incorrectly fail.
>>>>> Instead skip the test if no mmap() succeeded.
>>>>>
>>>>> Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
>>>>> Cc: stable@vger.kernel.org
>> CC stable on tests is ... odd.
> I thought it was fairly common, but it isn't.
> Will drop it.

Oh, well...
https://lore.kernel.org/all/20240521074358.675031-4-dev.jain@arm.com/
I have done that before :) although the change I was making was fixing a
fundamental flaw in the test and your change is fixing the test for a
specific case (memory pressure), so I tend to concur with David.

>
>>>>> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
>>>>>
>>>>> ---
>>>>> The logic in __vm_enough_memory() seems weird.
>>>>> It describes itself as "Check that a process has enough memory to
>>>>> allocate a new virtual mapping", however it never checks the current
>>>>> memory usage of the process.
>>>>> So it only disallows large mappings. But many small mappings taking the
>>>>> same amount of memory are allowed; and then even automatically merged
>>>>> into one big mapping.
>>>>> ---
>>>>>     tools/testing/selftests/mm/virtual_address_range.c | 6 ++++++
>>>>>     1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>>>>> index 2a2b69e91950a37999f606847c9c8328d79890c2..d7bf8094d8bcd4bc96e2db4dc3fcb41968def859 100644
>>>>> --- a/tools/testing/selftests/mm/virtual_address_range.c
>>>>> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>>>>> @@ -178,6 +178,12 @@ int main(int argc, char *argv[])
>>>>>     		validate_addr(ptr[i], 0);
>>>>>     	}
>>>>>     	lchunks = i;
>>>>> +
>>>>> +	if (!lchunks) {
>>>>> +		ksft_test_result_skip("Not enough memory for a single chunk\n");
>>>>> +		ksft_finished();
>>>>> +	}
>>>>> +
>>>>>     	hptr = (char **) calloc(NR_CHUNKS_HIGH, sizeof(char *));
>>>>>     	if (hptr == NULL) {
>>>>>     		ksft_test_result_skip("Memory constraint not fulfilled\n");
>>>>>
>>>> I do not  know about __vm_enough_memory(), but I am going by your description:
>>>> You say that the kernel may fail mmap() when enough physical memory is not
>>>> there, but it may happen that we have already done 100 mmap()'s, and then
>>>> the kernel fails mmap(), so if (!lchunks) won't be able to handle this case.
>>>> Basically, lchunks == 0 is not a complete indicator of kernel failing mmap().
>>> __vm_enough_memory() only checks the size of each single mmap() on its
>>> own. It does not actually check the current memory or address space
>>> usage of the process.
>>> This seems a bit weird, as indicated in my after-the-fold explanation.
>>>
>>>> The basic assumption of the test is that any process should be able to exhaust
>>>> its virtual address space, and running the test under memory pressure and the
>>>> kernel violating this behaviour defeats the point of the test I think?
>>> The assumption is correct, as soon as one mapping succeeds the others
>>> will also succeed, until the actual address space is exhausted.
>>>
>>> Looking at it again, __vm_enough_memory() is only called for writable
>>> mappings, so it would be possible to use only readable mappings in the
>>> test. The test will still fail with OOM, as the many PTEs need more than
>>> 1GiB of physical memory anyways, but at least that produces a usable
>>> error message.
>>> However I'm not sure if this would violate other test assumptions.
>>>
>> Note that with MAP_NORESRVE, most setups we care about will allow mapping as
>> much as you want, but on access OOM will fire.
> Thanks for the hint.
>
>> So one could require that /proc/sys/vm/overcommit_memory is setup properly
>> and use MAP_NORESRVE.
> Isn't the check for lchunks == 0 essentially exactly this?
>
>> Reading from anonymous memory will populate the shared zeropage. To mitigate
>> OOM from "too many page tables", one could simply unmap the pieces as they
>> are verified (or MAP_FIXED over them, to free page tables).
> The code has to figure out if a verified region was created by mmap(),
> otherwise an munmap() could crash the process.
> As the entries from /proc/self/maps may have been merged and (I assume)
> the ordering of mappings is not guaranteed, some bespoke logic to establish
> the link will be needed.
>
> Is it fine to rely on CONFIG_ANON_VMA_NAME?
> That would make it much easier to implement.
>
> Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even
> in very low physical memory conditions.
>
> Thomas
Thomas Weißschuh Jan. 9, 2025, 7:47 a.m. UTC | #7
On Wed, Jan 08, 2025 at 05:46:37PM +0100, David Hildenbrand wrote:
> On 08.01.25 17:13, Thomas Weißschuh wrote:
> > On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
> > > On 08.01.25 09:05, Thomas Weißschuh wrote:
> > > > On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
> > > > > 
> > > > > On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
> > > > > > If not enough physical memory is available the kernel may fail mmap();
> > > > > > see __vm_enough_memory() and vm_commit_limit().
> > > > > > In that case the logic in validate_complete_va_space() does not make
> > > > > > sense and will even incorrectly fail.
> > > > > > Instead skip the test if no mmap() succeeded.
> > > > > > 
> > > > > > Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
> > > > > > Cc: stable@vger.kernel.org
> > > 
> > > CC stable on tests is ... odd.
> > 
> > I thought it was fairly common, but it isn't.
> > Will drop it.
> 
> As it's not really a "kernel BUG", it's rather uncommon.

I also used it on patch 2, which is now reproducibly broken on x86
mainline since my commit mentioned in that patch.
But I'll drop it there, too.

> > > Note that with MAP_NORESRVE, most setups we care about will allow mapping as
> > > much as you want, but on access OOM will fire.
> > 
> > Thanks for the hint.
> > 
> > > So one could require that /proc/sys/vm/overcommit_memory is setup properly
> > > and use MAP_NORESRVE.
> > 
> > Isn't the check for lchunks == 0 essentially exactly this?
> 
> I assume paired with MAP_NORESERVE?

Yes.

> Maybe, but it could be better to have something that says "if
> overcommit_memory is not setup properly I will SKIP this test", but
> otherwise I expect this to work and will FAIL if it doesn't".

Ok, I'll validate the sysctl value.

> Or would you expect to run into lchunks == 0 even if overcommit_memory is
> setup properly and MAP_NORESERVE is used? (very very low memory that we
> cannot even create all the VMAs?)

No.

> > > Reading from anonymous memory will populate the shared zeropage. To mitigate
> > > OOM from "too many page tables", one could simply unmap the pieces as they
> > > are verified (or MAP_FIXED over them, to free page tables).
> > 
> > The code has to figure out if a verified region was created by mmap(),
> > otherwise an munmap() could crash the process.
> > As the entries from /proc/self/maps may have been merged and (I assume)
> 
> Yes, and partial unmap (in chunk granularity?) would split them again.
> 
> > the ordering of mappings is not guaranteed, some bespoke logic to establish
> > the link will be needed.
> 
> My thinking was that you simply process one /proc/self/maps entry in some
> chunks. After processing a chunk, you munmap() it.
> 
> So you would process + munmap in chunks.

That is clear. The issue would be to figure which chunks are valid to
unmap. If something critical like the executable file is unmapped,
the process crashes. But see below.

> > Is it fine to rely on CONFIG_ANON_VMA_NAME?
> > That would make it much easier to implement.
> 
> Can you elaborate how you would do it?

First set the VMA name after mmap():

for (i = 0; i < NR_CHUNKS_LOW; i++) {
	ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
		     MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

	if (ptr[i] == MAP_FAILED) {
		if (validate_lower_address_hint())
			ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
		break;
	}

	validate_addr(ptr[i], 0);
	if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
		ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));
}

During validation:

hop = 0;
while (start_addr + hop < end_addr) {
	if (write(fd, (void *)(start_addr + hop), 1) != 1)
		return 1;
	lseek(fd, 0, SEEK_SET);

	if (!strncmp(line + path_offset, "[anon:virtual_address_range]", 28))
		munmap((char *)(start_addr + hop), MAP_CHUNK_SIZE);

	hop += MAP_CHUNK_SIZE;

}

It is done for each chunk, as all chunks may have been merged into a
single VMA and a per-VMA unmap would not happen before OOM.

> > Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even
> > in very low physical memory conditions.
> 
> Cool.
David Hildenbrand Jan. 9, 2025, 1:05 p.m. UTC | #8
>
> That is clear. The issue would be to figure which chunks are valid to
> unmap. If something critical like the executable file is unmapped,
> the process crashes. But see below.

Ah, now I see what you mean. Yes, also the stack etc. will be 
problematic. So IIUC, you want to limit the munmap optimization only to 
the manually mmap()ed parts.

> 
>>> Is it fine to rely on CONFIG_ANON_VMA_NAME?
>>> That would make it much easier to implement.
>>
>> Can you elaborate how you would do it?
> 
> First set the VMA name after mmap():
> 
> for (i = 0; i < NR_CHUNKS_LOW; i++) {
> 	ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
> 		     MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 	if (ptr[i] == MAP_FAILED) {
> 		if (validate_lower_address_hint())
> 			ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
> 		break;
> 	}
> 
> 	validate_addr(ptr[i], 0);
> 	if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
> 		ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));

Likely this would prevent merging of VMAs.

With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already 
require 128k VMAs. The default limit is frequently 64k.

We could just scan the ptr / hptr array to see if this is a manual mmap 
area or not. If this takes too long, one could sort the arrays by 
address and perform a binary search.

Not the most efficient way of doing it, but maybe good enough for this test?

Alternatively, store the pointer in a xarray-like tree instead of two 
arrays. Requires a bit more memory ... and we'd have to find a simple 
implementation we could just reuse in this test. So maybe there is a 
simpler way to get it done.
David Hildenbrand Jan. 9, 2025, 1:19 p.m. UTC | #9
On 09.01.25 14:05, David Hildenbrand wrote:
>   >
>> That is clear. The issue would be to figure which chunks are valid to
>> unmap. If something critical like the executable file is unmapped,
>> the process crashes. But see below.
> 
> Ah, now I see what you mean. Yes, also the stack etc. will be
> problematic. So IIUC, you want to limit the munmap optimization only to
> the manually mmap()ed parts.
> 
>>
>>>> Is it fine to rely on CONFIG_ANON_VMA_NAME?
>>>> That would make it much easier to implement.
>>>
>>> Can you elaborate how you would do it?
>>
>> First set the VMA name after mmap():

I took a look at the implementation, and VMA merging seems to be able to 
merge such VMAs that share the same name (even when set separately).

So assuming you use the same name for all, that should indeed also work.
Thomas Weißschuh Jan. 9, 2025, 1:38 p.m. UTC | #10
On Thu, Jan 09, 2025 at 02:05:43PM +0100, David Hildenbrand wrote:
> >
> > That is clear. The issue would be to figure which chunks are valid to
> > unmap. If something critical like the executable file is unmapped,
> > the process crashes. But see below.
> 
> Ah, now I see what you mean. Yes, also the stack etc. will be problematic.
> So IIUC, you want to limit the munmap optimization only to the manually
> mmap()ed parts.

Correct.

> > > > Is it fine to rely on CONFIG_ANON_VMA_NAME?
> > > > That would make it much easier to implement.
> > > 
> > > Can you elaborate how you would do it?
> > 
> > First set the VMA name after mmap():
> > 
> > for (i = 0; i < NR_CHUNKS_LOW; i++) {
> > 	ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
> > 		     MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > 
> > 	if (ptr[i] == MAP_FAILED) {
> > 		if (validate_lower_address_hint())
> > 			ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
> > 		break;
> > 	}
> > 
> > 	validate_addr(ptr[i], 0);
> > 	if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
> > 		ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));
> 
> Likely this would prevent merging of VMAs.
>
> With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already require
> 128k VMAs. The default limit is frequently 64k.

They are merged for me, as they all share the same name.

PR_SET_VMA(2const) even mentions merging:

	Note that assigning an attribute to a virtual memory area might
	prevent it from being merged with adjacent virtual memory areas
	due to the difference in that attribute's value.

is_mergeable_vma() has an explicit check using anon_vma_name_eq().

> We could just scan the ptr / hptr array to see if this is a manual mmap area
> or not. If this takes too long, one could sort the arrays by address and
> perform a binary search.
>
> Not the most efficient way of doing it, but maybe good enough for this test?

A naive loop is what I tried first, but it took forever.

> Alternatively, store the pointer in a xarray-like tree instead of two
> arrays. Requires a bit more memory ... and we'd have to find a simple
> implementation we could just reuse in this test. So maybe there is a simpler
> way to get it done.

IMO the prctl() is that simpler way.
The only real drawback is the dependency on CONFIG_ANON_VMA_NAME.
We can add an entry to tools/testing/selftests/mm/config for it.


Thomas
diff mbox series

Patch

diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
index 2a2b69e91950a37999f606847c9c8328d79890c2..d7bf8094d8bcd4bc96e2db4dc3fcb41968def859 100644
--- a/tools/testing/selftests/mm/virtual_address_range.c
+++ b/tools/testing/selftests/mm/virtual_address_range.c
@@ -178,6 +178,12 @@  int main(int argc, char *argv[])
 		validate_addr(ptr[i], 0);
 	}
 	lchunks = i;
+
+	if (!lchunks) {
+		ksft_test_result_skip("Not enough memory for a single chunk\n");
+		ksft_finished();
+	}
+
 	hptr = (char **) calloc(NR_CHUNKS_HIGH, sizeof(char *));
 	if (hptr == NULL) {
 		ksft_test_result_skip("Memory constraint not fulfilled\n");