diff mbox series

[v1,2/2] mm/hugetlb: document why hugetlb uses folio_mapcount() for COW reuse decisions

Message ID 20240502085259.103784-3-david@redhat.com (mailing list archive)
State Accepted
Commit b8a2528835b31718286e7436529917e1f521bf6f
Headers show
Series selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL | expand

Commit Message

David Hildenbrand May 2, 2024, 8:52 a.m. UTC
Let's document why hugetlb still uses folio_mapcount() and is prone to
leaking memory between processes, for example using vmsplice() that
still uses FOLL_GET.

More details can be found in [1], especially around how hugetlb pages
cannot really be overcommitted, and why we don't particularly care about
these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.

[1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/hugetlb.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Peter Xu May 2, 2024, 2:28 p.m. UTC | #1
On Thu, May 02, 2024 at 10:52:59AM +0200, David Hildenbrand wrote:
> Let's document why hugetlb still uses folio_mapcount() and is prone to
> leaking memory between processes, for example using vmsplice() that
> still uses FOLL_GET.
> 
> More details can be found in [1], especially around how hugetlb pages
> cannot really be overcommitted, and why we don't particularly care about
> these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.
> 
> [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/hugetlb.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 417fc5cdb6eeb..a7efb350f5d07 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5963,6 +5963,13 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  	/*
>  	 * If no-one else is actually using this page, we're the exclusive
>  	 * owner and can reuse this page.
> +	 *
> +	 * Note that we don't rely on the (safer) folio refcount here, because
> +	 * copying the hugetlb folio when there are unexpected (temporary)
> +	 * folio references could harm simple fork()+exit() users when
> +	 * we run out of free hugetlb folios: we would have to kill processes
> +	 * in scenarios that used to work. As a side effect, there can still
> +	 * be leaks between processes, for example, with FOLL_GET users.
>  	 */
>  	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
>  		if (!PageAnonExclusive(&old_folio->page)) {

Thanks for preparing such updates, David.

However is fork+exit the real problem?  E.g. if a child simply fork, do
something, then exit, I don't see a major issue even if we follow refcount
here (despite the "check against 1 or 2 or 3" issue, where hugetlb_fault
can take one already).  As long as the child quits, all ref / map counts
will be released then.  If the child needs to write to ANON|PRIV it needs
to manage hugetlb reservations anyways.

In the case of vmsplice it's kind of malicious, and holding that refcount
(with 0 mapcount) doesn't sound the common scenario to me.

IIUC if we need to keep this, it was more about the case where (as you
correctly mentioned in another follow up reply) hugetlb isn't that flexible
to memory overcommits, and in many cases it won't have extra pages floating
around to allow adhoc CoWs?  While random refcount boost is easy to happen,
and here the problem is we simply cannot identify that v.s. vmsplice()
malicious takers.

Thanks,
David Hildenbrand May 2, 2024, 2:33 p.m. UTC | #2
On 02.05.24 16:28, Peter Xu wrote:
> On Thu, May 02, 2024 at 10:52:59AM +0200, David Hildenbrand wrote:
>> Let's document why hugetlb still uses folio_mapcount() and is prone to
>> leaking memory between processes, for example using vmsplice() that
>> still uses FOLL_GET.
>>
>> More details can be found in [1], especially around how hugetlb pages
>> cannot really be overcommitted, and why we don't particularly care about
>> these vmsplice() leaks for hugetlb -- in contrast to ordinary memory.
>>
>> [1] https://lore.kernel.org/all/8b42a24d-caf0-46ef-9e15-0f88d47d2f21@redhat.com/
>>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/hugetlb.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 417fc5cdb6eeb..a7efb350f5d07 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -5963,6 +5963,13 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>>   	/*
>>   	 * If no-one else is actually using this page, we're the exclusive
>>   	 * owner and can reuse this page.
>> +	 *
>> +	 * Note that we don't rely on the (safer) folio refcount here, because
>> +	 * copying the hugetlb folio when there are unexpected (temporary)
>> +	 * folio references could harm simple fork()+exit() users when
>> +	 * we run out of free hugetlb folios: we would have to kill processes
>> +	 * in scenarios that used to work. As a side effect, there can still
>> +	 * be leaks between processes, for example, with FOLL_GET users.
>>   	 */
>>   	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
>>   		if (!PageAnonExclusive(&old_folio->page)) {
> 
> Thanks for preparing such updates, David.
> 
> However is fork+exit the real problem?  E.g. if a child simply fork, do
> something, then exit, I don't see a major issue even if we follow refcount
> here (despite the "check against 1 or 2 or 3" issue, where hugetlb_fault
> can take one already).  As long as the child quits, all ref / map counts
> will be released then.  If the child needs to write to ANON|PRIV it needs
> to manage hugetlb reservations anyways.

The PAE flag was cleared and if there is any unexpected (temporary) 
reference (page migration, lockless swapcache lookups, whatever), we're 
in trouble.

> 
> In the case of vmsplice it's kind of malicious, and holding that refcount
> (with 0 mapcount) doesn't sound the common scenario to me.

Yes, I'm not that concerned about something that that APP would be 
triggering (vmsplice).

> 
> IIUC if we need to keep this, it was more about the case where (as you
> correctly mentioned in another follow up reply) hugetlb isn't that flexible
> to memory overcommits, and in many cases it won't have extra pages floating
> around to allow adhoc CoWs?  While random refcount boost is easy to happen,
> and here the problem is we simply cannot identify that v.s. vmsplice()
> malicious takers.

Yes, exactly.
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 417fc5cdb6eeb..a7efb350f5d07 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5963,6 +5963,13 @@  static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	/*
 	 * If no-one else is actually using this page, we're the exclusive
 	 * owner and can reuse this page.
+	 *
+	 * Note that we don't rely on the (safer) folio refcount here, because
+	 * copying the hugetlb folio when there are unexpected (temporary)
+	 * folio references could harm simple fork()+exit() users when
+	 * we run out of free hugetlb folios: we would have to kill processes
+	 * in scenarios that used to work. As a side effect, there can still
+	 * be leaks between processes, for example, with FOLL_GET users.
 	 */
 	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
 		if (!PageAnonExclusive(&old_folio->page)) {