[v2] mm: cma: indefinitely retry allocations in cma_alloc

Message ID	010101747e998731-e49f209f-8232-4496-a9fc-2465334e70d7-000000@us-west-2.amazonses.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=OUtR=CU=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D2BCC22208 DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 65A69C433CA From: Chris Goldsworthy <cgoldswo@codeaurora.org> To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, pratikp@codeaurora.org, pdaly@codeaurora.org, sudraja@codeaurora.org, iamjoonsoo.kim@lge.com, linux-arm-msm-owner@vger.kernel.org, Chris Goldsworthy <cgoldswo@codeaurora.org>, Vinayak Menon <vinmenon@codeaurora.org> Subject: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc Date: Fri, 11 Sep 2020 19:17:05 +0000 Message-ID: <010101747e998731-e49f209f-8232-4496-a9fc-2465334e70d7-000000@us-west-2.amazonses.com> In-Reply-To: <1599851809-4342-1-git-send-email-cgoldswo@codeaurora.org> References: <06489716814387e7f147cf53d1b185a8@codeaurora.org> <1599851809-4342-1-git-send-email-cgoldswo@codeaurora.org> Feedback-ID: 1.us-west-2.CZuq2qbDmUIuT3qdvXlRHZZCpfZqZ4GtG9v3VKgRyF0=:AmazonSES Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v2] mm: cma: indefinitely retry allocations in cma_alloc \| expand [v2] mm: cma: indefinitely retry allocations in cma_alloc

Chris Goldsworthy Sept. 11, 2020, 7:17 p.m. UTC

CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

One example of where _refcount can be greater than _mapcount for a page we
would not expect to be pinned is inside of copy_one_pte(), which is called
during a fork. For ptes for which pte_present(pte) == true, copy_one_pte()
will increment the _refcount field followed by the  _mapcount field of a
page. If the process doing copy_one_pte() is context switched out after
incrementing _refcount but before incrementing _mapcount, then the page
will be temporarily pinned.

So, inside of cma_alloc(), instead of giving up when alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap, perform
retries indefinitely, with sleeps, to give the system an opportunity to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 mm/cma.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

David Hildenbrand Sept. 14, 2020, 9:31 a.m. UTC | #1

On 11.09.20 21:17, Chris Goldsworthy wrote:
> CMA allocations will fail if 'pinned' pages are in a CMA area, since we
> cannot migrate pinned pages. The _refcount of a struct page being greater
> than _mapcount for that page can cause pinning for anonymous pages.  This
> is because try_to_unmap(), which (1) is called in the CMA allocation path,
> and (2) decrements both _refcount and _mapcount for a page, will stop
> unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
> implies that after try_to_unmap() has finished successfully for a page
> where _recount > _mapcount, that _refcount will be greater than 0.  Later
> in the CMA allocation path in migrate_page_move_mapping(), we will have one
> more reference count than intended for anonymous pages, meaning the
> allocation will fail for that page.
> 
> One example of where _refcount can be greater than _mapcount for a page we
> would not expect to be pinned is inside of copy_one_pte(), which is called
> during a fork. For ptes for which pte_present(pte) == true, copy_one_pte()
> will increment the _refcount field followed by the  _mapcount field of a
> page. If the process doing copy_one_pte() is context switched out after
> incrementing _refcount but before incrementing _mapcount, then the page
> will be temporarily pinned.
> 
> So, inside of cma_alloc(), instead of giving up when alloc_contig_range()
> returns -EBUSY after having scanned a whole CMA-region bitmap, perform
> retries indefinitely, with sleeps, to give the system an opportunity to
> unpin any pinned pages.
> 
> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> ---
>  mm/cma.c | 25 +++++++++++++++++++++++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 7f415d7..90bb505 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -32,6 +32,7 @@
>  #include <linux/highmem.h>
>  #include <linux/io.h>
>  #include <linux/kmemleak.h>
> +#include <linux/delay.h>
>  #include <trace/events/cma.h>
>  
>  #include "cma.h"
> @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
>  				bitmap_maxno, start, bitmap_count, mask,
>  				offset);
>  		if (bitmap_no >= bitmap_maxno) {
> -			mutex_unlock(&cma->lock);
> -			break;
> +			if (ret == -EBUSY) {
> +				mutex_unlock(&cma->lock);
> +
> +				/*
> +				 * Page may be momentarily pinned by some other
> +				 * process which has been scheduled out, e.g.
> +				 * in exit path, during unmap call, or process
> +				 * fork and so cannot be freed there. Sleep
> +				 * for 100ms and retry the allocation.
> +				 */
> +				start = 0;
> +				ret = -ENOMEM;
> +				msleep(100);
> +				continue;
> +			} else {
> +				/*
> +				 * ret == -ENOMEM - all bits in cma->bitmap are
> +				 * set, so we break accordingly.
> +				 */
> +				mutex_unlock(&cma->lock);
> +				break;
> +			}
>  		}
>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>  		/*
> 

What about long-term pinnings? IIRC, that can happen easily e.g., with
vfio (and I remember there is a way via vmsplice).

Not convinced trying forever is a sane approach in the general case ...

Chris Goldsworthy Sept. 14, 2020, 6:33 p.m. UTC | #2

On 2020-09-14 02:31, David Hildenbrand wrote:
> On 11.09.20 21:17, Chris Goldsworthy wrote:
>> 
>> So, inside of cma_alloc(), instead of giving up when 
>> alloc_contig_range()
>> returns -EBUSY after having scanned a whole CMA-region bitmap, perform
>> retries indefinitely, with sleeps, to give the system an opportunity 
>> to
>> unpin any pinned pages.
>> 
>> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
>> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>> ---
>>  mm/cma.c | 25 +++++++++++++++++++++++--
>>  1 file changed, 23 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/cma.c b/mm/cma.c
>> index 7f415d7..90bb505 100644
>> --- a/mm/cma.c
>> +++ b/mm/cma.c
>> @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t 
>> count, unsigned int align,
>>  				bitmap_maxno, start, bitmap_count, mask,
>>  				offset);
>>  		if (bitmap_no >= bitmap_maxno) {
>> -			mutex_unlock(&cma->lock);
>> -			break;
>> +			if (ret == -EBUSY) {
>> +				mutex_unlock(&cma->lock);
>> +
>> +				/*
>> +				 * Page may be momentarily pinned by some other
>> +				 * process which has been scheduled out, e.g.
>> +				 * in exit path, during unmap call, or process
>> +				 * fork and so cannot be freed there. Sleep
>> +				 * for 100ms and retry the allocation.
>> +				 */
>> +				start = 0;
>> +				ret = -ENOMEM;
>> +				msleep(100);
>> +				continue;
>> +			} else {
>> +				/*
>> +				 * ret == -ENOMEM - all bits in cma->bitmap are
>> +				 * set, so we break accordingly.
>> +				 */
>> +				mutex_unlock(&cma->lock);
>> +				break;
>> +			}
>>  		}
>>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>>  		/*
>> 
> 
> What about long-term pinnings? IIRC, that can happen easily e.g., with
> vfio (and I remember there is a way via vmsplice).
> 
> Not convinced trying forever is a sane approach in the general case ...

Hi David,

I've botched the threading, so there are discussions with respect to the 
previous patch-set that is missing on this thread, which I will 
summarize below:

V1:
[1] https://lkml.org/lkml/2020/8/5/1097
[2] https://lkml.org/lkml/2020/8/6/1040
[3] https://lkml.org/lkml/2020/8/11/893
[4] https://lkml.org/lkml/2020/8/21/1490
[5] https://lkml.org/lkml/2020/9/11/1072

[1] features version of the patch featured a finite number of retries, 
which has been stable for our kernels. In [2], Andrew questioned whether 
we could actually find a way of solving the problem on the grounds that 
doing a finite number of retries doesn't actually fix the problem (more 
importantly, in [4] Andrew indicated that he would prefer not to merge 
the patch as it doesn't solve the issue).  In [3], I suggest one actual 
fix for this, which is to use preempt_disable/enable() to prevent 
context switches from occurring during the periods in copy_one_pte() and 
exit_mmap() (I forgot to mention this case in the commit text) in which 
_refcount > _mapcount for a page - you would also need to prevent 
interrupts from occurring to if we were to fully prevent the issue from 
occurring.  I think this would be acceptable for the copy_one_pte() 
case, since there _refcount > _mapcount for little time.  For the 
exit_mmap() case, however, _refcount is greater than _mapcount whilst 
the page-tables are being torn down for a process - that could be too 
long for disabling preemption / interrupts.

So, in [4], Andrew asks about two alternatives to see if they're viable: 
(1) acquiring locks on the exit_mmap path and migration paths, (2) 
retrying indefinitely.  In [5], I discuss how using locks could increase 
the time it takes to perform a CMA allocation, such that a retry 
approach would avoid increased CMA allocation times. I'm also uncertain 
about how the locking scheme could be implemented effectively without 
introducing a new per-page lock that will be used specifically to solve 
this issue, and I'm not sure this would be accepted.

We're fine with doing indefinite retries, on the grounds that if there 
is some long-term pinning that occurs when alloc_contig_range returns 
-EBUSY, that it should be debugged and fixed.  Would it be possible to 
make this infinite-retrying something that could be enabled or disabled 
by a defconfig option?

Thanks,

Chris.

Chris Goldsworthy Sept. 14, 2020, 9:52 p.m. UTC | #3

On 2020-09-14 11:33, Chris Goldsworthy wrote:
> On 2020-09-14 02:31, David Hildenbrand wrote:
>> What about long-term pinnings? IIRC, that can happen easily e.g., with
>> vfio (and I remember there is a way via vmsplice).
>> 
>> Not convinced trying forever is a sane approach in the general case 
>> ...
> 
> Hi David,
> 
> I've botched the threading, so there are discussions with respect to
> the previous patch-set that is missing on this thread, which I will
> summarize below:
> 
> V1:
> [1] https://lkml.org/lkml/2020/8/5/1097
> [2] https://lkml.org/lkml/2020/8/6/1040
> [3] https://lkml.org/lkml/2020/8/11/893
> [4] https://lkml.org/lkml/2020/8/21/1490
> [5] https://lkml.org/lkml/2020/9/11/1072
> 
> [1] features version of the patch featured a finite number of retries,
> which has been stable for our kernels. In [2], Andrew questioned
> whether we could actually find a way of solving the problem on the
> grounds that doing a finite number of retries doesn't actually fix the
> problem (more importantly, in [4] Andrew indicated that he would
> prefer not to merge the patch as it doesn't solve the issue).  In [3],
> I suggest one actual fix for this, which is to use
> preempt_disable/enable() to prevent context switches from occurring
> during the periods in copy_one_pte() and exit_mmap() (I forgot to
> mention this case in the commit text) in which _refcount > _mapcount
> for a page - you would also need to prevent interrupts from occurring
> to if we were to fully prevent the issue from occurring.  I think this
> would be acceptable for the copy_one_pte() case, since there _refcount
> > _mapcount for little time.  For the exit_mmap() case, however, _refcount is greater than _mapcount whilst the page-tables are being torn down for a process - that could be too long for disabling preemption / interrupts.
> 
> So, in [4], Andrew asks about two alternatives to see if they're
> viable: (1) acquiring locks on the exit_mmap path and migration paths,
> (2) retrying indefinitely.  In [5], I discuss how using locks could
> increase the time it takes to perform a CMA allocation, such that a
> retry approach would avoid increased CMA allocation times. I'm also
> uncertain about how the locking scheme could be implemented
> effectively without introducing a new per-page lock that will be used
> specifically to solve this issue, and I'm not sure this would be
> accepted.
> 
> We're fine with doing indefinite retries, on the grounds that if there
> is some long-term pinning that occurs when alloc_contig_range returns
> -EBUSY, that it should be debugged and fixed.  Would it be possible to
> make this infinite-retrying something that could be enabled or
> disabled by a defconfig option?
> 
> Thanks,
> 
> Chris.

Actually, if we were willing to have a defconfig option for enabling / 
disabling indefinite retries on the return of -EBUSY, would it be 
possibly to re-structure the patch to allow either (1) indefinite 
retrying, or (2) doing a fixed number of retires (as some people might 
want to tolerate CMA allocation failures in favor of making progress)?

David Hildenbrand Sept. 15, 2020, 7:53 a.m. UTC | #4

On 14.09.20 20:33, Chris Goldsworthy wrote:
> On 2020-09-14 02:31, David Hildenbrand wrote:
>> On 11.09.20 21:17, Chris Goldsworthy wrote:
>>>
>>> So, inside of cma_alloc(), instead of giving up when 
>>> alloc_contig_range()
>>> returns -EBUSY after having scanned a whole CMA-region bitmap, perform
>>> retries indefinitely, with sleeps, to give the system an opportunity 
>>> to
>>> unpin any pinned pages.
>>>
>>> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
>>> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>>> ---
>>>  mm/cma.c | 25 +++++++++++++++++++++++--
>>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/cma.c b/mm/cma.c
>>> index 7f415d7..90bb505 100644
>>> --- a/mm/cma.c
>>> +++ b/mm/cma.c
>>> @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t 
>>> count, unsigned int align,
>>>  				bitmap_maxno, start, bitmap_count, mask,
>>>  				offset);
>>>  		if (bitmap_no >= bitmap_maxno) {
>>> -			mutex_unlock(&cma->lock);
>>> -			break;
>>> +			if (ret == -EBUSY) {
>>> +				mutex_unlock(&cma->lock);
>>> +
>>> +				/*
>>> +				 * Page may be momentarily pinned by some other
>>> +				 * process which has been scheduled out, e.g.
>>> +				 * in exit path, during unmap call, or process
>>> +				 * fork and so cannot be freed there. Sleep
>>> +				 * for 100ms and retry the allocation.
>>> +				 */
>>> +				start = 0;
>>> +				ret = -ENOMEM;
>>> +				msleep(100);
>>> +				continue;
>>> +			} else {
>>> +				/*
>>> +				 * ret == -ENOMEM - all bits in cma->bitmap are
>>> +				 * set, so we break accordingly.
>>> +				 */
>>> +				mutex_unlock(&cma->lock);
>>> +				break;
>>> +			}
>>>  		}
>>>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>>>  		/*
>>>
>>
>> What about long-term pinnings? IIRC, that can happen easily e.g., with
>> vfio (and I remember there is a way via vmsplice).
>>
>> Not convinced trying forever is a sane approach in the general case ...
> 
> Hi David,
> 
> I've botched the threading, so there are discussions with respect to the 
> previous patch-set that is missing on this thread, which I will 
> summarize below:
> 
> V1:
> [1] https://lkml.org/lkml/2020/8/5/1097
> [2] https://lkml.org/lkml/2020/8/6/1040
> [3] https://lkml.org/lkml/2020/8/11/893
> [4] https://lkml.org/lkml/2020/8/21/1490
> [5] https://lkml.org/lkml/2020/9/11/1072
> 
> [1] features version of the patch featured a finite number of retries, 
> which has been stable for our kernels. In [2], Andrew questioned whether 
> we could actually find a way of solving the problem on the grounds that 
> doing a finite number of retries doesn't actually fix the problem (more 
> importantly, in [4] Andrew indicated that he would prefer not to merge 
> the patch as it doesn't solve the issue).  In [3], I suggest one actual 
> fix for this, which is to use preempt_disable/enable() to prevent 
> context switches from occurring during the periods in copy_one_pte() and 
> exit_mmap() (I forgot to mention this case in the commit text) in which 
> _refcount > _mapcount for a page - you would also need to prevent 
> interrupts from occurring to if we were to fully prevent the issue from 
> occurring.  I think this would be acceptable for the copy_one_pte() 
> case, since there _refcount > _mapcount for little time.  For the 
> exit_mmap() case, however, _refcount is greater than _mapcount whilst 
> the page-tables are being torn down for a process - that could be too 
> long for disabling preemption / interrupts.
> 
> So, in [4], Andrew asks about two alternatives to see if they're viable: 
> (1) acquiring locks on the exit_mmap path and migration paths, (2) 
> retrying indefinitely.  In [5], I discuss how using locks could increase 
> the time it takes to perform a CMA allocation, such that a retry 
> approach would avoid increased CMA allocation times. I'm also uncertain 
> about how the locking scheme could be implemented effectively without 
> introducing a new per-page lock that will be used specifically to solve 
> this issue, and I'm not sure this would be accepted.

Thanks for the nice summary!

> 
> We're fine with doing indefinite retries, on the grounds that if there 
> is some long-term pinning that occurs when alloc_contig_range returns 
> -EBUSY, that it should be debugged and fixed.  Would it be possible to 
> make this infinite-retrying something that could be enabled or disabled 
> by a defconfig option?

Two thoughts:

1. Most (all?) alloc_contig_range() users are interested in handling
short-term pinnings in a nice way (IOW, make the allocation succeed).
I'd much rather want to see this being handled in a nice fashion inside
alloc_contig_range() than having to encode endless loops in the caller.
This means I strongly prefer something like [3] if feasible. But I can
understand that stuff ([5]) is complicated. I have to admit that I am
not an expert on the short term pinning described by you, and how to
eventually fix it.

2. The issue that I am having is that long-term pinnings are
(unfortunately) a real thing. It's not something to debug and fix as you
suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that VM
is running, all VM memory will be pinned. If memory falls onto a CMA
region your cma_alloc() will be stuck in an (endless, meaning until the
VM ended) loop. I am not sure if all cma users are fine with that -
especially, think about CMA being used for gigantic pages now.

Assume you want to start a new VM while the other one is running and use
some (new) gigantic pages for it. Suddenly you're trapped in an endless
loop in the kernel. That's nasty.

We do have a similar endless loop on the memory hotunplug/offlining path
(offline_pages()). However, if triggered by a user (echo 0 >
/sys/devices/system/memory/memoryX/online), a user can stop trying by
sending a signal.


If we want to stick to retrying forever, can't we use flags like
__GFP_NOFAIL to explicitly enable this new behavior for selected
cma_alloc() users that really can't fail/retry manually again?

Chris Goldsworthy Sept. 17, 2020, 5:26 p.m. UTC | #5

On 2020-09-15 00:53, David Hildenbrand wrote:
> On 14.09.20 20:33, Chris Goldsworthy wrote:
>> On 2020-09-14 02:31, David Hildenbrand wrote:
>>> On 11.09.20 21:17, Chris Goldsworthy wrote:
>>>> 
>>>> So, inside of cma_alloc(), instead of giving up when
>>>> alloc_contig_range()
>>>> returns -EBUSY after having scanned a whole CMA-region bitmap, 
>>>> perform
>>>> retries indefinitely, with sleeps, to give the system an opportunity
>>>> to
>>>> unpin any pinned pages.
>>>> 
>>>> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
>>>> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>> ---
>>>>  mm/cma.c | 25 +++++++++++++++++++++++--
>>>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/mm/cma.c b/mm/cma.c
>>>> index 7f415d7..90bb505 100644
>>>> --- a/mm/cma.c
>>>> +++ b/mm/cma.c
>>>> @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t
>>>> count, unsigned int align,
>>>>  				bitmap_maxno, start, bitmap_count, mask,
>>>>  				offset);
>>>>  		if (bitmap_no >= bitmap_maxno) {
>>>> -			mutex_unlock(&cma->lock);
>>>> -			break;
>>>> +			if (ret == -EBUSY) {
>>>> +				mutex_unlock(&cma->lock);
>>>> +
>>>> +				/*
>>>> +				 * Page may be momentarily pinned by some other
>>>> +				 * process which has been scheduled out, e.g.
>>>> +				 * in exit path, during unmap call, or process
>>>> +				 * fork and so cannot be freed there. Sleep
>>>> +				 * for 100ms and retry the allocation.
>>>> +				 */
>>>> +				start = 0;
>>>> +				ret = -ENOMEM;
>>>> +				msleep(100);
>>>> +				continue;
>>>> +			} else {
>>>> +				/*
>>>> +				 * ret == -ENOMEM - all bits in cma->bitmap are
>>>> +				 * set, so we break accordingly.
>>>> +				 */
>>>> +				mutex_unlock(&cma->lock);
>>>> +				break;
>>>> +			}
>>>>  		}
>>>>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>>>>  		/*
>>>> 
>>> 
>>> What about long-term pinnings? IIRC, that can happen easily e.g., 
>>> with
>>> vfio (and I remember there is a way via vmsplice).
>>> 
>>> Not convinced trying forever is a sane approach in the general case 
>>> ...
>> 
>> V1:
>> [1] https://lkml.org/lkml/2020/8/5/1097
>> [2] https://lkml.org/lkml/2020/8/6/1040
>> [3] https://lkml.org/lkml/2020/8/11/893
>> [4] https://lkml.org/lkml/2020/8/21/1490
>> [5] https://lkml.org/lkml/2020/9/11/1072
>> 
>> We're fine with doing indefinite retries, on the grounds that if there
>> is some long-term pinning that occurs when alloc_contig_range returns
>> -EBUSY, that it should be debugged and fixed.  Would it be possible to
>> make this infinite-retrying something that could be enabled or 
>> disabled
>> by a defconfig option?
> 
> Two thoughts:
> 
> This means I strongly prefer something like [3] if feasible.

I can give [3] some further thought then.  Also, I realized [3] will not 
completely solve the problem, it just reduces the window in which 
_refcount > _mapcount (as mentioned in earlier threads, we encountered 
the pinning when a task in copy_one_pte() or in the exit_mmap() path 
gets context switched out).  If we were to try a sleeping-lock based 
solution, do you think it would be permissible to add another lock to 
struct page?

> 2. The issue that I am having is that long-term pinnings are
> (unfortunately) a real thing. It's not something to debug and fix as 
> you
> suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that 
> VM
> is running, all VM memory will be pinned. If memory falls onto a CMA
> region your cma_alloc() will be stuck in an (endless, meaning until the
> VM ended) loop. I am not sure if all cma users are fine with that -
> especially, think about CMA being used for gigantic pages now.
> 
> Assume you want to start a new VM while the other one is running and 
> use
> some (new) gigantic pages for it. Suddenly you're trapped in an endless
> loop in the kernel. That's nasty.


Thanks for providing this example.

> 
> If we want to stick to retrying forever, can't we use flags like
> __GFP_NOFAIL to explicitly enable this new behavior for selected
> cma_alloc() users that really can't fail/retry manually again?

This would work, we would just have to undo the work done by this patch 
/ re-introduce the GFP parameter for cma_alloc(): 
http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com 
, and add the support __GFP_NOFAIL (and ignore any flag that is not one 
of __GFP_NOFAIL or __GFP_NOWARN).

Chris Goldsworthy Sept. 17, 2020, 5:54 p.m. UTC | #6

On 2020-09-15 00:53, David Hildenbrand wrote:
> On 14.09.20 20:33, Chris Goldsworthy wrote:
>> On 2020-09-14 02:31, David Hildenbrand wrote:
>>> On 11.09.20 21:17, Chris Goldsworthy wrote:
>>>> 
>>>> So, inside of cma_alloc(), instead of giving up when
>>>> alloc_contig_range()
>>>> returns -EBUSY after having scanned a whole CMA-region bitmap,
>>>> perform
>>>> retries indefinitely, with sleeps, to give the system an opportunity
>>>> to
>>>> unpin any pinned pages.
>>>> 
>>>> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
>>>> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>> ---
>>>>  mm/cma.c | 25 +++++++++++++++++++++++--
>>>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/mm/cma.c b/mm/cma.c
>>>> index 7f415d7..90bb505 100644
>>>> --- a/mm/cma.c
>>>> +++ b/mm/cma.c
>>>> @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t
>>>> count, unsigned int align,
>>>>  				bitmap_maxno, start, bitmap_count, mask,
>>>>  				offset);
>>>>  		if (bitmap_no >= bitmap_maxno) {
>>>> -			mutex_unlock(&cma->lock);
>>>> -			break;
>>>> +			if (ret == -EBUSY) {
>>>> +				mutex_unlock(&cma->lock);
>>>> +
>>>> +				/*
>>>> +				 * Page may be momentarily pinned by some other
>>>> +				 * process which has been scheduled out, e.g.
>>>> +				 * in exit path, during unmap call, or process
>>>> +				 * fork and so cannot be freed there. Sleep
>>>> +				 * for 100ms and retry the allocation.
>>>> +				 */
>>>> +				start = 0;
>>>> +				ret = -ENOMEM;
>>>> +				msleep(100);
>>>> +				continue;
>>>> +			} else {
>>>> +				/*
>>>> +				 * ret == -ENOMEM - all bits in cma->bitmap are
>>>> +				 * set, so we break accordingly.
>>>> +				 */
>>>> +				mutex_unlock(&cma->lock);
>>>> +				break;
>>>> +			}
>>>>  		}
>>>>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>>>>  		/*
>>>> 
>>> 
>>> What about long-term pinnings? IIRC, that can happen easily e.g.,
>>> with
>>> vfio (and I remember there is a way via vmsplice).
>>> 
>>> Not convinced trying forever is a sane approach in the general case
>>> ...
>> 
>> V1:
>> [1] https://lkml.org/lkml/2020/8/5/1097
>> [2] https://lkml.org/lkml/2020/8/6/1040
>> [3] https://lkml.org/lkml/2020/8/11/893
>> [4] https://lkml.org/lkml/2020/8/21/1490
>> [5] https://lkml.org/lkml/2020/9/11/1072
>> 
>> We're fine with doing indefinite retries, on the grounds that if there
>> is some long-term pinning that occurs when alloc_contig_range returns
>> -EBUSY, that it should be debugged and fixed.  Would it be possible to
>> make this infinite-retrying something that could be enabled or
>> disabled
>> by a defconfig option?
> 
> Two thoughts:
> 
> This means I strongly prefer something like [3] if feasible.

_Resending so that this ends up on LKML_

I can give [3] some further thought then.  Also, I realized [3] will not
completely solve the problem, it just reduces the window in which
_refcount > _mapcount (as mentioned in earlier threads, we encountered
the pinning when a task in copy_one_pte() or in the exit_mmap() path
gets context switched out).  If we were to try a sleeping-lock based
solution, do you think it would be permissible to add another lock to
struct page?

> 2. The issue that I am having is that long-term pinnings are
> (unfortunately) a real thing. It's not something to debug and fix as
> you
> suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that
> VM
> is running, all VM memory will be pinned. If memory falls onto a CMA
> region your cma_alloc() will be stuck in an (endless, meaning until the
> VM ended) loop. I am not sure if all cma users are fine with that -
> especially, think about CMA being used for gigantic pages now.
> 
> Assume you want to start a new VM while the other one is running and
> use
> some (new) gigantic pages for it. Suddenly you're trapped in an endless
> loop in the kernel. That's nasty.


Thanks for providing this example.

> 
> If we want to stick to retrying forever, can't we use flags like
> __GFP_NOFAIL to explicitly enable this new behavior for selected
> cma_alloc() users that really can't fail/retry manually again?

This would work, we would just have to undo the work done by this patch
/ re-introduce the GFP parameter for cma_alloc():
http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com
, and add the support __GFP_NOFAIL (and ignore any flag that is not one
of __GFP_NOFAIL or __GFP_NOWARN).

Chris Goldsworthy Sept. 24, 2020, 5:13 a.m. UTC | #7

On 2020-09-17 10:54, Chris Goldsworthy wrote:
> On 2020-09-15 00:53, David Hildenbrand wrote:
>> On 14.09.20 20:33, Chris Goldsworthy wrote:
>>> On 2020-09-14 02:31, David Hildenbrand wrote:
>>>> On 11.09.20 21:17, Chris Goldsworthy wrote:
>>>>> 
>>>>> So, inside of cma_alloc(), instead of giving up when
>>>>> alloc_contig_range()
>>>>> returns -EBUSY after having scanned a whole CMA-region bitmap,
>>>>> perform
>>>>> retries indefinitely, with sleeps, to give the system an 
>>>>> opportunity
>>>>> to
>>>>> unpin any pinned pages.
>>>>> 
>>>>> Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
>>>>> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>>> ---
>>>>>  mm/cma.c | 25 +++++++++++++++++++++++--
>>>>>  1 file changed, 23 insertions(+), 2 deletions(-)
>>>>> 
>>>>> diff --git a/mm/cma.c b/mm/cma.c
>>>>> index 7f415d7..90bb505 100644
>>>>> --- a/mm/cma.c
>>>>> +++ b/mm/cma.c
>>>>> @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t
>>>>> count, unsigned int align,
>>>>>  				bitmap_maxno, start, bitmap_count, mask,
>>>>>  				offset);
>>>>>  		if (bitmap_no >= bitmap_maxno) {
>>>>> -			mutex_unlock(&cma->lock);
>>>>> -			break;
>>>>> +			if (ret == -EBUSY) {
>>>>> +				mutex_unlock(&cma->lock);
>>>>> +
>>>>> +				/*
>>>>> +				 * Page may be momentarily pinned by some other
>>>>> +				 * process which has been scheduled out, e.g.
>>>>> +				 * in exit path, during unmap call, or process
>>>>> +				 * fork and so cannot be freed there. Sleep
>>>>> +				 * for 100ms and retry the allocation.
>>>>> +				 */
>>>>> +				start = 0;
>>>>> +				ret = -ENOMEM;
>>>>> +				msleep(100);
>>>>> +				continue;
>>>>> +			} else {
>>>>> +				/*
>>>>> +				 * ret == -ENOMEM - all bits in cma->bitmap are
>>>>> +				 * set, so we break accordingly.
>>>>> +				 */
>>>>> +				mutex_unlock(&cma->lock);
>>>>> +				break;
>>>>> +			}
>>>>>  		}
>>>>>  		bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
>>>>>  		/*
>>>>> 
>>>> 
>>>> What about long-term pinnings? IIRC, that can happen easily e.g.,
>>>> with
>>>> vfio (and I remember there is a way via vmsplice).
>>>> 
>>>> Not convinced trying forever is a sane approach in the general case
>>>> ...
>>> 
>>> V1:
>>> [1] https://lkml.org/lkml/2020/8/5/1097
>>> [2] https://lkml.org/lkml/2020/8/6/1040
>>> [3] https://lkml.org/lkml/2020/8/11/893
>>> [4] https://lkml.org/lkml/2020/8/21/1490
>>> [5] https://lkml.org/lkml/2020/9/11/1072
>>> 
>>> We're fine with doing indefinite retries, on the grounds that if 
>>> there
>>> is some long-term pinning that occurs when alloc_contig_range returns
>>> -EBUSY, that it should be debugged and fixed.  Would it be possible 
>>> to
>>> make this infinite-retrying something that could be enabled or
>>> disabled
>>> by a defconfig option?
>> 
>> Two thoughts:
>> 
>> This means I strongly prefer something like [3] if feasible.
> 
> _Resending so that this ends up on LKML_
> 
> I can give [3] some further thought then.  Also, I realized [3] will 
> not
> completely solve the problem, it just reduces the window in which
> _refcount > _mapcount (as mentioned in earlier threads, we encountered
> the pinning when a task in copy_one_pte() or in the exit_mmap() path
> gets context switched out).  If we were to try a sleeping-lock based
> solution, do you think it would be permissible to add another lock to
> struct page?

I have not been able to think of a clean way of introducing calls to 
preempt_disable() in exit_mmap(), which is the more problematic case.  
We would need to track state across multiple invocations of 
zap_pte_range() (which is called for each entry in a PMD when a 
process's memory is being unmapped), and would also need to extend this 
to tlb_finish_mmu(), which is called after all the process's memory has 
been unmapped: 
https://elixir.bootlin.com/linux/v5.8.10/source/mm/mmap.c#L3164.  As a 
follow-up to this patch, I'm submitting a patch that re-introduces the 
GFP mask for cma_alloc, that will perform indefinite retires if 
__GFP_NOFAIL is passed to the function.

Christoph Hellwig Sept. 28, 2020, 7:39 a.m. UTC | #8

On Tue, Sep 15, 2020 at 09:53:30AM +0200, David Hildenbrand wrote:
> Two thoughts:
> 
> 1. Most (all?) alloc_contig_range() users are interested in handling
> short-term pinnings in a nice way (IOW, make the allocation succeed).
> I'd much rather want to see this being handled in a nice fashion inside
> alloc_contig_range() than having to encode endless loops in the caller.
> This means I strongly prefer something like [3] if feasible. But I can
> understand that stuff ([5]) is complicated. I have to admit that I am
> not an expert on the short term pinning described by you, and how to
> eventually fix it.

Agreed.  Also retrying forever is simply broken, and will lead to
deadlocks for the DMA calls into CMA, so with my dma-mapping hat on
I have to hard-NAK this approach.

[v2] mm: cma: indefinitely retry allocations in cma_alloc

Commit Message

Comments

Patch