diff mbox series

[1/8] mm: cma: introduce cma_release_nowait()

Message ID 20210325002835.216118-2-mike.kravetz@oracle.com (mailing list archive)
State New, archived
Headers show
Series make hugetlb put_page safe for all calling contexts | expand

Commit Message

Mike Kravetz March 25, 2021, 12:28 a.m. UTC
From: Roman Gushchin <guro@fb.com>

cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
It makes it a blocking function, which complicates its usage from
non-blocking contexts. For instance, hugetlbfs code is temporarily
dropping the hugetlb_lock spinlock to call cma_release().

This patch introduces a non-blocking cma_release_nowait(), which
postpones the cma bitmap clearance. It's done later from a work
context. The first page in the cma allocation is used to store
the work struct. Because CMA allocations and de-allocations are
usually not that frequent, a single global workqueue is used.

To make sure that subsequent cma_alloc() call will pass, cma_alloc()
flushes the cma_release_wq workqueue. To avoid a performance
regression in the case when only cma_release() is used, gate it
by a per-cma area flag, which is set by the first call
of cma_release_nowait().

Signed-off-by: Roman Gushchin <guro@fb.com>
[mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/cma.h |  2 +
 mm/cma.c            | 93 +++++++++++++++++++++++++++++++++++++++++++++
 mm/cma.h            |  5 +++
 3 files changed, 100 insertions(+)

Comments

Oscar Salvador March 25, 2021, 9:39 a.m. UTC | #1
On Wed, Mar 24, 2021 at 05:28:28PM -0700, Mike Kravetz wrote:
> +struct cma_clear_bitmap_work {
> +	struct work_struct work;
> +	struct cma *cma;
> +	unsigned long pfn;
> +	unsigned int count;
> +};
> +
>  struct cma cma_areas[MAX_CMA_AREAS];
>  unsigned cma_area_count;
>  
> +struct workqueue_struct *cma_release_wq;

should this be static?

> +
>  phys_addr_t cma_get_base(const struct cma *cma)
>  {
>  	return PFN_PHYS(cma->base_pfn);
> @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
>  	for (i = 0; i < cma_area_count; i++)
>  		cma_activate_area(&cma_areas[i]);
>  
> +	cma_release_wq = create_workqueue("cma_release");
> +	if (!cma_release_wq)
> +		return -ENOMEM;
> +

I did not check the code that goes through the initcalls and maybe we
cannot really have this scneario, but what happens if we return -ENOMEM?
Because I can see that later in cma_release_nowait() you mess with
cma_release_wq. Can it be that at that stage cma_release_wq == NULL? due
to -ENOMEM, or are we guaranteed to never reach that point?

Also, should the cma_release_wq go before the cma_activate_area?

> +bool cma_release_nowait(struct cma *cma, const struct page *pages,
> +			unsigned int count)
> +{
> +	struct cma_clear_bitmap_work *work;
> +	unsigned long pfn;
> +
> +	if (!cma || !pages)
> +		return false;
> +
> +	pr_debug("%s(page %p)\n", __func__, (void *)pages);

cma_release() seems to have:

pr_debug("%s(page %p, count %u)\n", __func__, (void *)pages, count);

any reason to not have the same here?


> +
> +	pfn = page_to_pfn(pages);
> +
> +	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
> +		return false;
> +
> +	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
> +
> +	/*
> +	 * Set CMA_DELAYED_RELEASE flag: subsequent cma_alloc()'s
> +	 * will wait for the async part of cma_release_nowait() to
> +	 * finish.
> +	 */
> +	if (unlikely(!test_bit(CMA_DELAYED_RELEASE, &cma->flags)))
> +		set_bit(CMA_DELAYED_RELEASE, &cma->flags);

It seems this cannot really happen? This is the only place we set the
bit, right?
Why not set the bit unconditionally? Against what this is guarding us?

> +
> +	/*
> +	 * To make cma_release_nowait() non-blocking, cma bitmap is cleared
> +	 * from a work context (see cma_clear_bitmap_fn()). The first page
> +	 * in the cma allocation is used to store the work structure,
> +	 * so it's released after the cma bitmap clearance. Other pages
> +	 * are released immediately as previously.
> +	 */
> +	if (count > 1)
> +		free_contig_range(pfn + 1, count - 1);
> +
> +	work = (struct cma_clear_bitmap_work *)page_to_virt(pages);
> +	INIT_WORK(&work->work, cma_clear_bitmap_fn);
> +	work->cma = cma;
> +	work->pfn = pfn;
> +	work->count = count;
> +	queue_work(cma_release_wq, &work->work);

As I said above, can cma_release_wq be NULL if we had -ENOMEM before?
Michal Hocko March 25, 2021, 9:45 a.m. UTC | #2
On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
[...]
>  phys_addr_t cma_get_base(const struct cma *cma)
>  {
>  	return PFN_PHYS(cma->base_pfn);
> @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
>  	for (i = 0; i < cma_area_count; i++)
>  		cma_activate_area(&cma_areas[i]);
>  
> +	cma_release_wq = create_workqueue("cma_release");

Considering the workqueue is used to free up memory it should likely be
WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.

I cannot judge the CMA parts from a very quick glance this looks
reasonably.
Oscar Salvador March 25, 2021, 9:54 a.m. UTC | #3
On Thu, Mar 25, 2021 at 10:45:18AM +0100, Michal Hocko wrote:
> On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
> [...]
> >  phys_addr_t cma_get_base(const struct cma *cma)
> >  {
> >  	return PFN_PHYS(cma->base_pfn);
> > @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
> >  	for (i = 0; i < cma_area_count; i++)
> >  		cma_activate_area(&cma_areas[i]);
> >  
> > +	cma_release_wq = create_workqueue("cma_release");
> 
> Considering the workqueue is used to free up memory it should likely be
> WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.

I might be missing something but the worqueue->func() seems to not be anything
for memory related purposes.

The worqueue calls cma_clear_bitmap_fn(), and the only think that does
is freeing one page, and clearing the bitmap. 
I might be reading the CMA code wrongly, but I cannot see any freeing up
of memory in cma_clear_bitmap()->__bitmap_clear() (or its siblings).

The actual freeing of memory seems to happen in cma_release_no_wait()
with: 

 if (count > 1)
       free_contig_range(pfn + 1, count - 1);
David Hildenbrand March 25, 2021, 9:56 a.m. UTC | #4
On 25.03.21 01:28, Mike Kravetz wrote:
> From: Roman Gushchin <guro@fb.com>
> 
> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> It makes it a blocking function, which complicates its usage from
> non-blocking contexts. For instance, hugetlbfs code is temporarily
> dropping the hugetlb_lock spinlock to call cma_release().
> 
> This patch introduces a non-blocking cma_release_nowait(), which
> postpones the cma bitmap clearance. It's done later from a work
> context. The first page in the cma allocation is used to store
> the work struct. Because CMA allocations and de-allocations are
> usually not that frequent, a single global workqueue is used.
> 
> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> flushes the cma_release_wq workqueue. To avoid a performance
> regression in the case when only cma_release() is used, gate it
> by a per-cma area flag, which is set by the first call
> of cma_release_nowait().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---


1. Is there a real reason this is a mutex and not a spin lock? It seems 
to only protect the bitmap. Are bitmaps that huge that we spend a 
significant amount of time in there?

Because I also read "Because CMA allocations and de-allocations are
usually not that frequent".

With a spinlock, you would no longer be sleeping, but obviously you 
might end up waiting for the lock ;) Not sure if that would help.

2. IIUC, if we would do the clearing completely lockless and use atomic 
bitmap ops instead, only cma_debug_show_areas() would see slight 
inconsistencies. As long as the setting code (-> allocation code) holds 
the lock, I think this should be fine (-> no double allocations).

(sorry if that has already been discussed)
Michal Hocko March 25, 2021, 10:10 a.m. UTC | #5
On Thu 25-03-21 10:54:10, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 10:45:18AM +0100, Michal Hocko wrote:
> > On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
> > [...]
> > >  phys_addr_t cma_get_base(const struct cma *cma)
> > >  {
> > >  	return PFN_PHYS(cma->base_pfn);
> > > @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
> > >  	for (i = 0; i < cma_area_count; i++)
> > >  		cma_activate_area(&cma_areas[i]);
> > >  
> > > +	cma_release_wq = create_workqueue("cma_release");
> > 
> > Considering the workqueue is used to free up memory it should likely be
> > WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.
> 
> I might be missing something but the worqueue->func() seems to not be anything
> for memory related purposes.
> The worqueue calls cma_clear_bitmap_fn(), and the only think that does
> is freeing one page, and clearing the bitmap. 

No, it shouldn't realease a sing page. That is a bug which I have
overlooked as well. I should free up the whole count worth of pages.
But fundamentally the worker does all the heavy lifting and frees up the
memory.
Michal Hocko March 25, 2021, 10:11 a.m. UTC | #6
On Thu 25-03-21 10:45:19, Michal Hocko wrote:
> On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
> [...]
> >  phys_addr_t cma_get_base(const struct cma *cma)
> >  {
> >  	return PFN_PHYS(cma->base_pfn);
> > @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
> >  	for (i = 0; i < cma_area_count; i++)
> >  		cma_activate_area(&cma_areas[i]);
> >  
> > +	cma_release_wq = create_workqueue("cma_release");
> 
> Considering the workqueue is used to free up memory it should likely be
> WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.
> 
> I cannot judge the CMA parts from a very quick glance this looks
> reasonably.

I have overlooked that
+static void cma_clear_bitmap_fn(struct work_struct *work)
+{
+       struct cma_clear_bitmap_work *w;
+
+       w = container_of(work, struct cma_clear_bitmap_work, work);
+
+       cma_clear_bitmap(w->cma, w->pfn, w->count);
+
+       __free_page(pfn_to_page(w->pfn));
+}

should be doing free_contig_range with w->count target.
David Hildenbrand March 25, 2021, 10:13 a.m. UTC | #7
On 25.03.21 11:11, Michal Hocko wrote:
> On Thu 25-03-21 10:45:19, Michal Hocko wrote:
>> On Wed 24-03-21 17:28:28, Mike Kravetz wrote:
>> [...]
>>>   phys_addr_t cma_get_base(const struct cma *cma)
>>>   {
>>>   	return PFN_PHYS(cma->base_pfn);
>>> @@ -146,6 +155,10 @@ static int __init cma_init_reserved_areas(void)
>>>   	for (i = 0; i < cma_area_count; i++)
>>>   		cma_activate_area(&cma_areas[i]);
>>>   
>>> +	cma_release_wq = create_workqueue("cma_release");
>>
>> Considering the workqueue is used to free up memory it should likely be
>> WQ_MEM_RECLAIM to prevent from long stalls when WQs are overloaded.
>>
>> I cannot judge the CMA parts from a very quick glance this looks
>> reasonably.
> 
> I have overlooked that
> +static void cma_clear_bitmap_fn(struct work_struct *work)
> +{
> +       struct cma_clear_bitmap_work *w;
> +
> +       w = container_of(work, struct cma_clear_bitmap_work, work);
> +
> +       cma_clear_bitmap(w->cma, w->pfn, w->count);
> +
> +       __free_page(pfn_to_page(w->pfn));
> +}
> 
> should be doing free_contig_range with w->count target.
> 

Then the name "cma_clear_bitmap_fn" is misleading.
Oscar Salvador March 25, 2021, 10:17 a.m. UTC | #8
On Thu, Mar 25, 2021 at 11:11:49AM +0100, Michal Hocko wrote:
> I have overlooked that
> +static void cma_clear_bitmap_fn(struct work_struct *work)
> +{
> +       struct cma_clear_bitmap_work *w;
> +
> +       w = container_of(work, struct cma_clear_bitmap_work, work);
> +
> +       cma_clear_bitmap(w->cma, w->pfn, w->count);
> +
> +       __free_page(pfn_to_page(w->pfn));
> +}
> 
> should be doing free_contig_range with w->count target.

That is currently done in cma_release_nowait().
You meant we should move that work there in the wq?

I thought the reason for creating a WQ in the first place was to have a
non-blocking cma_release() variant, as we might block down the chain
when clearing the bitmat, I do not think this was about freeing up
memory, or at least the changelog did not mention it.

Also, IIUC, we use the first page to store the workqueue information (we use it
as a storage for it, in cma_release_nowait()), that is why once we are done with
the workqueue work in cma_clear_bitmap_fn(), we can free that page.
Michal Hocko March 25, 2021, 10:22 a.m. UTC | #9
On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
> On 25.03.21 01:28, Mike Kravetz wrote:
> > From: Roman Gushchin <guro@fb.com>
> > 
> > cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> > It makes it a blocking function, which complicates its usage from
> > non-blocking contexts. For instance, hugetlbfs code is temporarily
> > dropping the hugetlb_lock spinlock to call cma_release().
> > 
> > This patch introduces a non-blocking cma_release_nowait(), which
> > postpones the cma bitmap clearance. It's done later from a work
> > context. The first page in the cma allocation is used to store
> > the work struct. Because CMA allocations and de-allocations are
> > usually not that frequent, a single global workqueue is used.
> > 
> > To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> > flushes the cma_release_wq workqueue. To avoid a performance
> > regression in the case when only cma_release() is used, gate it
> > by a per-cma area flag, which is set by the first call
> > of cma_release_nowait().
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> 
> 
> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
> only protect the bitmap. Are bitmaps that huge that we spend a significant
> amount of time in there?

Good question. Looking at the code it doesn't seem that there is any
blockable operation or any heavy lifting done under the lock.
7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
the lock and there was a simple bitmat protection back then. I suspect
the patch just followed the cma_mutex lead and used the same type of the
lock. cma_mutex used to protect alloc_contig_range which is sleepable.

This all suggests that there is no real reason to use a sleepable lock
at all and we do not need all this heavy lifting.

Thanks!
Michal Hocko March 25, 2021, 10:24 a.m. UTC | #10
On Thu 25-03-21 11:17:32, Oscar Salvador wrote:
> On Thu, Mar 25, 2021 at 11:11:49AM +0100, Michal Hocko wrote:
> > I have overlooked that
> > +static void cma_clear_bitmap_fn(struct work_struct *work)
> > +{
> > +       struct cma_clear_bitmap_work *w;
> > +
> > +       w = container_of(work, struct cma_clear_bitmap_work, work);
> > +
> > +       cma_clear_bitmap(w->cma, w->pfn, w->count);
> > +
> > +       __free_page(pfn_to_page(w->pfn));
> > +}
> > 
> > should be doing free_contig_range with w->count target.
> 
> That is currently done in cma_release_nowait().
> You meant we should move that work there in the wq?

I have missed that part. My bad. But it seems this whole thing is moot
because we can make the lock a spinlock as pointed out by David.
Mike Kravetz March 25, 2021, 4:56 p.m. UTC | #11
On 3/25/21 3:22 AM, Michal Hocko wrote:
> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>> On 25.03.21 01:28, Mike Kravetz wrote:
>>> From: Roman Gushchin <guro@fb.com>
>>>
>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>> It makes it a blocking function, which complicates its usage from
>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>
>>> This patch introduces a non-blocking cma_release_nowait(), which
>>> postpones the cma bitmap clearance. It's done later from a work
>>> context. The first page in the cma allocation is used to store
>>> the work struct. Because CMA allocations and de-allocations are
>>> usually not that frequent, a single global workqueue is used.
>>>
>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>> flushes the cma_release_wq workqueue. To avoid a performance
>>> regression in the case when only cma_release() is used, gate it
>>> by a per-cma area flag, which is set by the first call
>>> of cma_release_nowait().
>>>
>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>
>>
>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>> amount of time in there?
> 
> Good question. Looking at the code it doesn't seem that there is any
> blockable operation or any heavy lifting done under the lock.
> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> the lock and there was a simple bitmat protection back then. I suspect
> the patch just followed the cma_mutex lead and used the same type of the
> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> 
> This all suggests that there is no real reason to use a sleepable lock
> at all and we do not need all this heavy lifting.
> 

When Roman first proposed these patches, I brought up the same issue:

https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/

Previously, Roman proposed replacing the mutex with a spinlock but
Joonsoo was opposed.

Adding Joonsoo on Cc:
David Hildenbrand March 25, 2021, 5:15 p.m. UTC | #12
On 25.03.21 17:56, Mike Kravetz wrote:
> On 3/25/21 3:22 AM, Michal Hocko wrote:
>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>> From: Roman Gushchin <guro@fb.com>
>>>>
>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>> It makes it a blocking function, which complicates its usage from
>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>
>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>> postpones the cma bitmap clearance. It's done later from a work
>>>> context. The first page in the cma allocation is used to store
>>>> the work struct. Because CMA allocations and de-allocations are
>>>> usually not that frequent, a single global workqueue is used.
>>>>
>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>> regression in the case when only cma_release() is used, gate it
>>>> by a per-cma area flag, which is set by the first call
>>>> of cma_release_nowait().
>>>>
>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> ---
>>>
>>>
>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>>> amount of time in there?
>>
>> Good question. Looking at the code it doesn't seem that there is any
>> blockable operation or any heavy lifting done under the lock.
>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>> the lock and there was a simple bitmat protection back then. I suspect
>> the patch just followed the cma_mutex lead and used the same type of the
>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>
>> This all suggests that there is no real reason to use a sleepable lock
>> at all and we do not need all this heavy lifting.
>>
> 
> When Roman first proposed these patches, I brought up the same issue:
> 
> https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
> 
> Previously, Roman proposed replacing the mutex with a spinlock but
> Joonsoo was opposed.
> 
> Adding Joonsoo on Cc:
> 

There has to be a good reason not to. And if there is a good reason, 
lockless clearing might be one feasible alternative.
Minchan Kim March 25, 2021, 8:12 p.m. UTC | #13
On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
> On 25.03.21 17:56, Mike Kravetz wrote:
> > On 3/25/21 3:22 AM, Michal Hocko wrote:
> > > On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
> > > > On 25.03.21 01:28, Mike Kravetz wrote:
> > > > > From: Roman Gushchin <guro@fb.com>
> > > > > 
> > > > > cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> > > > > It makes it a blocking function, which complicates its usage from
> > > > > non-blocking contexts. For instance, hugetlbfs code is temporarily
> > > > > dropping the hugetlb_lock spinlock to call cma_release().
> > > > > 
> > > > > This patch introduces a non-blocking cma_release_nowait(), which
> > > > > postpones the cma bitmap clearance. It's done later from a work
> > > > > context. The first page in the cma allocation is used to store
> > > > > the work struct. Because CMA allocations and de-allocations are
> > > > > usually not that frequent, a single global workqueue is used.
> > > > > 
> > > > > To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> > > > > flushes the cma_release_wq workqueue. To avoid a performance
> > > > > regression in the case when only cma_release() is used, gate it
> > > > > by a per-cma area flag, which is set by the first call
> > > > > of cma_release_nowait().
> > > > > 
> > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > > > ---
> > > > 
> > > > 
> > > > 1. Is there a real reason this is a mutex and not a spin lock? It seems to
> > > > only protect the bitmap. Are bitmaps that huge that we spend a significant
> > > > amount of time in there?
> > > 
> > > Good question. Looking at the code it doesn't seem that there is any
> > > blockable operation or any heavy lifting done under the lock.
> > > 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> > > the lock and there was a simple bitmat protection back then. I suspect
> > > the patch just followed the cma_mutex lead and used the same type of the
> > > lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> > > 
> > > This all suggests that there is no real reason to use a sleepable lock
> > > at all and we do not need all this heavy lifting.
> > > 
> > 
> > When Roman first proposed these patches, I brought up the same issue:
> > 
> > https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
> > 
> > Previously, Roman proposed replacing the mutex with a spinlock but
> > Joonsoo was opposed.
> > 
> > Adding Joonsoo on Cc:
> > 
> 
> There has to be a good reason not to. And if there is a good reason,
> lockless clearing might be one feasible alternative.

I also don't think nowait variant is good idea. If the scanning of
bitmap is *really* significant, it might be signal that we need to
introduce different technique or data structure not bitmap rather
than a new API variant.
Roman Gushchin March 25, 2021, 11:19 p.m. UTC | #14
On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
> > On 25.03.21 17:56, Mike Kravetz wrote:
> > > On 3/25/21 3:22 AM, Michal Hocko wrote:
> > > > On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
> > > > > On 25.03.21 01:28, Mike Kravetz wrote:
> > > > > > From: Roman Gushchin <guro@fb.com>
> > > > > > 
> > > > > > cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
> > > > > > It makes it a blocking function, which complicates its usage from
> > > > > > non-blocking contexts. For instance, hugetlbfs code is temporarily
> > > > > > dropping the hugetlb_lock spinlock to call cma_release().
> > > > > > 
> > > > > > This patch introduces a non-blocking cma_release_nowait(), which
> > > > > > postpones the cma bitmap clearance. It's done later from a work
> > > > > > context. The first page in the cma allocation is used to store
> > > > > > the work struct. Because CMA allocations and de-allocations are
> > > > > > usually not that frequent, a single global workqueue is used.
> > > > > > 
> > > > > > To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> > > > > > flushes the cma_release_wq workqueue. To avoid a performance
> > > > > > regression in the case when only cma_release() is used, gate it
> > > > > > by a per-cma area flag, which is set by the first call
> > > > > > of cma_release_nowait().
> > > > > > 
> > > > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > > > [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
> > > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > > > > ---
> > > > > 
> > > > > 
> > > > > 1. Is there a real reason this is a mutex and not a spin lock? It seems to
> > > > > only protect the bitmap. Are bitmaps that huge that we spend a significant
> > > > > amount of time in there?
> > > > 
> > > > Good question. Looking at the code it doesn't seem that there is any
> > > > blockable operation or any heavy lifting done under the lock.
> > > > 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> > > > the lock and there was a simple bitmat protection back then. I suspect
> > > > the patch just followed the cma_mutex lead and used the same type of the
> > > > lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> > > > 
> > > > This all suggests that there is no real reason to use a sleepable lock
> > > > at all and we do not need all this heavy lifting.
> > > > 
> > > 
> > > When Roman first proposed these patches, I brought up the same issue:
> > > 
> > > https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
> > > 
> > > Previously, Roman proposed replacing the mutex with a spinlock but
> > > Joonsoo was opposed.
> > > 
> > > Adding Joonsoo on Cc:
> > > 
> > 
> > There has to be a good reason not to. And if there is a good reason,
> > lockless clearing might be one feasible alternative.
> 
> I also don't think nowait variant is good idea. If the scanning of
> bitmap is *really* significant, it might be signal that we need to
> introduce different technique or data structure not bitmap rather
> than a new API variant.

I'd also prefer to just replace the mutex with a spinlock rather than fiddling
with a delayed release.

Thanks!
Mike Kravetz March 25, 2021, 11:49 p.m. UTC | #15
On 3/25/21 4:19 PM, Roman Gushchin wrote:
> On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
>> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
>>> On 25.03.21 17:56, Mike Kravetz wrote:
>>>> On 3/25/21 3:22 AM, Michal Hocko wrote:
>>>>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>>>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>>>>> From: Roman Gushchin <guro@fb.com>
>>>>>>>
>>>>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>>>>> It makes it a blocking function, which complicates its usage from
>>>>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>>>>
>>>>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>>>>> postpones the cma bitmap clearance. It's done later from a work
>>>>>>> context. The first page in the cma allocation is used to store
>>>>>>> the work struct. Because CMA allocations and de-allocations are
>>>>>>> usually not that frequent, a single global workqueue is used.
>>>>>>>
>>>>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>>>>> regression in the case when only cma_release() is used, gate it
>>>>>>> by a per-cma area flag, which is set by the first call
>>>>>>> of cma_release_nowait().
>>>>>>>
>>>>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>>>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>> ---
>>>>>>
>>>>>>
>>>>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>>>>>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>>>>>> amount of time in there?
>>>>>
>>>>> Good question. Looking at the code it doesn't seem that there is any
>>>>> blockable operation or any heavy lifting done under the lock.
>>>>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>>>>> the lock and there was a simple bitmat protection back then. I suspect
>>>>> the patch just followed the cma_mutex lead and used the same type of the
>>>>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>>>>
>>>>> This all suggests that there is no real reason to use a sleepable lock
>>>>> at all and we do not need all this heavy lifting.
>>>>>
>>>>
>>>> When Roman first proposed these patches, I brought up the same issue:
>>>>
>>>> https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
>>>>
>>>> Previously, Roman proposed replacing the mutex with a spinlock but
>>>> Joonsoo was opposed.
>>>>
>>>> Adding Joonsoo on Cc:
>>>>
>>>
>>> There has to be a good reason not to. And if there is a good reason,
>>> lockless clearing might be one feasible alternative.
>>
>> I also don't think nowait variant is good idea. If the scanning of
>> bitmap is *really* significant, it might be signal that we need to
>> introduce different technique or data structure not bitmap rather
>> than a new API variant.
> 
> I'd also prefer to just replace the mutex with a spinlock rather than fiddling
> with a delayed release.
> 

I hope Joonsoo or someone else brings up specific concerns.  I do not
know enough about all CMA use cases.  Certainly, in this specific use
case converting to a spinlock would not be an issue.  Do note that we
would want to convert to an irq safe spinlock and disable irqs if that
makes any difference in the discussion.
Mike Kravetz March 26, 2021, 9:32 p.m. UTC | #16
On 3/25/21 4:49 PM, Mike Kravetz wrote:
> On 3/25/21 4:19 PM, Roman Gushchin wrote:
>> On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
>>> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
>>>> On 25.03.21 17:56, Mike Kravetz wrote:
>>>>> On 3/25/21 3:22 AM, Michal Hocko wrote:
>>>>>> On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
>>>>>>> On 25.03.21 01:28, Mike Kravetz wrote:
>>>>>>>> From: Roman Gushchin <guro@fb.com>
>>>>>>>>
>>>>>>>> cma_release() has to lock the cma_lock mutex to clear the cma bitmap.
>>>>>>>> It makes it a blocking function, which complicates its usage from
>>>>>>>> non-blocking contexts. For instance, hugetlbfs code is temporarily
>>>>>>>> dropping the hugetlb_lock spinlock to call cma_release().
>>>>>>>>
>>>>>>>> This patch introduces a non-blocking cma_release_nowait(), which
>>>>>>>> postpones the cma bitmap clearance. It's done later from a work
>>>>>>>> context. The first page in the cma allocation is used to store
>>>>>>>> the work struct. Because CMA allocations and de-allocations are
>>>>>>>> usually not that frequent, a single global workqueue is used.
>>>>>>>>
>>>>>>>> To make sure that subsequent cma_alloc() call will pass, cma_alloc()
>>>>>>>> flushes the cma_release_wq workqueue. To avoid a performance
>>>>>>>> regression in the case when only cma_release() is used, gate it
>>>>>>>> by a per-cma area flag, which is set by the first call
>>>>>>>> of cma_release_nowait().
>>>>>>>>
>>>>>>>> Signed-off-by: Roman Gushchin <guro@fb.com>
>>>>>>>> [mike.kravetz@oracle.com: rebased to v5.12-rc3-mmotm-2021-03-17-22-24]
>>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>>> ---
>>>>>>>
>>>>>>>
>>>>>>> 1. Is there a real reason this is a mutex and not a spin lock? It seems to
>>>>>>> only protect the bitmap. Are bitmaps that huge that we spend a significant
>>>>>>> amount of time in there?
>>>>>>
>>>>>> Good question. Looking at the code it doesn't seem that there is any
>>>>>> blockable operation or any heavy lifting done under the lock.
>>>>>> 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
>>>>>> the lock and there was a simple bitmat protection back then. I suspect
>>>>>> the patch just followed the cma_mutex lead and used the same type of the
>>>>>> lock. cma_mutex used to protect alloc_contig_range which is sleepable.
>>>>>>
>>>>>> This all suggests that there is no real reason to use a sleepable lock
>>>>>> at all and we do not need all this heavy lifting.
>>>>>>
>>>>>
>>>>> When Roman first proposed these patches, I brought up the same issue:
>>>>>
>>>>> https://lore.kernel.org/linux-mm/20201022023352.GC300658@carbon.dhcp.thefacebook.com/
>>>>>
>>>>> Previously, Roman proposed replacing the mutex with a spinlock but
>>>>> Joonsoo was opposed.
>>>>>
>>>>> Adding Joonsoo on Cc:
>>>>>
>>>>
>>>> There has to be a good reason not to. And if there is a good reason,
>>>> lockless clearing might be one feasible alternative.
>>>
>>> I also don't think nowait variant is good idea. If the scanning of
>>> bitmap is *really* significant, it might be signal that we need to
>>> introduce different technique or data structure not bitmap rather
>>> than a new API variant.
>>
>> I'd also prefer to just replace the mutex with a spinlock rather than fiddling
>> with a delayed release.
>>
> 
> I hope Joonsoo or someone else brings up specific concerns.  I do not
> know enough about all CMA use cases.  Certainly, in this specific use
> case converting to a spinlock would not be an issue.  Do note that we
> would want to convert to an irq safe spinlock and disable irqs if that
> makes any difference in the discussion.
> 

Suggestions on how to move forward would be appreciated.  I can think of
the following options.

- Use the the cma_release_nowait() routine as defined in this patch.

- Just change the mutex to an irq safe spinlock.  AFAICT, the potential
  downsides could be:
  - Interrupts disabled during long bitmap scans
  - Wasted cpu cycles (spinning) if there is much contention on lock
  Both of these would be more of an issue on small/embedded systems. I
  took a quick look at the callers of cma_alloc/cma_release and nothing
  stood out that could lead to high degrees of contention.  However, I
  could have missed something.

- Another idea I had was to allow the user to specify the locking type
  when creating a cma area.  In this way, cma areas which may have
  release calls from atomic context would be set up with an irq safe
  spinlock.  Others, would use the mutex.  I admit this is a hackish
  way to address the issue, but perhaps not much worse than the separate
  cma_release_nowait interface?

- Change the CMA bitmap to some other data structure and algorithm.
  This would obviously take more work.

Thanks,
Michal Hocko March 29, 2021, 7:46 a.m. UTC | #17
On Fri 26-03-21 14:32:01, Mike Kravetz wrote:
[...]
> - Just change the mutex to an irq safe spinlock.

Yes please.

>   AFAICT, the potential
>   downsides could be:
>   - Interrupts disabled during long bitmap scans

How large those bitmaps are in practice?

>   - Wasted cpu cycles (spinning) if there is much contention on lock
>   Both of these would be more of an issue on small/embedded systems. I
>   took a quick look at the callers of cma_alloc/cma_release and nothing
>   stood out that could lead to high degrees of contention.  However, I
>   could have missed something.

If this is really a practical concern then we can try a more complex
solution based on some data.
Mike Kravetz March 29, 2021, 10:27 p.m. UTC | #18
On 3/29/21 12:46 AM, Michal Hocko wrote:
> On Fri 26-03-21 14:32:01, Mike Kravetz wrote:
> [...]
>> - Just change the mutex to an irq safe spinlock.
> 
> Yes please.
> 
>>   AFAICT, the potential
>>   downsides could be:
>>   - Interrupts disabled during long bitmap scans
> 
> How large those bitmaps are in practice?
> 
>>   - Wasted cpu cycles (spinning) if there is much contention on lock
>>   Both of these would be more of an issue on small/embedded systems. I
>>   took a quick look at the callers of cma_alloc/cma_release and nothing
>>   stood out that could lead to high degrees of contention.  However, I
>>   could have missed something.
> 
> If this is really a practical concern then we can try a more complex
> solution based on some data.
> 

Ok, I will send v2 with this approach.  Adding Barry and Will on Cc: as
they were involved in adding more cma use cases for dma on arm.
diff mbox series

Patch

diff --git a/include/linux/cma.h b/include/linux/cma.h
index 217999c8a762..497eca478c2f 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -47,6 +47,8 @@  extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 extern struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 			      bool no_warn);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned int count);
+extern bool cma_release_nowait(struct cma *cma, const struct page *pages,
+			       unsigned int count);
 
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
 #endif
diff --git a/mm/cma.c b/mm/cma.c
index 90e27458ddb7..14cc8e901703 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -36,9 +36,18 @@ 
 
 #include "cma.h"
 
+struct cma_clear_bitmap_work {
+	struct work_struct work;
+	struct cma *cma;
+	unsigned long pfn;
+	unsigned int count;
+};
+
 struct cma cma_areas[MAX_CMA_AREAS];
 unsigned cma_area_count;
 
+struct workqueue_struct *cma_release_wq;
+
 phys_addr_t cma_get_base(const struct cma *cma)
 {
 	return PFN_PHYS(cma->base_pfn);
@@ -146,6 +155,10 @@  static int __init cma_init_reserved_areas(void)
 	for (i = 0; i < cma_area_count; i++)
 		cma_activate_area(&cma_areas[i]);
 
+	cma_release_wq = create_workqueue("cma_release");
+	if (!cma_release_wq)
+		return -ENOMEM;
+
 	return 0;
 }
 core_initcall(cma_init_reserved_areas);
@@ -203,6 +216,7 @@  int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 
 	cma->base_pfn = PFN_DOWN(base);
 	cma->count = size >> PAGE_SHIFT;
+	cma->flags = 0;
 	cma->order_per_bit = order_per_bit;
 	*res_cma = cma;
 	cma_area_count++;
@@ -452,6 +466,14 @@  struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align,
 		goto out;
 
 	for (;;) {
+		/*
+		 * If the CMA bitmap is cleared asynchronously after
+		 * cma_release_nowait(), cma release workqueue has to be
+		 * flushed here in order to make the allocation succeed.
+		 */
+		if (test_bit(CMA_DELAYED_RELEASE, &cma->flags))
+			flush_workqueue(cma_release_wq);
+
 		mutex_lock(&cma->lock);
 		bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
 				bitmap_maxno, start, bitmap_count, mask,
@@ -552,6 +574,77 @@  bool cma_release(struct cma *cma, const struct page *pages, unsigned int count)
 	return true;
 }
 
+static void cma_clear_bitmap_fn(struct work_struct *work)
+{
+	struct cma_clear_bitmap_work *w;
+
+	w = container_of(work, struct cma_clear_bitmap_work, work);
+
+	cma_clear_bitmap(w->cma, w->pfn, w->count);
+
+	__free_page(pfn_to_page(w->pfn));
+}
+
+/**
+ * cma_release_nowait() - release allocated pages without blocking
+ * @cma:   Contiguous memory region for which the allocation is performed.
+ * @pages: Allocated pages.
+ * @count: Number of allocated pages.
+ *
+ * Similar to cma_release(), this function releases memory allocated
+ * by cma_alloc(), but unlike cma_release() is non-blocking and can be
+ * called from an atomic context.
+ * It returns false when provided pages do not belong to contiguous area
+ * and true otherwise.
+ */
+bool cma_release_nowait(struct cma *cma, const struct page *pages,
+			unsigned int count)
+{
+	struct cma_clear_bitmap_work *work;
+	unsigned long pfn;
+
+	if (!cma || !pages)
+		return false;
+
+	pr_debug("%s(page %p)\n", __func__, (void *)pages);
+
+	pfn = page_to_pfn(pages);
+
+	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
+		return false;
+
+	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
+
+	/*
+	 * Set CMA_DELAYED_RELEASE flag: subsequent cma_alloc()'s
+	 * will wait for the async part of cma_release_nowait() to
+	 * finish.
+	 */
+	if (unlikely(!test_bit(CMA_DELAYED_RELEASE, &cma->flags)))
+		set_bit(CMA_DELAYED_RELEASE, &cma->flags);
+
+	/*
+	 * To make cma_release_nowait() non-blocking, cma bitmap is cleared
+	 * from a work context (see cma_clear_bitmap_fn()). The first page
+	 * in the cma allocation is used to store the work structure,
+	 * so it's released after the cma bitmap clearance. Other pages
+	 * are released immediately as previously.
+	 */
+	if (count > 1)
+		free_contig_range(pfn + 1, count - 1);
+
+	work = (struct cma_clear_bitmap_work *)page_to_virt(pages);
+	INIT_WORK(&work->work, cma_clear_bitmap_fn);
+	work->cma = cma;
+	work->pfn = pfn;
+	work->count = count;
+	queue_work(cma_release_wq, &work->work);
+
+	trace_cma_release(pfn, pages, count);
+
+	return true;
+}
+
 int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data)
 {
 	int i;
diff --git a/mm/cma.h b/mm/cma.h
index 95d1aa2d808a..2063fb5bc985 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -17,6 +17,7 @@  struct cma_stat {
 struct cma {
 	unsigned long   base_pfn;
 	unsigned long   count;
+	unsigned long   flags;
 	unsigned long   *bitmap;
 	unsigned int order_per_bit; /* Order of pages represented by one bit */
 	struct mutex    lock;
@@ -31,6 +32,10 @@  struct cma {
 #endif
 };
 
+enum cma_flags {
+	CMA_DELAYED_RELEASE, /* cma bitmap is cleared asynchronously */
+};
+
 extern struct cma cma_areas[MAX_CMA_AREAS];
 extern unsigned cma_area_count;