mbox series

[RFC,0/7] Support high-order page bulk allocation

Message ID 20200814173131.2803002-1-minchan@kernel.org (mailing list archive)
Headers show
Series Support high-order page bulk allocation | expand

Message

Minchan Kim Aug. 14, 2020, 5:31 p.m. UTC
There is a need for special HW to require bulk allocation of
high-order pages. For example, 4800 * order-4 pages.

To meet the requirement, a option is using CMA area because
page allocator with compaction under memory pressure is
easily failed to meet the requirement and too slow for 4800
times. However, CMA has also the following drawbacks:

 * 4800 of order-4 * cma_alloc is too slow

To avoid the slowness, we could try to allocate 300M contiguous
memory once and then split them into order-4 chunks.
The problem of this approach is CMA allocation fails one of the
pages in those range couldn't migrate out, which happens easily
with fs write under memory pressure.

To solve issues, this patch introduces alloc_pages_bulk.

  int alloc_pages_bulk(unsigned long start, unsigned long end,
                       unsigned int migratetype, gfp_t gfp_mask,
                       unsigned int order, unsigned int nr_elem,
                       struct page **pages);

It will investigate the [start, end) and migrate movable pages
out there by best effort(by upcoming patches) to make requested
order's free pages.

The allocated pages will be returned using pages parameter.
Return value represents how many of requested order pages we got.
It could be less than user requested by nr_elem.

/**
 * alloc_pages_bulk() -- tries to allocate high order pages
 * by batch from given range [start, end)
 * @start:      start PFN to allocate
 * @end:        one-past-the-last PFN to allocate
 * @migratetype:        migratetype of the underlaying pageblocks (either
 *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
 *                      in range must have the same migratetype and it must
 *                      be either of the two.
 * @gfp_mask:   GFP mask to use during compaction
 * @order:      page order requested
 * @nr_elem:    the number of high-order pages to allocate
 * @pages:      page array pointer to store allocated pages (must
 *              have space for at least nr_elem elements)
 *
 * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
 * aligned.  The PFN range must belong to a single zone.
 *
 * Return: the number of pages allocated on success or negative error code.
 * The allocated pages should be freed using __free_pages
 */

The test goes order-4 * 4800 allocation(i.e., total 300MB) under kernel
build workload. System RAM size is 1.5GB and CMA is 500M.

With using CMA to allocate to 300M, ran 10 times trial, 10 time failed
with big latency(up to several seconds).

With this alloc_pages_bulk API, ran 10 time trial, 7 times are
successful to allocate 4800 times. Rest 3 times are allocated 4799, 4789
and 4799. They are all done with 300ms.

This patchset is against on next-20200813

Minchan Kim (7):
  mm: page_owner: split page by order
  mm: introduce split_page_by_order
  mm: compaction: deal with upcoming high-order page splitting
  mm: factor __alloc_contig_range out
  mm: introduce alloc_pages_bulk API
  mm: make alloc_pages_bulk best effort
  mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk

 include/linux/gfp.h            |   5 +
 include/linux/mm.h             |   2 +
 include/linux/page-isolation.h |   1 +
 include/linux/page_owner.h     |  10 +-
 mm/compaction.c                |  64 +++++++----
 mm/huge_memory.c               |   2 +-
 mm/internal.h                  |   5 +-
 mm/page_alloc.c                | 198 ++++++++++++++++++++++++++-------
 mm/page_isolation.c            |  10 +-
 mm/page_owner.c                |   7 +-
 10 files changed, 230 insertions(+), 74 deletions(-)

Comments

Matthew Wilcox (Oracle) Aug. 14, 2020, 5:40 p.m. UTC | #1
On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> There is a need for special HW to require bulk allocation of
> high-order pages. For example, 4800 * order-4 pages.

... but you haven't shown that user.

>   int alloc_pages_bulk(unsigned long start, unsigned long end,
>                        unsigned int migratetype, gfp_t gfp_mask,
>                        unsigned int order, unsigned int nr_elem,
>                        struct page **pages);
> 
> It will investigate the [start, end) and migrate movable pages
> out there by best effort(by upcoming patches) to make requested
> order's free pages.
> 
> The allocated pages will be returned using pages parameter.
> Return value represents how many of requested order pages we got.
> It could be less than user requested by nr_elem.

I don't understand why a user would need to know the PFNs to allocate
between.  This seems like something that's usually specified by GFP_DMA32
or similar.

Is it useful to return fewer pages than requested?
Minchan Kim Aug. 14, 2020, 8:55 p.m. UTC | #2
On Fri, Aug 14, 2020 at 06:40:20PM +0100, Matthew Wilcox wrote:
> On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> > There is a need for special HW to require bulk allocation of
> > high-order pages. For example, 4800 * order-4 pages.
> 
> ... but you haven't shown that user.

Kyoungho is working on it.
I am not sure how much he could share but hopefully, he could
show some. Kyoungho?

> 
> >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> >                        unsigned int migratetype, gfp_t gfp_mask,
> >                        unsigned int order, unsigned int nr_elem,
> >                        struct page **pages);
> > 
> > It will investigate the [start, end) and migrate movable pages
> > out there by best effort(by upcoming patches) to make requested
> > order's free pages.
> > 
> > The allocated pages will be returned using pages parameter.
> > Return value represents how many of requested order pages we got.
> > It could be less than user requested by nr_elem.
> 
> I don't understand why a user would need to know the PFNs to allocate
> between.  This seems like something that's usually specified by GFP_DMA32
> or similar.

I wanted to let the API wok from CMA area and/or movable zone where are
always fulled with migrable pages.
If we carry on only zone flags without pfn range, it couldn't fulfil cma
area cases.
Other reason is if user see fewer pages returned, he could try subsequent
ranges to get remained ones.

> Is it useful to return fewer pages than requested?

It's useful because user could ask further than what they need or retry.
David Hildenbrand Aug. 16, 2020, 12:31 p.m. UTC | #3
On 14.08.20 19:31, Minchan Kim wrote:
> There is a need for special HW to require bulk allocation of
> high-order pages. For example, 4800 * order-4 pages.
> 
> To meet the requirement, a option is using CMA area because
> page allocator with compaction under memory pressure is
> easily failed to meet the requirement and too slow for 4800
> times. However, CMA has also the following drawbacks:
> 
>  * 4800 of order-4 * cma_alloc is too slow
> 
> To avoid the slowness, we could try to allocate 300M contiguous
> memory once and then split them into order-4 chunks.
> The problem of this approach is CMA allocation fails one of the
> pages in those range couldn't migrate out, which happens easily
> with fs write under memory pressure.

Why not chose a value in between? Like try to allocate MAX_ORDER - 1
chunks and split them. That would already heavily reduce the call frequency.

I don't see a real need for a completely new range allocator function
for this special case yet.

> 
> To solve issues, this patch introduces alloc_pages_bulk.
> 
>   int alloc_pages_bulk(unsigned long start, unsigned long end,
>                        unsigned int migratetype, gfp_t gfp_mask,
>                        unsigned int order, unsigned int nr_elem,
>                        struct page **pages);
> 
> It will investigate the [start, end) and migrate movable pages
> out there by best effort(by upcoming patches) to make requested
> order's free pages.
> 
> The allocated pages will be returned using pages parameter.
> Return value represents how many of requested order pages we got.
> It could be less than user requested by nr_elem.
> 
> /**
>  * alloc_pages_bulk() -- tries to allocate high order pages
>  * by batch from given range [start, end)
>  * @start:      start PFN to allocate
>  * @end:        one-past-the-last PFN to allocate
>  * @migratetype:        migratetype of the underlaying pageblocks (either
>  *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
>  *                      in range must have the same migratetype and it must
>  *                      be either of the two.
>  * @gfp_mask:   GFP mask to use during compaction
>  * @order:      page order requested
>  * @nr_elem:    the number of high-order pages to allocate
>  * @pages:      page array pointer to store allocated pages (must
>  *              have space for at least nr_elem elements)
>  *
>  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
>  * aligned.  The PFN range must belong to a single zone.
>  *
>  * Return: the number of pages allocated on success or negative error code.
>  * The allocated pages should be freed using __free_pages
>  */
> 
> The test goes order-4 * 4800 allocation(i.e., total 300MB) under kernel
> build workload. System RAM size is 1.5GB and CMA is 500M.
> 
> With using CMA to allocate to 300M, ran 10 times trial, 10 time failed
> with big latency(up to several seconds).
> 
> With this alloc_pages_bulk API, ran 10 time trial, 7 times are
> successful to allocate 4800 times. Rest 3 times are allocated 4799, 4789
> and 4799. They are all done with 300ms.
> 
> This patchset is against on next-20200813
> 
> Minchan Kim (7):
>   mm: page_owner: split page by order
>   mm: introduce split_page_by_order
>   mm: compaction: deal with upcoming high-order page splitting
>   mm: factor __alloc_contig_range out
>   mm: introduce alloc_pages_bulk API
>   mm: make alloc_pages_bulk best effort
>   mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk
> 
>  include/linux/gfp.h            |   5 +
>  include/linux/mm.h             |   2 +
>  include/linux/page-isolation.h |   1 +
>  include/linux/page_owner.h     |  10 +-
>  mm/compaction.c                |  64 +++++++----
>  mm/huge_memory.c               |   2 +-
>  mm/internal.h                  |   5 +-
>  mm/page_alloc.c                | 198 ++++++++++++++++++++++++++-------
>  mm/page_isolation.c            |  10 +-
>  mm/page_owner.c                |   7 +-
>  10 files changed, 230 insertions(+), 74 deletions(-)
>
Minchan Kim Aug. 17, 2020, 3:27 p.m. UTC | #4
On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> On 14.08.20 19:31, Minchan Kim wrote:
> > There is a need for special HW to require bulk allocation of
> > high-order pages. For example, 4800 * order-4 pages.
> > 
> > To meet the requirement, a option is using CMA area because
> > page allocator with compaction under memory pressure is
> > easily failed to meet the requirement and too slow for 4800
> > times. However, CMA has also the following drawbacks:
> > 
> >  * 4800 of order-4 * cma_alloc is too slow
> > 
> > To avoid the slowness, we could try to allocate 300M contiguous
> > memory once and then split them into order-4 chunks.
> > The problem of this approach is CMA allocation fails one of the
> > pages in those range couldn't migrate out, which happens easily
> > with fs write under memory pressure.
> 
> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> chunks and split them. That would already heavily reduce the call frequency.

I think you meant this:

    alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)

It would work if system has lots of non-fragmented free memory.
However, once they are fragmented, it doesn't work. That's why we have
seen even order-4 allocation failure in the field easily and that's why
CMA was there.

CMA has more logics to isolate the memory during allocation/freeing as
well as fragmentation avoidance so that it has less chance to be stealed
from others and increase high success ratio. That's why I want this API
to be used with CMA or movable zone.

A usecase is device can set a exclusive CMA area up when system boots.
When device needs 4800 * order-4 pages, it could call this bulk against
of the area so that it could effectively be guaranteed to allocate
enough fast.

> 
> I don't see a real need for a completely new range allocator function
> for this special case yet.
> 
> > 
> > To solve issues, this patch introduces alloc_pages_bulk.
> > 
> >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> >                        unsigned int migratetype, gfp_t gfp_mask,
> >                        unsigned int order, unsigned int nr_elem,
> >                        struct page **pages);
> > 
> > It will investigate the [start, end) and migrate movable pages
> > out there by best effort(by upcoming patches) to make requested
> > order's free pages.
> > 
> > The allocated pages will be returned using pages parameter.
> > Return value represents how many of requested order pages we got.
> > It could be less than user requested by nr_elem.
> > 
> > /**
> >  * alloc_pages_bulk() -- tries to allocate high order pages
> >  * by batch from given range [start, end)
> >  * @start:      start PFN to allocate
> >  * @end:        one-past-the-last PFN to allocate
> >  * @migratetype:        migratetype of the underlaying pageblocks (either
> >  *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
> >  *                      in range must have the same migratetype and it must
> >  *                      be either of the two.
> >  * @gfp_mask:   GFP mask to use during compaction
> >  * @order:      page order requested
> >  * @nr_elem:    the number of high-order pages to allocate
> >  * @pages:      page array pointer to store allocated pages (must
> >  *              have space for at least nr_elem elements)
> >  *
> >  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
> >  * aligned.  The PFN range must belong to a single zone.
> >  *
> >  * Return: the number of pages allocated on success or negative error code.
> >  * The allocated pages should be freed using __free_pages
> >  */
> > 
> > The test goes order-4 * 4800 allocation(i.e., total 300MB) under kernel
> > build workload. System RAM size is 1.5GB and CMA is 500M.
> > 
> > With using CMA to allocate to 300M, ran 10 times trial, 10 time failed
> > with big latency(up to several seconds).
> > 
> > With this alloc_pages_bulk API, ran 10 time trial, 7 times are
> > successful to allocate 4800 times. Rest 3 times are allocated 4799, 4789
> > and 4799. They are all done with 300ms.
> > 
> > This patchset is against on next-20200813
> > 
> > Minchan Kim (7):
> >   mm: page_owner: split page by order
> >   mm: introduce split_page_by_order
> >   mm: compaction: deal with upcoming high-order page splitting
> >   mm: factor __alloc_contig_range out
> >   mm: introduce alloc_pages_bulk API
> >   mm: make alloc_pages_bulk best effort
> >   mm/page_isolation: avoid drain_all_pages for alloc_pages_bulk
> > 
> >  include/linux/gfp.h            |   5 +
> >  include/linux/mm.h             |   2 +
> >  include/linux/page-isolation.h |   1 +
> >  include/linux/page_owner.h     |  10 +-
> >  mm/compaction.c                |  64 +++++++----
> >  mm/huge_memory.c               |   2 +-
> >  mm/internal.h                  |   5 +-
> >  mm/page_alloc.c                | 198 ++++++++++++++++++++++++++-------
> >  mm/page_isolation.c            |  10 +-
> >  mm/page_owner.c                |   7 +-
> >  10 files changed, 230 insertions(+), 74 deletions(-)
> > 
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand Aug. 17, 2020, 3:45 p.m. UTC | #5
On 17.08.20 17:27, Minchan Kim wrote:
> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>> On 14.08.20 19:31, Minchan Kim wrote:
>>> There is a need for special HW to require bulk allocation of
>>> high-order pages. For example, 4800 * order-4 pages.
>>>
>>> To meet the requirement, a option is using CMA area because
>>> page allocator with compaction under memory pressure is
>>> easily failed to meet the requirement and too slow for 4800
>>> times. However, CMA has also the following drawbacks:
>>>
>>>  * 4800 of order-4 * cma_alloc is too slow
>>>
>>> To avoid the slowness, we could try to allocate 300M contiguous
>>> memory once and then split them into order-4 chunks.
>>> The problem of this approach is CMA allocation fails one of the
>>> pages in those range couldn't migrate out, which happens easily
>>> with fs write under memory pressure.
>>
>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>> chunks and split them. That would already heavily reduce the call frequency.
> 
> I think you meant this:
> 
>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> 
> It would work if system has lots of non-fragmented free memory.
> However, once they are fragmented, it doesn't work. That's why we have
> seen even order-4 allocation failure in the field easily and that's why
> CMA was there.
> 
> CMA has more logics to isolate the memory during allocation/freeing as
> well as fragmentation avoidance so that it has less chance to be stealed
> from others and increase high success ratio. That's why I want this API
> to be used with CMA or movable zone.

I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
big 300M allocation. As you correctly note, memory placed into CMA
should be movable, except for (short/long) term pinnings. In these
cases, doing allocations smaller than 300M and splitting them up should
be good enough to reduce the call frequency, no?

> 
> A usecase is device can set a exclusive CMA area up when system boots.
> When device needs 4800 * order-4 pages, it could call this bulk against
> of the area so that it could effectively be guaranteed to allocate
> enough fast.

Just wondering

a) Why does it have to be fast?
b) Why does it need that many order-4 pages?
c) How dynamic is the device need at runtime?
d) Would it be reasonable in your setup to mark a CMA region in a way
such that it will never be used for other (movable) allocations,
guaranteeing that you can immediately allocate it? Something like,
reserving a region during boot you know you'll immediately need later
completely for a device?
Minchan Kim Aug. 17, 2020, 4:30 p.m. UTC | #6
On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> On 17.08.20 17:27, Minchan Kim wrote:
> > On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >> On 14.08.20 19:31, Minchan Kim wrote:
> >>> There is a need for special HW to require bulk allocation of
> >>> high-order pages. For example, 4800 * order-4 pages.
> >>>
> >>> To meet the requirement, a option is using CMA area because
> >>> page allocator with compaction under memory pressure is
> >>> easily failed to meet the requirement and too slow for 4800
> >>> times. However, CMA has also the following drawbacks:
> >>>
> >>>  * 4800 of order-4 * cma_alloc is too slow
> >>>
> >>> To avoid the slowness, we could try to allocate 300M contiguous
> >>> memory once and then split them into order-4 chunks.
> >>> The problem of this approach is CMA allocation fails one of the
> >>> pages in those range couldn't migrate out, which happens easily
> >>> with fs write under memory pressure.
> >>
> >> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >> chunks and split them. That would already heavily reduce the call frequency.
> > 
> > I think you meant this:
> > 
> >     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> > 
> > It would work if system has lots of non-fragmented free memory.
> > However, once they are fragmented, it doesn't work. That's why we have
> > seen even order-4 allocation failure in the field easily and that's why
> > CMA was there.
> > 
> > CMA has more logics to isolate the memory during allocation/freeing as
> > well as fragmentation avoidance so that it has less chance to be stealed
> > from others and increase high success ratio. That's why I want this API
> > to be used with CMA or movable zone.
> 
> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> big 300M allocation. As you correctly note, memory placed into CMA
> should be movable, except for (short/long) term pinnings. In these
> cases, doing allocations smaller than 300M and splitting them up should
> be good enough to reduce the call frequency, no?

I should have written that. The 300M I mentioned is really minimum size.
In some scenraio, we need way bigger than 300M, up to several GB.
Furthermore, the demand would be increased in near future.

> 
> > 
> > A usecase is device can set a exclusive CMA area up when system boots.
> > When device needs 4800 * order-4 pages, it could call this bulk against
> > of the area so that it could effectively be guaranteed to allocate
> > enough fast.
> 
> Just wondering
> 
> a) Why does it have to be fast?

That's because it's related to application latency, which ends up
user feel bad.

> b) Why does it need that many order-4 pages?

It's HW requirement. I couldn't say much about that.

> c) How dynamic is the device need at runtime?

Whenever the application launched. It depends on user's usage pattern.

> d) Would it be reasonable in your setup to mark a CMA region in a way
> such that it will never be used for other (movable) allocations,

I don't get your point. If we don't want the area to used up for
other movable allocation, why should we use it as CMA first?
It sounds like reserved memory and just wasted the memory.

Please clarify if I misudersoold your suggestion.

> guaranteeing that you can immediately allocate it? Something like,
> reserving a region during boot you know you'll immediately need later
> completely for a device?


> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand Aug. 17, 2020, 4:44 p.m. UTC | #7
On 17.08.20 18:30, Minchan Kim wrote:
> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
>> On 17.08.20 17:27, Minchan Kim wrote:
>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>>>> On 14.08.20 19:31, Minchan Kim wrote:
>>>>> There is a need for special HW to require bulk allocation of
>>>>> high-order pages. For example, 4800 * order-4 pages.
>>>>>
>>>>> To meet the requirement, a option is using CMA area because
>>>>> page allocator with compaction under memory pressure is
>>>>> easily failed to meet the requirement and too slow for 4800
>>>>> times. However, CMA has also the following drawbacks:
>>>>>
>>>>>  * 4800 of order-4 * cma_alloc is too slow
>>>>>
>>>>> To avoid the slowness, we could try to allocate 300M contiguous
>>>>> memory once and then split them into order-4 chunks.
>>>>> The problem of this approach is CMA allocation fails one of the
>>>>> pages in those range couldn't migrate out, which happens easily
>>>>> with fs write under memory pressure.
>>>>
>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>>>> chunks and split them. That would already heavily reduce the call frequency.
>>>
>>> I think you meant this:
>>>
>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
>>>
>>> It would work if system has lots of non-fragmented free memory.
>>> However, once they are fragmented, it doesn't work. That's why we have
>>> seen even order-4 allocation failure in the field easily and that's why
>>> CMA was there.
>>>
>>> CMA has more logics to isolate the memory during allocation/freeing as
>>> well as fragmentation avoidance so that it has less chance to be stealed
>>> from others and increase high success ratio. That's why I want this API
>>> to be used with CMA or movable zone.
>>
>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
>> big 300M allocation. As you correctly note, memory placed into CMA
>> should be movable, except for (short/long) term pinnings. In these
>> cases, doing allocations smaller than 300M and splitting them up should
>> be good enough to reduce the call frequency, no?
> 
> I should have written that. The 300M I mentioned is really minimum size.
> In some scenraio, we need way bigger than 300M, up to several GB.
> Furthermore, the demand would be increased in near future.

And what will the driver do with that data besides providing it to the
device? Can it be mapped to user space? I think we really need more
information / the actual user.

>>
>>>
>>> A usecase is device can set a exclusive CMA area up when system boots.
>>> When device needs 4800 * order-4 pages, it could call this bulk against
>>> of the area so that it could effectively be guaranteed to allocate
>>> enough fast.
>>
>> Just wondering
>>
>> a) Why does it have to be fast?
> 
> That's because it's related to application latency, which ends up
> user feel bad.

Okay, but in theory, your device-needs are very similar to
application-needs, besides you requiring order-4 pages, correct? Similar
to an application that starts up and pins 300M (or more), just with
ordr-4 pages.

I don't get quite yet why you need a range allocator for that. Because
you intend to use CMA?

> 
>> b) Why does it need that many order-4 pages?
> 
> It's HW requirement. I couldn't say much about that.

Hm.

> 
>> c) How dynamic is the device need at runtime?
> 
> Whenever the application launched. It depends on user's usage pattern.
> 
>> d) Would it be reasonable in your setup to mark a CMA region in a way
>> such that it will never be used for other (movable) allocations,
> 
> I don't get your point. If we don't want the area to used up for
> other movable allocation, why should we use it as CMA first?
> It sounds like reserved memory and just wasted the memory.

Right, it's just very hard to get what you are trying to achieve without
the actual user at hand.

For example, will the pages you allocate be movable? Does the device
allow for that? If not, then the MOVABLE zone is usually not valid
(similar to gigantic pages not being allocated from the MOVABLE zone).
So your stuck with the NORMAL zone or CMA. Especially for the NORMAL
zone, alloc_contig_range() is currently not prepared to properly handle
sub-MAX_ORDER - 1 ranges. If any involved pageblock contains an
unmovable page, the allcoation will fail (see pageblock isolation /
has_unmovable_pages()). So CMA would be your only option.
David Hildenbrand Aug. 17, 2020, 5:03 p.m. UTC | #8
>>>> A usecase is device can set a exclusive CMA area up when system boots.
>>>> When device needs 4800 * order-4 pages, it could call this bulk against
>>>> of the area so that it could effectively be guaranteed to allocate
>>>> enough fast.
>>>
>>> Just wondering
>>>
>>> a) Why does it have to be fast?
>>
>> That's because it's related to application latency, which ends up
>> user feel bad.
> 
> Okay, but in theory, your device-needs are very similar to
> application-needs, besides you requiring order-4 pages, correct? Similar
> to an application that starts up and pins 300M (or more), just with
> ordr-4 pages.

Pinning was probably misleading.

I meant either actual pinning, like vfio pins all pages backing a VM
e.g., in QEMU, or mlocking+populating all memory.
Minchan Kim Aug. 17, 2020, 11:34 p.m. UTC | #9
On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> On 17.08.20 18:30, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 17:27, Minchan Kim wrote:
> >>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>> There is a need for special HW to require bulk allocation of
> >>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>
> >>>>> To meet the requirement, a option is using CMA area because
> >>>>> page allocator with compaction under memory pressure is
> >>>>> easily failed to meet the requirement and too slow for 4800
> >>>>> times. However, CMA has also the following drawbacks:
> >>>>>
> >>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>
> >>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>> memory once and then split them into order-4 chunks.
> >>>>> The problem of this approach is CMA allocation fails one of the
> >>>>> pages in those range couldn't migrate out, which happens easily
> >>>>> with fs write under memory pressure.
> >>>>
> >>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>
> >>> I think you meant this:
> >>>
> >>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>
> >>> It would work if system has lots of non-fragmented free memory.
> >>> However, once they are fragmented, it doesn't work. That's why we have
> >>> seen even order-4 allocation failure in the field easily and that's why
> >>> CMA was there.
> >>>
> >>> CMA has more logics to isolate the memory during allocation/freeing as
> >>> well as fragmentation avoidance so that it has less chance to be stealed
> >>> from others and increase high success ratio. That's why I want this API
> >>> to be used with CMA or movable zone.
> >>
> >> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >> big 300M allocation. As you correctly note, memory placed into CMA
> >> should be movable, except for (short/long) term pinnings. In these
> >> cases, doing allocations smaller than 300M and splitting them up should
> >> be good enough to reduce the call frequency, no?
> > 
> > I should have written that. The 300M I mentioned is really minimum size.
> > In some scenraio, we need way bigger than 300M, up to several GB.
> > Furthermore, the demand would be increased in near future.
> 
> And what will the driver do with that data besides providing it to the
> device? Can it be mapped to user space? I think we really need more
> information / the actual user.
> 
> >>
> >>>
> >>> A usecase is device can set a exclusive CMA area up when system boots.
> >>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>> of the area so that it could effectively be guaranteed to allocate
> >>> enough fast.
> >>
> >> Just wondering
> >>
> >> a) Why does it have to be fast?
> > 
> > That's because it's related to application latency, which ends up
> > user feel bad.
> 
> Okay, but in theory, your device-needs are very similar to
> application-needs, besides you requiring order-4 pages, correct? Similar
> to an application that starts up and pins 300M (or more), just with
> ordr-4 pages.

Yes.

> 
> I don't get quite yet why you need a range allocator for that. Because
> you intend to use CMA?

Yes, with CMA, it could be more guaranteed and fast enough with little
tweaking. Currently, CMA is too slow due to below IPI overheads.

1. set_migratetype_isolate does drain_all_pages for every pageblock.
2. __aloc_contig_migrate_range does migrate_prep
3. alloc_contig_range does lru_add_drain_all.

Thus, if we need to increase frequency of call as your suggestion,
the set up overhead is also scaling up depending on the size.
Such overhead makes sense if caller requests big contiguous memory
but it's too much for normal high-order allocations.

Maybe, we might optimize those call sites to reduce or/remove
frequency of those IPI call smarter way but that would need to deal
with success ratio vs fastness dance in the end.

Another concern to use existing cma API is it's trying to make
allocation successful at the cost of latency. For example, waiting
a page writeback.

That's the this new sematic API comes from for compromise since I believe
we need some way to separate original CMA alloc(biased to be guaranteed
but slower) with this new API(biased to be fast but less guaranteed).

Is there any idea to work without existing cma API tweaking?

> 
> > 
> >> b) Why does it need that many order-4 pages?
> > 
> > It's HW requirement. I couldn't say much about that.
> 
> Hm.
> 
> > 
> >> c) How dynamic is the device need at runtime?
> > 
> > Whenever the application launched. It depends on user's usage pattern.
> > 
> >> d) Would it be reasonable in your setup to mark a CMA region in a way
> >> such that it will never be used for other (movable) allocations,
> > 
> > I don't get your point. If we don't want the area to used up for
> > other movable allocation, why should we use it as CMA first?
> > It sounds like reserved memory and just wasted the memory.
> 
> Right, it's just very hard to get what you are trying to achieve without
> the actual user at hand.
> 
> For example, will the pages you allocate be movable? Does the device
> allow for that? If not, then the MOVABLE zone is usually not valid
> (similar to gigantic pages not being allocated from the MOVABLE zone).
> So your stuck with the NORMAL zone or CMA. Especially for the NORMAL
> zone, alloc_contig_range() is currently not prepared to properly handle
> sub-MAX_ORDER - 1 ranges. If any involved pageblock contains an
> unmovable page, the allcoation will fail (see pageblock isolation /
> has_unmovable_pages()). So CMA would be your only option.

Those page are not migrable so I agree here CMA would be only option.
Cho KyongHo Aug. 18, 2020, 2:16 a.m. UTC | #10
On Fri, Aug 14, 2020 at 01:55:58PM -0700, Minchan Kim wrote:
> On Fri, Aug 14, 2020 at 06:40:20PM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> > > There is a need for special HW to require bulk allocation of
> > > high-order pages. For example, 4800 * order-4 pages.
> > 
> > ... but you haven't shown that user.
> 
> Kyoungho is working on it.
> I am not sure how much he could share but hopefully, he could
> show some. Kyoungho?

We are preparing the following patch to dma-buf heap that uses
alloc_pages_bulk(). The heap collects pages of identical size from
alloc_pages_bulk() for the H/Ws that have restrictions in memory
alignments due to the performance or the functionality.

> > 
> > >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> > >                        unsigned int migratetype, gfp_t gfp_mask,
> > >                        unsigned int order, unsigned int nr_elem,
> > >                        struct page **pages);
> > > 
> > > It will investigate the [start, end) and migrate movable pages
> > > out there by best effort(by upcoming patches) to make requested
> > > order's free pages.
> > > 
> > > The allocated pages will be returned using pages parameter.
> > > Return value represents how many of requested order pages we got.
> > > It could be less than user requested by nr_elem.
> > 
> > I don't understand why a user would need to know the PFNs to allocate
> > between.  This seems like something that's usually specified by GFP_DMA32
> > or similar.
> 
> I wanted to let the API wok from CMA area and/or movable zone where are
> always fulled with migrable pages.
> If we carry on only zone flags without pfn range, it couldn't fulfil cma
> area cases.
> Other reason is if user see fewer pages returned, he could try subsequent
> ranges to get remained ones.
> 
> > Is it useful to return fewer pages than requested?
> 
> It's useful because user could ask further than what they need or retry.

I agree with Minchan. A CMA area is private to a device or a device
driver except the shared DMA pool. Drivers first try collecting large
order pages from its private CMA area. If the number of pages returned
by alloc_pages_bulk() is less than nr_elem, then the driver can retry
with fewer nr_elem after a while with expecting pinned pages or pages
under races in the CMA area are resolved.
The driver may further try call alloc_pages_bulk() on another CMA area
which is available to the driver or the movable zone.
Nicholas Piggin Aug. 18, 2020, 7:42 a.m. UTC | #11
Excerpts from Minchan Kim's message of August 18, 2020 9:34 am:
> On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
>> On 17.08.20 18:30, Minchan Kim wrote:
>> > On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
>> >> On 17.08.20 17:27, Minchan Kim wrote:
>> >>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>> >>>> On 14.08.20 19:31, Minchan Kim wrote:
>> >>>>> There is a need for special HW to require bulk allocation of
>> >>>>> high-order pages. For example, 4800 * order-4 pages.
>> >>>>>
>> >>>>> To meet the requirement, a option is using CMA area because
>> >>>>> page allocator with compaction under memory pressure is
>> >>>>> easily failed to meet the requirement and too slow for 4800
>> >>>>> times. However, CMA has also the following drawbacks:
>> >>>>>
>> >>>>>  * 4800 of order-4 * cma_alloc is too slow
>> >>>>>
>> >>>>> To avoid the slowness, we could try to allocate 300M contiguous
>> >>>>> memory once and then split them into order-4 chunks.
>> >>>>> The problem of this approach is CMA allocation fails one of the
>> >>>>> pages in those range couldn't migrate out, which happens easily
>> >>>>> with fs write under memory pressure.
>> >>>>
>> >>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>> >>>> chunks and split them. That would already heavily reduce the call frequency.
>> >>>
>> >>> I think you meant this:
>> >>>
>> >>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
>> >>>
>> >>> It would work if system has lots of non-fragmented free memory.
>> >>> However, once they are fragmented, it doesn't work. That's why we have
>> >>> seen even order-4 allocation failure in the field easily and that's why
>> >>> CMA was there.
>> >>>
>> >>> CMA has more logics to isolate the memory during allocation/freeing as
>> >>> well as fragmentation avoidance so that it has less chance to be stealed
>> >>> from others and increase high success ratio. That's why I want this API
>> >>> to be used with CMA or movable zone.
>> >>
>> >> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
>> >> big 300M allocation. As you correctly note, memory placed into CMA
>> >> should be movable, except for (short/long) term pinnings. In these
>> >> cases, doing allocations smaller than 300M and splitting them up should
>> >> be good enough to reduce the call frequency, no?
>> > 
>> > I should have written that. The 300M I mentioned is really minimum size.
>> > In some scenraio, we need way bigger than 300M, up to several GB.
>> > Furthermore, the demand would be increased in near future.
>> 
>> And what will the driver do with that data besides providing it to the
>> device? Can it be mapped to user space? I think we really need more
>> information / the actual user.
>> 
>> >>
>> >>>
>> >>> A usecase is device can set a exclusive CMA area up when system boots.
>> >>> When device needs 4800 * order-4 pages, it could call this bulk against
>> >>> of the area so that it could effectively be guaranteed to allocate
>> >>> enough fast.
>> >>
>> >> Just wondering
>> >>
>> >> a) Why does it have to be fast?
>> > 
>> > That's because it's related to application latency, which ends up
>> > user feel bad.
>> 
>> Okay, but in theory, your device-needs are very similar to
>> application-needs, besides you requiring order-4 pages, correct? Similar
>> to an application that starts up and pins 300M (or more), just with
>> ordr-4 pages.
> 
> Yes.

Linux has never seriously catered for broken devices that require
large contiguous physical ranges to perform well.

The problem with doing this is it allows hardware designers to get
progressively lazier and foist more of their work onto us, and then
we'd be stuck with it.

I think you need to provide a lot better justification than this, and
probably should just solve it with some hack like allocating larger
pages or pre-allocating some of that CMA space before the user opens
the device, or require application to use hugetlbfs.

Thanks,
Nick
David Hildenbrand Aug. 18, 2020, 7:49 a.m. UTC | #12
On 18.08.20 01:34, Minchan Kim wrote:
> On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
>> On 17.08.20 18:30, Minchan Kim wrote:
>>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
>>>> On 17.08.20 17:27, Minchan Kim wrote:
>>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
>>>>>> On 14.08.20 19:31, Minchan Kim wrote:
>>>>>>> There is a need for special HW to require bulk allocation of
>>>>>>> high-order pages. For example, 4800 * order-4 pages.
>>>>>>>
>>>>>>> To meet the requirement, a option is using CMA area because
>>>>>>> page allocator with compaction under memory pressure is
>>>>>>> easily failed to meet the requirement and too slow for 4800
>>>>>>> times. However, CMA has also the following drawbacks:
>>>>>>>
>>>>>>>  * 4800 of order-4 * cma_alloc is too slow
>>>>>>>
>>>>>>> To avoid the slowness, we could try to allocate 300M contiguous
>>>>>>> memory once and then split them into order-4 chunks.
>>>>>>> The problem of this approach is CMA allocation fails one of the
>>>>>>> pages in those range couldn't migrate out, which happens easily
>>>>>>> with fs write under memory pressure.
>>>>>>
>>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
>>>>>> chunks and split them. That would already heavily reduce the call frequency.
>>>>>
>>>>> I think you meant this:
>>>>>
>>>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
>>>>>
>>>>> It would work if system has lots of non-fragmented free memory.
>>>>> However, once they are fragmented, it doesn't work. That's why we have
>>>>> seen even order-4 allocation failure in the field easily and that's why
>>>>> CMA was there.
>>>>>
>>>>> CMA has more logics to isolate the memory during allocation/freeing as
>>>>> well as fragmentation avoidance so that it has less chance to be stealed
>>>>> from others and increase high success ratio. That's why I want this API
>>>>> to be used with CMA or movable zone.
>>>>
>>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
>>>> big 300M allocation. As you correctly note, memory placed into CMA
>>>> should be movable, except for (short/long) term pinnings. In these
>>>> cases, doing allocations smaller than 300M and splitting them up should
>>>> be good enough to reduce the call frequency, no?
>>>
>>> I should have written that. The 300M I mentioned is really minimum size.
>>> In some scenraio, we need way bigger than 300M, up to several GB.
>>> Furthermore, the demand would be increased in near future.
>>
>> And what will the driver do with that data besides providing it to the
>> device? Can it be mapped to user space? I think we really need more
>> information / the actual user.
>>
>>>>
>>>>>
>>>>> A usecase is device can set a exclusive CMA area up when system boots.
>>>>> When device needs 4800 * order-4 pages, it could call this bulk against
>>>>> of the area so that it could effectively be guaranteed to allocate
>>>>> enough fast.
>>>>
>>>> Just wondering
>>>>
>>>> a) Why does it have to be fast?
>>>
>>> That's because it's related to application latency, which ends up
>>> user feel bad.
>>
>> Okay, but in theory, your device-needs are very similar to
>> application-needs, besides you requiring order-4 pages, correct? Similar
>> to an application that starts up and pins 300M (or more), just with
>> ordr-4 pages.
> 
> Yes.
> 
>>
>> I don't get quite yet why you need a range allocator for that. Because
>> you intend to use CMA?
> 
> Yes, with CMA, it could be more guaranteed and fast enough with little
> tweaking. Currently, CMA is too slow due to below IPI overheads.
> 
> 1. set_migratetype_isolate does drain_all_pages for every pageblock.
> 2. __aloc_contig_migrate_range does migrate_prep
> 3. alloc_contig_range does lru_add_drain_all.
> 
> Thus, if we need to increase frequency of call as your suggestion,
> the set up overhead is also scaling up depending on the size.
> Such overhead makes sense if caller requests big contiguous memory
> but it's too much for normal high-order allocations.
> 
> Maybe, we might optimize those call sites to reduce or/remove
> frequency of those IPI call smarter way but that would need to deal
> with success ratio vs fastness dance in the end.
> 
> Another concern to use existing cma API is it's trying to make
> allocation successful at the cost of latency. For example, waiting
> a page writeback.
> 
> That's the this new sematic API comes from for compromise since I believe
> we need some way to separate original CMA alloc(biased to be guaranteed
> but slower) with this new API(biased to be fast but less guaranteed).
> 
> Is there any idea to work without existing cma API tweaking?

Let me try to summarize:

1. Your driver needs a lot of order-4 pages. And it's needs them fast,
because of observerable lag/delay in an application. The pages will be
unmovable by the driver.

2. Your idea is to use CMA, as that avoids unmovable allocations,
theoretically allowing you to allocate all memory. But you don't
actually want a large contiguous memory area.

3. Doing a whole bunch of order-4 cma allocations is slow.

4. Doing a single large cma allocation and splitting it manually in the
caller can fail easily due to temporary page pinnings.


Regarding 4., [1] comes to mind, which has the same issues with
temporary page pinnings and solves it by simply retrying. Yeah, there
will be some lag, but maybe it's overall faster than doing separate
order-4 cma allocations?

In general, proactive compaction [2] comes to mind, does that help?

[1]
https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/
[2] https://nitingupta.dev/post/proactive-compaction/
Cho KyongHo Aug. 18, 2020, 9:22 a.m. UTC | #13
On Fri, Aug 14, 2020 at 01:55:58PM -0700, Minchan Kim wrote:
> On Fri, Aug 14, 2020 at 06:40:20PM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 14, 2020 at 10:31:24AM -0700, Minchan Kim wrote:
> > > There is a need for special HW to require bulk allocation of
> > > high-order pages. For example, 4800 * order-4 pages.
> > 
> > ... but you haven't shown that user.
> 
> Kyoungho is working on it.
> I am not sure how much he could share but hopefully, he could
> show some. Kyoungho?
> 
Hyesoo posted a patch series that uses alloc_pages_bulk() in a dma-heap;
please take a look at:
https://lore.kernel.org/linux-mm/20200818080415.7531-1-hyesoo.yu@samsung.com/

The patch series introduces a new type of dma-heap, chunk heap which is
initialized by a device tree node. The chunk heap also needs its device
tree node should have a phandle to reserved memory node with "reusable"
property.

> > 
> > >   int alloc_pages_bulk(unsigned long start, unsigned long end,
> > >                        unsigned int migratetype, gfp_t gfp_mask,
> > >                        unsigned int order, unsigned int nr_elem,
> > >                        struct page **pages);
> > > 
> > > It will investigate the [start, end) and migrate movable pages
> > > out there by best effort(by upcoming patches) to make requested
> > > order's free pages.
> > > 
> > > The allocated pages will be returned using pages parameter.
> > > Return value represents how many of requested order pages we got.
> > > It could be less than user requested by nr_elem.
> > 
> > I don't understand why a user would need to know the PFNs to allocate
> > between.  This seems like something that's usually specified by GFP_DMA32
> > or similar.
> 
> I wanted to let the API wok from CMA area and/or movable zone where are
> always fulled with migrable pages.
> If we carry on only zone flags without pfn range, it couldn't fulfil cma
> area cases.
> Other reason is if user see fewer pages returned, he could try subsequent
> ranges to get remained ones.
> 
> > Is it useful to return fewer pages than requested?
> 
> It's useful because user could ask further than what they need or retry.
>
Minchan Kim Aug. 18, 2020, 3:15 p.m. UTC | #14
On Tue, Aug 18, 2020 at 09:49:24AM +0200, David Hildenbrand wrote:
> On 18.08.20 01:34, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 18:30, Minchan Kim wrote:
> >>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >>>> On 17.08.20 17:27, Minchan Kim wrote:
> >>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>>>> There is a need for special HW to require bulk allocation of
> >>>>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>>>
> >>>>>>> To meet the requirement, a option is using CMA area because
> >>>>>>> page allocator with compaction under memory pressure is
> >>>>>>> easily failed to meet the requirement and too slow for 4800
> >>>>>>> times. However, CMA has also the following drawbacks:
> >>>>>>>
> >>>>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>>>
> >>>>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>>>> memory once and then split them into order-4 chunks.
> >>>>>>> The problem of this approach is CMA allocation fails one of the
> >>>>>>> pages in those range couldn't migrate out, which happens easily
> >>>>>>> with fs write under memory pressure.
> >>>>>>
> >>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>>>
> >>>>> I think you meant this:
> >>>>>
> >>>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>>>
> >>>>> It would work if system has lots of non-fragmented free memory.
> >>>>> However, once they are fragmented, it doesn't work. That's why we have
> >>>>> seen even order-4 allocation failure in the field easily and that's why
> >>>>> CMA was there.
> >>>>>
> >>>>> CMA has more logics to isolate the memory during allocation/freeing as
> >>>>> well as fragmentation avoidance so that it has less chance to be stealed
> >>>>> from others and increase high success ratio. That's why I want this API
> >>>>> to be used with CMA or movable zone.
> >>>>
> >>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >>>> big 300M allocation. As you correctly note, memory placed into CMA
> >>>> should be movable, except for (short/long) term pinnings. In these
> >>>> cases, doing allocations smaller than 300M and splitting them up should
> >>>> be good enough to reduce the call frequency, no?
> >>>
> >>> I should have written that. The 300M I mentioned is really minimum size.
> >>> In some scenraio, we need way bigger than 300M, up to several GB.
> >>> Furthermore, the demand would be increased in near future.
> >>
> >> And what will the driver do with that data besides providing it to the
> >> device? Can it be mapped to user space? I think we really need more
> >> information / the actual user.
> >>
> >>>>
> >>>>>
> >>>>> A usecase is device can set a exclusive CMA area up when system boots.
> >>>>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>>>> of the area so that it could effectively be guaranteed to allocate
> >>>>> enough fast.
> >>>>
> >>>> Just wondering
> >>>>
> >>>> a) Why does it have to be fast?
> >>>
> >>> That's because it's related to application latency, which ends up
> >>> user feel bad.
> >>
> >> Okay, but in theory, your device-needs are very similar to
> >> application-needs, besides you requiring order-4 pages, correct? Similar
> >> to an application that starts up and pins 300M (or more), just with
> >> ordr-4 pages.
> > 
> > Yes.
> > 
> >>
> >> I don't get quite yet why you need a range allocator for that. Because
> >> you intend to use CMA?
> > 
> > Yes, with CMA, it could be more guaranteed and fast enough with little
> > tweaking. Currently, CMA is too slow due to below IPI overheads.
> > 
> > 1. set_migratetype_isolate does drain_all_pages for every pageblock.
> > 2. __aloc_contig_migrate_range does migrate_prep
> > 3. alloc_contig_range does lru_add_drain_all.
> > 
> > Thus, if we need to increase frequency of call as your suggestion,
> > the set up overhead is also scaling up depending on the size.
> > Such overhead makes sense if caller requests big contiguous memory
> > but it's too much for normal high-order allocations.
> > 
> > Maybe, we might optimize those call sites to reduce or/remove
> > frequency of those IPI call smarter way but that would need to deal
> > with success ratio vs fastness dance in the end.
> > 
> > Another concern to use existing cma API is it's trying to make
> > allocation successful at the cost of latency. For example, waiting
> > a page writeback.
> > 
> > That's the this new sematic API comes from for compromise since I believe
> > we need some way to separate original CMA alloc(biased to be guaranteed
> > but slower) with this new API(biased to be fast but less guaranteed).
> > 
> > Is there any idea to work without existing cma API tweaking?
> 
> Let me try to summarize:
> 
> 1. Your driver needs a lot of order-4 pages. And it's needs them fast,
> because of observerable lag/delay in an application. The pages will be
> unmovable by the driver.
> 
> 2. Your idea is to use CMA, as that avoids unmovable allocations,
> theoretically allowing you to allocate all memory. But you don't
> actually want a large contiguous memory area.
> 
> 3. Doing a whole bunch of order-4 cma allocations is slow.
> 
> 4. Doing a single large cma allocation and splitting it manually in the
> caller can fail easily due to temporary page pinnings.
> 
> 
> Regarding 4., [1] comes to mind, which has the same issues with
> temporary page pinnings and solves it by simply retrying. Yeah, there
> will be some lag, but maybe it's overall faster than doing separate
> order-4 cma allocations?

Thanks for the pointer. However, it's not a single reason to make CMA
failure. Historically, there are various potentail problems to make
"temporal" as "non-temporal" like page write, indirect dependency
between objects.

> 
> In general, proactive compaction [2] comes to mind, does that help?

I think it makes sense if such high-order allocation are dominant in the
system workload because the benefit caused by TLB would be bigger than cost
caused by frequent migration overhead. However, it's not the our usecase.

> 
> [1]
> https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/
> [2] https://nitingupta.dev/post/proactive-compaction/
> 

I understand pfn stuff in the API is not pretty but the concept of idea
makes sense to me in that go though the *migratable area* and get possible
order pages with hard effort. It looks like GFP_NORETRY version for
kmem_cache_alloc_bulk.

How about this?

    int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);
Matthew Wilcox (Oracle) Aug. 18, 2020, 3:58 p.m. UTC | #15
On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
> I understand pfn stuff in the API is not pretty but the concept of idea
> makes sense to me in that go though the *migratable area* and get possible
> order pages with hard effort. It looks like GFP_NORETRY version for
> kmem_cache_alloc_bulk.
> 
> How about this?
> 
>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);

I think that makes a lot more sense as an API.  Although I think you want

int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
		struct page **pages);
David Hildenbrand Aug. 18, 2020, 4:22 p.m. UTC | #16
On 18.08.20 17:58, Matthew Wilcox wrote:
> On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
>> I understand pfn stuff in the API is not pretty but the concept of idea
>> makes sense to me in that go though the *migratable area* and get possible
>> order pages with hard effort. It looks like GFP_NORETRY version for
>> kmem_cache_alloc_bulk.
>>
>> How about this?
>>
>>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);
> 
> I think that makes a lot more sense as an API.  Although I think you want
> 
> int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
> 		struct page **pages);
> 

Right, and I would start with a very simple implementation that does not
mess with alloc_contig_range() (meaning: modify it).

I'd then much rather want to see simple tweaks to alloc_contig_range()
to improve the situation. E.g., some kind of "fail fast" flag that let's
the caller specify to skip some draining (or do it manually in cma
before a bulk allocation) and rather fail fast than really trying to
allocate the range whatever it costs.

There are multiple optimizations you can play with then (start with big
granularity and split, move to smaller granularity on demand, etc., all
nicely wrapped in cma_bulk_alloc()).

Yes, it might not end up as fast as this big hack (sorry) here, but as
Nicholas correctly said, it's not our motivation to implement and
maintain such complexity just to squeeze the last milliseconds out of an
allocation path for "broken devices".

I absolutely dislike pushing this very specific allocation policy down
to the core range allocator. It's already makes my head spin every time
I look at it in detail.
Minchan Kim Aug. 18, 2020, 4:49 p.m. UTC | #17
On Tue, Aug 18, 2020 at 06:22:10PM +0200, David Hildenbrand wrote:
> On 18.08.20 17:58, Matthew Wilcox wrote:
> > On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
> >> I understand pfn stuff in the API is not pretty but the concept of idea
> >> makes sense to me in that go though the *migratable area* and get possible
> >> order pages with hard effort. It looks like GFP_NORETRY version for
> >> kmem_cache_alloc_bulk.
> >>
> >> How about this?
> >>
> >>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);
> > 
> > I think that makes a lot more sense as an API.  Although I think you want
> > 
> > int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
> > 		struct page **pages);
> > 
> 
> Right, and I would start with a very simple implementation that does not
> mess with alloc_contig_range() (meaning: modify it).
> 
> I'd then much rather want to see simple tweaks to alloc_contig_range()
> to improve the situation. E.g., some kind of "fail fast" flag that let's
> the caller specify to skip some draining (or do it manually in cma
> before a bulk allocation) and rather fail fast than really trying to
> allocate the range whatever it costs.
> 
> There are multiple optimizations you can play with then (start with big
> granularity and split, move to smaller granularity on demand, etc., all
> nicely wrapped in cma_bulk_alloc()).

Okay, let me hide the detail inside cma_bulk_alloc as much as possible.
Anyway, at least we need to pass some flag to indicate "fail fast"
in alloc_contig_range. Maybe __GFP_NORETRY could carry on the indication.

Thanks for the review.
Yang Shi Aug. 19, 2020, 12:27 a.m. UTC | #18
On Tue, Aug 18, 2020 at 9:22 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.08.20 17:58, Matthew Wilcox wrote:
> > On Tue, Aug 18, 2020 at 08:15:43AM -0700, Minchan Kim wrote:
> >> I understand pfn stuff in the API is not pretty but the concept of idea
> >> makes sense to me in that go though the *migratable area* and get possible
> >> order pages with hard effort. It looks like GFP_NORETRY version for
> >> kmem_cache_alloc_bulk.
> >>
> >> How about this?
> >>
> >>     int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);
> >
> > I think that makes a lot more sense as an API.  Although I think you want
> >
> > int cma_bulk_alloc(struct cma *cma, unsigned order, unsigned nr_elem,
> >               struct page **pages);
> >
>
> Right, and I would start with a very simple implementation that does not
> mess with alloc_contig_range() (meaning: modify it).
>
> I'd then much rather want to see simple tweaks to alloc_contig_range()
> to improve the situation. E.g., some kind of "fail fast" flag that let's
> the caller specify to skip some draining (or do it manually in cma
> before a bulk allocation) and rather fail fast than really trying to
> allocate the range whatever it costs.

Make sense to me. There are plenty such optimizations in mm, i.e.
light async migration vs sync migration.

And, it looks Minchan could accept allocation failure (return less
than "elem" pages), then the user could just retry.

>
> There are multiple optimizations you can play with then (start with big
> granularity and split, move to smaller granularity on demand, etc., all
> nicely wrapped in cma_bulk_alloc()).
>
> Yes, it might not end up as fast as this big hack (sorry) here, but as
> Nicholas correctly said, it's not our motivation to implement and
> maintain such complexity just to squeeze the last milliseconds out of an
> allocation path for "broken devices".
>
> I absolutely dislike pushing this very specific allocation policy down
> to the core range allocator. It's already makes my head spin every time
> I look at it in detail.
>
> --
> Thanks,
>
> David / dhildenb
>
>