dma-buf: system_heap: avoid reclaim for order 4

Message ID	20230117082508.8953-1-jaewon31.kim@samsung.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Jaewon Kim <jaewon31.kim@samsung.com> To: john.stultz@linaro.org, sumit.semwal@linaro.org, daniel.vetter@ffwll.ch, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, jaewon31.kim@gmail.com, Jaewon Kim <jaewon31.kim@samsung.com> Subject: [PATCH] dma-buf: system_heap: avoid reclaim for order 4 Date: Tue, 17 Jan 2023 17:25:08 +0900 Message-Id: <20230117082508.8953-1-jaewon31.kim@samsung.com> Content-Type: text/plain; charset="utf-8" CMS-TYPE: 101P DLP-Filter: Pass References: <CGME20230117082521epcas1p22a709521a9e6d2346d06ac220786560d@epcas1p2.samsung.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	dma-buf: system_heap: avoid reclaim for order 4 \| expand dma-buf: system_heap: avoid reclaim for order 4

Jaewon Kim Jan. 17, 2023, 8:25 a.m. UTC

Using order 4 pages would be helpful for many IOMMUs, but it could spend
quite much time in page allocation perspective.

The order 4 allocation with __GFP_RECLAIM may spend much time in
reclaim and compation logic. __GFP_NORETRY also may affect. These cause
unpredictable delay.

To get reasonable allocation speed from dma-buf system heap, use
HIGH_ORDER_GFP for order 4 to avoid reclaim.

Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
---
 drivers/dma-buf/heaps/system_heap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Jaewon Kim Jan. 17, 2023, 8:31 a.m. UTC | #1

> Using order 4 pages would be helpful for many IOMMUs, but it could spend
> quite much time in page allocation perspective.
> 
> The order 4 allocation with __GFP_RECLAIM may spend much time in
> reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> unpredictable delay.
> 
> To get reasonable allocation speed from dma-buf system heap, use
> HIGH_ORDER_GFP for order 4 to avoid reclaim.
> 
> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
> ---
>  drivers/dma-buf/heaps/system_heap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index e8bd10e60998..5a405e99ef1e 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -42,11 +42,10 @@ struct dma_heap_attachment {
>  };
>  
>  #define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
> -#define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
>  #define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
>  				| __GFP_NORETRY) & ~__GFP_RECLAIM) \
>  				| __GFP_COMP)
> -static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};
> +static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
>  /*
>   * The selection of the orders used for allocation (1MB, 64K, 4K) is designed
>   * to match with the sizes often found in IOMMUs. Using order 4 pages instead
> -- 
> 2.17.1

added John Stultz <jstultz@google.com>

John Stultz Jan. 18, 2023, 6:54 a.m. UTC | #2

On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > quite much time in page allocation perspective.
> >
> > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > unpredictable delay.
> >
> > To get reasonable allocation speed from dma-buf system heap, use
> > HIGH_ORDER_GFP for order 4 to avoid reclaim.

Thanks for sharing this!
The case where the allocation gets stuck behind reclaim under pressure
does sound undesirable, but I'd be a bit hesitant to tweak numbers
that have been used for a long while (going back to ion) without a bit
more data.

It might be good to also better understand the tradeoff of potential
on-going impact to performance from using low order pages when the
buffer is used.  Do you have any details like or tests that you could
share to help ensure this won't impact other users?

TJ: Do you have any additional thoughts on this?

thanks
-john

Jaewon Kim Jan. 18, 2023, 7:30 a.m. UTC | #3

>On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
>> > Using order 4 pages would be helpful for many IOMMUs, but it could spend
>> > quite much time in page allocation perspective.
>> >
>> > The order 4 allocation with __GFP_RECLAIM may spend much time in
>> > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
>> > unpredictable delay.
>> >
>> > To get reasonable allocation speed from dma-buf system heap, use
>> > HIGH_ORDER_GFP for order 4 to avoid reclaim.
>
>Thanks for sharing this!
>The case where the allocation gets stuck behind reclaim under pressure
>does sound undesirable, but I'd be a bit hesitant to tweak numbers
>that have been used for a long while (going back to ion) without a bit
>more data.
>
>It might be good to also better understand the tradeoff of potential
>on-going impact to performance from using low order pages when the
>buffer is used.  Do you have any details like or tests that you could
>share to help ensure this won't impact other users?
>
>TJ: Do you have any additional thoughts on this?
>
>thanks
>-john

Let me share what I tested with trace events of compaction, direct_reclaim.
This log was captured within one dma-buf alloc for 1,957,888 byte size.
It shows direct reclaim spent 4.973 msec total so that it could reclaim 209 pages total.
But it spent 12.606 msec in compaction.
I might miss preempted time but I think compaction is quite time consuming job.

Sorry I don't know how much the current gfp is helpful to IOMMU.
I think the default should be set like my change to improve allocation,
and the current gfp could be set if need through other options like cmdline.

-5972    [006] ....  7971.248176: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x40000 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.248394: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xac800 free_pfn=0xffc00 zone_end=0x100000, mode=async status=contended      
-5972    [006] ....  7971.248399: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.248922: mm_vmscan_direct_reclaim_end: nr_reclaimed=17                                                                                  
-5972    [006] ....  7971.248974: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0xacc00 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.249063: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xb5c00 free_pfn=0xffc00 zone_end=0x100000, mode=async status=contended      
-5972    [006] ....  7971.249067: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.249768: mm_vmscan_direct_reclaim_end: nr_reclaimed=18                                                                                  
-5972    [006] ....  7971.249843: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0xb6000 free_pfn=0xafc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.249845: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xb6000 free_pfn=0xafc00 zone_end=0x100000, mode=async status=partial_skipped
-5972    [006] ....  7971.249849: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.250116: mm_vmscan_direct_reclaim_end: nr_reclaimed=22                                                                                  
-5972    [006] ....  7971.250152: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x40000 free_pfn=0xaf800 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.250305: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xac400 free_pfn=0xaf800 zone_end=0x100000, mode=async status=success        
-5972    [006] ....  7971.250375: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0xac800 free_pfn=0xaf800 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.250380: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xaf800 free_pfn=0xaf800 zone_end=0x100000, mode=async status=partial_skipped
-5972    [006] ....  7971.250384: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.250597: mm_vmscan_direct_reclaim_end: nr_reclaimed=20                                                                                  
-5972    [006] ....  7971.251289: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x40000 free_pfn=0xaf000 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.251497: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xac400 free_pfn=0xaf000 zone_end=0x100000, mode=async status=contended      
-5972    [006] ....  7971.251502: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.251797: mm_vmscan_direct_reclaim_end: nr_reclaimed=22                                                                                  
-5972    [006] ....  7971.251890: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x5b400 free_pfn=0xb0400 zone_end=0x100000, mode=sync                      
-5972    [006] ....  7971.252740: mm_compaction_end: zone_start=0x40000 migrate_pfn=0x63800 free_pfn=0xb0400 zone_end=0x100000, mode=sync status=success         
-5972    [006] ....  7971.252774: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x63c00 free_pfn=0xb0000 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.253325: mm_compaction_end: zone_start=0x40000 migrate_pfn=0x74e20 free_pfn=0xade92 zone_end=0x100000, mode=async status=success        
-5972    [006] ....  7971.253333: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x83400 free_pfn=0xafc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.253683: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xac400 free_pfn=0xafc00 zone_end=0x100000, mode=async status=contended      
-5972    [006] ....  7971.253688: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.254134: mm_vmscan_direct_reclaim_end: nr_reclaimed=22                                                                                  
-5972    [006] ....  7971.254206: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0xad000 free_pfn=0xafc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.254289: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xae1c0 free_pfn=0xae000 zone_end=0x100000, mode=async status=partial_skipped
-5972    [006] ....  7971.254304: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x40000 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.254510: mm_compaction_end: zone_start=0x40000 migrate_pfn=0x74db0 free_pfn=0xadcc1 zone_end=0x100000, mode=async status=success        
-5972    [006] ....  7971.254519: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x75000 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.254612: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xac400 free_pfn=0xffc00 zone_end=0x100000, mode=async status=contended      
-5972    [006] ....  7971.254617: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.255303: mm_vmscan_direct_reclaim_end: nr_reclaimed=22                                                                                  
-5972    [006] ....  7971.255307: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x75000 free_pfn=0xffc00 zone_end=0x100000, mode=sync                      
-5972    [006] ....  7971.260198: mm_compaction_end: zone_start=0x40000 migrate_pfn=0x751e5 free_pfn=0xaa83f zone_end=0x100000, mode=sync status=success         
-5972    [006] ....  7971.260236: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x40000 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.260883: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xb5c00 free_pfn=0xffc00 zone_end=0x100000, mode=async status=contended      
-5972    [006] ....  7971.260891: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.261174: mm_vmscan_direct_reclaim_end: nr_reclaimed=22                                                                                  
-5972    [006] ....  7971.261178: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x75000 free_pfn=0xffc00 zone_end=0x100000, mode=sync                      
-5972    [006] ....  7971.264861: mm_compaction_end: zone_start=0x40000 migrate_pfn=0x75681 free_pfn=0xffc00 zone_end=0x100000, mode=sync status=success         
-5972    [006] ....  7971.264873: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0xbbc00 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.264889: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xc1400 free_pfn=0xffc00 zone_end=0x100000, mode=async status=contended      
-5972    [006] ....  7971.264891: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [006] ....  7971.265141: mm_vmscan_direct_reclaim_end: nr_reclaimed=22                                                                                  
-5972    [002] ....  7971.265585: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0xc1800 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [002] ....  7971.265669: mm_compaction_end: zone_start=0x40000 migrate_pfn=0xc2c00 free_pfn=0xffc00 zone_end=0x100000, mode=async status=contended      
-5972    [002] ....  7971.265678: mm_vmscan_direct_reclaim_begin: order=4 gfp_flags=GFP_HIGHUSER|__GFP_NOWARN|__GFP_COMP|__GFP_ZERO                              
-5972    [002] ....  7971.266987: mm_vmscan_direct_reclaim_end: nr_reclaimed=22                                                                                  
-5972    [006] ....  7971.267466: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x40000 free_pfn=0xffc00 zone_end=0x100000, mode=sync                      
-5972    [006] ....  7971.267778: mm_compaction_end: zone_start=0x40000 migrate_pfn=0x5d000 free_pfn=0xffc00 zone_end=0x100000, mode=sync status=success         
-5972    [006] ....  7971.267927: mm_compaction_begin: zone_start=0x40000 migrate_pfn=0x5c000 free_pfn=0xffc00 zone_end=0x100000, mode=async                     
-5972    [006] ....  7971.268092: mm_compaction_end: zone_start=0x40000 migrate_pfn=0x74d70 free_pfn=0xffc00 zone_end=0x100000, mode=async status=success        

Thank you
Jaewon Kim

T.J. Mercier Jan. 18, 2023, 7:55 p.m. UTC | #4

On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@google.com> wrote:
>
> On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > quite much time in page allocation perspective.
> > >
> > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > unpredictable delay.
> > >
> > > To get reasonable allocation speed from dma-buf system heap, use
> > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
>
> Thanks for sharing this!
> The case where the allocation gets stuck behind reclaim under pressure
> does sound undesirable, but I'd be a bit hesitant to tweak numbers
> that have been used for a long while (going back to ion) without a bit
> more data.
>
> It might be good to also better understand the tradeoff of potential
> on-going impact to performance from using low order pages when the
> buffer is used.  Do you have any details like or tests that you could
> share to help ensure this won't impact other users?
>
> TJ: Do you have any additional thoughts on this?
>
I don't have any data on how often we hit reclaim for mid order
allocations. That would be interesting to know. However the 70th
percentile of system-wide buffer sizes while running the camera on my
phone is still only 1 page, so it looks like this change would affect
a subset of use-cases.

Wouldn't this change make it less likely to get an order 4 allocation
(under memory pressure)? The commit message makes me think the goal of
the change is to get more of them.

Actually with the low order being 0, I don't think __GFP_COMP makes
sense in LOW_ORDER_GFP. But I guess that flag isn't harmful there.

> thanks
> -john

Jaewon Kim Jan. 25, 2023, 9:56 a.m. UTC | #5

> On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@google.com> wrote:
> >
> > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > quite much time in page allocation perspective.
> > > >
> > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > unpredictable delay.
> > > >
> > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> >
> > Thanks for sharing this!
> > The case where the allocation gets stuck behind reclaim under pressure
> > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > that have been used for a long while (going back to ion) without a bit
> > more data.
> >
> > It might be good to also better understand the tradeoff of potential
> > on-going impact to performance from using low order pages when the
> > buffer is used.  Do you have any details like or tests that you could
> > share to help ensure this won't impact other users?
> >
> > TJ: Do you have any additional thoughts on this?
> >
> I don't have any data on how often we hit reclaim for mid order
> allocations. That would be interesting to know. However the 70th
> percentile of system-wide buffer sizes while running the camera on my
> phone is still only 1 page, so it looks like this change would affect
> a subset of use-cases.
> 
> Wouldn't this change make it less likely to get an order 4 allocation
> (under memory pressure)? The commit message makes me think the goal of
> the change is to get more of them.

Hello John Stultz

I've been waiting for your next reply.

With my commit, we may gather less number of order 4 pages and fill the 
requested size with more number of order 0 pages. I think, howerver, stable
allocation speed is quite important so that corresponding user space
context can move on within a specific time.

Not only compaction but reclaim also, I think, would be invoked more if the 
__GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
if we move to order 0.

Thank you
Jaewon Kim

> 
> Actually with the low order being 0, I don't think __GFP_COMP makes
> sense in LOW_ORDER_GFP. But I guess that flag isn't harmful there.
> 
> > thanks
> > -john

Jaewon Kim Jan. 25, 2023, 10:19 a.m. UTC | #6

> > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@google.com> wrote:
> > >
> > > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > > quite much time in page allocation perspective.
> > > > >
> > > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > > unpredictable delay.
> > > > >
> > > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> > >
> > > Thanks for sharing this!
> > > The case where the allocation gets stuck behind reclaim under pressure
> > > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > > that have been used for a long while (going back to ion) without a bit
> > > more data.
> > >
> > > It might be good to also better understand the tradeoff of potential
> > > on-going impact to performance from using low order pages when the
> > > buffer is used.  Do you have any details like or tests that you could
> > > share to help ensure this won't impact other users?
> > >
> > > TJ: Do you have any additional thoughts on this?
> > >
> > I don't have any data on how often we hit reclaim for mid order
> > allocations. That would be interesting to know. However the 70th
> > percentile of system-wide buffer sizes while running the camera on my
> > phone is still only 1 page, so it looks like this change would affect
> > a subset of use-cases.
> > 
> > Wouldn't this change make it less likely to get an order 4 allocation
> > (under memory pressure)? The commit message makes me think the goal of
> > the change is to get more of them.
> 
> Hello John Stultz
> 
> I've been waiting for your next reply.
> 
> With my commit, we may gather less number of order 4 pages and fill the 
> requested size with more number of order 0 pages. I think, howerver, stable
> allocation speed is quite important so that corresponding user space
> context can move on within a specific time.
> 
> Not only compaction but reclaim also, I think, would be invoked more if the 
> __GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
> if we move to order 0.
> 

Additionally I'd like to say the old legacy ion system heap also used the
__GFP_RECLAIM only for order 8, not for order 4.

drivers/staging/android/ion/ion_system_heap.c

static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
                                    __GFP_NORETRY) & ~__GFP_RECLAIM;
static gfp_t low_order_gfp_flags  = GFP_HIGHUSER | __GFP_ZERO;
static const unsigned int orders[] = {8, 4, 0};

static int ion_system_heap_create_pools(struct ion_page_pool **pools)
{
       int i;

       for (i = 0; i < NUM_ORDERS; i++) {
               struct ion_page_pool *pool;
               gfp_t gfp_flags = low_order_gfp_flags;

               if (orders[i] > 4)
                       gfp_flags = high_order_gfp_flags;


> Thank you
> Jaewon Kim
> 
> > 
> > Actually with the low order being 0, I don't think __GFP_COMP makes
> > sense in LOW_ORDER_GFP. But I guess that flag isn't harmful there.
> > 
> > > thanks
> > > -john

John Stultz Jan. 25, 2023, 8:32 p.m. UTC | #7

On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@google.com> wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > > > quite much time in page allocation perspective.
> > > > > >
> > > > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > > > unpredictable delay.
> > > > > >
> > > > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> > > >
> > > > Thanks for sharing this!
> > > > The case where the allocation gets stuck behind reclaim under pressure
> > > > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > > > that have been used for a long while (going back to ion) without a bit
> > > > more data.
> > > >
> > > > It might be good to also better understand the tradeoff of potential
> > > > on-going impact to performance from using low order pages when the
> > > > buffer is used.  Do you have any details like or tests that you could
> > > > share to help ensure this won't impact other users?
> > > >
> > > > TJ: Do you have any additional thoughts on this?
> > > >
> > > I don't have any data on how often we hit reclaim for mid order
> > > allocations. That would be interesting to know. However the 70th
> > > percentile of system-wide buffer sizes while running the camera on my
> > > phone is still only 1 page, so it looks like this change would affect
> > > a subset of use-cases.
> > >
> > > Wouldn't this change make it less likely to get an order 4 allocation
> > > (under memory pressure)? The commit message makes me think the goal of
> > > the change is to get more of them.
> >
> > Hello John Stultz
> >
> > I've been waiting for your next reply.

Sorry, I was thinking you were gathering data on the tradeoffs. Sorry
for my confusion.

> > With my commit, we may gather less number of order 4 pages and fill the
> > requested size with more number of order 0 pages. I think, howerver, stable
> > allocation speed is quite important so that corresponding user space
> > context can move on within a specific time.
> >
> > Not only compaction but reclaim also, I think, would be invoked more if the
> > __GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
> > if we move to order 0.
> >
>
> Additionally I'd like to say the old legacy ion system heap also used the
> __GFP_RECLAIM only for order 8, not for order 4.
>
> drivers/staging/android/ion/ion_system_heap.c
>
> static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
>                                     __GFP_NORETRY) & ~__GFP_RECLAIM;
> static gfp_t low_order_gfp_flags  = GFP_HIGHUSER | __GFP_ZERO;
> static const unsigned int orders[] = {8, 4, 0};
>
> static int ion_system_heap_create_pools(struct ion_page_pool **pools)
> {
>        int i;
>
>        for (i = 0; i < NUM_ORDERS; i++) {
>                struct ion_page_pool *pool;
>                gfp_t gfp_flags = low_order_gfp_flags;
>
>                if (orders[i] > 4)
>                        gfp_flags = high_order_gfp_flags;

This seems a bit backwards from your statement. It's only removing
__GFP_RECLAIM on order 8 (high_order_gfp_flags).

So apologies again, but how is that different from the existing code?

#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
#define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
#define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
                                | __GFP_NORETRY) & ~__GFP_RECLAIM) \
                                | __GFP_COMP)
static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};

Where the main reason we introduced the mid-order flags is to avoid
the warnings on order 4 allocation failures when we'll fall back to
order 0

The only substantial difference I see between the old ion code and
what we have now is the GFP_COMP addition, which is a bit hazy in my
memory. I unfortunately don't have a record of why it was added (don't
have access to my old mail box), so I suspect it was something brought
up in private review.  Dropping that from the low order flags probably
makes sense as TJ pointed out, but this isn't what your patch is
changing.

Your patch is changing that for mid-order allocations we'll use the
high order flags, so we'll not retry and not reclaim, so there will be
more failing and falling back to single page allocations.
This makes sense to make allocation time faster and more deterministic
(I like it!), but potentially has the tradeoff of losing the
performance benefit of using mid order page sizes.

I suspect your change is a net win overall, as the cumulative effect
of using larger pages probably won't benefit more than the large
indeterministic allocation time, particularly under pressure.

But because your change is different from what the old ion code did, I
want to be a little cautious. So it would be nice to see some
evaluation of not just the benefits the patch provides you but also of
what negative impact it might have.  And so far you haven't provided
any details there.

A quick example might be for the use case where mid-order allocations
are causing you trouble, you could see how the performance changes if
you force all mid-order allocations to be single page allocations (so
orders[] = {8, 0, 0};) and compare it with the current code when
there's no memory pressure (right after reboot when pages haven't been
fragmented) so the mid-order allocations will succeed.  That will let
us know the potential downside if we have brief / transient pressure
at allocation time that forces small pages.

Does that make sense?

thanks
-john

Jaewon Kim Jan. 26, 2023, 4:42 a.m. UTC | #8

> On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@google.com> wrote:
> > > > >
> > > > > On Tue, Jan 17, 2023 at 12:31 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > > > > > Using order 4 pages would be helpful for many IOMMUs, but it could spend
> > > > > > > quite much time in page allocation perspective.
> > > > > > >
> > > > > > > The order 4 allocation with __GFP_RECLAIM may spend much time in
> > > > > > > reclaim and compation logic. __GFP_NORETRY also may affect. These cause
> > > > > > > unpredictable delay.
> > > > > > >
> > > > > > > To get reasonable allocation speed from dma-buf system heap, use
> > > > > > > HIGH_ORDER_GFP for order 4 to avoid reclaim.
> > > > >
> > > > > Thanks for sharing this!
> > > > > The case where the allocation gets stuck behind reclaim under pressure
> > > > > does sound undesirable, but I'd be a bit hesitant to tweak numbers
> > > > > that have been used for a long while (going back to ion) without a bit
> > > > > more data.
> > > > >
> > > > > It might be good to also better understand the tradeoff of potential
> > > > > on-going impact to performance from using low order pages when the
> > > > > buffer is used.  Do you have any details like or tests that you could
> > > > > share to help ensure this won't impact other users?
> > > > >
> > > > > TJ: Do you have any additional thoughts on this?
> > > > >
> > > > I don't have any data on how often we hit reclaim for mid order
> > > > allocations. That would be interesting to know. However the 70th
> > > > percentile of system-wide buffer sizes while running the camera on my
> > > > phone is still only 1 page, so it looks like this change would affect
> > > > a subset of use-cases.
> > > >
> > > > Wouldn't this change make it less likely to get an order 4 allocation
> > > > (under memory pressure)? The commit message makes me think the goal of
> > > > the change is to get more of them.
> > >
> > > Hello John Stultz
> > >
> > > I've been waiting for your next reply.
> 
> Sorry, I was thinking you were gathering data on the tradeoffs. Sorry
> for my confusion.
> 
> > > With my commit, we may gather less number of order 4 pages and fill the
> > > requested size with more number of order 0 pages. I think, howerver, stable
> > > allocation speed is quite important so that corresponding user space
> > > context can move on within a specific time.
> > >
> > > Not only compaction but reclaim also, I think, would be invoked more if the
> > > __GFP_RECLAIM is added on order 4. I expect the reclaim could be decreased
> > > if we move to order 0.
> > >
> >
> > Additionally I'd like to say the old legacy ion system heap also used the
> > __GFP_RECLAIM only for order 8, not for order 4.
> >
> > drivers/staging/android/ion/ion_system_heap.c
> >
> > static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
> >                                     __GFP_NORETRY) & ~__GFP_RECLAIM;
> > static gfp_t low_order_gfp_flags  = GFP_HIGHUSER | __GFP_ZERO;
> > static const unsigned int orders[] = {8, 4, 0};
> >
> > static int ion_system_heap_create_pools(struct ion_page_pool **pools)
> > {
> >        int i;
> >
> >        for (i = 0; i < NUM_ORDERS; i++) {
> >                struct ion_page_pool *pool;
> >                gfp_t gfp_flags = low_order_gfp_flags;
> >
> >                if (orders[i] > 4)
> >                        gfp_flags = high_order_gfp_flags;
> 
> 
> This seems a bit backwards from your statement. It's only removing
> __GFP_RECLAIM on order 8 (high_order_gfp_flags).

Oh sorry, my fault. I also read wrongly. But as far as I know, most of
AP chipset vendors have been using __GFP_RECLAIM only for order 0.
I can't say in detail though.

> 
> So apologies again, but how is that different from the existing code?
> 
> #define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
> #define MID_ORDER_GFP (LOW_ORDER_GFP | __GFP_NOWARN)
> #define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
>                                 | __GFP_NORETRY) & ~__GFP_RECLAIM) \
>                                 | __GFP_COMP)
> static gfp_t order_flags[] = {HIGH_ORDER_GFP, MID_ORDER_GFP, LOW_ORDER_GFP};
> 
> Where the main reason we introduced the mid-order flags is to avoid
> the warnings on order 4 allocation failures when we'll fall back to
> order 0
> 
> The only substantial difference I see between the old ion code and
> what we have now is the GFP_COMP addition, which is a bit hazy in my
> memory. I unfortunately don't have a record of why it was added (don't
> have access to my old mail box), so I suspect it was something brought
> up in private review.  Dropping that from the low order flags probably
> makes sense as TJ pointed out, but this isn't what your patch is
> changing.
> 
> Your patch is changing that for mid-order allocations we'll use the
> high order flags, so we'll not retry and not reclaim, so there will be
> more failing and falling back to single page allocations.
> This makes sense to make allocation time faster and more deterministic
> (I like it!), but potentially has the tradeoff of losing the
> performance benefit of using mid order page sizes.
> 
> I suspect your change is a net win overall, as the cumulative effect
> of using larger pages probably won't benefit more than the large
> indeterministic allocation time, particularly under pressure.
> 
> But because your change is different from what the old ion code did, I
> want to be a little cautious. So it would be nice to see some
> evaluation of not just the benefits the patch provides you but also of
> what negative impact it might have.  And so far you haven't provided
> any details there.
> 
> A quick example might be for the use case where mid-order allocations
> are causing you trouble, you could see how the performance changes if
> you force all mid-order allocations to be single page allocations (so
> orders[] = {8, 0, 0};) and compare it with the current code when
> there's no memory pressure (right after reboot when pages haven't been
> fragmented) so the mid-order allocations will succeed.  That will let
> us know the potential downside if we have brief / transient pressure
> at allocation time that forces small pages.
> 
> Does that make sense?

Let me try this. It make take some days. But I guess it depends on memory
status as you said. If there were quite many order 4 pages, then 8 4 0
should be faster than 8 0 0.

I don't know this is a right approach. In my opinion, except the specific
cases like right after reboot, there are not many order 4 pages. And
in determinisitic allocation time perspective, I think avoiding too long
allocations is more important than making faster with already existing
free order 4 pages.

BR
Jaewon Kim

> 
> thanks
> -john

John Stultz Jan. 26, 2023, 5:04 a.m. UTC | #9

On Wed, Jan 25, 2023 at 8:42 PM 김재원 <jaewon31.kim@samsung.com> wrote:
> > On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> > > > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@google.com> wrote:
> > But because your change is different from what the old ion code did, I
> > want to be a little cautious. So it would be nice to see some
> > evaluation of not just the benefits the patch provides you but also of
> > what negative impact it might have.  And so far you haven't provided
> > any details there.
> >
> > A quick example might be for the use case where mid-order allocations
> > are causing you trouble, you could see how the performance changes if
> > you force all mid-order allocations to be single page allocations (so
> > orders[] = {8, 0, 0};) and compare it with the current code when
> > there's no memory pressure (right after reboot when pages haven't been
> > fragmented) so the mid-order allocations will succeed.  That will let
> > us know the potential downside if we have brief / transient pressure
> > at allocation time that forces small pages.
> >
> > Does that make sense?
>
> Let me try this. It make take some days. But I guess it depends on memory
> status as you said. If there were quite many order 4 pages, then 8 4 0
> should be faster than 8 0 0.
>
> I don't know this is a right approach. In my opinion, except the specific
> cases like right after reboot, there are not many order 4 pages. And
> in determinisitic allocation time perspective, I think avoiding too long
> allocations is more important than making faster with already existing
> free order 4 pages.

I suspect you are right, and do think your change will be helpful.
But I just want to make sure we're doing some due diligence, instead
of going on just gut instinct.

Thanks so much for helping with this!
-john

Jaewon Kim Feb. 4, 2023, 3:02 p.m. UTC | #10

> 
> --------- Original Message ---------
> 
> Sender : John Stultz <jstultz@google.com>
> 
> Date : 2023-01-26 14:04 (GMT+9)
> 
> Title : Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4
> 
>  
> 
> On Wed, Jan 25, 2023 at 8:42 PM 김재원 <jaewon31.kim@samsung.com> wrote:
> 
> > > On Wed, Jan 25, 2023 at 2:20 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> 
> > > > > > On Tue, Jan 17, 2023 at 10:54 PM John Stultz <jstultz@google.com> wrote:
> 
> > > But because your change is different from what the old ion code did, I
> 
> > > want to be a little cautious. So it would be nice to see some
> 
> > > evaluation of not just the benefits the patch provides you but also of
> 
> > > what negative impact it might have.  And so far you haven't provided
> 
> > > any details there.
> 
> > >
> 
> > > A quick example might be for the use case where mid-order allocations
> 
> > > are causing you trouble, you could see how the performance changes if
> 
> > > you force all mid-order allocations to be single page allocations (so
> 
> > > orders[] = {8, 0, 0};) and compare it with the current code when
> 
> > > there's no memory pressure (right after reboot when pages haven't been
> 
> > > fragmented) so the mid-order allocations will succeed.  That will let
> 
> > > us know the potential downside if we have brief / transient pressure
> 
> > > at allocation time that forces small pages.
> 
> > >
> 
> > > Does that make sense?
> 
> >
> 
> > Let me try this. It make take some days. But I guess it depends on memory
> 
> > status as you said. If there were quite many order 4 pages, then 8 4 0
> 
> > should be faster than 8 0 0.
> 
> >
> 
> > I don't know this is a right approach. In my opinion, except the specific
> 
> > cases like right after reboot, there are not many order 4 pages. And
> 
> > in determinisitic allocation time perspective, I think avoiding too long
> 
> > allocations is more important than making faster with already existing
> 
> > free order 4 pages.
> 
> 
> 
> I suspect you are right, and do think your change will be helpful.
> 
> But I just want to make sure we're doing some due diligence, instead
> 
> of going on just gut instinct.
> 
> 
> 
> Thanks so much for helping with this!
> 
> -john
> 
> 

Hello John Stultz, sorry for late reply.
I had to manage other urgent things and this test also took some time to finish.
Any I hope you to be happy with following my test results.


1. system heap modification

To avoid effect of allocation from the pool, all the freed dma
buffer were passed to buddy without keeping them in the pool.
Some trace_printk and order counting logic were added.

2. the test tool

To test the dma-buf system heap allocation speed, I prepared
a userspace test program which requests a specified size to a heap.
With the program, I tried to request 16 times of 10 MB size and
added 1 sleep between each request. Each memory was not freed
until the total 16 times total memory was allocated.

3. the test device

The test device has arm64 CPU cores and v5.15 based kernel.
To get stable results, the CPU clock was fixed not to be changed
in run time, and the test tool was set to some specific CPU cores
running in the same CPU clock.

4. test results

As we expected if order 4 exist in the buddy, the order 8, 4, 0
allocation was 1 to 4 times faster than the order 8, 0, 0. But
the order 8, 0, 0 also looks fast enough.

Here's time diff, and number of each order.

order 8, 4, 0 in the enough order 4 case

         diff	8	4	0
     665 usec	0	160	0
   1,148 usec	0	160	0
   1,089 usec	0	160	0
   1,154 usec	0	160	0
   1,264 usec	0	160	0
   1,414 usec	0	160	0
     873 usec	0	160	0
   1,148 usec	0	160	0
   1,158 usec	0	160	0
   1,139 usec	0	160	0
   1,169 usec	0	160	0
   1,174 usec	0	160	0
   1,210 usec	0	160	0
     995 usec	0	160	0
   1,151 usec	0	160	0
     977 usec	0	160	0

order 8, 0, 0 in the enough order 4 case

         diff	8	4	0
     441 usec	10	0	0
     747 usec	10	0	0
   2,330 usec	2	0	2048
   2,469 usec	0	0	2560
   2,518 usec	0	0	2560
   1,176 usec	0	0	2560
   1,487 usec	0	0	2560
   1,402 usec	0	0	2560
   1,449 usec	0	0	2560
   1,330 usec	0	0	2560
   1,089 usec	0	0	2560
   1,481 usec	0	0	2560
   1,326 usec	0	0	2560
   3,057 usec	0	0	2560
   2,758 usec	0	0	2560
   3,271 usec	0	0	2560

From the perspective of responsiveness, the deterministic
memory allocation speed, I think, is quite important. So I
tested other case where the free memory are not enough.

On this test, I ran the 16 times allocation sets twice
consecutively. Then it showed the first set order 8, 4, 0
became very slow and varied, but the second set became
faster because of the already created the high order.

order 8, 4, 0 in low memory

         diff	8	4	0
     584 usec	0	160	0
  28,428 usec	0	160	0
 100,701 usec	0	160	0
  76,645 usec	0	160	0
  25,522 usec	0	160	0
  38,798 usec	0	160	0
  89,012 usec	0	160	0
  23,015 usec	0	160	0
  73,360 usec	0	160	0
  76,953 usec	0	160	0
  31,492 usec	0	160	0
  75,889 usec	0	160	0
  84,551 usec	0	160	0
  84,352 usec	0	160	0
  57,103 usec	0	160	0
  93,452 usec	0	160	0
			
         diff	8	4	0
     808 usec	10	0	0
     778 usec	4	96	0
     829 usec	0	160	0
     700 usec	0	160	0
     937 usec	0	160	0
     651 usec	0	160	0
     636 usec	0	160	0
     811 usec	0	160	0
     622 usec	0	160	0
     674 usec	0	160	0
     677 usec	0	160	0
     738 usec	0	160	0
   1,130 usec	0	160	0
     677 usec	0	160	0
     553 usec	0	160	0
   1,048 usec	0	160	0


order 8, 0, 0 in low memory

        diff	8	4	0
  1,699 usec	2	0	2048
  2,082 usec	0	0	2560
    840 usec	0	0	2560
    875 usec	0	0	2560
    845 usec	0	0	2560
  1,706 usec	0	0	2560
    967 usec	0	0	2560
  1,000 usec	0	0	2560
  1,905 usec	0	0	2560
  2,451 usec	0	0	2560
  3,384 usec	0	0	2560
  2,397 usec	0	0	2560
  3,171 usec	0	0	2560
  2,376 usec	0	0	2560
  3,347 usec	0	0	2560
  2,554 usec	0	0	2560
			
       diff	8	4	0
 1,409 usec	2	0	2048
 1,438 usec	0	0	2560
 1,035 usec	0	0	2560
 1,108 usec	0	0	2560
   825 usec	0	0	2560
   927 usec	0	0	2560
 1,931 usec	0	0	2560
 2,024 usec	0	0	2560
 1,884 usec	0	0	2560
 1,769 usec	0	0	2560
 2,136 usec	0	0	2560
 1,738 usec	0	0	2560
 1,328 usec	0	0	2560
 1,438 usec	0	0	2560
 1,972 usec	0	0	2560
 2,963 usec	0	0	2560

Finally if we change order 4 to use HIGH_ORDER_GFP,
I expect that we could avoid the very slow cases.

order 8, 4, 0 in low memory with HIGH_ORDER_GFP

          diff	8	4	0
 1,356 usec	0	155	80
 1,901 usec	0	11	2384
 1,912 usec	0	0	2560
 1,911 usec	0	0	2560
 1,884 usec	0	0	2560
 1,577 usec	0	0	2560
 1,366 usec	0	0	2560
 1,711 usec	0	0	2560
 1,635 usec	0	28	2112
   544 usec	10	0	0
   633 usec	2	128	0
   848 usec	0	160	0
   729 usec	0	160	0
 1,000 usec	0	160	0
 1,358 usec	0	160	0
 2,638 usec	0	31	2064
			
          diff	8	4	0
   669 usec	10	0	0
   789 usec	8	32	0
   603 usec	3	112	0
   578 usec	0	160	0
   562 usec	0	160	0
   564 usec	0	160	0
   686 usec	0	160	0
 1,621 usec	0	160	0
 2,080 usec	0	40	1920
 1,749 usec	0	0	2560
 2,244 usec	0	0	2560
 2,333 usec	0	0	2560
 1,257 usec	0	0	2560
 1,703 usec	0	0	2560
 1,782 usec	0	1	2544
 2,225 usec	0	0	2560

Thank you
Jaewon Kim

John Stultz Feb. 7, 2023, 4:37 a.m. UTC | #11

On Sat, Feb 4, 2023 at 7:02 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> Hello John Stultz, sorry for late reply.
> I had to manage other urgent things and this test also took some time to finish.
> Any I hope you to be happy with following my test results.
>
>
> 1. system heap modification
>
> To avoid effect of allocation from the pool, all the freed dma
> buffer were passed to buddy without keeping them in the pool.
> Some trace_printk and order counting logic were added.
>
> 2. the test tool
>
> To test the dma-buf system heap allocation speed, I prepared
> a userspace test program which requests a specified size to a heap.
> With the program, I tried to request 16 times of 10 MB size and
> added 1 sleep between each request. Each memory was not freed
> until the total 16 times total memory was allocated.

Oof. I really appreciate all your effort that I'm sure went in to
generate these numbers, but  this wasn't quite what I was asking for.
I know you've been focused on allocation performance under memory
pressure, but I was hoping to see the impact of __using__ order 0
pages over order 4 pages in real world conditions (for camera or video
recording or other use cases that use large allocations). These
results seem to be still just focused on the difference in allocation
performance between order 0 and order 4 with and without contention.

That said, re-reading my email, I probably could have been more clear
on this aspect.


> 3. the test device
>
> The test device has arm64 CPU cores and v5.15 based kernel.
> To get stable results, the CPU clock was fixed not to be changed
> in run time, and the test tool was set to some specific CPU cores
> running in the same CPU clock.
>
> 4. test results
>
> As we expected if order 4 exist in the buddy, the order 8, 4, 0
> allocation was 1 to 4 times faster than the order 8, 0, 0. But
> the order 8, 0, 0 also looks fast enough.
>
> Here's time diff, and number of each order.
>
> order 8, 4, 0 in the enough order 4 case
>
>          diff   8       4       0
>      665 usec   0       160     0
>    1,148 usec   0       160     0
>    1,089 usec   0       160     0
>    1,154 usec   0       160     0
>    1,264 usec   0       160     0
>    1,414 usec   0       160     0
>      873 usec   0       160     0
>    1,148 usec   0       160     0
>    1,158 usec   0       160     0
>    1,139 usec   0       160     0
>    1,169 usec   0       160     0
>    1,174 usec   0       160     0
>    1,210 usec   0       160     0
>      995 usec   0       160     0
>    1,151 usec   0       160     0
>      977 usec   0       160     0
>
> order 8, 0, 0 in the enough order 4 case
>
>          diff   8       4       0
>      441 usec   10      0       0
>      747 usec   10      0       0
>    2,330 usec   2       0       2048
>    2,469 usec   0       0       2560
>    2,518 usec   0       0       2560
>    1,176 usec   0       0       2560
>    1,487 usec   0       0       2560
>    1,402 usec   0       0       2560
>    1,449 usec   0       0       2560
>    1,330 usec   0       0       2560
>    1,089 usec   0       0       2560
>    1,481 usec   0       0       2560
>    1,326 usec   0       0       2560
>    3,057 usec   0       0       2560
>    2,758 usec   0       0       2560
>    3,271 usec   0       0       2560
>
> From the perspective of responsiveness, the deterministic
> memory allocation speed, I think, is quite important. So I
> tested other case where the free memory are not enough.
>
> On this test, I ran the 16 times allocation sets twice
> consecutively. Then it showed the first set order 8, 4, 0
> became very slow and varied, but the second set became
> faster because of the already created the high order.
>
> order 8, 4, 0 in low memory
>
>          diff   8       4       0
>      584 usec   0       160     0
>   28,428 usec   0       160     0
>  100,701 usec   0       160     0
>   76,645 usec   0       160     0
>   25,522 usec   0       160     0
>   38,798 usec   0       160     0
>   89,012 usec   0       160     0
>   23,015 usec   0       160     0
>   73,360 usec   0       160     0
>   76,953 usec   0       160     0
>   31,492 usec   0       160     0
>   75,889 usec   0       160     0
>   84,551 usec   0       160     0
>   84,352 usec   0       160     0
>   57,103 usec   0       160     0
>   93,452 usec   0       160     0
>
>          diff   8       4       0
>      808 usec   10      0       0
>      778 usec   4       96      0
>      829 usec   0       160     0
>      700 usec   0       160     0
>      937 usec   0       160     0
>      651 usec   0       160     0
>      636 usec   0       160     0
>      811 usec   0       160     0
>      622 usec   0       160     0
>      674 usec   0       160     0
>      677 usec   0       160     0
>      738 usec   0       160     0
>    1,130 usec   0       160     0
>      677 usec   0       160     0
>      553 usec   0       160     0
>    1,048 usec   0       160     0
>
>
> order 8, 0, 0 in low memory
>
>         diff    8       4       0
>   1,699 usec    2       0       2048
>   2,082 usec    0       0       2560
>     840 usec    0       0       2560
>     875 usec    0       0       2560
>     845 usec    0       0       2560
>   1,706 usec    0       0       2560
>     967 usec    0       0       2560
>   1,000 usec    0       0       2560
>   1,905 usec    0       0       2560
>   2,451 usec    0       0       2560
>   3,384 usec    0       0       2560
>   2,397 usec    0       0       2560
>   3,171 usec    0       0       2560
>   2,376 usec    0       0       2560
>   3,347 usec    0       0       2560
>   2,554 usec    0       0       2560
>
>        diff     8       4       0
>  1,409 usec     2       0       2048
>  1,438 usec     0       0       2560
>  1,035 usec     0       0       2560
>  1,108 usec     0       0       2560
>    825 usec     0       0       2560
>    927 usec     0       0       2560
>  1,931 usec     0       0       2560
>  2,024 usec     0       0       2560
>  1,884 usec     0       0       2560
>  1,769 usec     0       0       2560
>  2,136 usec     0       0       2560
>  1,738 usec     0       0       2560
>  1,328 usec     0       0       2560
>  1,438 usec     0       0       2560
>  1,972 usec     0       0       2560
>  2,963 usec     0       0       2560

So, thank you for generating all of this. I think this all looks as
expected, showing the benefit of your change to allocation under
contention and showing the potential downside in the non-contention
case.

I still worry about the performance impact outside of allocation time
of using the smaller order pages (via map and unmap through iommu to
devices, etc), so it would still be nice to have some confidence this
won't introduce other regressions, but I do agree the worse case
impact is very bad.

> Finally if we change order 4 to use HIGH_ORDER_GFP,
> I expect that we could avoid the very slow cases.
>

Yeah. Again, this all aligns with the upside of your changes, which
I'm eager for.
I just want to have a strong sense of any regressions it might also cause.

I don't mean to discourage you, especially after all the effort here.
Do you think evaluating the before and after impact to buffer usage
(not just allocation) would be doable in the near term?

If you don't think so, given the benefit to allocation under pressure
is large (and I don't mean to give you hurdles to jump), I'm willing
to ack your change to get it merged, but if we later see performance
trouble, I'll be quick to advocate for reverting it.  Is that ok?

thanks
-john

Jaewon Kim Feb. 7, 2023, 7:33 a.m. UTC | #12

> 
>
> 
>
>--------- Original Message ---------
>
>Sender : John Stultz <jstultz@google.com>
>
>Date : 2023-02-07 13:37 (GMT+9)
>
>Title : Re: (2) [PATCH] dma-buf: system_heap: avoid reclaim for order 4
>
> 
>
>On Sat, Feb 4, 2023 at 7:02 AM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
>
>> Hello John Stultz, sorry for late reply.
>
>> I had to manage other urgent things and this test also took some time to finish.
>
>> Any I hope you to be happy with following my test results.
>
>>
>
>>
>
>> 1. system heap modification
>
>>
>
>> To avoid effect of allocation from the pool, all the freed dma
>
>> buffer were passed to buddy without keeping them in the pool.
>
>> Some trace_printk and order counting logic were added.
>
>>
>
>> 2. the test tool
>
>>
>
>> To test the dma-buf system heap allocation speed, I prepared
>
>> a userspace test program which requests a specified size to a heap.
>
>> With the program, I tried to request 16 times of 10 MB size and
>
>> added 1 sleep between each request. Each memory was not freed
>
>> until the total 16 times total memory was allocated.
>
>
>
>Oof. I really appreciate all your effort that I'm sure went in to
>
>generate these numbers, but  this wasn't quite what I was asking for.
>
>I know you've been focused on allocation performance under memory
>
>pressure, but I was hoping to see the impact of __using__ order 0
>
>pages over order 4 pages in real world conditions (for camera or video
>
>recording or other use cases that use large allocations). These
>
>results seem to be still just focused on the difference in allocation
>
>performance between order 0 and order 4 with and without contention.
>
>
>
>That said, re-reading my email, I probably could have been more clear
>
>on this aspect.
>
>
>
>
>
>> 3. the test device
>
>>
>
>> The test device has arm64 CPU cores and v5.15 based kernel.
>
>> To get stable results, the CPU clock was fixed not to be changed
>
>> in run time, and the test tool was set to some specific CPU cores
>
>> running in the same CPU clock.
>
>>
>
>> 4. test results
>
>>
>
>> As we expected if order 4 exist in the buddy, the order 8, 4, 0
>
>> allocation was 1 to 4 times faster than the order 8, 0, 0. But
>
>> the order 8, 0, 0 also looks fast enough.
>
>>
>
>> Here's time diff, and number of each order.
>
>>
>
>> order 8, 4, 0 in the enough order 4 case
>
>>
>
>>          diff   8       4       0
>
>>      665 usec   0       160     0
>
>>    1,148 usec   0       160     0
>
>>    1,089 usec   0       160     0
>
>>    1,154 usec   0       160     0
>
>>    1,264 usec   0       160     0
>
>>    1,414 usec   0       160     0
>
>>      873 usec   0       160     0
>
>>    1,148 usec   0       160     0
>
>>    1,158 usec   0       160     0
>
>>    1,139 usec   0       160     0
>
>>    1,169 usec   0       160     0
>
>>    1,174 usec   0       160     0
>
>>    1,210 usec   0       160     0
>
>>      995 usec   0       160     0
>
>>    1,151 usec   0       160     0
>
>>      977 usec   0       160     0
>
>>
>
>> order 8, 0, 0 in the enough order 4 case
>
>>
>
>>          diff   8       4       0
>
>>      441 usec   10      0       0
>
>>      747 usec   10      0       0
>
>>    2,330 usec   2       0       2048
>
>>    2,469 usec   0       0       2560
>
>>    2,518 usec   0       0       2560
>
>>    1,176 usec   0       0       2560
>
>>    1,487 usec   0       0       2560
>
>>    1,402 usec   0       0       2560
>
>>    1,449 usec   0       0       2560
>
>>    1,330 usec   0       0       2560
>
>>    1,089 usec   0       0       2560
>
>>    1,481 usec   0       0       2560
>
>>    1,326 usec   0       0       2560
>
>>    3,057 usec   0       0       2560
>
>>    2,758 usec   0       0       2560
>
>>    3,271 usec   0       0       2560
>
>>
>
>> From the perspective of responsiveness, the deterministic
>
>> memory allocation speed, I think, is quite important. So I
>
>> tested other case where the free memory are not enough.
>
>>
>
>> On this test, I ran the 16 times allocation sets twice
>
>> consecutively. Then it showed the first set order 8, 4, 0
>
>> became very slow and varied, but the second set became
>
>> faster because of the already created the high order.
>
>>
>
>> order 8, 4, 0 in low memory
>
>>
>
>>          diff   8       4       0
>
>>      584 usec   0       160     0
>
>>   28,428 usec   0       160     0
>
>>  100,701 usec   0       160     0
>
>>   76,645 usec   0       160     0
>
>>   25,522 usec   0       160     0
>
>>   38,798 usec   0       160     0
>
>>   89,012 usec   0       160     0
>
>>   23,015 usec   0       160     0
>
>>   73,360 usec   0       160     0
>
>>   76,953 usec   0       160     0
>
>>   31,492 usec   0       160     0
>
>>   75,889 usec   0       160     0
>
>>   84,551 usec   0       160     0
>
>>   84,352 usec   0       160     0
>
>>   57,103 usec   0       160     0
>
>>   93,452 usec   0       160     0
>
>>
>
>>          diff   8       4       0
>
>>      808 usec   10      0       0
>
>>      778 usec   4       96      0
>
>>      829 usec   0       160     0
>
>>      700 usec   0       160     0
>
>>      937 usec   0       160     0
>
>>      651 usec   0       160     0
>
>>      636 usec   0       160     0
>
>>      811 usec   0       160     0
>
>>      622 usec   0       160     0
>
>>      674 usec   0       160     0
>
>>      677 usec   0       160     0
>
>>      738 usec   0       160     0
>
>>    1,130 usec   0       160     0
>
>>      677 usec   0       160     0
>
>>      553 usec   0       160     0
>
>>    1,048 usec   0       160     0
>
>>
>
>>
>
>> order 8, 0, 0 in low memory
>
>>
>
>>         diff    8       4       0
>
>>   1,699 usec    2       0       2048
>
>>   2,082 usec    0       0       2560
>
>>     840 usec    0       0       2560
>
>>     875 usec    0       0       2560
>
>>     845 usec    0       0       2560
>
>>   1,706 usec    0       0       2560
>
>>     967 usec    0       0       2560
>
>>   1,000 usec    0       0       2560
>
>>   1,905 usec    0       0       2560
>
>>   2,451 usec    0       0       2560
>
>>   3,384 usec    0       0       2560
>
>>   2,397 usec    0       0       2560
>
>>   3,171 usec    0       0       2560
>
>>   2,376 usec    0       0       2560
>
>>   3,347 usec    0       0       2560
>
>>   2,554 usec    0       0       2560
>
>>
>
>>        diff     8       4       0
>
>>  1,409 usec     2       0       2048
>
>>  1,438 usec     0       0       2560
>
>>  1,035 usec     0       0       2560
>
>>  1,108 usec     0       0       2560
>
>>    825 usec     0       0       2560
>
>>    927 usec     0       0       2560
>
>>  1,931 usec     0       0       2560
>
>>  2,024 usec     0       0       2560
>
>>  1,884 usec     0       0       2560
>
>>  1,769 usec     0       0       2560
>
>>  2,136 usec     0       0       2560
>
>>  1,738 usec     0       0       2560
>
>>  1,328 usec     0       0       2560
>
>>  1,438 usec     0       0       2560
>
>>  1,972 usec     0       0       2560
>
>>  2,963 usec     0       0       2560
>
>
>
>So, thank you for generating all of this. I think this all looks as
>
>expected, showing the benefit of your change to allocation under
>
>contention and showing the potential downside in the non-contention
>
>case.
>
>
>
>I still worry about the performance impact outside of allocation time
>
>of using the smaller order pages (via map and unmap through iommu to
>
>devices, etc), so it would still be nice to have some confidence this
>
>won't introduce other regressions, but I do agree the worse case
>
>impact is very bad.
>
>
>
>> Finally if we change order 4 to use HIGH_ORDER_GFP,
>
>> I expect that we could avoid the very slow cases.
>
>>
>
>
>
>Yeah. Again, this all aligns with the upside of your changes, which
>
>I'm eager for.
>
>I just want to have a strong sense of any regressions it might also cause.
>
>
>
>I don't mean to discourage you, especially after all the effort here.
>
>Do you think evaluating the before and after impact to buffer usage
>
>(not just allocation) would be doable in the near term?
>

Hello sorry but I don't have expertise on iommu. Actually I'm also wondering
all IOMMU can use order 4 free pages, if they are allocated. I am not sure
but I remember I heard order 9 (2MB) could be used, but I don't know about order 8 4.

I guess IOMMU mmap also be same patern like we expect. I mean if order 4 is
prepared it could be faster like 1 to 4 times. But it, I think, should NOT be
that much slow even though the entire free memory is prepared as order 0 pages.

>
>
>If you don't think so, given the benefit to allocation under pressure
>
>is large (and I don't mean to give you hurdles to jump), I'm willing
>
>to ack your change to get it merged, but if we later see performance
>
>trouble, I'll be quick to advocate for reverting it.  Is that ok?
>

Yes sure. I also want to know if it is.
Thank you

>
>
>thanks
>
>-john
>
>

John Stultz Feb. 7, 2023, 4:56 p.m. UTC | #13

On Mon, Feb 6, 2023 at 11:33 PM Jaewon Kim <jaewon31.kim@samsung.com> wrote:
> >I don't mean to discourage you, especially after all the effort here.
> >
> >Do you think evaluating the before and after impact to buffer usage
> >
> >(not just allocation) would be doable in the near term?
> >
>
> Hello sorry but I don't have expertise on iommu. Actually I'm also wondering
> all IOMMU can use order 4 free pages, if they are allocated. I am not sure
> but I remember I heard order 9 (2MB) could be used, but I don't know about order 8 4.
>
> I guess IOMMU mmap also be same patern like we expect. I mean if order 4 is
> prepared it could be faster like 1 to 4 times. But it, I think, should NOT be
> that much slow even though the entire free memory is prepared as order 0 pages.
>
> >
> >
> >If you don't think so, given the benefit to allocation under pressure
> >
> >is large (and I don't mean to give you hurdles to jump), I'm willing
> >
> >to ack your change to get it merged, but if we later see performance
> >
> >trouble, I'll be quick to advocate for reverting it.  Is that ok?
> >
>
> Yes sure. I also want to know if it is.

Ok. Please resend your latest patch and I'll go ahead and ack it and
we'll watch.

Thanks again for your efforts here!
-john

dma-buf: system_heap: avoid reclaim for order 4

Commit Message

Comments

Patch