diff mbox series

[PATCHv5] mm: skip CMA pages when they are not available

Message ID 1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com (mailing list archive)
State New
Headers show
Series [PATCHv5] mm: skip CMA pages when they are not available | expand

Commit Message

zhaoyang.huang May 31, 2023, 2:51 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

This patch fixes unproductive reclaiming of CMA pages by skipping them when they
are not available for current context. It is arise from bellowing OOM issue, which
caused by large proportion of MIGRATE_CMA pages among free pages.

[   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
[   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
[   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
...
[   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
[   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
[   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
v2: update commit message and fix build error when CONFIG_CMA is not set
v3,v4,v5: update code and comments
---
---
 mm/vmscan.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Andrew Morton June 9, 2023, 10:35 p.m. UTC | #1
On Wed, 31 May 2023 10:51:01 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:

> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> This patch fixes unproductive reclaiming of CMA pages by skipping them when they
> are not available for current context. It is arise from bellowing OOM issue, which
> caused by large proportion of MIGRATE_CMA pages among free pages.
> 
> [   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
> [   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
> [   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
> ...
> [   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
> [   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
> [   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0
> 

We saw plenty of feedback for earlier versions, but now silence.  Does
this mean we're all OK with v5?
Matthew Wilcox June 10, 2023, 1:51 a.m. UTC | #2
On Fri, Jun 09, 2023 at 03:35:19PM -0700, Andrew Morton wrote:
> > This patch fixes unproductive reclaiming of CMA pages by skipping them when they
> > are not available for current context. It is arise from bellowing OOM issue, which
> > caused by large proportion of MIGRATE_CMA pages among free pages.
> 
> We saw plenty of feedback for earlier versions, but now silence.  Does
> this mean we're all OK with v5?

I'm fine with the implementation now.  I have no idea if this is the right
approach.
David Hildenbrand June 12, 2023, 9:29 a.m. UTC | #3
On 10.06.23 00:35, Andrew Morton wrote:
> On Wed, 31 May 2023 10:51:01 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:
> 
>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>
>> This patch fixes unproductive reclaiming of CMA pages by skipping them when they
>> are not available for current context. It is arise from bellowing OOM issue, which
>> caused by large proportion of MIGRATE_CMA pages among free pages.
>>
>> [   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
>> [   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
>> [   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
>> ...
>> [   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
>> [   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
>> [   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0
>>
> 
> We saw plenty of feedback for earlier versions, but now silence.  Does
> this mean we're all OK with v5?

The logic kind-of makes sense to me (but the kswapd special-casing 
already shows that it might be a bit fragile for future use), but I did 
not yet figure out if this actually fixes something or is a pure 
performance improvement.

As we phrased it in the comment "It is waste of effort", but in the 
patch description "This patch fixes unproductive reclaiming" + a scary 
dmesg.

Am I correct that this is a pure performance optimization (and the issue 
revealed itself in that OOM report), or does this actually *fix* something?

If it's a performance improvement, it would be good to show that it is 
an actual improvement worth the churn ...
Zhaoyang Huang June 12, 2023, 9:35 a.m. UTC | #4
On Mon, Jun 12, 2023 at 5:29 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.06.23 00:35, Andrew Morton wrote:
> > On Wed, 31 May 2023 10:51:01 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:
> >
> >> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>
> >> This patch fixes unproductive reclaiming of CMA pages by skipping them when they
> >> are not available for current context. It is arise from bellowing OOM issue, which
> >> caused by large proportion of MIGRATE_CMA pages among free pages.
> >>
> >> [   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
> >> [   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
> >> [   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
> >> ...
> >> [   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
> >> [   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
> >> [   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0
> >>
> >
> > We saw plenty of feedback for earlier versions, but now silence.  Does
> > this mean we're all OK with v5?
>
> The logic kind-of makes sense to me (but the kswapd special-casing
> already shows that it might be a bit fragile for future use), but I did
> not yet figure out if this actually fixes something or is a pure
> performance improvement.
>
> As we phrased it in the comment "It is waste of effort", but in the
> patch description "This patch fixes unproductive reclaiming" + a scary
> dmesg.
>
> Am I correct that this is a pure performance optimization (and the issue
> revealed itself in that OOM report), or does this actually *fix* something?
>
> If it's a performance improvement, it would be good to show that it is
> an actual improvement worth the churn ...
Sorry for the confusion. As for the OOM issue, the previous
commit(https://lkml.kernel.org/r/1683782550-25799-1-git-send-email-zhaoyang.huang@unisoc.com)
helps to decrease the fail rate from 12/20 to 2/20, which it turn to
be 0 when applying this patch.
>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand June 12, 2023, 10:01 a.m. UTC | #5
On 12.06.23 11:35, Zhaoyang Huang wrote:
> On Mon, Jun 12, 2023 at 5:29 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 10.06.23 00:35, Andrew Morton wrote:
>>> On Wed, 31 May 2023 10:51:01 +0800 "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:
>>>
>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>>
>>>> This patch fixes unproductive reclaiming of CMA pages by skipping them when they
>>>> are not available for current context. It is arise from bellowing OOM issue, which
>>>> caused by large proportion of MIGRATE_CMA pages among free pages.
>>>>
>>>> [   36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0
>>>> [   36.189447] [03-19 10:05:52.189] DMA32: 0*4kB 447*8kB (C) 217*16kB (C) 124*32kB (C) 136*64kB (C) 70*128kB (C) 22*256kB (C) 3*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 35848kB
>>>> [   36.193125] [03-19 10:05:52.193] Normal: 231*4kB (UMEH) 49*8kB (MEH) 14*16kB (H) 13*32kB (H) 8*64kB (H) 2*128kB (H) 0*256kB 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 3236kB
>>>> ...
>>>> [   36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
>>>> [   36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0
>>>> [   36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0
>>>>
>>>
>>> We saw plenty of feedback for earlier versions, but now silence.  Does
>>> this mean we're all OK with v5?
>>
>> The logic kind-of makes sense to me (but the kswapd special-casing
>> already shows that it might be a bit fragile for future use), but I did
>> not yet figure out if this actually fixes something or is a pure
>> performance improvement.
>>
>> As we phrased it in the comment "It is waste of effort", but in the
>> patch description "This patch fixes unproductive reclaiming" + a scary
>> dmesg.
>>
>> Am I correct that this is a pure performance optimization (and the issue
>> revealed itself in that OOM report), or does this actually *fix* something?
>>
>> If it's a performance improvement, it would be good to show that it is
>> an actual improvement worth the churn ...
> Sorry for the confusion. As for the OOM issue, the previous
> commit(https://lkml.kernel.org/r/1683782550-25799-1-git-send-email-zhaoyang.huang@unisoc.com)
> helps to decrease the fail rate from 12/20 to 2/20, which it turn to
> be 0 when applying this patch.

Thanks! Can we make that clearer in the patch description? I'm 
struggling a bit my self to find the right words.

Something like

"This change further decreases the chance for wrong OOMs in the presence 
of a lot of CMA memory."

?

In any case

Acked-by: David Hildenbrand <david@redhat.com>
Andrew Morton June 12, 2023, 8:56 p.m. UTC | #6
On Mon, 12 Jun 2023 12:01:20 +0200 David Hildenbrand <david@redhat.com> wrote:

> ...
>
> >>
> >> If it's a performance improvement, it would be good to show that it is
> >> an actual improvement worth the churn ...
> > Sorry for the confusion. As for the OOM issue, the previous
> > commit(https://lkml.kernel.org/r/1683782550-25799-1-git-send-email-zhaoyang.huang@unisoc.com)
> > helps to decrease the fail rate from 12/20 to 2/20, which it turn to
> > be 0 when applying this patch.
> 
> Thanks! Can we make that clearer in the patch description? I'm 
> struggling a bit my self to find the right words.
> 
> Something like
> 
> "This change further decreases the chance for wrong OOMs in the presence 
> of a lot of CMA memory."
> 

Great, I added that.

> 
> In any case
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 

And I'll move this patch into mm-stable.
Breno Leitao Aug. 13, 2024, 9:49 a.m. UTC | #7
On Wed, May 31, 2023 at 10:51:01AM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> This patch fixes unproductive reclaiming of CMA pages by skipping them when they
> are not available for current context. It is arise from bellowing OOM issue, which
> caused by large proportion of MIGRATE_CMA pages among free pages.

Hello,

I've been looking into a problem with high memory pressure causing OOMs
in some of our workloads, and it seems that this change may have
introduced lock contention when there is high memory pressure. 

I've collected some metrics for my specific workload that suggest this
change has increased the lruvec->lru_lock waittime-max by 500x and the
waittime-avg by 20x.

Experiment
==========

The experiment involved 100 hosts, each with 64GB of memory and a single
Xeon 8321HC CPU. The experiment ran for over 80 hours.

Half of the hosts (50) were configured with the patch reverted and lock
stat enabled, while the other half was run against the upstream version.
All machines had hugetlb_cma=6G set as a command-line argument.

In this context, "upstream" refers to kernel release 6.9 with some minor
changes that should not impact the results.

Workload
========

The workload is a Java based application that fully utilized the memory,
in fact, the JVM runs with `-Xms50735m -Xmx50735m` arguments.

Results:
=======

A few values from lockstat:

		  waittime-max   waittime-total  waittime-avg  holdtime-max 
6.9:                    242889      15618873933           715         17485
6.9-with-revert:           487        688563299            34           464

The full data could be seen at:
https://docs.google.com/spreadsheets/d/1Dl-8ImlE4OZrfKjbyWAIWWuQtgD3fwEEl9INaZQZ4e8/edit?usp=sharing

Possible causes:
================

I've been discussing this with colleagues and we're speculating that the
high contention might be linked to the fact that CMA regions are now
being skipped. This could potentially extend the duration of the
isolate_lru_folios() 'while' loop, resulting in increased pressure on
the lock.

However, I want to emphasize that I'm not an expert in this
area and I am simply sharing the data I collected.
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd6637f..972a54d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2193,6 +2193,25 @@  static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 
 }
 
+#ifdef CONFIG_CMA
+/*
+ * It is waste of effort to scan and reclaim CMA pages if it is not available
+ * for current allocation context. Kswapd can not be enrolled as it can not
+ * distinguish this scenario by using sc->gfp_mask = GFP_KERNEL
+ */
+static bool skip_cma(struct folio *folio, struct scan_control *sc)
+{
+	return !current_is_kswapd() &&
+			gfp_migratetype(sc->gfp_mask) != MIGRATE_MOVABLE &&
+			get_pageblock_migratetype(&folio->page) == MIGRATE_CMA;
+}
+#else
+static bool skip_cma(struct folio *folio, struct scan_control *sc)
+{
+	return false;
+}
+#endif
+
 /*
  * Isolating page from the lruvec to fill in @dst list by nr_to_scan times.
  *
@@ -2239,7 +2258,8 @@  static unsigned long isolate_lru_folios(unsigned long nr_to_scan,
 		nr_pages = folio_nr_pages(folio);
 		total_scan += nr_pages;
 
-		if (folio_zonenum(folio) > sc->reclaim_idx) {
+		if (folio_zonenum(folio) > sc->reclaim_idx ||
+				skip_cma(folio, sc)) {
 			nr_skipped[folio_zonenum(folio)] += nr_pages;
 			move_to = &folios_skipped;
 			goto move;