Message ID | 20241026033625.2237102-1-yuzhao@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [mm-unstable,v2] mm/page_alloc: keep track of free highatomic | expand |
On Fri, 25 Oct 2024 21:36:25 -0600 Yu Zhao <yuzhao@google.com> wrote: > OOM kills due to vastly overestimated free highatomic reserves were > observed: > > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB Under what circumstances? > The second line above shows that the OOM kill was due to the following > condition: > > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) > > And the third line shows there were no free pages in any > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type > 'H'. Therefore __zone_watermark_unusable_free() underestimated the > usable free memory by over 1GB, which resulted in the unnecessary OOM > kill above. > > The comments in __zone_watermark_unusable_free() warns about the > potential risk, i.e., > > If the caller does not have rights to reserves below the min > watermark then subtract the high-atomic reserves. This will > over-estimate the size of the atomic reserve but it avoids a search. > > However, it is possible to keep track of free pages in reserved > highatomic pageblocks with a new per-zone counter nr_free_highatomic > protected by the zone lock, to avoid a search when calculating the > usable free memory. And the cost would be minimal, i.e., simple > arithmetics in the highatomic alloc/free/move paths. Is a -stable backport needed? If so, is a Fixes: target identifiable?
On Fri, Oct 25, 2024 at 10:24 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 25 Oct 2024 21:36:25 -0600 Yu Zhao <yuzhao@google.com> wrote: > > > OOM kills due to vastly overestimated free highatomic reserves were > > observed: > > > > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... > > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... > > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB > > Under what circumstances? Link wrote a repro, so he'd be the best person to answer this question. Link, please help. Thanks. > > The second line above shows that the OOM kill was due to the following > > condition: > > > > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) > > > > And the third line shows there were no free pages in any > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type > > 'H'. Therefore __zone_watermark_unusable_free() underestimated the > > usable free memory by over 1GB, which resulted in the unnecessary OOM > > kill above. > > > > The comments in __zone_watermark_unusable_free() warns about the > > potential risk, i.e., > > > > If the caller does not have rights to reserves below the min > > watermark then subtract the high-atomic reserves. This will > > over-estimate the size of the atomic reserve but it avoids a search. > > > > However, it is possible to keep track of free pages in reserved > > highatomic pageblocks with a new per-zone counter nr_free_highatomic > > protected by the zone lock, to avoid a search when calculating the > > usable free memory. And the cost would be minimal, i.e., simple > > arithmetics in the highatomic alloc/free/move paths. > > Is a -stable backport needed? > > If so, is a Fixes: target identifiable? The code has been there for many years, and we only recently noticed the problem from Link's repro. So it doesn't look like a stable material.
On Fri, 25 Oct 2024, Yu Zhao wrote: > OOM kills due to vastly overestimated free highatomic reserves were > observed: > > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB > > The second line above shows that the OOM kill was due to the following > condition: > > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) > > And the third line shows there were no free pages in any > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type > 'H'. Therefore __zone_watermark_unusable_free() underestimated the > usable free memory by over 1GB, which resulted in the unnecessary OOM > kill above. > > The comments in __zone_watermark_unusable_free() warns about the > potential risk, i.e., > > If the caller does not have rights to reserves below the min > watermark then subtract the high-atomic reserves. This will > over-estimate the size of the atomic reserve but it avoids a search. > > However, it is possible to keep track of free pages in reserved > highatomic pageblocks with a new per-zone counter nr_free_highatomic > protected by the zone lock, to avoid a search when calculating the > usable free memory. And the cost would be minimal, i.e., simple > arithmetics in the highatomic alloc/free/move paths. > > Reported-by: Link Lin <linkl@google.com> > Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: David Rientjes <rientjes@google.com>
On 10/26/24 06:40, Yu Zhao wrote: > On Fri, Oct 25, 2024 at 10:24 PM Andrew Morton > <akpm@linux-foundation.org> wrote: >> >> Is a -stable backport needed? >> >> If so, is a Fixes: target identifiable? > > The code has been there for many years, and we only recently noticed > the problem from Link's repro. So it doesn't look like a stable > material. The stable backportability would be limited as there's a prerequisity on the "mm: page_alloc: freelist migratetype hygiene" patchset from half a year ago - patch applicability wise and importantly functionality wise. Otherwise the counter could drift easily. However we could perhaps mark it for a 6.12 stable backport as that's presumably the upcoming LTS.
On 10/26/24 05:36, Yu Zhao wrote: > OOM kills due to vastly overestimated free highatomic reserves were > observed: > > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB > > The second line above shows that the OOM kill was due to the following > condition: > > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) > > And the third line shows there were no free pages in any > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type > 'H'. Therefore __zone_watermark_unusable_free() underestimated the > usable free memory by over 1GB, which resulted in the unnecessary OOM > kill above. > > The comments in __zone_watermark_unusable_free() warns about the > potential risk, i.e., > > If the caller does not have rights to reserves below the min > watermark then subtract the high-atomic reserves. This will > over-estimate the size of the atomic reserve but it avoids a search. > > However, it is possible to keep track of free pages in reserved > highatomic pageblocks with a new per-zone counter nr_free_highatomic > protected by the zone lock, to avoid a search when calculating the It's only possible to track this reliably since the "mm: page_alloc: freelist migratetype hygiene" patchset was merged, which explains why nr_reserved_highatomic was used until now, even if it's imprecise. But I agree it should be a good solution to the problem now. > usable free memory. And the cost would be minimal, i.e., simple > arithmetics in the highatomic alloc/free/move paths. > > Reported-by: Link Lin <linkl@google.com> > Signed-off-by: Yu Zhao <yuzhao@google.com> However, implementation wise that patchset also centralized everything to account_freepages() which currently handles NR_FREE_PAGES and NR_FREE_CMA_PAGES, and your patch differs from that approach. Can't you simply add a NR_FREE_HIGHATOMIC update there instead of creating account_highatomic_freepages() which is hooked to 3 different places? And for consistency indeed add a zone_stat_item counter instead a struct zone field? Then it becomes visible in e.g. /proc/zoneinfo and /proc/vmstat too. Bonus points for printing it in the oom/alloc failure reports as well (wherever we print NR_FREE_CMA_PAGES). Thanks! Vlastimil > --- > include/linux/mmzone.h | 1 + > mm/page_alloc.c | 23 ++++++++++++++++++++--- > 2 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 2e8c4307c728..5e8f567753bd 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -825,6 +825,7 @@ struct zone { > unsigned long watermark_boost; > > unsigned long nr_reserved_highatomic; > + unsigned long nr_free_highatomic; > > /* > * We don't know if the memory that we're going to allocate will be > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c81762c49b00..43ecc7d75a2a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -644,6 +644,18 @@ static inline void account_freepages(struct zone *zone, int nr_pages, > __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); > } > > +static void account_highatomic_freepages(struct zone *zone, unsigned int order, > + int old_mt, int new_mt) > +{ > + int delta = 1 << order; > + > + if (is_migrate_highatomic(old_mt)) > + WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic - delta); > + > + if (is_migrate_highatomic(new_mt)) > + WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic + delta); > +} > + > /* Used for pages not on another list */ > static inline void __add_to_free_list(struct page *page, struct zone *zone, > unsigned int order, int migratetype, > @@ -660,6 +672,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone, > else > list_add(&page->buddy_list, &area->free_list[migratetype]); > area->nr_free++; > + > + account_highatomic_freepages(zone, order, -1, migratetype); > } > > /* > @@ -681,6 +695,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, > > account_freepages(zone, -(1 << order), old_mt); > account_freepages(zone, 1 << order, new_mt); > + > + account_highatomic_freepages(zone, order, old_mt, new_mt); > } > > static inline void __del_page_from_free_list(struct page *page, struct zone *zone, > @@ -698,6 +714,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon > __ClearPageBuddy(page); > set_page_private(page, 0); > zone->free_area[order].nr_free--; > + > + account_highatomic_freepages(zone, order, migratetype, -1); > } > > static inline void del_page_from_free_list(struct page *page, struct zone *zone, > @@ -3119,11 +3137,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z, > > /* > * If the caller does not have rights to reserves below the min > - * watermark then subtract the high-atomic reserves. This will > - * over-estimate the size of the atomic reserve but it avoids a search. > + * watermark then subtract the free pages reserved for highatomic. > */ > if (likely(!(alloc_flags & ALLOC_RESERVES))) > - unusable_free += z->nr_reserved_highatomic; > + unusable_free += READ_ONCE(z->nr_free_highatomic); > > #ifdef CONFIG_CMA > /* If allocation can't use CMA areas don't use free CMA pages */
On Sun, Oct 27, 2024 at 1:40 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 10/26/24 06:40, Yu Zhao wrote: > > On Fri, Oct 25, 2024 at 10:24 PM Andrew Morton > > <akpm@linux-foundation.org> wrote: > >> > >> Is a -stable backport needed? > >> > >> If so, is a Fixes: target identifiable? > > > > The code has been there for many years, and we only recently noticed > > the problem from Link's repro. So it doesn't look like a stable > > material. > > The stable backportability would be limited as there's a prerequisity on the > "mm: page_alloc: freelist migratetype hygiene" patchset from half a year ago > - patch applicability wise and importantly functionality wise. I don't see any dependency to that series. > Otherwise the > counter could drift easily. Care to elaborate how that could happen? > However we could perhaps mark it for a 6.12 stable backport as that's > presumably the upcoming LTS. Right.
On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 10/26/24 05:36, Yu Zhao wrote: > > OOM kills due to vastly overestimated free highatomic reserves were > > observed: > > > > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... > > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... > > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB > > > > The second line above shows that the OOM kill was due to the following > > condition: > > > > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) > > > > And the third line shows there were no free pages in any > > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type > > 'H'. Therefore __zone_watermark_unusable_free() underestimated the > > usable free memory by over 1GB, which resulted in the unnecessary OOM > > kill above. > > > > The comments in __zone_watermark_unusable_free() warns about the > > potential risk, i.e., > > > > If the caller does not have rights to reserves below the min > > watermark then subtract the high-atomic reserves. This will > > over-estimate the size of the atomic reserve but it avoids a search. > > > > However, it is possible to keep track of free pages in reserved > > highatomic pageblocks with a new per-zone counter nr_free_highatomic > > protected by the zone lock, to avoid a search when calculating the > > It's only possible to track this reliably since the "mm: page_alloc: > freelist migratetype hygiene" patchset was merged, which explains why > nr_reserved_highatomic was used until now, even if it's imprecise. I just refreshed my memory by quickly going through the discussion around that series and didn't find anything that helps me understand the above. More pointers please?
On 10/27/24 21:17, Yu Zhao wrote: > On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 10/26/24 05:36, Yu Zhao wrote: >> > OOM kills due to vastly overestimated free highatomic reserves were >> > observed: >> > >> > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... >> > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... >> > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB >> > >> > The second line above shows that the OOM kill was due to the following >> > condition: >> > >> > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) >> > >> > And the third line shows there were no free pages in any >> > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type >> > 'H'. Therefore __zone_watermark_unusable_free() underestimated the >> > usable free memory by over 1GB, which resulted in the unnecessary OOM >> > kill above. >> > >> > The comments in __zone_watermark_unusable_free() warns about the >> > potential risk, i.e., >> > >> > If the caller does not have rights to reserves below the min >> > watermark then subtract the high-atomic reserves. This will >> > over-estimate the size of the atomic reserve but it avoids a search. >> > >> > However, it is possible to keep track of free pages in reserved >> > highatomic pageblocks with a new per-zone counter nr_free_highatomic >> > protected by the zone lock, to avoid a search when calculating the >> >> It's only possible to track this reliably since the "mm: page_alloc: >> freelist migratetype hygiene" patchset was merged, which explains why >> nr_reserved_highatomic was used until now, even if it's imprecise. > > I just refreshed my memory by quickly going through the discussion > around that series and didn't find anything that helps me understand > the above. More pointers please? For example: - a page is on pcplist in MIGRATE_MOVABLE list - we reserve its pageblock as highatomic, which does nothing to the page on the pcplist - page above is flushed from pcplist to zone freelist, but it remembers it was MIGRATE_MOVABLE, merges with another buddy/buddies from the now-highatomic list, the resulting order-X page ends up on the movable freelist despite being in highatomic pageblock. The counter of free highatomic is now wrong wrt the freelist reality The series has addressed various scenarios like that, where page can end up on the wrong freelist.
On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 10/27/24 21:17, Yu Zhao wrote: > > On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> On 10/26/24 05:36, Yu Zhao wrote: > >> > OOM kills due to vastly overestimated free highatomic reserves were > >> > observed: > >> > > >> > ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... > >> > Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... > >> > Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB > >> > > >> > The second line above shows that the OOM kill was due to the following > >> > condition: > >> > > >> > free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) > >> > > >> > And the third line shows there were no free pages in any > >> > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type > >> > 'H'. Therefore __zone_watermark_unusable_free() underestimated the > >> > usable free memory by over 1GB, which resulted in the unnecessary OOM > >> > kill above. > >> > > >> > The comments in __zone_watermark_unusable_free() warns about the > >> > potential risk, i.e., > >> > > >> > If the caller does not have rights to reserves below the min > >> > watermark then subtract the high-atomic reserves. This will > >> > over-estimate the size of the atomic reserve but it avoids a search. > >> > > >> > However, it is possible to keep track of free pages in reserved > >> > highatomic pageblocks with a new per-zone counter nr_free_highatomic > >> > protected by the zone lock, to avoid a search when calculating the > >> > >> It's only possible to track this reliably since the "mm: page_alloc: > >> freelist migratetype hygiene" patchset was merged, which explains why > >> nr_reserved_highatomic was used until now, even if it's imprecise. > > > > I just refreshed my memory by quickly going through the discussion > > around that series and didn't find anything that helps me understand > > the above. More pointers please? > > For example: > > - a page is on pcplist in MIGRATE_MOVABLE list > - we reserve its pageblock as highatomic, which does nothing to the page on > the pcplist > - page above is flushed from pcplist to zone freelist, but it remembers it > was MIGRATE_MOVABLE, merges with another buddy/buddies from the > now-highatomic list, the resulting order-X page ends up on the movable > freelist despite being in highatomic pageblock. The counter of free > highatomic is now wrong wrt the freelist reality This is the part I don't follow: how is it wrong w.r.t. the freelist reality? The new nr_free_highatomic should reflect how many pages are exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated accordingly. (My current understanding is that, in this case, the reservation itself is messed up, i.e., under-reserved.) > The series has addressed various scenarios like that, where page can end up > on the wrong freelist.
On 10/27/24 21:51, Yu Zhao wrote: > On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 10/27/24 21:17, Yu Zhao wrote: >> > On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> >> >> For example: >> >> - a page is on pcplist in MIGRATE_MOVABLE list >> - we reserve its pageblock as highatomic, which does nothing to the page on >> the pcplist >> - page above is flushed from pcplist to zone freelist, but it remembers it >> was MIGRATE_MOVABLE, merges with another buddy/buddies from the >> now-highatomic list, the resulting order-X page ends up on the movable >> freelist despite being in highatomic pageblock. The counter of free >> highatomic is now wrong wrt the freelist reality > > This is the part I don't follow: how is it wrong w.r.t. the freelist > reality? The new nr_free_highatomic should reflect how many pages are > exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated > accordingly. You'd have to try implementing your change in the kernel without that migratetype hygiene series, and see how it would either not work, or you'd end up implementing the series as part of that. > (My current understanding is that, in this case, the reservation > itself is messed up, i.e., under-reserved.) > >> The series has addressed various scenarios like that, where page can end up >> on the wrong freelist.
On Sun, Oct 27, 2024 at 3:05 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 10/27/24 21:51, Yu Zhao wrote: > > On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> On 10/27/24 21:17, Yu Zhao wrote: > >> > On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: > >> >> > >> > >> For example: > >> > >> - a page is on pcplist in MIGRATE_MOVABLE list > >> - we reserve its pageblock as highatomic, which does nothing to the page on > >> the pcplist > >> - page above is flushed from pcplist to zone freelist, but it remembers it > >> was MIGRATE_MOVABLE, merges with another buddy/buddies from the > >> now-highatomic list, the resulting order-X page ends up on the movable > >> freelist despite being in highatomic pageblock. The counter of free > >> highatomic is now wrong wrt the freelist reality > > > > This is the part I don't follow: how is it wrong w.r.t. the freelist > > reality? The new nr_free_highatomic should reflect how many pages are > > exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated > > accordingly. > > You'd have to try implementing your change in the kernel without that > migratetype hygiene series, and see how it would either not work, or you'd > end up implementing the series as part of that. A proper backport would need to track the MT of the free_list a page is deleted from, and I see no reason why in such a proper backport "the counter could drift easily" or "the counter of free highatomic is now wrong wrt the freelist reality". So I assume you actually mean this patch can't be backported cleanly? (Which I do agree.)
On 10/28/24 1:24 AM, Yu Zhao wrote: > On Sun, Oct 27, 2024 at 3:05 PM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 10/27/24 21:51, Yu Zhao wrote: >>> On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote: >>>> >>>> On 10/27/24 21:17, Yu Zhao wrote: >>>>> On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: >>>>>> >>>> >>>> For example: >>>> >>>> - a page is on pcplist in MIGRATE_MOVABLE list >>>> - we reserve its pageblock as highatomic, which does nothing to the page on >>>> the pcplist >>>> - page above is flushed from pcplist to zone freelist, but it remembers it >>>> was MIGRATE_MOVABLE, merges with another buddy/buddies from the >>>> now-highatomic list, the resulting order-X page ends up on the movable >>>> freelist despite being in highatomic pageblock. The counter of free >>>> highatomic is now wrong wrt the freelist reality >>> >>> This is the part I don't follow: how is it wrong w.r.t. the freelist >>> reality? The new nr_free_highatomic should reflect how many pages are >>> exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated >>> accordingly. >> >> You'd have to try implementing your change in the kernel without that >> migratetype hygiene series, and see how it would either not work, or you'd >> end up implementing the series as part of that. > > A proper backport would need to track the MT of the free_list a page > is deleted from, and I see no reason why in such a proper backport > "the counter could drift easily" or "the counter of free highatomic is > now wrong wrt the freelist reality". So I assume you actually mean > this patch can't be backported cleanly? (Which I do agree.) Yes you're right. But since we don't plan to backport it beyond 6.12, sorry for sidetracking the discussion unnecessarily. More importantly, is it possible to change the implementation as I suggested? [1] Hooking to __del_page_from_free_list() and __add_to_free_list() means extra work in every loop iteration in expand() and __free_one_page(). The migratetype hygiene should ensure it's not necessary to intercept every freelist add/move and hooking to account_freepages() should be sufficient and in line with the intended design. Thanks, Vlastimil [1] https://lore.kernel.org/all/37a28ef7-e477-40b0-a8e4-3d74b747e323@suse.cz/
On Mon, Oct 28, 2024 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 10/28/24 1:24 AM, Yu Zhao wrote: > > On Sun, Oct 27, 2024 at 3:05 PM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> On 10/27/24 21:51, Yu Zhao wrote: > >>> On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote: > >>>> > >>>> On 10/27/24 21:17, Yu Zhao wrote: > >>>>> On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: > >>>>>> > >>>> > >>>> For example: > >>>> > >>>> - a page is on pcplist in MIGRATE_MOVABLE list > >>>> - we reserve its pageblock as highatomic, which does nothing to the page on > >>>> the pcplist > >>>> - page above is flushed from pcplist to zone freelist, but it remembers it > >>>> was MIGRATE_MOVABLE, merges with another buddy/buddies from the > >>>> now-highatomic list, the resulting order-X page ends up on the movable > >>>> freelist despite being in highatomic pageblock. The counter of free > >>>> highatomic is now wrong wrt the freelist reality > >>> > >>> This is the part I don't follow: how is it wrong w.r.t. the freelist > >>> reality? The new nr_free_highatomic should reflect how many pages are > >>> exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated > >>> accordingly. > >> > >> You'd have to try implementing your change in the kernel without that > >> migratetype hygiene series, and see how it would either not work, or you'd > >> end up implementing the series as part of that. > > > > A proper backport would need to track the MT of the free_list a page > > is deleted from, and I see no reason why in such a proper backport > > "the counter could drift easily" or "the counter of free highatomic is > > now wrong wrt the freelist reality". So I assume you actually mean > > this patch can't be backported cleanly? (Which I do agree.) > > Yes you're right. But since we don't plan to backport it beyond 6.12, > sorry for sidetracking the discussion unnecessarily. More importantly, > is it possible to change the implementation as I suggested? The only reason I didn't fold account_highatomic_freepages() into account_freepages() is because the former must be called under the zone lock, which is also how the latter is called but not as a requirement. I understand where you come from when suggesting a new per-cpu counter for free highatomic. I have to disagree with that because 1) free highatomic is relatively small and drifting might defeat its purpose; 2) per-cpu memory is among the top kernel memory overhead in our fleet -- it really adds up. So I prefer not to use per-cpu counters unless necessary. So if it's ok with you, I'll just fold account_highatomic_freepages() into account_freepages(), but keep the counter as per zone, not per cpu. > [1] Hooking > to __del_page_from_free_list() and __add_to_free_list() means extra work > in every loop iteration in expand() and __free_one_page(). The > migratetype hygiene should ensure it's not necessary to intercept every > freelist add/move and hooking to account_freepages() should be > sufficient and in line with the intended design. Agreed.
On 10/28/24 18:54, Yu Zhao wrote: > On Mon, Oct 28, 2024 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> Yes you're right. But since we don't plan to backport it beyond 6.12, >> sorry for sidetracking the discussion unnecessarily. More importantly, >> is it possible to change the implementation as I suggested? > > The only reason I didn't fold account_highatomic_freepages() into > account_freepages() is because the former must be called under the > zone lock, which is also how the latter is called but not as a > requirement. Ah, I guess we can document the requirement/add an lockdep assert. Using __mod_zone_page_state() already implies some context restrictions although not zone lock specifically. > I understand where you come from when suggesting a new per-cpu counter > for free highatomic. I have to disagree with that because 1) free > highatomic is relatively small and drifting might defeat its purpose; > 2) per-cpu memory is among the top kernel memory overhead in our fleet > -- it really adds up. So I prefer not to use per-cpu counters unless > necessary. OK, didn't think of these drawbacks. > So if it's ok with you, I'll just fold account_highatomic_freepages() > into account_freepages(), but keep the counter as per zone, not per > cpu. OK, thanks! >> [1] Hooking >> to __del_page_from_free_list() and __add_to_free_list() means extra work >> in every loop iteration in expand() and __free_one_page(). The >> migratetype hygiene should ensure it's not necessary to intercept every >> freelist add/move and hooking to account_freepages() should be >> sufficient and in line with the intended design. > > Agreed.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2e8c4307c728..5e8f567753bd 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -825,6 +825,7 @@ struct zone { unsigned long watermark_boost; unsigned long nr_reserved_highatomic; + unsigned long nr_free_highatomic; /* * We don't know if the memory that we're going to allocate will be diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c81762c49b00..43ecc7d75a2a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -644,6 +644,18 @@ static inline void account_freepages(struct zone *zone, int nr_pages, __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages); } +static void account_highatomic_freepages(struct zone *zone, unsigned int order, + int old_mt, int new_mt) +{ + int delta = 1 << order; + + if (is_migrate_highatomic(old_mt)) + WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic - delta); + + if (is_migrate_highatomic(new_mt)) + WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic + delta); +} + /* Used for pages not on another list */ static inline void __add_to_free_list(struct page *page, struct zone *zone, unsigned int order, int migratetype, @@ -660,6 +672,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone, else list_add(&page->buddy_list, &area->free_list[migratetype]); area->nr_free++; + + account_highatomic_freepages(zone, order, -1, migratetype); } /* @@ -681,6 +695,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, account_freepages(zone, -(1 << order), old_mt); account_freepages(zone, 1 << order, new_mt); + + account_highatomic_freepages(zone, order, old_mt, new_mt); } static inline void __del_page_from_free_list(struct page *page, struct zone *zone, @@ -698,6 +714,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon __ClearPageBuddy(page); set_page_private(page, 0); zone->free_area[order].nr_free--; + + account_highatomic_freepages(zone, order, migratetype, -1); } static inline void del_page_from_free_list(struct page *page, struct zone *zone, @@ -3119,11 +3137,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z, /* * If the caller does not have rights to reserves below the min - * watermark then subtract the high-atomic reserves. This will - * over-estimate the size of the atomic reserve but it avoids a search. + * watermark then subtract the free pages reserved for highatomic. */ if (likely(!(alloc_flags & ALLOC_RESERVES))) - unusable_free += z->nr_reserved_highatomic; + unusable_free += READ_ONCE(z->nr_free_highatomic); #ifdef CONFIG_CMA /* If allocation can't use CMA areas don't use free CMA pages */
OOM kills due to vastly overestimated free highatomic reserves were observed: ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ... Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ... Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB The second line above shows that the OOM kill was due to the following condition: free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB) And the third line shows there were no free pages in any MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type 'H'. Therefore __zone_watermark_unusable_free() underestimated the usable free memory by over 1GB, which resulted in the unnecessary OOM kill above. The comments in __zone_watermark_unusable_free() warns about the potential risk, i.e., If the caller does not have rights to reserves below the min watermark then subtract the high-atomic reserves. This will over-estimate the size of the atomic reserve but it avoids a search. However, it is possible to keep track of free pages in reserved highatomic pageblocks with a new per-zone counter nr_free_highatomic protected by the zone lock, to avoid a search when calculating the usable free memory. And the cost would be minimal, i.e., simple arithmetics in the highatomic alloc/free/move paths. Reported-by: Link Lin <linkl@google.com> Signed-off-by: Yu Zhao <yuzhao@google.com> --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 23 ++++++++++++++++++++--- 2 files changed, 21 insertions(+), 3 deletions(-)