diff mbox series

[mm-unstable,v2] mm/page_alloc: keep track of free highatomic

Message ID 20241026033625.2237102-1-yuzhao@google.com (mailing list archive)
State New
Headers show
Series [mm-unstable,v2] mm/page_alloc: keep track of free highatomic | expand

Commit Message

Yu Zhao Oct. 26, 2024, 3:36 a.m. UTC
OOM kills due to vastly overestimated free highatomic reserves were
observed:

  ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
  Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
  Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB

The second line above shows that the OOM kill was due to the following
condition:

  free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)

And the third line shows there were no free pages in any
MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
'H'. Therefore __zone_watermark_unusable_free() underestimated the
usable free memory by over 1GB, which resulted in the unnecessary OOM
kill above.

The comments in __zone_watermark_unusable_free() warns about the
potential risk, i.e.,

  If the caller does not have rights to reserves below the min
  watermark then subtract the high-atomic reserves. This will
  over-estimate the size of the atomic reserve but it avoids a search.

However, it is possible to keep track of free pages in reserved
highatomic pageblocks with a new per-zone counter nr_free_highatomic
protected by the zone lock, to avoid a search when calculating the
usable free memory. And the cost would be minimal, i.e., simple
arithmetics in the highatomic alloc/free/move paths.

Reported-by: Link Lin <linkl@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

Comments

Andrew Morton Oct. 26, 2024, 4:24 a.m. UTC | #1
On Fri, 25 Oct 2024 21:36:25 -0600 Yu Zhao <yuzhao@google.com> wrote:

> OOM kills due to vastly overestimated free highatomic reserves were
> observed:
> 
>   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
>   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
>   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB

Under what circumstances?

> The second line above shows that the OOM kill was due to the following
> condition:
> 
>   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> 
> And the third line shows there were no free pages in any
> MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> 'H'. Therefore __zone_watermark_unusable_free() underestimated the
> usable free memory by over 1GB, which resulted in the unnecessary OOM
> kill above.
> 
> The comments in __zone_watermark_unusable_free() warns about the
> potential risk, i.e.,
> 
>   If the caller does not have rights to reserves below the min
>   watermark then subtract the high-atomic reserves. This will
>   over-estimate the size of the atomic reserve but it avoids a search.
> 
> However, it is possible to keep track of free pages in reserved
> highatomic pageblocks with a new per-zone counter nr_free_highatomic
> protected by the zone lock, to avoid a search when calculating the
> usable free memory. And the cost would be minimal, i.e., simple
> arithmetics in the highatomic alloc/free/move paths.

Is a -stable backport needed?

If so, is a Fixes: target identifiable?
Yu Zhao Oct. 26, 2024, 4:40 a.m. UTC | #2
On Fri, Oct 25, 2024 at 10:24 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Fri, 25 Oct 2024 21:36:25 -0600 Yu Zhao <yuzhao@google.com> wrote:
>
> > OOM kills due to vastly overestimated free highatomic reserves were
> > observed:
> >
> >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
>
> Under what circumstances?

Link wrote a repro, so he'd be the best person to answer this question.

Link, please help. Thanks.

> > The second line above shows that the OOM kill was due to the following
> > condition:
> >
> >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> >
> > And the third line shows there were no free pages in any
> > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> > 'H'. Therefore __zone_watermark_unusable_free() underestimated the
> > usable free memory by over 1GB, which resulted in the unnecessary OOM
> > kill above.
> >
> > The comments in __zone_watermark_unusable_free() warns about the
> > potential risk, i.e.,
> >
> >   If the caller does not have rights to reserves below the min
> >   watermark then subtract the high-atomic reserves. This will
> >   over-estimate the size of the atomic reserve but it avoids a search.
> >
> > However, it is possible to keep track of free pages in reserved
> > highatomic pageblocks with a new per-zone counter nr_free_highatomic
> > protected by the zone lock, to avoid a search when calculating the
> > usable free memory. And the cost would be minimal, i.e., simple
> > arithmetics in the highatomic alloc/free/move paths.
>
> Is a -stable backport needed?
>
> If so, is a Fixes: target identifiable?

The code has been there for many years, and we only recently noticed
the problem from Link's repro. So it doesn't look like a stable
material.
David Rientjes Oct. 26, 2024, 5:35 a.m. UTC | #3
On Fri, 25 Oct 2024, Yu Zhao wrote:

> OOM kills due to vastly overestimated free highatomic reserves were
> observed:
> 
>   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
>   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
>   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> 
> The second line above shows that the OOM kill was due to the following
> condition:
> 
>   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> 
> And the third line shows there were no free pages in any
> MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> 'H'. Therefore __zone_watermark_unusable_free() underestimated the
> usable free memory by over 1GB, which resulted in the unnecessary OOM
> kill above.
> 
> The comments in __zone_watermark_unusable_free() warns about the
> potential risk, i.e.,
> 
>   If the caller does not have rights to reserves below the min
>   watermark then subtract the high-atomic reserves. This will
>   over-estimate the size of the atomic reserve but it avoids a search.
> 
> However, it is possible to keep track of free pages in reserved
> highatomic pageblocks with a new per-zone counter nr_free_highatomic
> protected by the zone lock, to avoid a search when calculating the
> usable free memory. And the cost would be minimal, i.e., simple
> arithmetics in the highatomic alloc/free/move paths.
> 
> Reported-by: Link Lin <linkl@google.com>
> Signed-off-by: Yu Zhao <yuzhao@google.com>

Acked-by: David Rientjes <rientjes@google.com>
Vlastimil Babka Oct. 27, 2024, 7:40 p.m. UTC | #4
On 10/26/24 06:40, Yu Zhao wrote:
> On Fri, Oct 25, 2024 at 10:24 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> Is a -stable backport needed?
>>
>> If so, is a Fixes: target identifiable?
> 
> The code has been there for many years, and we only recently noticed
> the problem from Link's repro. So it doesn't look like a stable
> material.

The stable backportability would be limited as there's a prerequisity on the
"mm: page_alloc: freelist migratetype hygiene" patchset from half a year ago
- patch applicability wise and importantly functionality wise. Otherwise the
counter could drift easily.

However we could perhaps mark it for a 6.12 stable backport as that's
presumably the upcoming LTS.
Vlastimil Babka Oct. 27, 2024, 7:53 p.m. UTC | #5
On 10/26/24 05:36, Yu Zhao wrote:
> OOM kills due to vastly overestimated free highatomic reserves were
> observed:
> 
>   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
>   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
>   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> 
> The second line above shows that the OOM kill was due to the following
> condition:
> 
>   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> 
> And the third line shows there were no free pages in any
> MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> 'H'. Therefore __zone_watermark_unusable_free() underestimated the
> usable free memory by over 1GB, which resulted in the unnecessary OOM
> kill above.
> 
> The comments in __zone_watermark_unusable_free() warns about the
> potential risk, i.e.,
> 
>   If the caller does not have rights to reserves below the min
>   watermark then subtract the high-atomic reserves. This will
>   over-estimate the size of the atomic reserve but it avoids a search.
> 
> However, it is possible to keep track of free pages in reserved
> highatomic pageblocks with a new per-zone counter nr_free_highatomic
> protected by the zone lock, to avoid a search when calculating the

It's only possible to track this reliably since the "mm: page_alloc:
freelist migratetype hygiene" patchset was merged, which explains why
nr_reserved_highatomic was used until now, even if it's imprecise. But I
agree it should be a good solution to the problem now.

> usable free memory. And the cost would be minimal, i.e., simple
> arithmetics in the highatomic alloc/free/move paths.
> 
> Reported-by: Link Lin <linkl@google.com>
> Signed-off-by: Yu Zhao <yuzhao@google.com>

However, implementation wise that patchset also centralized everything to
account_freepages() which currently handles NR_FREE_PAGES and
NR_FREE_CMA_PAGES, and your patch differs from that approach. Can't you
simply add a NR_FREE_HIGHATOMIC update there instead of creating
account_highatomic_freepages() which is hooked to 3 different places? And
for consistency indeed add a zone_stat_item counter instead a struct zone
field? Then it becomes visible in e.g. /proc/zoneinfo and /proc/vmstat too.
Bonus points for printing it in the oom/alloc failure reports as well
(wherever we print NR_FREE_CMA_PAGES).

Thanks!
Vlastimil

> ---
>  include/linux/mmzone.h |  1 +
>  mm/page_alloc.c        | 23 ++++++++++++++++++++---
>  2 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2e8c4307c728..5e8f567753bd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -825,6 +825,7 @@ struct zone {
>  	unsigned long watermark_boost;
>  
>  	unsigned long nr_reserved_highatomic;
> +	unsigned long nr_free_highatomic;
>  
>  	/*
>  	 * We don't know if the memory that we're going to allocate will be
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c81762c49b00..43ecc7d75a2a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -644,6 +644,18 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
>  		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
>  }
>  
> +static void account_highatomic_freepages(struct zone *zone, unsigned int order,
> +					 int old_mt, int new_mt)
> +{
> +	int delta = 1 << order;
> +
> +	if (is_migrate_highatomic(old_mt))
> +		WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic - delta);
> +
> +	if (is_migrate_highatomic(new_mt))
> +		WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic + delta);
> +}
> +
>  /* Used for pages not on another list */
>  static inline void __add_to_free_list(struct page *page, struct zone *zone,
>  				      unsigned int order, int migratetype,
> @@ -660,6 +672,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
>  	else
>  		list_add(&page->buddy_list, &area->free_list[migratetype]);
>  	area->nr_free++;
> +
> +	account_highatomic_freepages(zone, order, -1, migratetype);
>  }
>  
>  /*
> @@ -681,6 +695,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
>  
>  	account_freepages(zone, -(1 << order), old_mt);
>  	account_freepages(zone, 1 << order, new_mt);
> +
> +	account_highatomic_freepages(zone, order, old_mt, new_mt);
>  }
>  
>  static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
> @@ -698,6 +714,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
>  	__ClearPageBuddy(page);
>  	set_page_private(page, 0);
>  	zone->free_area[order].nr_free--;
> +
> +	account_highatomic_freepages(zone, order, migratetype, -1);
>  }
>  
>  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> @@ -3119,11 +3137,10 @@ static inline long __zone_watermark_unusable_free(struct zone *z,
>  
>  	/*
>  	 * If the caller does not have rights to reserves below the min
> -	 * watermark then subtract the high-atomic reserves. This will
> -	 * over-estimate the size of the atomic reserve but it avoids a search.
> +	 * watermark then subtract the free pages reserved for highatomic.
>  	 */
>  	if (likely(!(alloc_flags & ALLOC_RESERVES)))
> -		unusable_free += z->nr_reserved_highatomic;
> +		unusable_free += READ_ONCE(z->nr_free_highatomic);
>  
>  #ifdef CONFIG_CMA
>  	/* If allocation can't use CMA areas don't use free CMA pages */
Yu Zhao Oct. 27, 2024, 8:03 p.m. UTC | #6
On Sun, Oct 27, 2024 at 1:40 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/26/24 06:40, Yu Zhao wrote:
> > On Fri, Oct 25, 2024 at 10:24 PM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> >>
> >> Is a -stable backport needed?
> >>
> >> If so, is a Fixes: target identifiable?
> >
> > The code has been there for many years, and we only recently noticed
> > the problem from Link's repro. So it doesn't look like a stable
> > material.
>
> The stable backportability would be limited as there's a prerequisity on the
> "mm: page_alloc: freelist migratetype hygiene" patchset from half a year ago
> - patch applicability wise and importantly functionality wise.

I don't see any dependency to that series.

> Otherwise the
> counter could drift easily.

Care to elaborate how that could happen?

> However we could perhaps mark it for a 6.12 stable backport as that's
> presumably the upcoming LTS.

Right.
Yu Zhao Oct. 27, 2024, 8:17 p.m. UTC | #7
On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/26/24 05:36, Yu Zhao wrote:
> > OOM kills due to vastly overestimated free highatomic reserves were
> > observed:
> >
> >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> >
> > The second line above shows that the OOM kill was due to the following
> > condition:
> >
> >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> >
> > And the third line shows there were no free pages in any
> > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> > 'H'. Therefore __zone_watermark_unusable_free() underestimated the
> > usable free memory by over 1GB, which resulted in the unnecessary OOM
> > kill above.
> >
> > The comments in __zone_watermark_unusable_free() warns about the
> > potential risk, i.e.,
> >
> >   If the caller does not have rights to reserves below the min
> >   watermark then subtract the high-atomic reserves. This will
> >   over-estimate the size of the atomic reserve but it avoids a search.
> >
> > However, it is possible to keep track of free pages in reserved
> > highatomic pageblocks with a new per-zone counter nr_free_highatomic
> > protected by the zone lock, to avoid a search when calculating the
>
> It's only possible to track this reliably since the "mm: page_alloc:
> freelist migratetype hygiene" patchset was merged, which explains why
> nr_reserved_highatomic was used until now, even if it's imprecise.

I just refreshed my memory by quickly going through the discussion
around that series and didn't find anything that helps me understand
the above. More pointers please?
Vlastimil Babka Oct. 27, 2024, 8:36 p.m. UTC | #8
On 10/27/24 21:17, Yu Zhao wrote:
> On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 10/26/24 05:36, Yu Zhao wrote:
>> > OOM kills due to vastly overestimated free highatomic reserves were
>> > observed:
>> >
>> >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
>> >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
>> >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
>> >
>> > The second line above shows that the OOM kill was due to the following
>> > condition:
>> >
>> >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
>> >
>> > And the third line shows there were no free pages in any
>> > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
>> > 'H'. Therefore __zone_watermark_unusable_free() underestimated the
>> > usable free memory by over 1GB, which resulted in the unnecessary OOM
>> > kill above.
>> >
>> > The comments in __zone_watermark_unusable_free() warns about the
>> > potential risk, i.e.,
>> >
>> >   If the caller does not have rights to reserves below the min
>> >   watermark then subtract the high-atomic reserves. This will
>> >   over-estimate the size of the atomic reserve but it avoids a search.
>> >
>> > However, it is possible to keep track of free pages in reserved
>> > highatomic pageblocks with a new per-zone counter nr_free_highatomic
>> > protected by the zone lock, to avoid a search when calculating the
>>
>> It's only possible to track this reliably since the "mm: page_alloc:
>> freelist migratetype hygiene" patchset was merged, which explains why
>> nr_reserved_highatomic was used until now, even if it's imprecise.
> 
> I just refreshed my memory by quickly going through the discussion
> around that series and didn't find anything that helps me understand
> the above. More pointers please?

For example:

- a page is on pcplist in MIGRATE_MOVABLE list
- we reserve its pageblock as highatomic, which does nothing to the page on
the pcplist
- page above is flushed from pcplist to zone freelist, but it remembers it
was MIGRATE_MOVABLE, merges with another buddy/buddies from the
now-highatomic list, the resulting order-X page ends up on the movable
freelist despite being in highatomic pageblock. The counter of free
highatomic is now wrong wrt the freelist reality

The series has addressed various scenarios like that, where page can end up
on the wrong freelist.
Yu Zhao Oct. 27, 2024, 8:51 p.m. UTC | #9
On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/27/24 21:17, Yu Zhao wrote:
> > On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 10/26/24 05:36, Yu Zhao wrote:
> >> > OOM kills due to vastly overestimated free highatomic reserves were
> >> > observed:
> >> >
> >> >   ... invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0 ...
> >> >   Node 0 Normal free:1482936kB boost:0kB min:410416kB low:739404kB high:1068392kB reserved_highatomic:1073152KB ...
> >> >   Node 0 Normal: 1292*4kB (ME) 1920*8kB (E) 383*16kB (UE) 220*32kB (ME) 340*64kB (E) 2155*128kB (UE) 3243*256kB (UE) 615*512kB (U) 1*1024kB (M) 0*2048kB 0*4096kB = 1477408kB
> >> >
> >> > The second line above shows that the OOM kill was due to the following
> >> > condition:
> >> >
> >> >   free (1482936kB) - reserved_highatomic (1073152kB) = 409784KB < min (410416kB)
> >> >
> >> > And the third line shows there were no free pages in any
> >> > MIGRATE_HIGHATOMIC pageblocks, which otherwise would show up as type
> >> > 'H'. Therefore __zone_watermark_unusable_free() underestimated the
> >> > usable free memory by over 1GB, which resulted in the unnecessary OOM
> >> > kill above.
> >> >
> >> > The comments in __zone_watermark_unusable_free() warns about the
> >> > potential risk, i.e.,
> >> >
> >> >   If the caller does not have rights to reserves below the min
> >> >   watermark then subtract the high-atomic reserves. This will
> >> >   over-estimate the size of the atomic reserve but it avoids a search.
> >> >
> >> > However, it is possible to keep track of free pages in reserved
> >> > highatomic pageblocks with a new per-zone counter nr_free_highatomic
> >> > protected by the zone lock, to avoid a search when calculating the
> >>
> >> It's only possible to track this reliably since the "mm: page_alloc:
> >> freelist migratetype hygiene" patchset was merged, which explains why
> >> nr_reserved_highatomic was used until now, even if it's imprecise.
> >
> > I just refreshed my memory by quickly going through the discussion
> > around that series and didn't find anything that helps me understand
> > the above. More pointers please?
>
> For example:
>
> - a page is on pcplist in MIGRATE_MOVABLE list
> - we reserve its pageblock as highatomic, which does nothing to the page on
> the pcplist
> - page above is flushed from pcplist to zone freelist, but it remembers it
> was MIGRATE_MOVABLE, merges with another buddy/buddies from the
> now-highatomic list, the resulting order-X page ends up on the movable
> freelist despite being in highatomic pageblock. The counter of free
> highatomic is now wrong wrt the freelist reality

This is the part I don't follow: how is it wrong w.r.t. the freelist
reality? The new nr_free_highatomic should reflect how many pages are
exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated
accordingly.

(My current understanding is that, in this case, the reservation
itself is messed up, i.e., under-reserved.)

> The series has addressed various scenarios like that, where page can end up
> on the wrong freelist.
Vlastimil Babka Oct. 27, 2024, 9:05 p.m. UTC | #10
On 10/27/24 21:51, Yu Zhao wrote:
> On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 10/27/24 21:17, Yu Zhao wrote:
>> > On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >>
>>
>> For example:
>>
>> - a page is on pcplist in MIGRATE_MOVABLE list
>> - we reserve its pageblock as highatomic, which does nothing to the page on
>> the pcplist
>> - page above is flushed from pcplist to zone freelist, but it remembers it
>> was MIGRATE_MOVABLE, merges with another buddy/buddies from the
>> now-highatomic list, the resulting order-X page ends up on the movable
>> freelist despite being in highatomic pageblock. The counter of free
>> highatomic is now wrong wrt the freelist reality
> 
> This is the part I don't follow: how is it wrong w.r.t. the freelist
> reality? The new nr_free_highatomic should reflect how many pages are
> exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated
> accordingly.

You'd have to try implementing your change in the kernel without that
migratetype hygiene series, and see how it would either not work, or you'd
end up implementing the series as part of that.

> (My current understanding is that, in this case, the reservation
> itself is messed up, i.e., under-reserved.)
> 
>> The series has addressed various scenarios like that, where page can end up
>> on the wrong freelist.
Yu Zhao Oct. 28, 2024, 12:24 a.m. UTC | #11
On Sun, Oct 27, 2024 at 3:05 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/27/24 21:51, Yu Zhao wrote:
> > On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 10/27/24 21:17, Yu Zhao wrote:
> >> > On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >>
> >>
> >> For example:
> >>
> >> - a page is on pcplist in MIGRATE_MOVABLE list
> >> - we reserve its pageblock as highatomic, which does nothing to the page on
> >> the pcplist
> >> - page above is flushed from pcplist to zone freelist, but it remembers it
> >> was MIGRATE_MOVABLE, merges with another buddy/buddies from the
> >> now-highatomic list, the resulting order-X page ends up on the movable
> >> freelist despite being in highatomic pageblock. The counter of free
> >> highatomic is now wrong wrt the freelist reality
> >
> > This is the part I don't follow: how is it wrong w.r.t. the freelist
> > reality? The new nr_free_highatomic should reflect how many pages are
> > exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated
> > accordingly.
>
> You'd have to try implementing your change in the kernel without that
> migratetype hygiene series, and see how it would either not work, or you'd
> end up implementing the series as part of that.

A proper backport would need to track the MT of the free_list a page
is deleted from, and I see no reason why in such a proper backport
"the counter could drift easily" or "the counter of free highatomic is
now wrong wrt the freelist reality". So I assume you actually mean
this patch can't be backported cleanly? (Which I do agree.)
Vlastimil Babka Oct. 28, 2024, 11:04 a.m. UTC | #12
On 10/28/24 1:24 AM, Yu Zhao wrote:
> On Sun, Oct 27, 2024 at 3:05 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 10/27/24 21:51, Yu Zhao wrote:
>>> On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>
>>>> On 10/27/24 21:17, Yu Zhao wrote:
>>>>> On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>>>
>>>>
>>>> For example:
>>>>
>>>> - a page is on pcplist in MIGRATE_MOVABLE list
>>>> - we reserve its pageblock as highatomic, which does nothing to the page on
>>>> the pcplist
>>>> - page above is flushed from pcplist to zone freelist, but it remembers it
>>>> was MIGRATE_MOVABLE, merges with another buddy/buddies from the
>>>> now-highatomic list, the resulting order-X page ends up on the movable
>>>> freelist despite being in highatomic pageblock. The counter of free
>>>> highatomic is now wrong wrt the freelist reality
>>>
>>> This is the part I don't follow: how is it wrong w.r.t. the freelist
>>> reality? The new nr_free_highatomic should reflect how many pages are
>>> exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated
>>> accordingly.
>>
>> You'd have to try implementing your change in the kernel without that
>> migratetype hygiene series, and see how it would either not work, or you'd
>> end up implementing the series as part of that.
> 
> A proper backport would need to track the MT of the free_list a page
> is deleted from, and I see no reason why in such a proper backport
> "the counter could drift easily" or "the counter of free highatomic is
> now wrong wrt the freelist reality". So I assume you actually mean
> this patch can't be backported cleanly? (Which I do agree.)

Yes you're right. But since we don't plan to backport it beyond 6.12,
sorry for sidetracking the discussion unnecessarily. More importantly,
is it possible to change the implementation as I suggested? [1] Hooking
to __del_page_from_free_list() and __add_to_free_list() means extra work
in every loop iteration in expand() and __free_one_page(). The
migratetype hygiene should ensure it's not necessary to intercept every
freelist add/move and hooking to account_freepages() should be
sufficient and in line with the intended design.

Thanks,
Vlastimil

[1]
https://lore.kernel.org/all/37a28ef7-e477-40b0-a8e4-3d74b747e323@suse.cz/
Yu Zhao Oct. 28, 2024, 5:54 p.m. UTC | #13
On Mon, Oct 28, 2024 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/28/24 1:24 AM, Yu Zhao wrote:
> > On Sun, Oct 27, 2024 at 3:05 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 10/27/24 21:51, Yu Zhao wrote:
> >>> On Sun, Oct 27, 2024 at 2:36 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>>>
> >>>> On 10/27/24 21:17, Yu Zhao wrote:
> >>>>> On Sun, Oct 27, 2024 at 1:53 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>>>>>
> >>>>
> >>>> For example:
> >>>>
> >>>> - a page is on pcplist in MIGRATE_MOVABLE list
> >>>> - we reserve its pageblock as highatomic, which does nothing to the page on
> >>>> the pcplist
> >>>> - page above is flushed from pcplist to zone freelist, but it remembers it
> >>>> was MIGRATE_MOVABLE, merges with another buddy/buddies from the
> >>>> now-highatomic list, the resulting order-X page ends up on the movable
> >>>> freelist despite being in highatomic pageblock. The counter of free
> >>>> highatomic is now wrong wrt the freelist reality
> >>>
> >>> This is the part I don't follow: how is it wrong w.r.t. the freelist
> >>> reality? The new nr_free_highatomic should reflect how many pages are
> >>> exactly on free_list[MIGRATE_HIGHATOMIC], because it's updated
> >>> accordingly.
> >>
> >> You'd have to try implementing your change in the kernel without that
> >> migratetype hygiene series, and see how it would either not work, or you'd
> >> end up implementing the series as part of that.
> >
> > A proper backport would need to track the MT of the free_list a page
> > is deleted from, and I see no reason why in such a proper backport
> > "the counter could drift easily" or "the counter of free highatomic is
> > now wrong wrt the freelist reality". So I assume you actually mean
> > this patch can't be backported cleanly? (Which I do agree.)
>
> Yes you're right. But since we don't plan to backport it beyond 6.12,
> sorry for sidetracking the discussion unnecessarily. More importantly,
> is it possible to change the implementation as I suggested?

The only reason I didn't fold account_highatomic_freepages() into
account_freepages() is because the former must be called under the
zone lock, which is also how the latter is called but not as a
requirement.

I understand where you come from when suggesting a new per-cpu counter
for free highatomic. I have to disagree with that because 1) free
highatomic is relatively small and drifting might defeat its purpose;
2) per-cpu memory is among the top kernel memory overhead in our fleet
-- it really adds up. So I prefer not to use per-cpu counters unless
necessary.

So if it's ok with you, I'll just fold account_highatomic_freepages()
into account_freepages(), but keep the counter as per zone, not per
cpu.

> [1] Hooking
> to __del_page_from_free_list() and __add_to_free_list() means extra work
> in every loop iteration in expand() and __free_one_page(). The
> migratetype hygiene should ensure it's not necessary to intercept every
> freelist add/move and hooking to account_freepages() should be
> sufficient and in line with the intended design.

Agreed.
Vlastimil Babka Oct. 28, 2024, 6:29 p.m. UTC | #14
On 10/28/24 18:54, Yu Zhao wrote:
> On Mon, Oct 28, 2024 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Yes you're right. But since we don't plan to backport it beyond 6.12,
>> sorry for sidetracking the discussion unnecessarily. More importantly,
>> is it possible to change the implementation as I suggested?
> 
> The only reason I didn't fold account_highatomic_freepages() into
> account_freepages() is because the former must be called under the
> zone lock, which is also how the latter is called but not as a
> requirement.

Ah, I guess we can document the requirement/add an lockdep assert. Using
__mod_zone_page_state() already implies some context restrictions although
not zone lock specifically.

> I understand where you come from when suggesting a new per-cpu counter
> for free highatomic. I have to disagree with that because 1) free
> highatomic is relatively small and drifting might defeat its purpose;
> 2) per-cpu memory is among the top kernel memory overhead in our fleet
> -- it really adds up. So I prefer not to use per-cpu counters unless
> necessary.

OK, didn't think of these drawbacks.

> So if it's ok with you, I'll just fold account_highatomic_freepages()
> into account_freepages(), but keep the counter as per zone, not per
> cpu.

OK, thanks!

>> [1] Hooking
>> to __del_page_from_free_list() and __add_to_free_list() means extra work
>> in every loop iteration in expand() and __free_one_page(). The
>> migratetype hygiene should ensure it's not necessary to intercept every
>> freelist add/move and hooking to account_freepages() should be
>> sufficient and in line with the intended design.
> 
> Agreed.
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2e8c4307c728..5e8f567753bd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -825,6 +825,7 @@  struct zone {
 	unsigned long watermark_boost;
 
 	unsigned long nr_reserved_highatomic;
+	unsigned long nr_free_highatomic;
 
 	/*
 	 * We don't know if the memory that we're going to allocate will be
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c81762c49b00..43ecc7d75a2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -644,6 +644,18 @@  static inline void account_freepages(struct zone *zone, int nr_pages,
 		__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, nr_pages);
 }
 
+static void account_highatomic_freepages(struct zone *zone, unsigned int order,
+					 int old_mt, int new_mt)
+{
+	int delta = 1 << order;
+
+	if (is_migrate_highatomic(old_mt))
+		WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic - delta);
+
+	if (is_migrate_highatomic(new_mt))
+		WRITE_ONCE(zone->nr_free_highatomic, zone->nr_free_highatomic + delta);
+}
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      unsigned int order, int migratetype,
@@ -660,6 +672,8 @@  static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	else
 		list_add(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
+
+	account_highatomic_freepages(zone, order, -1, migratetype);
 }
 
 /*
@@ -681,6 +695,8 @@  static inline void move_to_free_list(struct page *page, struct zone *zone,
 
 	account_freepages(zone, -(1 << order), old_mt);
 	account_freepages(zone, 1 << order, new_mt);
+
+	account_highatomic_freepages(zone, order, old_mt, new_mt);
 }
 
 static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
@@ -698,6 +714,8 @@  static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
 	zone->free_area[order].nr_free--;
+
+	account_highatomic_freepages(zone, order, migratetype, -1);
 }
 
 static inline void del_page_from_free_list(struct page *page, struct zone *zone,
@@ -3119,11 +3137,10 @@  static inline long __zone_watermark_unusable_free(struct zone *z,
 
 	/*
 	 * If the caller does not have rights to reserves below the min
-	 * watermark then subtract the high-atomic reserves. This will
-	 * over-estimate the size of the atomic reserve but it avoids a search.
+	 * watermark then subtract the free pages reserved for highatomic.
 	 */
 	if (likely(!(alloc_flags & ALLOC_RESERVES)))
-		unusable_free += z->nr_reserved_highatomic;
+		unusable_free += READ_ONCE(z->nr_free_highatomic);
 
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */