Message ID | 20250115021746.34691-3-alexei.starovoitov@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | bpf, mm: Introduce try_alloc_pages() | expand |
On 1/15/25 03:17, Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Introduce free_pages_nolock() that can free pages without taking locks. > It relies on trylock and can be called from any context. > Since spin_trylock() cannot be used in RT from hard IRQ or NMI > it uses lockless link list to stash the pages which will be freed > by subsequent free_pages() from good context. > > Do not use llist unconditionally. BPF maps continuously > allocate/free, so we cannot unconditionally delay the freeing to > llist. When the memory becomes free make it available to the > kernel and BPF users right away if possible, and fallback to > llist as the last resort. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> With: > @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order) > } > EXPORT_SYMBOL(__free_pages); > > +/* > + * Can be called while holding raw_spin_lock or from IRQ and NMI, > + * but only for pages that came from try_alloc_pages(): > + * order <= 3, !folio, etc I think order > 3 is fine, as !pcp_allowed_order() case is handled too? And what does "!folio" mean? > + */ > +void free_pages_nolock(struct page *page, unsigned int order) > +{ > + if (put_page_testzero(page)) > + __free_unref_page(page, order, FPI_TRYLOCK); Hmm this will reach reset_page_owner() and thus stackdepot so same mental note as for patch 1. > +} > + > void free_pages(unsigned long addr, unsigned int order) > { > if (addr != 0) {
On Wed, Jan 15, 2025 at 3:47 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 1/15/25 03:17, Alexei Starovoitov wrote: > > From: Alexei Starovoitov <ast@kernel.org> > > > > Introduce free_pages_nolock() that can free pages without taking locks. > > It relies on trylock and can be called from any context. > > Since spin_trylock() cannot be used in RT from hard IRQ or NMI > > it uses lockless link list to stash the pages which will be freed > > by subsequent free_pages() from good context. > > > > Do not use llist unconditionally. BPF maps continuously > > allocate/free, so we cannot unconditionally delay the freeing to > > llist. When the memory becomes free make it available to the > > kernel and BPF users right away if possible, and fallback to > > llist as the last resort. > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > With: > > > @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order) > > } > > EXPORT_SYMBOL(__free_pages); > > > > +/* > > + * Can be called while holding raw_spin_lock or from IRQ and NMI, > > + * but only for pages that came from try_alloc_pages(): > > + * order <= 3, !folio, etc > > I think order > 3 is fine, as !pcp_allowed_order() case is handled too? try_alloc_page() has: if (!pcp_allowed_order(order)) return NULL; to make sure it tries pcp first. bpf has no use for order > 1. Even 3 is overkill, but it's kinda free to support order <= 3, so why not. > And > what does "!folio" mean? That's what we discussed last year. __free_pages() has all the extra stuff if (!head) and support for dropping ref on the middle page. !folio captures this more broadly. > > + */ > > +void free_pages_nolock(struct page *page, unsigned int order) > > +{ > > + if (put_page_testzero(page)) > > + __free_unref_page(page, order, FPI_TRYLOCK); > > Hmm this will reach reset_page_owner() and thus stackdepot so same mental > note as for patch 1. save_stack() has recursions protection already. So should be fine. > > +} > > + > > void free_pages(unsigned long addr, unsigned int order) > > { > > if (addr != 0) { >
On 1/16/25 00:15, Alexei Starovoitov wrote: > On Wed, Jan 15, 2025 at 3:47 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 1/15/25 03:17, Alexei Starovoitov wrote: >> > From: Alexei Starovoitov <ast@kernel.org> >> > >> > Introduce free_pages_nolock() that can free pages without taking locks. >> > It relies on trylock and can be called from any context. >> > Since spin_trylock() cannot be used in RT from hard IRQ or NMI >> > it uses lockless link list to stash the pages which will be freed >> > by subsequent free_pages() from good context. >> > >> > Do not use llist unconditionally. BPF maps continuously >> > allocate/free, so we cannot unconditionally delay the freeing to >> > llist. When the memory becomes free make it available to the >> > kernel and BPF users right away if possible, and fallback to >> > llist as the last resort. >> > >> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> >> >> Acked-by: Vlastimil Babka <vbabka@suse.cz> >> >> With: >> >> > @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order) >> > } >> > EXPORT_SYMBOL(__free_pages); >> > >> > +/* >> > + * Can be called while holding raw_spin_lock or from IRQ and NMI, >> > + * but only for pages that came from try_alloc_pages(): >> > + * order <= 3, !folio, etc >> >> I think order > 3 is fine, as !pcp_allowed_order() case is handled too? > > try_alloc_page() has: > if (!pcp_allowed_order(order)) > return NULL; Ah ok I missed that the comment describes what pages try_alloc_pages() produces, not what's accepted here. > to make sure it tries pcp first. > bpf has no use for order > 1. Even 3 is overkill, > but it's kinda free to support order <= 3, so why not. > >> And >> what does "!folio" mean? > > That's what we discussed last year. > __free_pages() has all the extra stuff if (!head) and > support for dropping ref on the middle page. > !folio captures this more broadly. Aha! But in that case I realize we're actually wrong. It needs to be a folio (compound page) if it's order > 0 in order to drop that tricky !head freeing code. It's order > 0 pages that are not compound, that are problematic and should be eventually removed from the kernel. The solution is to add __GFP_COMP in try_alloc_pages_noprof(). This will have no effect on order-0 pages that BPF uses. Instead of "!folio" the comment could then say "compound".
On 2025-01-14 18:17:41 [-0800], Alexei Starovoitov wrote: > From: Alexei Starovoitov <ast@kernel.org> > > Introduce free_pages_nolock() that can free pages without taking locks. > It relies on trylock and can be called from any context. > Since spin_trylock() cannot be used in RT from hard IRQ or NMI > it uses lockless link list to stash the pages which will be freed > by subsequent free_pages() from good context. > > Do not use llist unconditionally. BPF maps continuously > allocate/free, so we cannot unconditionally delay the freeing to > llist. When the memory becomes free make it available to the > kernel and BPF users right away if possible, and fallback to > llist as the last resort. > > Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> … > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 74c2a7af1a77..a9c639e3db91 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page, … > static void free_one_page(struct zone *zone, struct page *page, > unsigned long pfn, unsigned int order, > fpi_t fpi_flags) > { > + struct llist_head *llhead; > unsigned long flags; > > - spin_lock_irqsave(&zone->lock, flags); > + if (!spin_trylock_irqsave(&zone->lock, flags)) { > + if (unlikely(fpi_flags & FPI_TRYLOCK)) { > + add_page_to_zone_llist(zone, page, order); > + return; > + } > + spin_lock_irqsave(&zone->lock, flags); > + } > + > + /* The lock succeeded. Process deferred pages. */ > + llhead = &zone->trylock_free_pages; > + if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) { Thank you. > + struct llist_node *llnode; > + struct page *p, *tmp; > + > + llnode = llist_del_all(llhead); > + llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) { > + unsigned int p_order = p->order; > + > + split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags); > + __count_vm_events(PGFREE, 1 << p_order); > + } > + } Sebastian
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index b41bb6e01781..6eba2d80feb8 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -391,6 +391,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas __get_free_pages((gfp_mask) | GFP_DMA, (order)) extern void __free_pages(struct page *page, unsigned int order); +extern void free_pages_nolock(struct page *page, unsigned int order); extern void free_pages(unsigned long addr, unsigned int order); #define __free_page(page) __free_pages((page), 0) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7361a8f3ab68..52547b3e5fd8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -99,6 +99,10 @@ struct page { /* Or, free page */ struct list_head buddy_list; struct list_head pcp_list; + struct { + struct llist_node pcp_llist; + unsigned int order; + }; }; /* See page-flags.h for PAGE_MAPPING_FLAGS */ struct address_space *mapping; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b36124145a16..1a854e0a9e3b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -953,6 +953,9 @@ struct zone { /* Primarily protects free_area */ spinlock_t lock; + /* Pages to be freed when next trylock succeeds */ + struct llist_head trylock_free_pages; + /* Write-intensive fields used by compaction and vmstats. */ CACHELINE_PADDING(_pad2_); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 74c2a7af1a77..a9c639e3db91 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -88,6 +88,9 @@ typedef int __bitwise fpi_t; */ #define FPI_TO_TAIL ((__force fpi_t)BIT(1)) +/* Free the page without taking locks. Rely on trylock only. */ +#define FPI_TRYLOCK ((__force fpi_t)BIT(2)) + /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8) @@ -1247,13 +1250,44 @@ static void split_large_buddy(struct zone *zone, struct page *page, } } +static void add_page_to_zone_llist(struct zone *zone, struct page *page, + unsigned int order) +{ + /* Remember the order */ + page->order = order; + /* Add the page to the free list */ + llist_add(&page->pcp_llist, &zone->trylock_free_pages); +} + static void free_one_page(struct zone *zone, struct page *page, unsigned long pfn, unsigned int order, fpi_t fpi_flags) { + struct llist_head *llhead; unsigned long flags; - spin_lock_irqsave(&zone->lock, flags); + if (!spin_trylock_irqsave(&zone->lock, flags)) { + if (unlikely(fpi_flags & FPI_TRYLOCK)) { + add_page_to_zone_llist(zone, page, order); + return; + } + spin_lock_irqsave(&zone->lock, flags); + } + + /* The lock succeeded. Process deferred pages. */ + llhead = &zone->trylock_free_pages; + if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) { + struct llist_node *llnode; + struct page *p, *tmp; + + llnode = llist_del_all(llhead); + llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) { + unsigned int p_order = p->order; + + split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags); + __count_vm_events(PGFREE, 1 << p_order); + } + } split_large_buddy(zone, page, pfn, order, fpi_flags); spin_unlock_irqrestore(&zone->lock, flags); @@ -2596,7 +2630,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, struct page *page, int migratetype, - unsigned int order) + unsigned int order, fpi_t fpi_flags) { int high, batch; int pindex; @@ -2631,6 +2665,14 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, } if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) pcp->free_count += (1 << order); + + if (unlikely(fpi_flags & FPI_TRYLOCK)) { + /* + * Do not attempt to take a zone lock. Let pcp->count get + * over high mark temporarily. + */ + return; + } high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), @@ -2645,7 +2687,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, /* * Free a pcp page */ -void free_unref_page(struct page *page, unsigned int order) +static void __free_unref_page(struct page *page, unsigned int order, + fpi_t fpi_flags) { unsigned long __maybe_unused UP_flags; struct per_cpu_pages *pcp; @@ -2654,7 +2697,7 @@ void free_unref_page(struct page *page, unsigned int order) int migratetype; if (!pcp_allowed_order(order)) { - __free_pages_ok(page, order, FPI_NONE); + __free_pages_ok(page, order, fpi_flags); return; } @@ -2671,24 +2714,33 @@ void free_unref_page(struct page *page, unsigned int order) migratetype = get_pfnblock_migratetype(page, pfn); if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { if (unlikely(is_migrate_isolate(migratetype))) { - free_one_page(page_zone(page), page, pfn, order, FPI_NONE); + free_one_page(page_zone(page), page, pfn, order, fpi_flags); return; } migratetype = MIGRATE_MOVABLE; } zone = page_zone(page); + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq())) { + add_page_to_zone_llist(zone, page, order); + return; + } pcp_trylock_prepare(UP_flags); pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (pcp) { - free_unref_page_commit(zone, pcp, page, migratetype, order); + free_unref_page_commit(zone, pcp, page, migratetype, order, fpi_flags); pcp_spin_unlock(pcp); } else { - free_one_page(zone, page, pfn, order, FPI_NONE); + free_one_page(zone, page, pfn, order, fpi_flags); } pcp_trylock_finish(UP_flags); } +void free_unref_page(struct page *page, unsigned int order) +{ + __free_unref_page(page, order, FPI_NONE); +} + /* * Free a batch of folios */ @@ -2777,7 +2829,7 @@ void free_unref_folios(struct folio_batch *folios) trace_mm_page_free_batched(&folio->page); free_unref_page_commit(zone, pcp, &folio->page, migratetype, - order); + order, FPI_NONE); } if (pcp) { @@ -4853,6 +4905,17 @@ void __free_pages(struct page *page, unsigned int order) } EXPORT_SYMBOL(__free_pages); +/* + * Can be called while holding raw_spin_lock or from IRQ and NMI, + * but only for pages that came from try_alloc_pages(): + * order <= 3, !folio, etc + */ +void free_pages_nolock(struct page *page, unsigned int order) +{ + if (put_page_testzero(page)) + __free_unref_page(page, order, FPI_TRYLOCK); +} + void free_pages(unsigned long addr, unsigned int order) { if (addr != 0) {