Message ID | 20210414133931.4555-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Use local_lock for pcp protection and reduce stat overhead | expand |
On 4/14/21 3:39 PM, Mel Gorman wrote: > Historically when freeing pages, free_one_page() assumed that callers > had IRQs disabled and the zone->lock could be acquired with spin_lock(). > This confuses the scope of what local_lock_irq is protecting and what > zone->lock is protecting in free_unref_page_list in particular. > > This patch uses spin_lock_irqsave() for the zone->lock in > free_one_page() instead of relying on callers to have disabled > IRQs. free_unref_page_commit() is changed to only deal with PCP pages > protected by the local lock. free_unref_page_list() then first frees > isolated pages to the buddy lists with free_one_page() and frees the rest > of the pages to the PCP via free_unref_page_commit(). The end result > is that free_one_page() is no longer depending on side-effects of > local_lock to be correct. > > Note that this may incur a performance penalty while memory hot-remove > is running but that is not a common operation. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> A nit below: > @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list) > struct page *page, *next; > unsigned long flags, pfn; > int batch_count = 0; > + int migratetype; > > /* Prepare pages for freeing */ > list_for_each_entry_safe(page, next, list, lru) { > @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list) > if (!free_unref_page_prepare(page, pfn)) > list_del(&page->lru); > set_page_private(page, pfn); Should probably move this below so we don't set private for pages that then go through free_one_page()? Doesn't seem to be a bug, just unneccessary. > + > + /* > + * Free isolated pages directly to the allocator, see > + * comment in free_unref_page. > + */ > + migratetype = get_pcppage_migratetype(page); > + if (unlikely(migratetype >= MIGRATE_PCPTYPES)) { > + if (unlikely(is_migrate_isolate(migratetype))) { > + free_one_page(page_zone(page), page, pfn, 0, > + migratetype, FPI_NONE); > + list_del(&page->lru); > + } > + } > } > > local_lock_irqsave(&pagesets.lock, flags); > list_for_each_entry_safe(page, next, list, lru) { > - unsigned long pfn = page_private(page); > - > + pfn = page_private(page); > set_page_private(page, 0); > + migratetype = get_pcppage_migratetype(page); > trace_mm_page_free_batched(page); > - free_unref_page_commit(page, pfn); > + free_unref_page_commit(page, pfn, migratetype); > > /* > * Guard against excessive IRQ disabled times when we get >
On Thu, Apr 15, 2021 at 02:25:36PM +0200, Vlastimil Babka wrote: > > @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list) > > struct page *page, *next; > > unsigned long flags, pfn; > > int batch_count = 0; > > + int migratetype; > > > > /* Prepare pages for freeing */ > > list_for_each_entry_safe(page, next, list, lru) { > > @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list) > > if (!free_unref_page_prepare(page, pfn)) > > list_del(&page->lru); > > set_page_private(page, pfn); > > Should probably move this below so we don't set private for pages that then go > through free_one_page()? Doesn't seem to be a bug, just unneccessary. > Sure. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1d87ca364680..a9c1282d9c7b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3293,7 +3293,6 @@ void free_unref_page_list(struct list_head *list) pfn = page_to_pfn(page); if (!free_unref_page_prepare(page, pfn)) list_del(&page->lru); - set_page_private(page, pfn); /* * Free isolated pages directly to the allocator, see @@ -3307,6 +3306,8 @@ void free_unref_page_list(struct list_head *list) list_del(&page->lru); } } + + set_page_private(page, pfn); } local_lock_irqsave(&pagesets.lock, flags);
On 4/14/21 3:39 PM, Mel Gorman wrote: > struct per_cpu_pages is protected by the pagesets lock but it can be > embedded within struct per_cpu_pages at a minor cost. This is possible > because per-cpu lookups are based on offsets. Paraphrasing an explanation > from Peter Ziljstra > > The whole thing relies on: > > &per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu) > > Which is true because the lhs: > > (local_lock_t *)((zone->per_cpu_pages + per_cpu_offset(cpu)) + offsetof(struct per_cpu_pages, lock)) > > and the rhs: > > (local_lock_t *)((zone->per_cpu_pages + offsetof(struct per_cpu_pages, lock)) + per_cpu_offset(cpu)) > > are identical, because addition is associative. > > More details are included in mmzone.h. This embedding is not completely > free for three reasons. > > 1. As local_lock does not return a per-cpu structure, the PCP has to > be looked up twice -- first to acquire the lock and again to get the > PCP pointer. > > 2. For PREEMPT_RT and CONFIG_DEBUG_LOCK_ALLOC, local_lock is potentially > a spinlock or has lock-specific tracking. In both cases, it becomes > necessary to release/acquire different locks when freeing a list of > pages in free_unref_page_list. Looks like this pattern could benefit from a local_lock API helper that would do the right thing? It probably couldn't optimize much the CONFIG_PREEMPT_RT case which would need to be unlock/lock in any case, but CONFIG_DEBUG_LOCK_ALLOC could perhaps just keep the IRQ's disabled and just note the change of what's acquired? > 3. For most kernel configurations, local_lock_t is empty and no storage is > required. By embedding the lock, the memory consumption on PREEMPT_RT > and CONFIG_DEBUG_LOCK_ALLOC is higher. But I wonder, is there really a benefit to this increased complexity? Before the patch we had "pagesets" - a local_lock that protects all zones' pcplists. Now each zone's pcplists have own local_lock. On !PREEMPT_RT we will never take the locks of multiple zones from the same CPU in parallel, because we use local_lock_irqsave(). Can that parallelism happen on PREEMPT_RT, because that could perhaps justify the change? > Suggested-by: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > ---
On Thu, Apr 15, 2021 at 04:53:46PM +0200, Vlastimil Babka wrote: > On 4/14/21 3:39 PM, Mel Gorman wrote: > > struct per_cpu_pages is protected by the pagesets lock but it can be > > embedded within struct per_cpu_pages at a minor cost. This is possible > > because per-cpu lookups are based on offsets. Paraphrasing an explanation > > from Peter Ziljstra > > > > The whole thing relies on: > > > > &per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu) > > > > Which is true because the lhs: > > > > (local_lock_t *)((zone->per_cpu_pages + per_cpu_offset(cpu)) + offsetof(struct per_cpu_pages, lock)) > > > > and the rhs: > > > > (local_lock_t *)((zone->per_cpu_pages + offsetof(struct per_cpu_pages, lock)) + per_cpu_offset(cpu)) > > > > are identical, because addition is associative. > > > > More details are included in mmzone.h. This embedding is not completely > > free for three reasons. > > > > 1. As local_lock does not return a per-cpu structure, the PCP has to > > be looked up twice -- first to acquire the lock and again to get the > > PCP pointer. > > > > 2. For PREEMPT_RT and CONFIG_DEBUG_LOCK_ALLOC, local_lock is potentially > > a spinlock or has lock-specific tracking. In both cases, it becomes > > necessary to release/acquire different locks when freeing a list of > > pages in free_unref_page_list. > > Looks like this pattern could benefit from a local_lock API helper that would do > the right thing? It probably couldn't optimize much the CONFIG_PREEMPT_RT case > which would need to be unlock/lock in any case, but CONFIG_DEBUG_LOCK_ALLOC > could perhaps just keep the IRQ's disabled and just note the change of what's > acquired? > A helper could potentially be used but right now, there is only one call-site that needs this type of care so it may be overkill. A helper was proposed that can lookup and lock a per-cpu structure which is generally useful but does not suit the case where different locks need to be acquired. > > 3. For most kernel configurations, local_lock_t is empty and no storage is > > required. By embedding the lock, the memory consumption on PREEMPT_RT > > and CONFIG_DEBUG_LOCK_ALLOC is higher. > > But I wonder, is there really a benefit to this increased complexity? Before the > patch we had "pagesets" - a local_lock that protects all zones' pcplists. Now > each zone's pcplists have own local_lock. On !PREEMPT_RT we will never take the > locks of multiple zones from the same CPU in parallel, because we use > local_lock_irqsave(). Can that parallelism happen on PREEMPT_RT, because that > could perhaps justify the change? > I don't think PREEMPT_RT gets additional parallelism because it's still a per-cpu structure that is being protected. The difference is whether we are protecting the CPU-N index for all per_cpu_pages or just one. The patch exists because it was asked why the lock was not embedded within the structure it's protecting. I initially thought that was unsafe and I was wrong as explained in the changelog. But now that I find it *can* be done but it's a bit ugly so I put it at the end of the series so it can be dropped if necessary.