Message ID | 20220512085043.5234-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Drain remote per-cpu directly v3 | expand |
On Thu, 12 May 2022 09:50:37 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > Changelog since v2 > o More conversions from page->lru to page->[pcp_list|buddy_list] > o Additional test results in changelogs > > Changelog since v1 > o Fix unsafe RT locking scheme > o Use spin_trylock on UP PREEMPT_RT > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > per-cpu lists drain support" -- avoid interference of a high priority > task due to a workqueue item draining per-cpu page lists. While many > workloads can tolerate a brief interruption, it may be cause a real-time s/may be/may/ > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, s/nnn/nn/ > the draining in non-deterministic. s/n/s/;) > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > The local_lock on its own prevents migration and the IRQ disabling protects > from corruption due to an interrupt arriving while a page allocation is > in progress. The locking is inherently unsafe for remote access unless > the CPU is hot-removed. I don't understand the final sentence here. Which CPU and why does hot-removing it make the locking safe? > This series adjusts the locking. A spinlock is added to struct > per_cpu_pages to protect the list contents while local_lock_irq continues > to prevent migration and IRQ reentry. This allows a remote CPU to safely > drain a remote per-cpu list. > > This series is a partial series. Follow-on work should allow the > local_irq_save to be converted to a local_irq to avoid IRQs being > disabled/enabled in most cases. Consequently, there are some TODO comments > highlighting the places that would change if local_irq was used. However, > there are enough corner cases that it deserves a series on its own > separated by one kernel release and the priority right now is to avoid > interference of high priority tasks.
Hi Mel, On Thu, 2022-05-12 at 09:50 +0100, Mel Gorman wrote: > Changelog since v2 > o More conversions from page->lru to page->[pcp_list|buddy_list] > o Additional test results in changelogs > > Changelog since v1 > o Fix unsafe RT locking scheme > o Use spin_trylock on UP PREEMPT_RT > > This series has the same intent as Nicolas' series "mm/page_alloc: > Remote > per-cpu lists drain support" -- avoid interference of a high priority > task due to a workqueue item draining per-cpu page lists. While many > workloads can tolerate a brief interruption, it may be cause a real- > time > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > the draining in non-deterministic. > > Currently an IRQ-safe local_lock protects the page allocator per-cpu > lists. > The local_lock on its own prevents migration and the IRQ disabling > protects > from corruption due to an interrupt arriving while a page allocation > is > in progress. The locking is inherently unsafe for remote access > unless > the CPU is hot-removed. > > This series adjusts the locking. A spinlock is added to struct > per_cpu_pages to protect the list contents while local_lock_irq > continues > to prevent migration and IRQ reentry. This allows a remote CPU to > safely > drain a remote per-cpu list. > > This series is a partial series. Follow-on work should allow the > local_irq_save to be converted to a local_irq to avoid IRQs being > disabled/enabled in most cases. Consequently, there are some TODO > comments > highlighting the places that would change if local_irq was used. > However, > there are enough corner cases that it deserves a series on its own > separated by one kernel release and the priority right now is to > avoid > interference of high priority tasks. > > Patch 1 is a cosmetic patch to clarify when page->lru is storing > buddy pages > and when it is storing per-cpu pages. > > Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly > speaking > this is not necessary but it avoids per_cpu_pages consuming > another > cache line. > > Patch 3 is a preparation patch to avoid code duplication. > > Patch 4 is a simple micro-optimisation that improves code flow > necessary for > a later patch to avoid code duplication. > > Patch 5 uses a spin_lock to protect the per_cpu_pages contents while > still > relying on local_lock to prevent migration, stabilise the pcp > lookup and prevent IRQ reentrancy. > > Patch 6 remote drains per-cpu pages directly instead of using a > workqueue. > > include/linux/mm_types.h | 5 + > include/linux/mmzone.h | 12 +- > mm/page_alloc.c | 348 +++++++++++++++++++++++++------------ > -- Thanks for the series! I'm testing the series ATM, both with on vanilla and RT kernels. I'll have the results by Monday. Regards,
On Thu, May 12, 2022 at 12:43:25PM -0700, Andrew Morton wrote: > On Thu, 12 May 2022 09:50:37 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > > > Changelog since v2 > > o More conversions from page->lru to page->[pcp_list|buddy_list] > > o Additional test results in changelogs > > > > Changelog since v1 > > o Fix unsafe RT locking scheme > > o Use spin_trylock on UP PREEMPT_RT > > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > > per-cpu lists drain support" -- avoid interference of a high priority > > task due to a workqueue item draining per-cpu page lists. While many > > workloads can tolerate a brief interruption, it may be cause a real-time > > s/may be/may/ > > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > > s/nnn/nn/ > Correct. > > the draining in non-deterministic. > > s/n/s/;) > Think that one is ok. At least spell check did not complain. > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > The local_lock on its own prevents migration and the IRQ disabling protects > > from corruption due to an interrupt arriving while a page allocation is > > in progress. The locking is inherently unsafe for remote access unless > > the CPU is hot-removed. > > I don't understand the final sentence here. Which CPU and why does > hot-removing it make the locking safe? > The sentence can be dropped because it adds little and is potentially confusing. The PCP being safe to access remotely is specific to the context of the CPU being hot-removed and there are other special corner cases like zone_pcp_disable that modifies a per-cpu structure remotely but not in a way that causes corruption.
On Fri, 13 May 2022 15:23:30 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > Correct. > > > > the draining in non-deterministic. > > > > s/n/s/;) > > > > Think that one is ok. At least spell check did not complain. s/in/si/ > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > > The local_lock on its own prevents migration and the IRQ disabling protects > > > from corruption due to an interrupt arriving while a page allocation is > > > in progress. The locking is inherently unsafe for remote access unless > > > the CPU is hot-removed. > > > > I don't understand the final sentence here. Which CPU and why does > > hot-removing it make the locking safe? > > > > The sentence can be dropped because it adds little and is potentially > confusing. The PCP being safe to access remotely is specific to the > context of the CPU being hot-removed and there are other special corner > cases like zone_pcp_disable that modifies a per-cpu structure remotely > but not in a way that causes corruption. OK. I pasted in your para from the other email. Current 0/n blurb: Some setups, notably NOHZ_FULL CPUs, may be running realtime or latency-sensitive applications that cannot tolerate interference due to per-cpu drain work queued by __drain_all_pages(). Introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has two advantages, the time to drain is more predictable and other unrelated tasks are not interrupted. This series has the same intent as Nicolas' series "mm/page_alloc: Remote per-cpu lists drain support" -- avoid interference of a high priority task due to a workqueue item draining per-cpu page lists. While many workloads can tolerate a brief interruption, it may cause a real-time task running on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is non-deterministic. Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. The local_lock on its own prevents migration and the IRQ disabling protects from corruption due to an interrupt arriving while a page allocation is in progress. This series adjusts the locking. A spinlock is added to struct per_cpu_pages to protect the list contents while local_lock_irq continues to prevent migration and IRQ reentry. This allows a remote CPU to safely drain a remote per-cpu list. This series is a partial series. Follow-on work should allow the local_irq_save to be converted to a local_irq to avoid IRQs being disabled/enabled in most cases. Consequently, there are some TODO comments highlighting the places that would change if local_irq was used. However, there are enough corner cases that it deserves a series on its own separated by one kernel release and the priority right now is to avoid interference of high priority tasks. Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages and when it is storing per-cpu pages. Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking this is not necessary but it avoids per_cpu_pages consuming another cache line. Patch 3 is a preparation patch to avoid code duplication. Patch 4 is a simple micro-optimisation that improves code flow necessary for a later patch to avoid code duplication. Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still relying on local_lock to prevent migration, stabilise the pcp lookup and prevent IRQ reentrancy. Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
On Fri, May 13, 2022 at 12:38:05PM -0700, Andrew Morton wrote: > > The sentence can be dropped because it adds little and is potentially > > confusing. The PCP being safe to access remotely is specific to the > > context of the CPU being hot-removed and there are other special corner > > cases like zone_pcp_disable that modifies a per-cpu structure remotely > > but not in a way that causes corruption. > > OK. I pasted in your para from the other email. Current 0/n blurb: > > Some setups, notably NOHZ_FULL CPUs, may be running realtime or > latency-sensitive applications that cannot tolerate interference due to > per-cpu drain work queued by __drain_all_pages(). Introduce a new > mechanism to remotely drain the per-cpu lists. It is made possible by > remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. This has > two advantages, the time to drain is more predictable and other unrelated > tasks are not interrupted. > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > per-cpu lists drain support" -- avoid interference of a high priority task > due to a workqueue item draining per-cpu page lists. While many workloads > can tolerate a brief interruption, it may cause a real-time task running > on a NOHZ_FULL CPU to miss a deadline and at minimum, the draining is > non-deterministic. > > Currently an IRQ-safe local_lock protects the page allocator per-cpu > lists. The local_lock on its own prevents migration and the IRQ disabling > protects from corruption due to an interrupt arriving while a page > allocation is in progress. > > This series adjusts the locking. A spinlock is added to struct > per_cpu_pages to protect the list contents while local_lock_irq continues > to prevent migration and IRQ reentry. This allows a remote CPU to safely > drain a remote per-cpu list. > > This series is a partial series. Follow-on work should allow the > local_irq_save to be converted to a local_irq to avoid IRQs being > disabled/enabled in most cases. Consequently, there are some TODO > comments highlighting the places that would change if local_irq was used. > However, there are enough corner cases that it deserves a series on its > own separated by one kernel release and the priority right now is to avoid > interference of high priority tasks. > Looks good, thanks!
On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > Changelog since v2 > o More conversions from page->lru to page->[pcp_list|buddy_list] > o Additional test results in changelogs > > Changelog since v1 > o Fix unsafe RT locking scheme > o Use spin_trylock on UP PREEMPT_RT > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > per-cpu lists drain support" -- avoid interference of a high priority > task due to a workqueue item draining per-cpu page lists. While many > workloads can tolerate a brief interruption, it may be cause a real-time > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > the draining in non-deterministic. > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > The local_lock on its own prevents migration and the IRQ disabling protects > from corruption due to an interrupt arriving while a page allocation is > in progress. The locking is inherently unsafe for remote access unless > the CPU is hot-removed. > > This series adjusts the locking. A spinlock is added to struct > per_cpu_pages to protect the list contents while local_lock_irq continues > to prevent migration and IRQ reentry. This allows a remote CPU to safely > drain a remote per-cpu list. > > This series is a partial series. Follow-on work should allow the > local_irq_save to be converted to a local_irq to avoid IRQs being > disabled/enabled in most cases. Consequently, there are some TODO comments > highlighting the places that would change if local_irq was used. However, > there are enough corner cases that it deserves a series on its own > separated by one kernel release and the priority right now is to avoid > interference of high priority tasks. Reverting the whole series fixed an issue that offlining a memory section blocking for hours on today's linux-next tree. __wait_rcu_gp synchronize_rcu at kernel/rcu/tree.c:3915 lru_cache_disable at mm/swap.c:886 __alloc_contig_migrate_range at mm/page_alloc.c:9078 isolate_single_pageblock at mm/page_isolation.c:405 start_isolate_page_range offline_pages memory_subsys_offline device_offline online_store dev_attr_store sysfs_kf_write kernfs_fop_write_iter new_sync_write vfs_write ksys_write __arm64_sys_write invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync For full disclosure, I have also reverted the commit 0d523026abd4 ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the series can be reverted cleanly. But, I can't see how the commit 0d523026abd4 could cause this issue at all.
On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote: > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > > Changelog since v2 > > o More conversions from page->lru to page->[pcp_list|buddy_list] > > o Additional test results in changelogs > > > > Changelog since v1 > > o Fix unsafe RT locking scheme > > o Use spin_trylock on UP PREEMPT_RT > > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > > per-cpu lists drain support" -- avoid interference of a high priority > > task due to a workqueue item draining per-cpu page lists. While many > > workloads can tolerate a brief interruption, it may be cause a real-time > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > > the draining in non-deterministic. > > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > The local_lock on its own prevents migration and the IRQ disabling protects > > from corruption due to an interrupt arriving while a page allocation is > > in progress. The locking is inherently unsafe for remote access unless > > the CPU is hot-removed. > > > > This series adjusts the locking. A spinlock is added to struct > > per_cpu_pages to protect the list contents while local_lock_irq continues > > to prevent migration and IRQ reentry. This allows a remote CPU to safely > > drain a remote per-cpu list. > > > > This series is a partial series. Follow-on work should allow the > > local_irq_save to be converted to a local_irq to avoid IRQs being > > disabled/enabled in most cases. Consequently, there are some TODO comments > > highlighting the places that would change if local_irq was used. However, > > there are enough corner cases that it deserves a series on its own > > separated by one kernel release and the priority right now is to avoid > > interference of high priority tasks. > > Reverting the whole series fixed an issue that offlining a memory > section blocking for hours on today's linux-next tree. > > __wait_rcu_gp > synchronize_rcu at kernel/rcu/tree.c:3915 > lru_cache_disable at mm/swap.c:886 > __alloc_contig_migrate_range at mm/page_alloc.c:9078 > isolate_single_pageblock at mm/page_isolation.c:405 > start_isolate_page_range > offline_pages > memory_subsys_offline > device_offline > online_store > dev_attr_store > sysfs_kf_write > kernfs_fop_write_iter > new_sync_write > vfs_write > ksys_write > __arm64_sys_write > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > For full disclosure, I have also reverted the commit 0d523026abd4 > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the > series can be reverted cleanly. But, I can't see how the commit > 0d523026abd4 could cause this issue at all. This is halting in __lru_add_drain_all where it calls synchronize_rcu before the drain even happens. It's also an LRU drain and not PCP which is what the series affects and the allocator doesn't use rcu. In a KVM machine, I can do $ for BANK in `(for i in {1..20}; do echo $((RANDOM%416)); done) | sort -n | uniq`; do BEFORE=`cat /sys/devices/system/memory/memory$BANK/online`; echo 0 > /sys/devices/system/memory/memory$BANK/online; AFTER=`cat /sys/devices/system/memory/memory$BANK/online`; printf "%4d %d -> %d\n" $BANK $BEFORE $AFTER; done 3 1 -> 0 57 1 -> 0 74 1 -> 0 93 1 -> 0 101 1 -> 0 128 1 -> 0 133 1 -> 0 199 1 -> 0 223 1 -> 0 225 1 -> 0 229 1 -> 0 243 1 -> 0 263 1 -> 0 300 1 -> 0 309 1 -> 0 329 1 -> 0 355 1 -> 0 365 1 -> 0 372 1 -> 0 383 1 -> 0 It offlines 20 sections although after several attempts free -m starts reporting negative used memory so there is a bug of some description. How are you testing this exactly? Is it every time or intermittent? Are you confident that reverting the series makes the problem go away?
On Wed, May 18, 2022 at 01:51:52PM +0100, Mel Gorman wrote: > On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote: > > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > > > Changelog since v2 > > > o More conversions from page->lru to page->[pcp_list|buddy_list] > > > o Additional test results in changelogs > > > > > > Changelog since v1 > > > o Fix unsafe RT locking scheme > > > o Use spin_trylock on UP PREEMPT_RT > > > > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > > > per-cpu lists drain support" -- avoid interference of a high priority > > > task due to a workqueue item draining per-cpu page lists. While many > > > workloads can tolerate a brief interruption, it may be cause a real-time > > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > > > the draining in non-deterministic. > > > > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > > The local_lock on its own prevents migration and the IRQ disabling protects > > > from corruption due to an interrupt arriving while a page allocation is > > > in progress. The locking is inherently unsafe for remote access unless > > > the CPU is hot-removed. > > > > > > This series adjusts the locking. A spinlock is added to struct > > > per_cpu_pages to protect the list contents while local_lock_irq continues > > > to prevent migration and IRQ reentry. This allows a remote CPU to safely > > > drain a remote per-cpu list. > > > > > > This series is a partial series. Follow-on work should allow the > > > local_irq_save to be converted to a local_irq to avoid IRQs being > > > disabled/enabled in most cases. Consequently, there are some TODO comments > > > highlighting the places that would change if local_irq was used. However, > > > there are enough corner cases that it deserves a series on its own > > > separated by one kernel release and the priority right now is to avoid > > > interference of high priority tasks. > > > > Reverting the whole series fixed an issue that offlining a memory > > section blocking for hours on today's linux-next tree. > > > > __wait_rcu_gp > > synchronize_rcu at kernel/rcu/tree.c:3915 > > lru_cache_disable at mm/swap.c:886 > > __alloc_contig_migrate_range at mm/page_alloc.c:9078 > > isolate_single_pageblock at mm/page_isolation.c:405 > > start_isolate_page_range > > offline_pages > > memory_subsys_offline > > device_offline > > online_store > > dev_attr_store > > sysfs_kf_write > > kernfs_fop_write_iter > > new_sync_write > > vfs_write > > ksys_write > > __arm64_sys_write > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > For full disclosure, I have also reverted the commit 0d523026abd4 > > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the > > series can be reverted cleanly. But, I can't see how the commit > > 0d523026abd4 could cause this issue at all. > > This is halting in __lru_add_drain_all where it calls synchronize_rcu > before the drain even happens. It's also an LRU drain and not PCP which > is what the series affects and the allocator doesn't use rcu. In a KVM > machine, I can do > > $ for BANK in `(for i in {1..20}; do echo $((RANDOM%416)); done) | sort -n | uniq`; do BEFORE=`cat /sys/devices/system/memory/memory$BANK/online`; echo 0 > /sys/devices/system/memory/memory$BANK/online; AFTER=`cat /sys/devices/system/memory/memory$BANK/online`; printf "%4d %d -> %d\n" $BANK $BEFORE $AFTER; done > 3 1 -> 0 > 57 1 -> 0 > 74 1 -> 0 > 93 1 -> 0 > 101 1 -> 0 > 128 1 -> 0 > 133 1 -> 0 > 199 1 -> 0 > 223 1 -> 0 > 225 1 -> 0 > 229 1 -> 0 > 243 1 -> 0 > 263 1 -> 0 > 300 1 -> 0 > 309 1 -> 0 > 329 1 -> 0 > 355 1 -> 0 > 365 1 -> 0 > 372 1 -> 0 > 383 1 -> 0 > > It offlines 20 sections although after several attempts free -m starts > reporting negative used memory so there is a bug of some description. > How are you testing this exactly? Is it every time or intermittent? Are > you confident that reverting the series makes the problem go away? Cc'ing Paul. Either reverting this series or Paul's 3 patches below from today's linux-next tree fixed the issue. ca52639daa5b rcu-tasks: Drive synchronous grace periods from calling task 89ad98e93ce8 rcu-tasks: Move synchronize_rcu_tasks_generic() down 0d90e7225fb1 rcu-tasks: Split rcu_tasks_one_gp() from rcu_tasks_kthread() It was reproduced by running this script below on an arm64 server. I can reproduce it every time within 5 attempts. I noticed that when it happens, we have a few rcu kthreads all are stuck in this line, rcuwait_wait_event(&rtp->cbs_wait, (needgpcb = rcu_tasks_need_gpcb(rtp)), TASK_IDLE); rcu_tasks_kthread rcu_tasks_rude_kthread [rcu_tasks_trace_kthread #!/usr/bin/env python3 # SPDX-License-Identifier: GPL-2.0 import os import re import subprocess def mem_iter(): base_dir = '/sys/devices/system/memory/' for curr_dir in os.listdir(base_dir): if re.match(r'memory\d+', curr_dir): yield base_dir + curr_dir if __name__ == '__main__': print('- Try to remove each memory section and then add it back.') for mem_dir in mem_iter(): status = f'{mem_dir}/online' if open(status).read().rstrip() == '1': # This could expectedly fail due to many reasons. section = os.path.basename(mem_dir) print(f'- Try to remove {section}.') proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True) if proc.returncode == 0: print(f'- Try to add {section}.') subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True)
On Wed, May 18, 2022 at 12:27:22PM -0400, Qian Cai wrote: > On Wed, May 18, 2022 at 01:51:52PM +0100, Mel Gorman wrote: > > On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote: > > > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > > > > Changelog since v2 > > > > o More conversions from page->lru to page->[pcp_list|buddy_list] > > > > o Additional test results in changelogs > > > > > > > > Changelog since v1 > > > > o Fix unsafe RT locking scheme > > > > o Use spin_trylock on UP PREEMPT_RT > > > > > > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > > > > per-cpu lists drain support" -- avoid interference of a high priority > > > > task due to a workqueue item draining per-cpu page lists. While many > > > > workloads can tolerate a brief interruption, it may be cause a real-time > > > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > > > > the draining in non-deterministic. > > > > > > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > > > The local_lock on its own prevents migration and the IRQ disabling protects > > > > from corruption due to an interrupt arriving while a page allocation is > > > > in progress. The locking is inherently unsafe for remote access unless > > > > the CPU is hot-removed. > > > > > > > > This series adjusts the locking. A spinlock is added to struct > > > > per_cpu_pages to protect the list contents while local_lock_irq continues > > > > to prevent migration and IRQ reentry. This allows a remote CPU to safely > > > > drain a remote per-cpu list. > > > > > > > > This series is a partial series. Follow-on work should allow the > > > > local_irq_save to be converted to a local_irq to avoid IRQs being > > > > disabled/enabled in most cases. Consequently, there are some TODO comments > > > > highlighting the places that would change if local_irq was used. However, > > > > there are enough corner cases that it deserves a series on its own > > > > separated by one kernel release and the priority right now is to avoid > > > > interference of high priority tasks. > > > > > > Reverting the whole series fixed an issue that offlining a memory > > > section blocking for hours on today's linux-next tree. > > > > > > __wait_rcu_gp > > > synchronize_rcu at kernel/rcu/tree.c:3915 > > > lru_cache_disable at mm/swap.c:886 > > > __alloc_contig_migrate_range at mm/page_alloc.c:9078 > > > isolate_single_pageblock at mm/page_isolation.c:405 > > > start_isolate_page_range > > > offline_pages > > > memory_subsys_offline > > > device_offline > > > online_store > > > dev_attr_store > > > sysfs_kf_write > > > kernfs_fop_write_iter > > > new_sync_write > > > vfs_write > > > ksys_write > > > __arm64_sys_write > > > invoke_syscall > > > el0_svc_common.constprop.0 > > > do_el0_svc > > > el0_svc > > > el0t_64_sync_handler > > > el0t_64_sync > > > > > > For full disclosure, I have also reverted the commit 0d523026abd4 > > > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the > > > series can be reverted cleanly. But, I can't see how the commit > > > 0d523026abd4 could cause this issue at all. > > > > This is halting in __lru_add_drain_all where it calls synchronize_rcu > > before the drain even happens. It's also an LRU drain and not PCP which > > is what the series affects and the allocator doesn't use rcu. In a KVM > > machine, I can do > > > > $ for BANK in `(for i in {1..20}; do echo $((RANDOM%416)); done) | sort -n | uniq`; do BEFORE=`cat /sys/devices/system/memory/memory$BANK/online`; echo 0 > /sys/devices/system/memory/memory$BANK/online; AFTER=`cat /sys/devices/system/memory/memory$BANK/online`; printf "%4d %d -> %d\n" $BANK $BEFORE $AFTER; done > > 3 1 -> 0 > > 57 1 -> 0 > > 74 1 -> 0 > > 93 1 -> 0 > > 101 1 -> 0 > > 128 1 -> 0 > > 133 1 -> 0 > > 199 1 -> 0 > > 223 1 -> 0 > > 225 1 -> 0 > > 229 1 -> 0 > > 243 1 -> 0 > > 263 1 -> 0 > > 300 1 -> 0 > > 309 1 -> 0 > > 329 1 -> 0 > > 355 1 -> 0 > > 365 1 -> 0 > > 372 1 -> 0 > > 383 1 -> 0 > > > > It offlines 20 sections although after several attempts free -m starts > > reporting negative used memory so there is a bug of some description. > > How are you testing this exactly? Is it every time or intermittent? Are > > you confident that reverting the series makes the problem go away? > > Cc'ing Paul. Either reverting this series or Paul's 3 patches below from > today's linux-next tree fixed the issue. > > ca52639daa5b rcu-tasks: Drive synchronous grace periods from calling task > 89ad98e93ce8 rcu-tasks: Move synchronize_rcu_tasks_generic() down > 0d90e7225fb1 rcu-tasks: Split rcu_tasks_one_gp() from rcu_tasks_kthread() > > It was reproduced by running this script below on an arm64 server. I can > reproduce it every time within 5 attempts. I noticed that when it happens, > we have a few rcu kthreads all are stuck in this line, > > rcuwait_wait_event(&rtp->cbs_wait, > (needgpcb = rcu_tasks_need_gpcb(rtp)), > TASK_IDLE); > > rcu_tasks_kthread > rcu_tasks_rude_kthread > [rcu_tasks_trace_kthread This is the normal state of these kthreads when there is nothing for them to do. And unless you are removing tracing trampolines (kprobes, ftrace, BPF), there should be nothing for them to do. So does this python script somehow change the tracing state? (It does not look to me like it does, but I could easily be missing something.) Either way, is there something else waiting for these RCU flavors? (There should not be.) Nevertheless, if so, there should be a synchronize_rcu_tasks(), synchronize_rcu_tasks_rude(), or synchronize_rcu_tasks_trace() on some other blocked task's stack somewhere. Or maybe something sleeps waiting for an RCU Tasks * callback to be invoked. In that case (and in the above case, for that matter), at least one of these pointers would be non-NULL on some CPU: 1. rcu_tasks__percpu.cblist.head 2. rcu_tasks_rude__percpu.cblist.head 3. rcu_tasks_trace__percpu.cblist.head The ->func field of the pointed-to structure contains a pointer to the callback function, which will help work out what is going on. (Most likely a wakeup being lost or not provided.) Alternatively, if your system has hundreds of thousands of tasks and you have attached BPF programs to short-lived socket structures and you don't yet have the workaround, then you can see hangs. (I am working on a longer-term fix.) In the short term, applying the workaround is the right thing to do. (Adding a couple of the BPF guys on CC for their thoughts.) Does any of that help? Thanx, Paul > #!/usr/bin/env python3 > # SPDX-License-Identifier: GPL-2.0 > > import os > import re > import subprocess > > > def mem_iter(): > base_dir = '/sys/devices/system/memory/' > for curr_dir in os.listdir(base_dir): > if re.match(r'memory\d+', curr_dir): > yield base_dir + curr_dir > > > if __name__ == '__main__': > print('- Try to remove each memory section and then add it back.') > for mem_dir in mem_iter(): > status = f'{mem_dir}/online' > if open(status).read().rstrip() == '1': > # This could expectedly fail due to many reasons. > section = os.path.basename(mem_dir) > print(f'- Try to remove {section}.') > proc = subprocess.run([f'echo 0 | sudo tee {status}'], shell=True) > if proc.returncode == 0: > print(f'- Try to add {section}.') > subprocess.check_call([f'echo 1 | sudo tee {status}'], shell=True) >
On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote: > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > > Changelog since v2 > > o More conversions from page->lru to page->[pcp_list|buddy_list] > > o Additional test results in changelogs > > > > Changelog since v1 > > o Fix unsafe RT locking scheme > > o Use spin_trylock on UP PREEMPT_RT > > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > > per-cpu lists drain support" -- avoid interference of a high priority > > task due to a workqueue item draining per-cpu page lists. While many > > workloads can tolerate a brief interruption, it may be cause a real-time > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > > the draining in non-deterministic. > > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > The local_lock on its own prevents migration and the IRQ disabling protects > > from corruption due to an interrupt arriving while a page allocation is > > in progress. The locking is inherently unsafe for remote access unless > > the CPU is hot-removed. > > > > This series adjusts the locking. A spinlock is added to struct > > per_cpu_pages to protect the list contents while local_lock_irq continues > > to prevent migration and IRQ reentry. This allows a remote CPU to safely > > drain a remote per-cpu list. > > > > This series is a partial series. Follow-on work should allow the > > local_irq_save to be converted to a local_irq to avoid IRQs being > > disabled/enabled in most cases. Consequently, there are some TODO comments > > highlighting the places that would change if local_irq was used. However, > > there are enough corner cases that it deserves a series on its own > > separated by one kernel release and the priority right now is to avoid > > interference of high priority tasks. > > Reverting the whole series fixed an issue that offlining a memory > section blocking for hours on today's linux-next tree. > > __wait_rcu_gp > synchronize_rcu at kernel/rcu/tree.c:3915 > lru_cache_disable at mm/swap.c:886 > __alloc_contig_migrate_range at mm/page_alloc.c:9078 > isolate_single_pageblock at mm/page_isolation.c:405 > start_isolate_page_range > offline_pages > memory_subsys_offline > device_offline > online_store > dev_attr_store > sysfs_kf_write > kernfs_fop_write_iter > new_sync_write > vfs_write > ksys_write > __arm64_sys_write > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > For full disclosure, I have also reverted the commit 0d523026abd4 > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the > series can be reverted cleanly. But, I can't see how the commit > 0d523026abd4 could cause this issue at all. Hi Qian, The issue is probably due to lack of the following: https://lore.kernel.org/linux-mm/YmrWK%2FKoU1zrAxPI@fuller.cnet/ Can you please give the patch on the URL a try? Thanks!
On Wed, May 18, 2022 at 02:26:08PM -0300, Marcelo Tosatti wrote: > On Tue, May 17, 2022 at 07:35:07PM -0400, Qian Cai wrote: > > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > > > Changelog since v2 > > > o More conversions from page->lru to page->[pcp_list|buddy_list] > > > o Additional test results in changelogs > > > > > > Changelog since v1 > > > o Fix unsafe RT locking scheme > > > o Use spin_trylock on UP PREEMPT_RT > > > > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > > > per-cpu lists drain support" -- avoid interference of a high priority > > > task due to a workqueue item draining per-cpu page lists. While many > > > workloads can tolerate a brief interruption, it may be cause a real-time > > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > > > the draining in non-deterministic. > > > > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > > The local_lock on its own prevents migration and the IRQ disabling protects > > > from corruption due to an interrupt arriving while a page allocation is > > > in progress. The locking is inherently unsafe for remote access unless > > > the CPU is hot-removed. > > > > > > This series adjusts the locking. A spinlock is added to struct > > > per_cpu_pages to protect the list contents while local_lock_irq continues > > > to prevent migration and IRQ reentry. This allows a remote CPU to safely > > > drain a remote per-cpu list. > > > > > > This series is a partial series. Follow-on work should allow the > > > local_irq_save to be converted to a local_irq to avoid IRQs being > > > disabled/enabled in most cases. Consequently, there are some TODO comments > > > highlighting the places that would change if local_irq was used. However, > > > there are enough corner cases that it deserves a series on its own > > > separated by one kernel release and the priority right now is to avoid > > > interference of high priority tasks. > > > > Reverting the whole series fixed an issue that offlining a memory > > section blocking for hours on today's linux-next tree. > > > > __wait_rcu_gp > > synchronize_rcu at kernel/rcu/tree.c:3915 > > lru_cache_disable at mm/swap.c:886 > > __alloc_contig_migrate_range at mm/page_alloc.c:9078 > > isolate_single_pageblock at mm/page_isolation.c:405 > > start_isolate_page_range > > offline_pages > > memory_subsys_offline > > device_offline > > online_store > > dev_attr_store > > sysfs_kf_write > > kernfs_fop_write_iter > > new_sync_write > > vfs_write > > ksys_write > > __arm64_sys_write > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > For full disclosure, I have also reverted the commit 0d523026abd4 > > ("mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()"), so the > > series can be reverted cleanly. But, I can't see how the commit > > 0d523026abd4 could cause this issue at all. > > Hi Qian, > > The issue is probably due to lack of the following: > > https://lore.kernel.org/linux-mm/YmrWK%2FKoU1zrAxPI@fuller.cnet/ > > Can you please give the patch on the URL a try? > > Thanks! Oops, sorry don't think the above URL has anything to do with this problem.
On Thu, 2022-05-12 at 09:50 +0100, Mel Gorman wrote: > Changelog since v2 > o More conversions from page->lru to page->[pcp_list|buddy_list] > o Additional test results in changelogs > > Changelog since v1 > o Fix unsafe RT locking scheme > o Use spin_trylock on UP PREEMPT_RT > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > per-cpu lists drain support" -- avoid interference of a high priority > task due to a workqueue item draining per-cpu page lists. While many > workloads can tolerate a brief interruption, it may be cause a real-time > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > the draining in non-deterministic. > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > The local_lock on its own prevents migration and the IRQ disabling protects > from corruption due to an interrupt arriving while a page allocation is > in progress. The locking is inherently unsafe for remote access unless > the CPU is hot-removed. > > This series adjusts the locking. A spinlock is added to struct > per_cpu_pages to protect the list contents while local_lock_irq continues > to prevent migration and IRQ reentry. This allows a remote CPU to safely > drain a remote per-cpu list. > > This series is a partial series. Follow-on work should allow the > local_irq_save to be converted to a local_irq to avoid IRQs being > disabled/enabled in most cases. Consequently, there are some TODO comments > highlighting the places that would change if local_irq was used. However, > there are enough corner cases that it deserves a series on its own > separated by one kernel release and the priority right now is to avoid > interference of high priority tasks. FWIW tested this against our RT+nohz_full workloads. I can have another go if the locking scheme changes. Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Thanks,
On Wed, May 18, 2022 at 10:15:03AM -0700, Paul E. McKenney wrote: > So does this python script somehow change the tracing state? (It does > not look to me like it does, but I could easily be missing something.) No, I don't think so either. It pretty much just offline memory sections one at a time. > Either way, is there something else waiting for these RCU flavors? > (There should not be.) Nevertheless, if so, there should be > a synchronize_rcu_tasks(), synchronize_rcu_tasks_rude(), or > synchronize_rcu_tasks_trace() on some other blocked task's stack > somewhere. There are only three blocked tasks when this happens. The kmemleak_scan() is just the victim waiting for the locks taken by the stucking offline_pages()->synchronize_rcu() task. task:kmemleak state:D stack:25824 pid: 1033 ppid: 2 flags:0x00000008 Call trace: __switch_to __schedule schedule percpu_rwsem_wait __percpu_down_read percpu_down_read.constprop.0 get_online_mems kmemleak_scan kmemleak_scan_thread kthread ret_from_fork task:cppc_fie state:D stack:23472 pid: 1848 ppid: 2 flags:0x00000008 Call trace: __switch_to __schedule lockdep_recursion task:tee state:D stack:24816 pid:16733 ppid: 16732 flags:0x0000020c Call trace: __switch_to __schedule schedule schedule_timeout __wait_for_common wait_for_completion __wait_rcu_gp synchronize_rcu lru_cache_disable __alloc_contig_migrate_range isolate_single_pageblock start_isolate_page_range offline_pages memory_subsys_offline device_offline online_store dev_attr_store sysfs_kf_write kernfs_fop_write_iter new_sync_write vfs_write ksys_write __arm64_sys_write invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync > Or maybe something sleeps waiting for an RCU Tasks * callback to > be invoked. In that case (and in the above case, for that matter), > at least one of these pointers would be non-NULL on some CPU: > > 1. rcu_tasks__percpu.cblist.head > 2. rcu_tasks_rude__percpu.cblist.head > 3. rcu_tasks_trace__percpu.cblist.head > > The ->func field of the pointed-to structure contains a pointer to > the callback function, which will help work out what is going on. > (Most likely a wakeup being lost or not provided.) What would be some of the easy ways to find out those? I can't see anything interesting from the output of sysrq-t. > Alternatively, if your system has hundreds of thousands of tasks and > you have attached BPF programs to short-lived socket structures and you > don't yet have the workaround, then you can see hangs. (I am working on a > longer-term fix.) In the short term, applying the workaround is the right > thing to do. (Adding a couple of the BPF guys on CC for their thoughts.) The system is pretty much idle after a fresh reboot. The only workload is to run the script.
On Thu, May 19, 2022 at 09:29:45AM -0400, Qian Cai wrote: > On Wed, May 18, 2022 at 10:15:03AM -0700, Paul E. McKenney wrote: > > So does this python script somehow change the tracing state? (It does > > not look to me like it does, but I could easily be missing something.) > > No, I don't think so either. It pretty much just offline memory sections > one at a time. No idea. > > Either way, is there something else waiting for these RCU flavors? > > (There should not be.) Nevertheless, if so, there should be > > a synchronize_rcu_tasks(), synchronize_rcu_tasks_rude(), or > > synchronize_rcu_tasks_trace() on some other blocked task's stack > > somewhere. > > There are only three blocked tasks when this happens. The kmemleak_scan() > is just the victim waiting for the locks taken by the stucking > offline_pages()->synchronize_rcu() task. OK, then I believe that the RCU Tasks flavors were innocent bystanders. Is the task doing offline_pages()->synchronize_rcu() doing this repeatedly? Or is there a stalled RCU grace period? (From what I can see, offline_pages() is not doing huge numbers of calls to synchronize_rcu() in any of its loops, but I freely admit that I do not know this code.) If repeatedly, one workaround is to use synchronize_rcu_expedited() instead of synchronize_rcu(). A better fix might be to batch the grace periods, so that one RCU grace period serves several page offline operations. An alternative better fix might be to use call_rcu() instead of synchronize_rcu(). > task:kmemleak state:D stack:25824 pid: 1033 ppid: 2 flags:0x00000008 > Call trace: > __switch_to > __schedule > schedule > percpu_rwsem_wait > __percpu_down_read > percpu_down_read.constprop.0 > get_online_mems This is read-acquiring the mem_hotplug_lock. It looks like offline_pages() write-acquires this same lock. > kmemleak_scan > kmemleak_scan_thread > kthread > ret_from_fork > > task:cppc_fie state:D stack:23472 pid: 1848 ppid: 2 flags:0x00000008 > Call trace: > __switch_to > __schedule > lockdep_recursion > > task:tee state:D stack:24816 pid:16733 ppid: 16732 flags:0x0000020c > Call trace: > __switch_to > __schedule > schedule > schedule_timeout > __wait_for_common > wait_for_completion > __wait_rcu_gp > synchronize_rcu So, yes, this is sleeping holding the lock that kmemleak_scan wants to acquire. > lru_cache_disable > __alloc_contig_migrate_range > isolate_single_pageblock > start_isolate_page_range > offline_pages > memory_subsys_offline > device_offline > online_store > dev_attr_store > sysfs_kf_write > kernfs_fop_write_iter > new_sync_write > vfs_write > ksys_write > __arm64_sys_write > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > > Or maybe something sleeps waiting for an RCU Tasks * callback to > > be invoked. In that case (and in the above case, for that matter), > > at least one of these pointers would be non-NULL on some CPU: > > > > 1. rcu_tasks__percpu.cblist.head > > 2. rcu_tasks_rude__percpu.cblist.head > > 3. rcu_tasks_trace__percpu.cblist.head > > > > The ->func field of the pointed-to structure contains a pointer to > > the callback function, which will help work out what is going on. > > (Most likely a wakeup being lost or not provided.) > > What would be some of the easy ways to find out those? I can't see anything > interesting from the output of sysrq-t. Again, I believe that these are victims of circumstance. Though that does not explain why revertin those three patches makes things work better. Or is it possible that reverting those three patches simply decreases the probability of failure, rather than eliminating the failure? Such a decrease could be due to many things, for example, changes to offsets and sizes of data structures. > > Alternatively, if your system has hundreds of thousands of tasks and > > you have attached BPF programs to short-lived socket structures and you > > don't yet have the workaround, then you can see hangs. (I am working on a > > longer-term fix.) In the short term, applying the workaround is the right > > thing to do. (Adding a couple of the BPF guys on CC for their thoughts.) > > The system is pretty much idle after a fresh reboot. The only workload is > to run the script. Do you ever see RCU CPU stall warnings? Could you please trace the offline_pages() function? Is it really stuck, or is it being invoked periodically during the hang? Thanx, Paul
On Thu, May 19, 2022 at 12:15:24PM -0700, Paul E. McKenney wrote: > Is the task doing offline_pages()->synchronize_rcu() doing this > repeatedly? Or is there a stalled RCU grace period? (From what > I can see, offline_pages() is not doing huge numbers of calls to > synchronize_rcu() in any of its loops, but I freely admit that I do not > know this code.) Yes, we are running into an endless loop in isolate_single_pageblock(). There was a similar issue happened not long ago, so I am wondering if we did not solve it entirely then. Anyway, I will continue the thread over there. https://lore.kernel.org/all/YoavU%2F+NfQIzQiDF@qian/ > Or is it possible that reverting those three patches simply decreases > the probability of failure, rather than eliminating the failure? > Such a decrease could be due to many things, for example, changes to > offsets and sizes of data structures. Entirely possible. Sorry for the false alarm. > Do you ever see RCU CPU stall warnings? No.
On Thu, May 19, 2022 at 05:05:04PM -0400, Qian Cai wrote: > On Thu, May 19, 2022 at 12:15:24PM -0700, Paul E. McKenney wrote: > > Is the task doing offline_pages()->synchronize_rcu() doing this > > repeatedly? Or is there a stalled RCU grace period? (From what > > I can see, offline_pages() is not doing huge numbers of calls to > > synchronize_rcu() in any of its loops, but I freely admit that I do not > > know this code.) > > Yes, we are running into an endless loop in isolate_single_pageblock(). > There was a similar issue happened not long ago, so I am wondering if we > did not solve it entirely then. Anyway, I will continue the thread over > there. > > https://lore.kernel.org/all/YoavU%2F+NfQIzQiDF@qian/ I do know that feeling. > > Or is it possible that reverting those three patches simply decreases > > the probability of failure, rather than eliminating the failure? > > Such a decrease could be due to many things, for example, changes to > > offsets and sizes of data structures. > > Entirely possible. Sorry for the false alarm. Not a problem! > > Do you ever see RCU CPU stall warnings? > > No. OK, then perhaps a sequence of offline_pages() calls. Hmmm... The percpu_up_write() function sets ->block to zero before awakening waiters. Given wakeup latencies, might this allow an only somewhat unfortunate sequence of events to allow offline_pages() to starve readers? Or is there something I am missing that prevents this from happening? Thanx, Paul
On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > Changelog since v2 > o More conversions from page->lru to page->[pcp_list|buddy_list] > o Additional test results in changelogs > > Changelog since v1 > o Fix unsafe RT locking scheme > o Use spin_trylock on UP PREEMPT_RT > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > per-cpu lists drain support" -- avoid interference of a high priority > task due to a workqueue item draining per-cpu page lists. While many > workloads can tolerate a brief interruption, it may be cause a real-time > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > the draining in non-deterministic. > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > The local_lock on its own prevents migration and the IRQ disabling protects > from corruption due to an interrupt arriving while a page allocation is > in progress. The locking is inherently unsafe for remote access unless > the CPU is hot-removed. > > This series adjusts the locking. A spinlock is added to struct > per_cpu_pages to protect the list contents while local_lock_irq continues > to prevent migration and IRQ reentry. This allows a remote CPU to safely > drain a remote per-cpu list. > > This series is a partial series. Follow-on work should allow the > local_irq_save to be converted to a local_irq to avoid IRQs being > disabled/enabled in most cases. Consequently, there are some TODO comments > highlighting the places that would change if local_irq was used. However, > there are enough corner cases that it deserves a series on its own > separated by one kernel release and the priority right now is to avoid > interference of high priority tasks. > > Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages > and when it is storing per-cpu pages. > > Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking > this is not necessary but it avoids per_cpu_pages consuming another > cache line. > > Patch 3 is a preparation patch to avoid code duplication. > > Patch 4 is a simple micro-optimisation that improves code flow necessary for > a later patch to avoid code duplication. > > Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still > relying on local_lock to prevent migration, stabilise the pcp > lookup and prevent IRQ reentrancy. > > Patch 6 remote drains per-cpu pages directly instead of using a workqueue. Mel, we saw spontanous "mm_percpu_wq" crash on today's linux-next tree while running CPU offlining/onlining, and wondering if you have any thoughts? WARNING: CPU: 31 PID: 173 at kernel/kthread.c:524 __kthread_bind_mask CPU: 31 PID: 173 Comm: kworker/31:0 Not tainted 5.18.0-next-20220526-dirty #127 Workqueue: 0x0 (mm_percpu_wq) pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __kthread_bind_mask lr : __kthread_bind_mask sp : ffff800018667c50 x29: ffff800018667c50 x28: ffff800018667d20 x27: ffff083678bc2458 x26: 1fffe1002f5b17a8 x25: ffff08017ad8bd40 x24: 1fffe106cf17848b x23: 1ffff000030ccfa0 x22: ffff0801de2d1ac0 x21: ffff0801de2d1ac0 x20: ffff07ff80286f08 x19: ffff0801de2d1ac0 x18: ffffd6056a577d1c x17: ffffffffffffffff x16: 1fffe0fff158eb18 x15: 1fffe106cf176138 x14: 000000000000f1f1 x13: 00000000f3f3f3f3 x12: ffff7000030ccf3b x11: 1ffff000030ccf3a x10: ffff7000030ccf3a x9 : dfff800000000000 x8 : ffff8000186679d7 x7 : 0000000000000001 x6 : ffff7000030ccf3a x5 : 1ffff000030ccf39 x4 : 1ffff000030ccf4e x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff07ff8ac74fc0 x0 : 0000000000000000 Call trace: __kthread_bind_mask kthread_bind_mask create_worker worker_thread kthread ret_from_fork irq event stamp: 146 hardirqs last enabled at (145): _raw_spin_unlock_irqrestore hardirqs last disabled at (146): el1_dbg softirqs last enabled at (0): copy_process softirqs last disabled at (0): 0x0 WARNING: CPU: 31 PID: 173 at kernel/kthread.c:593 kthread_set_per_cpu CPU: 31 PID: 173 Comm: kworker/31:0 Tainted: G W 5.18.0-next-20220526-dirty #127 Workqueue: 0x0 (mm_percpu_wq) pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : kthread_set_per_cpu lr : worker_attach_to_pool sp : ffff800018667be0 x29: ffff800018667be0 x28: ffff800018667d20 x27: ffff083678bc2458 x26: 1fffe1002f5b17a8 x25: ffff08017ad8bd40 x24: 1fffe106cf17848b x23: 1fffe1003bc5a35d x22: ffff0801de2d1aec x21: 0000000000000007 x20: ffff4026d8adae00 x19: ffff0801de2d1ac0 x18: ffffd6056a577d1c x17: ffffffffffffffff x16: 1fffe0fff158eb18 x15: 1fffe106cf176138 x14: 000000000000f1f1 x13: 00000000f3f3f3f3 x12: ffff7000030ccf53 x11: 1ffff000030ccf52 x10: ffff7000030ccf52 x9 : ffffd60563f9a038 x8 : ffff800018667a97 x7 : 0000000000000001 x6 : ffff7000030ccf52 x5 : ffff800018667a90 x4 : ffff7000030ccf53 x3 : 1fffe1003bc5a408 x2 : 0000000000000000 x1 : 000000000000001f x0 : 0000000000208060 Call trace: kthread_set_per_cpu worker_attach_to_pool at kernel/workqueue.c:1873 create_worker worker_thread kthread ret_from_fork irq event stamp: 146 hardirqs last enabled at (145): _raw_spin_unlock_irqrestore hardirqs last disabled at (146): el1_dbg softirqs last enabled at (0): copy_process softirqs last disabled at (0): 0x0 Unable to handle kernel paging request at virtual address dfff800000000003 KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [dfff800000000003] address between user and kernel address ranges Internal error: Oops: 96000004 [#1] PREEMPT SMP CPU: 83 PID: 23994 Comm: kworker/31:2 Not tainted 5.18.0-next-20220526-dirty #127 pstate: 104000c9 (nzcV daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __lock_acquire lr : lock_acquire.part.0 sp : ffff800071777ac0 x29: ffff800071777ac0 x28: ffffd60563fa6380 x27: 0000000000000018 x26: 0000000000000080 x25: 0000000000000018 x24: 0000000000000000 x23: ffff0801de2d1ac0 x22: ffffd6056a66a7e0 x21: 0000000000000000 x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000767 x17: 0000000000000000 x16: 1fffe1003bc5a473 x15: 1fffe806c88e9338 x14: 000000000000f1f1 x13: 00000000f3f3f3f3 x12: ffff0801de2d1ac8 x11: 1ffffac0ad4aefa3 x10: ffffd6056a577d18 x9 : 0000000000000000 x8 : 0000000000000003 x7 : ffffd60563fa6380 x6 : 0000000000000000 x5 : 0000000000000080 x4 : 0000000000000001 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000003 x0 : dfff800000000000 Call trace: __lock_acquire at kernel/locking/lockdep.c:4923 lock_acquire _raw_spin_lock_irq worker_thread at kernel/workqueue.c:2389 kthread ret_from_fork Code: d65f03c0 d343ff61 d2d00000 f2fbffe0 (38e06820) ---[ end trace 0000000000000000 ]--- 1424.464630][T23994] Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: 0x56055bdf0000 from 0xffff800008000000 PHYS_OFFSET: 0x80000000 CPU features: 0x000,0042e015,19801c82 Memory Limit: none
On Thu, May 26, 2022 at 01:19:38PM -0400, Qian Cai wrote: > On Thu, May 12, 2022 at 09:50:37AM +0100, Mel Gorman wrote: > > Changelog since v2 > > o More conversions from page->lru to page->[pcp_list|buddy_list] > > o Additional test results in changelogs > > > > Changelog since v1 > > o Fix unsafe RT locking scheme > > o Use spin_trylock on UP PREEMPT_RT > > > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote > > per-cpu lists drain support" -- avoid interference of a high priority > > task due to a workqueue item draining per-cpu page lists. While many > > workloads can tolerate a brief interruption, it may be cause a real-time > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum, > > the draining in non-deterministic. > > > > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists. > > The local_lock on its own prevents migration and the IRQ disabling protects > > from corruption due to an interrupt arriving while a page allocation is > > in progress. The locking is inherently unsafe for remote access unless > > the CPU is hot-removed. > > > > This series adjusts the locking. A spinlock is added to struct > > per_cpu_pages to protect the list contents while local_lock_irq continues > > to prevent migration and IRQ reentry. This allows a remote CPU to safely > > drain a remote per-cpu list. > > > > This series is a partial series. Follow-on work should allow the > > local_irq_save to be converted to a local_irq to avoid IRQs being > > disabled/enabled in most cases. Consequently, there are some TODO comments > > highlighting the places that would change if local_irq was used. However, > > there are enough corner cases that it deserves a series on its own > > separated by one kernel release and the priority right now is to avoid > > interference of high priority tasks. > > > > Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages > > and when it is storing per-cpu pages. > > > > Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking > > this is not necessary but it avoids per_cpu_pages consuming another > > cache line. > > > > Patch 3 is a preparation patch to avoid code duplication. > > > > Patch 4 is a simple micro-optimisation that improves code flow necessary for > > a later patch to avoid code duplication. > > > > Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still > > relying on local_lock to prevent migration, stabilise the pcp > > lookup and prevent IRQ reentrancy. > > > > Patch 6 remote drains per-cpu pages directly instead of using a workqueue. > > Mel, we saw spontanous "mm_percpu_wq" crash on today's linux-next tree > while running CPU offlining/onlining, and wondering if you have any > thoughts? > Do you think it's related to the series and if so why? From the warning, it's not obvious to me why it would be given that it's a warning about a task not being inactive when it is expected to be.
On Fri, May 27, 2022 at 09:39:46AM +0100, Mel Gorman wrote: > Do you think it's related to the series and if so why? From the warning, > it's not obvious to me why it would be given that it's a warning about a > task not being inactive when it is expected to be. I don't know. I just saw the 6/6 patch touched the mm_percpu_wq. Anyway, I'll spend more time to reproduce, and see if we are lucky.