Message ID | 20230213123444.155149-1-ying.huang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | migrate_pages(): batch TLB flushing | expand |
On Mon, 13 Feb 2023, Huang Ying wrote: > From: "Huang, Ying" <ying.huang@intel.com> > > Now, migrate_pages() migrate folios one by one, like the fake code as > follows, > > for each folio > unmap > flush TLB > copy > restore map > > If multiple folios are passed to migrate_pages(), there are > opportunities to batch the TLB flushing and copying. That is, we can > change the code to something as follows, > > for each folio > unmap > for each folio > flush TLB > for each folio > copy > for each folio > restore map > > The total number of TLB flushing IPI can be reduced considerably. And > we may use some hardware accelerator such as DSA to accelerate the > folio copying. > > So in this patch, we refactor the migrate_pages() implementation and > implement the TLB flushing batching. Base on this, hardware > accelerated folio copying can be implemented. > > If too many folios are passed to migrate_pages(), in the naive batched > implementation, we may unmap too many folios at the same time. The > possibility for a task to wait for the migrated folios to be mapped > again increases. So the latency may be hurt. To deal with this > issue, the max number of folios be unmapped in batch is restricted to > no more than HPAGE_PMD_NR in the unit of page. That is, the influence > is at the same level of THP migration. > > We use the following test to measure the performance impact of the > patchset, > > On a 2-socket Intel server, > > - Run pmbench memory accessing benchmark > > - Run `migratepages` to migrate pages of pmbench between node 0 and > node 1 back and forth. > > With the patch, the TLB flushing IPI reduces 99.1% during the test and > the number of pages migrated successfully per second increases 291.7%. > > Xin Hao helped to test the patchset on an ARM64 server with 128 cores, > 2 NUMA nodes. Test results show that the page migration performance > increases up to 78%. > > This patchset is based on mm-unstable 2023-02-10. And back in linux-next this week: I tried next-20230217 overnight. There is a deadlock in this patchset (and in previous versions: sorry it's taken me so long to report), but I think one that's easily solved. I've not bisected to precisely which patch (load can take several hours to hit the deadlock), but it doesn't really matter, and I expect that you can guess. My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. So, plenty of ext4 page cache and buffer_heads. Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), either in kcompactd0 or in khugepaged trying to compact, or in both: it ends up calling __lock_buffer(), and that schedules away, waiting forever to get BH_lock. I have not identified who is holding BH_lock, but I imagine a jbd2 journalling thread, and presume that it wants one of the folio locks which migrate_pages_batch() is already holding; or maybe it's all more convoluted than that. Other tasks then back up waiting on those folio locks held in the batch. Never a problem with buffer_migrate_folio(), always with the "more careful" buffer_migrate_folio_norefs(). And the patch below fixes it for me: I've had enough hours with it now, on enough occasions, to be confident of that. Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 very well, and I hope can assure us that there is an understandable deadlock here, from holding several random folio locks, then trying to lock buffers. Cc'ing fsdevel, because there's a risk that mm folk think something is safe, when it's not sufficient to cope with the diversity of filesystems. I hope nothing more than the below is needed (and I've had no other problems with the patchset: good job), but cannot be sure. [PATCH next] migrate_pages: fix deadlock on buffer heads When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(), force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only trylock_buffer(), failing with -EAGAIN as usual if that does not succeed. Signed-off-by: Hugh Dickins <hughd@google.com> --- next-20230217/mm/migrate.c +++ fixed/mm/migrate.c @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct if (folio_ref_count(src) != expected_count) return -EAGAIN; - if (!buffer_migrate_lock_buffers(head, mode)) + if (!buffer_migrate_lock_buffers(head, + check_refs ? MIGRATE_ASYNC : mode)) return -EAGAIN; if (check_refs) {
Hi, Hugh, Hugh Dickins <hughd@google.com> writes: > On Mon, 13 Feb 2023, Huang Ying wrote: > >> From: "Huang, Ying" <ying.huang@intel.com> >> >> Now, migrate_pages() migrate folios one by one, like the fake code as >> follows, >> >> for each folio >> unmap >> flush TLB >> copy >> restore map >> >> If multiple folios are passed to migrate_pages(), there are >> opportunities to batch the TLB flushing and copying. That is, we can >> change the code to something as follows, >> >> for each folio >> unmap >> for each folio >> flush TLB >> for each folio >> copy >> for each folio >> restore map >> >> The total number of TLB flushing IPI can be reduced considerably. And >> we may use some hardware accelerator such as DSA to accelerate the >> folio copying. >> >> So in this patch, we refactor the migrate_pages() implementation and >> implement the TLB flushing batching. Base on this, hardware >> accelerated folio copying can be implemented. >> >> If too many folios are passed to migrate_pages(), in the naive batched >> implementation, we may unmap too many folios at the same time. The >> possibility for a task to wait for the migrated folios to be mapped >> again increases. So the latency may be hurt. To deal with this >> issue, the max number of folios be unmapped in batch is restricted to >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence >> is at the same level of THP migration. >> >> We use the following test to measure the performance impact of the >> patchset, >> >> On a 2-socket Intel server, >> >> - Run pmbench memory accessing benchmark >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> node 1 back and forth. >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and >> the number of pages migrated successfully per second increases 291.7%. >> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores, >> 2 NUMA nodes. Test results show that the page migration performance >> increases up to 78%. >> >> This patchset is based on mm-unstable 2023-02-10. > > And back in linux-next this week: I tried next-20230217 overnight. > > There is a deadlock in this patchset (and in previous versions: sorry > it's taken me so long to report), but I think one that's easily solved. > > I've not bisected to precisely which patch (load can take several hours > to hit the deadlock), but it doesn't really matter, and I expect that > you can guess. > > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. > So, plenty of ext4 page cache and buffer_heads. > > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), > either in kcompactd0 or in khugepaged trying to compact, or in both: > it ends up calling __lock_buffer(), and that schedules away, waiting > forever to get BH_lock. I have not identified who is holding BH_lock, > but I imagine a jbd2 journalling thread, and presume that it wants one > of the folio locks which migrate_pages_batch() is already holding; or > maybe it's all more convoluted than that. Other tasks then back up > waiting on those folio locks held in the batch. > > Never a problem with buffer_migrate_folio(), always with the "more > careful" buffer_migrate_folio_norefs(). And the patch below fixes > it for me: I've had enough hours with it now, on enough occasions, > to be confident of that. > > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 > very well, and I hope can assure us that there is an understandable > deadlock here, from holding several random folio locks, then trying > to lock buffers. Cc'ing fsdevel, because there's a risk that mm > folk think something is safe, when it's not sufficient to cope with > the diversity of filesystems. I hope nothing more than the below is > needed (and I've had no other problems with the patchset: good job), > but cannot be sure. > > [PATCH next] migrate_pages: fix deadlock on buffer heads > > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(), > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed. > > Signed-off-by: Hugh Dickins <hughd@google.com> > > --- next-20230217/mm/migrate.c > +++ fixed/mm/migrate.c > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct > if (folio_ref_count(src) != expected_count) > return -EAGAIN; > > - if (!buffer_migrate_lock_buffers(head, mode)) > + if (!buffer_migrate_lock_buffers(head, > + check_refs ? MIGRATE_ASYNC : mode)) > return -EAGAIN; > > if (check_refs) { Thank you very much for pointing this out and the fix patch. Today, my colleague Pengfei reported a deadlock bug to me. It seems that we cannot wait the writeback to complete when we have locked some folios. Below patch can fix that deadlock. I don't know whether this is related to the deadlock you run into. It appears that we should avoid to lock/wait synchronously if we have locked more than one folios. Best Regards, Huang, Ying ------------------------------------8<------------------------------------ From 0699fa2f80a67e863107d49a25909c92b900a9be Mon Sep 17 00:00:00 2001 From: Huang Ying <ying.huang@intel.com> Date: Mon, 20 Feb 2023 14:56:34 +0800 Subject: [PATCH] migrate_pages: fix deadlock on waiting writeback Pengfei reported a system soft lockup issue with Syzkaller. The stack traces are as follows, ... [ 300.124933] INFO: task kworker/u4:3:73 blocked for more than 147 seconds. [ 300.125214] Not tainted 6.2.0-rc4-kvm+ #1314 [ 300.125408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 300.125736] task:kworker/u4:3 state:D stack:0 pid:73 ppid:2 flags:0x00004000 [ 300.126059] Workqueue: writeback wb_workfn (flush-7:3) [ 300.126282] Call Trace: [ 300.126378] <TASK> [ 300.126464] __schedule+0x43b/0xd00 [ 300.126601] ? __blk_flush_plug+0x142/0x180 [ 300.126765] schedule+0x6a/0xf0 [ 300.126912] io_schedule+0x4a/0x80 [ 300.127051] folio_wait_bit_common+0x1b5/0x4e0 [ 300.127227] ? __pfx_wake_page_function+0x10/0x10 [ 300.127403] __folio_lock+0x27/0x40 [ 300.127541] write_cache_pages+0x350/0x870 [ 300.127699] ? __pfx_iomap_do_writepage+0x10/0x10 [ 300.127889] iomap_writepages+0x3f/0x80 [ 300.128037] xfs_vm_writepages+0x94/0xd0 [ 300.128192] ? __pfx_xfs_vm_writepages+0x10/0x10 [ 300.128370] do_writepages+0x10a/0x240 [ 300.128514] ? lock_is_held_type+0xe6/0x140 [ 300.128675] __writeback_single_inode+0x9f/0xa90 [ 300.128854] writeback_sb_inodes+0x2fb/0x8d0 [ 300.129030] __writeback_inodes_wb+0x68/0x150 [ 300.129212] wb_writeback+0x49c/0x770 [ 300.129357] wb_workfn+0x6fb/0x9d0 [ 300.129500] process_one_work+0x3cc/0x8d0 [ 300.129669] worker_thread+0x66/0x630 [ 300.129824] ? __pfx_worker_thread+0x10/0x10 [ 300.129989] kthread+0x153/0x190 [ 300.130116] ? __pfx_kthread+0x10/0x10 [ 300.130264] ret_from_fork+0x29/0x50 [ 300.130409] </TASK> [ 300.179347] INFO: task repro:1023 blocked for more than 147 seconds. [ 300.179905] Not tainted 6.2.0-rc4-kvm+ #1314 [ 300.180317] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 300.180955] task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 [ 300.181660] Call Trace: [ 300.181879] <TASK> [ 300.182085] __schedule+0x43b/0xd00 [ 300.182407] schedule+0x6a/0xf0 [ 300.182694] io_schedule+0x4a/0x80 [ 300.183020] folio_wait_bit_common+0x1b5/0x4e0 [ 300.183506] ? compaction_alloc+0x77/0x1150 [ 300.183892] ? __pfx_wake_page_function+0x10/0x10 [ 300.184304] folio_wait_bit+0x30/0x40 [ 300.184640] folio_wait_writeback+0x2e/0x1e0 [ 300.185034] migrate_pages_batch+0x555/0x1ac0 [ 300.185462] ? __pfx_compaction_alloc+0x10/0x10 [ 300.185808] ? __pfx_compaction_free+0x10/0x10 [ 300.186022] ? __this_cpu_preempt_check+0x17/0x20 [ 300.186234] ? lock_is_held_type+0xe6/0x140 [ 300.186423] migrate_pages+0x100e/0x1180 [ 300.186603] ? __pfx_compaction_free+0x10/0x10 [ 300.186800] ? __pfx_compaction_alloc+0x10/0x10 [ 300.187011] compact_zone+0xe10/0x1b50 [ 300.187182] ? lock_is_held_type+0xe6/0x140 [ 300.187374] ? check_preemption_disabled+0x80/0xf0 [ 300.187588] compact_node+0xa3/0x100 [ 300.187755] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 [ 300.187993] ? _find_first_bit+0x7b/0x90 [ 300.188171] sysctl_compaction_handler+0x5d/0xb0 [ 300.188376] proc_sys_call_handler+0x29d/0x420 [ 300.188583] proc_sys_write+0x2b/0x40 [ 300.188749] vfs_write+0x3a3/0x780 [ 300.188912] ksys_write+0xb7/0x180 [ 300.189070] __x64_sys_write+0x26/0x30 [ 300.189260] do_syscall_64+0x3b/0x90 [ 300.189424] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 300.189654] RIP: 0033:0x7f3a2471f59d [ 300.189815] RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 [ 300.190137] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d [ 300.190397] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 [ 300.190653] RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 [ 300.190910] R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 [ 300.191172] R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 [ 300.191440] </TASK> ... To migrate a folio, we may wait the writeback of a folio to complete when we already have held the lock of some folios. But the writeback code may wait to lock some folio we held lock. This causes the deadlock. To fix the issue, we will avoid to wait the writeback to complete if we have locked some folios. After moving the locked folios and unlocked, we will retry. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> --- mm/migrate.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/mm/migrate.c b/mm/migrate.c index 28b435cdeac8..bc9a8050f1b0 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1205,6 +1205,18 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page } if (!force) goto out; + /* + * We have locked some folios and are going to wait the + * writeback of this folio to complete. But it's possible for + * the writeback to wait to lock the folios we have locked. To + * avoid a potential deadlock, let's bail out and not do that. + * The locked folios will be moved and unlocked, then we + * can wait the writeback of this folio. + */ + if (avoid_force_lock) { + rc = -EDEADLOCK; + goto out; + } folio_wait_writeback(src); }
On Mon, 20 Feb 2023, Huang, Ying wrote: > Hi, Hugh, > > Hugh Dickins <hughd@google.com> writes: > > > On Mon, 13 Feb 2023, Huang Ying wrote: > > > >> From: "Huang, Ying" <ying.huang@intel.com> > >> > >> Now, migrate_pages() migrate folios one by one, like the fake code as > >> follows, > >> > >> for each folio > >> unmap > >> flush TLB > >> copy > >> restore map > >> > >> If multiple folios are passed to migrate_pages(), there are > >> opportunities to batch the TLB flushing and copying. That is, we can > >> change the code to something as follows, > >> > >> for each folio > >> unmap > >> for each folio > >> flush TLB > >> for each folio > >> copy > >> for each folio > >> restore map > >> > >> The total number of TLB flushing IPI can be reduced considerably. And > >> we may use some hardware accelerator such as DSA to accelerate the > >> folio copying. > >> > >> So in this patch, we refactor the migrate_pages() implementation and > >> implement the TLB flushing batching. Base on this, hardware > >> accelerated folio copying can be implemented. > >> > >> If too many folios are passed to migrate_pages(), in the naive batched > >> implementation, we may unmap too many folios at the same time. The > >> possibility for a task to wait for the migrated folios to be mapped > >> again increases. So the latency may be hurt. To deal with this > >> issue, the max number of folios be unmapped in batch is restricted to > >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence > >> is at the same level of THP migration. > >> > >> We use the following test to measure the performance impact of the > >> patchset, > >> > >> On a 2-socket Intel server, > >> > >> - Run pmbench memory accessing benchmark > >> > >> - Run `migratepages` to migrate pages of pmbench between node 0 and > >> node 1 back and forth. > >> > >> With the patch, the TLB flushing IPI reduces 99.1% during the test and > >> the number of pages migrated successfully per second increases 291.7%. > >> > >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores, > >> 2 NUMA nodes. Test results show that the page migration performance > >> increases up to 78%. > >> > >> This patchset is based on mm-unstable 2023-02-10. > > > > And back in linux-next this week: I tried next-20230217 overnight. > > > > There is a deadlock in this patchset (and in previous versions: sorry > > it's taken me so long to report), but I think one that's easily solved. > > > > I've not bisected to precisely which patch (load can take several hours > > to hit the deadlock), but it doesn't really matter, and I expect that > > you can guess. > > > > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), > > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. > > So, plenty of ext4 page cache and buffer_heads. > > > > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), > > either in kcompactd0 or in khugepaged trying to compact, or in both: > > it ends up calling __lock_buffer(), and that schedules away, waiting > > forever to get BH_lock. I have not identified who is holding BH_lock, > > but I imagine a jbd2 journalling thread, and presume that it wants one > > of the folio locks which migrate_pages_batch() is already holding; or > > maybe it's all more convoluted than that. Other tasks then back up > > waiting on those folio locks held in the batch. > > > > Never a problem with buffer_migrate_folio(), always with the "more > > careful" buffer_migrate_folio_norefs(). And the patch below fixes > > it for me: I've had enough hours with it now, on enough occasions, > > to be confident of that. > > > > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 > > very well, and I hope can assure us that there is an understandable > > deadlock here, from holding several random folio locks, then trying > > to lock buffers. Cc'ing fsdevel, because there's a risk that mm > > folk think something is safe, when it's not sufficient to cope with > > the diversity of filesystems. I hope nothing more than the below is > > needed (and I've had no other problems with the patchset: good job), > > but cannot be sure. > > > > [PATCH next] migrate_pages: fix deadlock on buffer heads > > > > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(), > > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only > > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed. > > > > Signed-off-by: Hugh Dickins <hughd@google.com> > > > > --- next-20230217/mm/migrate.c > > +++ fixed/mm/migrate.c > > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct > > if (folio_ref_count(src) != expected_count) > > return -EAGAIN; > > > > - if (!buffer_migrate_lock_buffers(head, mode)) > > + if (!buffer_migrate_lock_buffers(head, > > + check_refs ? MIGRATE_ASYNC : mode)) > > return -EAGAIN; > > > > if (check_refs) { > > Thank you very much for pointing this out and the fix patch. Today, my > colleague Pengfei reported a deadlock bug to me. It seems that we > cannot wait the writeback to complete when we have locked some folios. > Below patch can fix that deadlock. I don't know whether this is related > to the deadlock you run into. It appears that we should avoid to > lock/wait synchronously if we have locked more than one folios. Thanks, I've checked now, on next-20230217 without my patch but with your patch below: it took a few hours, but still deadlocks as I described above, so it's not the same issue. Yes, that's a good principle, that we should avoid to lock/wait synchronously once we have locked one folio (hmm, above you say "more than one": I think we mean the same thing, we're just stating it differently, given how the code runs at present). I'm not a great fan of migrate_folio_unmap()'s arguments, "force" followed by "oh, but don't force" (but applaud the recent "avoid_force_lock" as much better than the original "force_lock"). I haven't tried, but I wonder if you can avoid both those arguments, and both of these patches, by passing down an adjusted mode (perhaps MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first folio of a batch has been acquired (then restore to the original mode when starting a new batch). (My patch is weak in that it trylocks for buffer_head even on the first folio of a MIGRATE_SYNC norefs batch, although that has never given a problem historically: adjusting the mode after acquiring the first folio would correct that weakness.) Hugh > > Best Regards, > Huang, Ying > > ------------------------------------8<------------------------------------ > From 0699fa2f80a67e863107d49a25909c92b900a9be Mon Sep 17 00:00:00 2001 > From: Huang Ying <ying.huang@intel.com> > Date: Mon, 20 Feb 2023 14:56:34 +0800 > Subject: [PATCH] migrate_pages: fix deadlock on waiting writeback > > Pengfei reported a system soft lockup issue with Syzkaller. The stack > traces are as follows, > > ... > [ 300.124933] INFO: task kworker/u4:3:73 blocked for more than 147 seconds. > [ 300.125214] Not tainted 6.2.0-rc4-kvm+ #1314 > [ 300.125408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 300.125736] task:kworker/u4:3 state:D stack:0 pid:73 ppid:2 flags:0x00004000 > [ 300.126059] Workqueue: writeback wb_workfn (flush-7:3) > [ 300.126282] Call Trace: > [ 300.126378] <TASK> > [ 300.126464] __schedule+0x43b/0xd00 > [ 300.126601] ? __blk_flush_plug+0x142/0x180 > [ 300.126765] schedule+0x6a/0xf0 > [ 300.126912] io_schedule+0x4a/0x80 > [ 300.127051] folio_wait_bit_common+0x1b5/0x4e0 > [ 300.127227] ? __pfx_wake_page_function+0x10/0x10 > [ 300.127403] __folio_lock+0x27/0x40 > [ 300.127541] write_cache_pages+0x350/0x870 > [ 300.127699] ? __pfx_iomap_do_writepage+0x10/0x10 > [ 300.127889] iomap_writepages+0x3f/0x80 > [ 300.128037] xfs_vm_writepages+0x94/0xd0 > [ 300.128192] ? __pfx_xfs_vm_writepages+0x10/0x10 > [ 300.128370] do_writepages+0x10a/0x240 > [ 300.128514] ? lock_is_held_type+0xe6/0x140 > [ 300.128675] __writeback_single_inode+0x9f/0xa90 > [ 300.128854] writeback_sb_inodes+0x2fb/0x8d0 > [ 300.129030] __writeback_inodes_wb+0x68/0x150 > [ 300.129212] wb_writeback+0x49c/0x770 > [ 300.129357] wb_workfn+0x6fb/0x9d0 > [ 300.129500] process_one_work+0x3cc/0x8d0 > [ 300.129669] worker_thread+0x66/0x630 > [ 300.129824] ? __pfx_worker_thread+0x10/0x10 > [ 300.129989] kthread+0x153/0x190 > [ 300.130116] ? __pfx_kthread+0x10/0x10 > [ 300.130264] ret_from_fork+0x29/0x50 > [ 300.130409] </TASK> > [ 300.179347] INFO: task repro:1023 blocked for more than 147 seconds. > [ 300.179905] Not tainted 6.2.0-rc4-kvm+ #1314 > [ 300.180317] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 300.180955] task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 > [ 300.181660] Call Trace: > [ 300.181879] <TASK> > [ 300.182085] __schedule+0x43b/0xd00 > [ 300.182407] schedule+0x6a/0xf0 > [ 300.182694] io_schedule+0x4a/0x80 > [ 300.183020] folio_wait_bit_common+0x1b5/0x4e0 > [ 300.183506] ? compaction_alloc+0x77/0x1150 > [ 300.183892] ? __pfx_wake_page_function+0x10/0x10 > [ 300.184304] folio_wait_bit+0x30/0x40 > [ 300.184640] folio_wait_writeback+0x2e/0x1e0 > [ 300.185034] migrate_pages_batch+0x555/0x1ac0 > [ 300.185462] ? __pfx_compaction_alloc+0x10/0x10 > [ 300.185808] ? __pfx_compaction_free+0x10/0x10 > [ 300.186022] ? __this_cpu_preempt_check+0x17/0x20 > [ 300.186234] ? lock_is_held_type+0xe6/0x140 > [ 300.186423] migrate_pages+0x100e/0x1180 > [ 300.186603] ? __pfx_compaction_free+0x10/0x10 > [ 300.186800] ? __pfx_compaction_alloc+0x10/0x10 > [ 300.187011] compact_zone+0xe10/0x1b50 > [ 300.187182] ? lock_is_held_type+0xe6/0x140 > [ 300.187374] ? check_preemption_disabled+0x80/0xf0 > [ 300.187588] compact_node+0xa3/0x100 > [ 300.187755] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 > [ 300.187993] ? _find_first_bit+0x7b/0x90 > [ 300.188171] sysctl_compaction_handler+0x5d/0xb0 > [ 300.188376] proc_sys_call_handler+0x29d/0x420 > [ 300.188583] proc_sys_write+0x2b/0x40 > [ 300.188749] vfs_write+0x3a3/0x780 > [ 300.188912] ksys_write+0xb7/0x180 > [ 300.189070] __x64_sys_write+0x26/0x30 > [ 300.189260] do_syscall_64+0x3b/0x90 > [ 300.189424] entry_SYSCALL_64_after_hwframe+0x72/0xdc > [ 300.189654] RIP: 0033:0x7f3a2471f59d > [ 300.189815] RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 > [ 300.190137] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d > [ 300.190397] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 > [ 300.190653] RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 > [ 300.190910] R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 > [ 300.191172] R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 > [ 300.191440] </TASK> > ... > > To migrate a folio, we may wait the writeback of a folio to complete > when we already have held the lock of some folios. But the writeback > code may wait to lock some folio we held lock. This causes the > deadlock. To fix the issue, we will avoid to wait the writeback to > complete if we have locked some folios. After moving the locked > folios and unlocked, we will retry. > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Stefan Roesch <shr@devkernel.io> > Cc: Tejun Heo <tj@kernel.org> > Cc: Xin Hao <xhao@linux.alibaba.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Yang Shi <shy828301@gmail.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > --- > mm/migrate.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 28b435cdeac8..bc9a8050f1b0 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1205,6 +1205,18 @@ static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page > } > if (!force) > goto out; > + /* > + * We have locked some folios and are going to wait the > + * writeback of this folio to complete. But it's possible for > + * the writeback to wait to lock the folios we have locked. To > + * avoid a potential deadlock, let's bail out and not do that. > + * The locked folios will be moved and unlocked, then we > + * can wait the writeback of this folio. > + */ > + if (avoid_force_lock) { > + rc = -EDEADLOCK; > + goto out; > + } > folio_wait_writeback(src); > } > > -- > 2.39.1 > >
Hugh Dickins <hughd@google.com> writes: > On Mon, 20 Feb 2023, Huang, Ying wrote: > >> Hi, Hugh, >> >> Hugh Dickins <hughd@google.com> writes: >> >> > On Mon, 13 Feb 2023, Huang Ying wrote: >> > >> >> From: "Huang, Ying" <ying.huang@intel.com> >> >> >> >> Now, migrate_pages() migrate folios one by one, like the fake code as >> >> follows, >> >> >> >> for each folio >> >> unmap >> >> flush TLB >> >> copy >> >> restore map >> >> >> >> If multiple folios are passed to migrate_pages(), there are >> >> opportunities to batch the TLB flushing and copying. That is, we can >> >> change the code to something as follows, >> >> >> >> for each folio >> >> unmap >> >> for each folio >> >> flush TLB >> >> for each folio >> >> copy >> >> for each folio >> >> restore map >> >> >> >> The total number of TLB flushing IPI can be reduced considerably. And >> >> we may use some hardware accelerator such as DSA to accelerate the >> >> folio copying. >> >> >> >> So in this patch, we refactor the migrate_pages() implementation and >> >> implement the TLB flushing batching. Base on this, hardware >> >> accelerated folio copying can be implemented. >> >> >> >> If too many folios are passed to migrate_pages(), in the naive batched >> >> implementation, we may unmap too many folios at the same time. The >> >> possibility for a task to wait for the migrated folios to be mapped >> >> again increases. So the latency may be hurt. To deal with this >> >> issue, the max number of folios be unmapped in batch is restricted to >> >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence >> >> is at the same level of THP migration. >> >> >> >> We use the following test to measure the performance impact of the >> >> patchset, >> >> >> >> On a 2-socket Intel server, >> >> >> >> - Run pmbench memory accessing benchmark >> >> >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> >> node 1 back and forth. >> >> >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and >> >> the number of pages migrated successfully per second increases 291.7%. >> >> >> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores, >> >> 2 NUMA nodes. Test results show that the page migration performance >> >> increases up to 78%. >> >> >> >> This patchset is based on mm-unstable 2023-02-10. >> > >> > And back in linux-next this week: I tried next-20230217 overnight. >> > >> > There is a deadlock in this patchset (and in previous versions: sorry >> > it's taken me so long to report), but I think one that's easily solved. >> > >> > I've not bisected to precisely which patch (load can take several hours >> > to hit the deadlock), but it doesn't really matter, and I expect that >> > you can guess. >> > >> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), >> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. >> > So, plenty of ext4 page cache and buffer_heads. >> > >> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), >> > either in kcompactd0 or in khugepaged trying to compact, or in both: >> > it ends up calling __lock_buffer(), and that schedules away, waiting >> > forever to get BH_lock. I have not identified who is holding BH_lock, >> > but I imagine a jbd2 journalling thread, and presume that it wants one >> > of the folio locks which migrate_pages_batch() is already holding; or >> > maybe it's all more convoluted than that. Other tasks then back up >> > waiting on those folio locks held in the batch. >> > >> > Never a problem with buffer_migrate_folio(), always with the "more >> > careful" buffer_migrate_folio_norefs(). And the patch below fixes >> > it for me: I've had enough hours with it now, on enough occasions, >> > to be confident of that. >> > >> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 >> > very well, and I hope can assure us that there is an understandable >> > deadlock here, from holding several random folio locks, then trying >> > to lock buffers. Cc'ing fsdevel, because there's a risk that mm >> > folk think something is safe, when it's not sufficient to cope with >> > the diversity of filesystems. I hope nothing more than the below is >> > needed (and I've had no other problems with the patchset: good job), >> > but cannot be sure. >> > >> > [PATCH next] migrate_pages: fix deadlock on buffer heads >> > >> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(), >> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only >> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed. >> > >> > Signed-off-by: Hugh Dickins <hughd@google.com> >> > >> > --- next-20230217/mm/migrate.c >> > +++ fixed/mm/migrate.c >> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct >> > if (folio_ref_count(src) != expected_count) >> > return -EAGAIN; >> > >> > - if (!buffer_migrate_lock_buffers(head, mode)) >> > + if (!buffer_migrate_lock_buffers(head, >> > + check_refs ? MIGRATE_ASYNC : mode)) >> > return -EAGAIN; >> > >> > if (check_refs) { >> >> Thank you very much for pointing this out and the fix patch. Today, my >> colleague Pengfei reported a deadlock bug to me. It seems that we >> cannot wait the writeback to complete when we have locked some folios. >> Below patch can fix that deadlock. I don't know whether this is related >> to the deadlock you run into. It appears that we should avoid to >> lock/wait synchronously if we have locked more than one folios. > > Thanks, I've checked now, on next-20230217 without my patch but > with your patch below: it took a few hours, but still deadlocks > as I described above, so it's not the same issue. > > Yes, that's a good principle, that we should avoid to lock/wait > synchronously once we have locked one folio (hmm, above you say > "more than one": I think we mean the same thing, we're just > stating it differently, given how the code runs at present). > > I'm not a great fan of migrate_folio_unmap()'s arguments, > "force" followed by "oh, but don't force" (but applaud the recent > "avoid_force_lock" as much better than the original "force_lock"). > I haven't tried, but I wonder if you can avoid both those arguments, > and both of these patches, by passing down an adjusted mode (perhaps > MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first > folio of a batch has been acquired (then restore to the original mode > when starting a new batch). Thanks for suggestion! I think it's a good idea, but it will take me some time to think how to implement. > (My patch is weak in that it trylocks for buffer_head even on the > first folio of a MIGRATE_SYNC norefs batch, although that has never > given a problem historically: adjusting the mode after acquiring > the first folio would correct that weakness.) Further digging shows that this is related to loop device. That can be shown in the following trace, [ 300.109786] INFO: task kworker/u4:0:9 blocked for more than 147 seconds. [ 300.110616] Not tainted 6.2.0-rc4-kvm+ #1314 [ 300.111143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 300.111985] task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 [ 300.112964] Workqueue: loop4 loop_rootcg_workfn [ 300.113658] Call Trace: [ 300.113996] <TASK> [ 300.114315] __schedule+0x43b/0xd00 [ 300.114848] schedule+0x6a/0xf0 [ 300.115319] io_schedule+0x4a/0x80 [ 300.115860] folio_wait_bit_common+0x1b5/0x4e0 [ 300.116430] ? __pfx_wake_page_function+0x10/0x10 [ 300.116615] __filemap_get_folio+0x73d/0x770 [ 300.116790] shmem_get_folio_gfp+0x1fd/0xc80 [ 300.116963] shmem_write_begin+0x91/0x220 [ 300.117121] generic_perform_write+0x10e/0x2e0 [ 300.117312] __generic_file_write_iter+0x17e/0x290 [ 300.117498] ? generic_write_checks+0x12b/0x1a0 [ 300.117693] generic_file_write_iter+0x97/0x180 [ 300.117881] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 300.118087] do_iter_readv_writev+0x13c/0x210 [ 300.118256] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 [ 300.118463] do_iter_write+0xf6/0x330 [ 300.118608] vfs_iter_write+0x46/0x70 [ 300.118754] loop_process_work+0x723/0xfe0 [ 300.118922] loop_rootcg_workfn+0x28/0x40 [ 300.119078] process_one_work+0x3cc/0x8d0 [ 300.119240] worker_thread+0x66/0x630 [ 300.119384] ? __pfx_worker_thread+0x10/0x10 [ 300.119551] kthread+0x153/0x190 [ 300.119681] ? __pfx_kthread+0x10/0x10 [ 300.119833] ret_from_fork+0x29/0x50 [ 300.119984] </TASK> When a folio of the loop device is written back, the underlying shmem needs to be written back. Which will acquire the lock of the folio of shmem. While that folio may be locked by migrate_pages_batch() already. Your testing involves the loop device too. So is it related to loop? For example, after the buffer head was locked, is it possible to acquire lock for folios of the underlying file system? Best Regards, Huang, Ying
On Mon, Feb 20, 2023 at 06:48:38PM -0800, Hugh Dickins wrote: > Yes, that's a good principle, that we should avoid to lock/wait > synchronously once we have locked one folio (hmm, above you say > "more than one": I think we mean the same thing, we're just > stating it differently, given how the code runs at present). I suspect the migrate page code is disobeying the locking ordering rules for multiple folios. if two folios belong to the same file, they must be locked by folio->index order, low to high. If two folios belong to different files, they must be ordered by folio->mapping, the mapping lowest in memory first. You can see this locking rule embedded in vfs_lock_two_folios() in fs/remap_range.c. I don't know what the locking rules are for two folios which are not file folios, or for two folios when one is anonymous and the other belongs to a file. Maybe it's the same; you can lock them ordered by ->mapping first, then by ->index. Or you can just trylock multiple folios and skip the ones that don't work.
Matthew Wilcox <willy@infradead.org> writes: > On Mon, Feb 20, 2023 at 06:48:38PM -0800, Hugh Dickins wrote: >> Yes, that's a good principle, that we should avoid to lock/wait >> synchronously once we have locked one folio (hmm, above you say >> "more than one": I think we mean the same thing, we're just >> stating it differently, given how the code runs at present). > > I suspect the migrate page code is disobeying the locking ordering > rules for multiple folios. if two folios belong to the same file, > they must be locked by folio->index order, low to high. If two folios > belong to different files, they must be ordered by folio->mapping, the > mapping lowest in memory first. You can see this locking rule embedded > in vfs_lock_two_folios() in fs/remap_range.c. > > I don't know what the locking rules are for two folios which are not file > folios, or for two folios when one is anonymous and the other belongs > to a file. Maybe it's the same; you can lock them ordered by ->mapping > first, then by ->index. > > Or you can just trylock multiple folios and skip the ones that don't work. Yes. We will only trylock when we have locked some folios (including one). And retry locking only after having processed and unlocked the already locked folios. Best Regards, Huang, Ying
Hugh Dickins <hughd@google.com> writes: > On Mon, 20 Feb 2023, Huang, Ying wrote: > >> Hi, Hugh, >> >> Hugh Dickins <hughd@google.com> writes: >> >> > On Mon, 13 Feb 2023, Huang Ying wrote: >> > >> >> From: "Huang, Ying" <ying.huang@intel.com> >> >> >> >> Now, migrate_pages() migrate folios one by one, like the fake code as >> >> follows, >> >> >> >> for each folio >> >> unmap >> >> flush TLB >> >> copy >> >> restore map >> >> >> >> If multiple folios are passed to migrate_pages(), there are >> >> opportunities to batch the TLB flushing and copying. That is, we can >> >> change the code to something as follows, >> >> >> >> for each folio >> >> unmap >> >> for each folio >> >> flush TLB >> >> for each folio >> >> copy >> >> for each folio >> >> restore map >> >> >> >> The total number of TLB flushing IPI can be reduced considerably. And >> >> we may use some hardware accelerator such as DSA to accelerate the >> >> folio copying. >> >> >> >> So in this patch, we refactor the migrate_pages() implementation and >> >> implement the TLB flushing batching. Base on this, hardware >> >> accelerated folio copying can be implemented. >> >> >> >> If too many folios are passed to migrate_pages(), in the naive batched >> >> implementation, we may unmap too many folios at the same time. The >> >> possibility for a task to wait for the migrated folios to be mapped >> >> again increases. So the latency may be hurt. To deal with this >> >> issue, the max number of folios be unmapped in batch is restricted to >> >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence >> >> is at the same level of THP migration. >> >> >> >> We use the following test to measure the performance impact of the >> >> patchset, >> >> >> >> On a 2-socket Intel server, >> >> >> >> - Run pmbench memory accessing benchmark >> >> >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and >> >> node 1 back and forth. >> >> >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and >> >> the number of pages migrated successfully per second increases 291.7%. >> >> >> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores, >> >> 2 NUMA nodes. Test results show that the page migration performance >> >> increases up to 78%. >> >> >> >> This patchset is based on mm-unstable 2023-02-10. >> > >> > And back in linux-next this week: I tried next-20230217 overnight. >> > >> > There is a deadlock in this patchset (and in previous versions: sorry >> > it's taken me so long to report), but I think one that's easily solved. >> > >> > I've not bisected to precisely which patch (load can take several hours >> > to hit the deadlock), but it doesn't really matter, and I expect that >> > you can guess. >> > >> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), >> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. >> > So, plenty of ext4 page cache and buffer_heads. >> > >> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), >> > either in kcompactd0 or in khugepaged trying to compact, or in both: >> > it ends up calling __lock_buffer(), and that schedules away, waiting >> > forever to get BH_lock. I have not identified who is holding BH_lock, >> > but I imagine a jbd2 journalling thread, and presume that it wants one >> > of the folio locks which migrate_pages_batch() is already holding; or >> > maybe it's all more convoluted than that. Other tasks then back up >> > waiting on those folio locks held in the batch. >> > >> > Never a problem with buffer_migrate_folio(), always with the "more >> > careful" buffer_migrate_folio_norefs(). And the patch below fixes >> > it for me: I've had enough hours with it now, on enough occasions, >> > to be confident of that. >> > >> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 >> > very well, and I hope can assure us that there is an understandable >> > deadlock here, from holding several random folio locks, then trying >> > to lock buffers. Cc'ing fsdevel, because there's a risk that mm >> > folk think something is safe, when it's not sufficient to cope with >> > the diversity of filesystems. I hope nothing more than the below is >> > needed (and I've had no other problems with the patchset: good job), >> > but cannot be sure. >> > >> > [PATCH next] migrate_pages: fix deadlock on buffer heads >> > >> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(), >> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only >> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed. >> > >> > Signed-off-by: Hugh Dickins <hughd@google.com> >> > >> > --- next-20230217/mm/migrate.c >> > +++ fixed/mm/migrate.c >> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct >> > if (folio_ref_count(src) != expected_count) >> > return -EAGAIN; >> > >> > - if (!buffer_migrate_lock_buffers(head, mode)) >> > + if (!buffer_migrate_lock_buffers(head, >> > + check_refs ? MIGRATE_ASYNC : mode)) >> > return -EAGAIN; >> > >> > if (check_refs) { >> >> Thank you very much for pointing this out and the fix patch. Today, my >> colleague Pengfei reported a deadlock bug to me. It seems that we >> cannot wait the writeback to complete when we have locked some folios. >> Below patch can fix that deadlock. I don't know whether this is related >> to the deadlock you run into. It appears that we should avoid to >> lock/wait synchronously if we have locked more than one folios. > > Thanks, I've checked now, on next-20230217 without my patch but > with your patch below: it took a few hours, but still deadlocks > as I described above, so it's not the same issue. > > Yes, that's a good principle, that we should avoid to lock/wait > synchronously once we have locked one folio (hmm, above you say > "more than one": I think we mean the same thing, we're just > stating it differently, given how the code runs at present). > > I'm not a great fan of migrate_folio_unmap()'s arguments, > "force" followed by "oh, but don't force" (but applaud the recent > "avoid_force_lock" as much better than the original "force_lock"). > I haven't tried, but I wonder if you can avoid both those arguments, > and both of these patches, by passing down an adjusted mode (perhaps > MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first > folio of a batch has been acquired (then restore to the original mode > when starting a new batch). > > (My patch is weak in that it trylocks for buffer_head even on the > first folio of a MIGRATE_SYNC norefs batch, although that has never > given a problem historically: adjusting the mode after acquiring > the first folio would correct that weakness.) On second thought, I think that it may be better to provide a fix as simple as possible firstly. Then we can work on a more complex fix as we discussed above. The simple fix is easy to review now. And, we will have more time to test and review the complex fix. In the following fix, I disabled the migration batching except for the MIGRATE_ASYNC mode, or the split folios of a THP folio. After that, I will work on the complex fix to enable migration batching for all modes. What do you think about that? Best Regards, Huang, Ying -------------------------------8<--------------------------------- From 8e475812eacd9f2eeac76776c2b1a17af3e59b89 Mon Sep 17 00:00:00 2001 From: Huang Ying <ying.huang@intel.com> Date: Tue, 21 Feb 2023 16:37:50 +0800 Subject: [PATCH] migrate_pages: fix deadlock in batched migration Two deadlock bugs were reported for the migrate_pages() batching series. Thanks Hugh and Pengfei! For example, in the following deadlock trace snippet, INFO: task kworker/u4:0:9 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 Workqueue: loop4 loop_rootcg_workfn Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? __pfx_wake_page_function+0x10/0x10 __filemap_get_folio+0x73d/0x770 shmem_get_folio_gfp+0x1fd/0xc80 shmem_write_begin+0x91/0x220 generic_perform_write+0x10e/0x2e0 __generic_file_write_iter+0x17e/0x290 ? generic_write_checks+0x12b/0x1a0 generic_file_write_iter+0x97/0x180 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_readv_writev+0x13c/0x210 ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 do_iter_write+0xf6/0x330 vfs_iter_write+0x46/0x70 loop_process_work+0x723/0xfe0 loop_rootcg_workfn+0x28/0x40 process_one_work+0x3cc/0x8d0 worker_thread+0x66/0x630 ? __pfx_worker_thread+0x10/0x10 kthread+0x153/0x190 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> INFO: task repro:1023 blocked for more than 147 seconds. Not tainted 6.2.0-rc4-kvm+ #1314 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 Call Trace: <TASK> __schedule+0x43b/0xd00 schedule+0x6a/0xf0 io_schedule+0x4a/0x80 folio_wait_bit_common+0x1b5/0x4e0 ? compaction_alloc+0x77/0x1150 ? __pfx_wake_page_function+0x10/0x10 folio_wait_bit+0x30/0x40 folio_wait_writeback+0x2e/0x1e0 migrate_pages_batch+0x555/0x1ac0 ? __pfx_compaction_alloc+0x10/0x10 ? __pfx_compaction_free+0x10/0x10 ? __this_cpu_preempt_check+0x17/0x20 ? lock_is_held_type+0xe6/0x140 migrate_pages+0x100e/0x1180 ? __pfx_compaction_free+0x10/0x10 ? __pfx_compaction_alloc+0x10/0x10 compact_zone+0xe10/0x1b50 ? lock_is_held_type+0xe6/0x140 ? check_preemption_disabled+0x80/0xf0 compact_node+0xa3/0x100 ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 ? _find_first_bit+0x7b/0x90 sysctl_compaction_handler+0x5d/0xb0 proc_sys_call_handler+0x29d/0x420 proc_sys_write+0x2b/0x40 vfs_write+0x3a3/0x780 ksys_write+0xb7/0x180 __x64_sys_write+0x26/0x30 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f3a2471f59d RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 </TASK> The page migration task has held the lock of the shmem folio A, and is waiting the writeback of the folio B of the file system on the loop block device to complete. While the loop worker task which writes back the folio B is waiting to lock the shmem folio A, because the folio A backs the folio B in the loop device. Thus deadlock is triggered. In general, if we have locked some other folios except the one we are migrating, it's not safe to wait synchronously, for example, to wait the writeback to complete or wait to lock the buffer head. To fix the deadlock, in this patch, we avoid to batch the page migration except for MIGRATE_ASYNC mode or the split folios of a THP folio. In MIGRATE_ASYNC mode, synchronous waiting is avoided. And there isn't any dependency relationship among the split folios of a THP folio. The fix can be improved via converting migration mode from synchronous to asynchronous if we have locked some other folios except the one we are migrating. We will do that in the near future. Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/ Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/ Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Stefan Roesch <shr@devkernel.io> Cc: Tejun Heo <tj@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> --- mm/migrate.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index ef68a1aff35c..bc04c34543f3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1937,7 +1937,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, enum migrate_mode mode, int reason, unsigned int *ret_succeeded) { int rc, rc_gather; - int nr_pages; + int nr_pages, batch; struct folio *folio, *folio2; LIST_HEAD(folios); LIST_HEAD(ret_folios); @@ -1951,6 +1951,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, mode, reason, &stats, &ret_folios); if (rc_gather < 0) goto out; + + if (mode == MIGRATE_ASYNC) + batch = NR_MAX_BATCHED_MIGRATION; + else + batch = 1; again: nr_pages = 0; list_for_each_entry_safe(folio, folio2, from, lru) { @@ -1961,11 +1966,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, } nr_pages += folio_nr_pages(folio); - if (nr_pages > NR_MAX_BATCHED_MIGRATION) + if (nr_pages >= batch) break; } - if (nr_pages > NR_MAX_BATCHED_MIGRATION) - list_cut_before(&folios, from, &folio->lru); + if (nr_pages >= batch) + list_cut_before(&folios, from, &folio2->lru); else list_splice_init(from, &folios); rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private,
On Tue, 21 Feb 2023, Huang, Ying wrote: > Hugh Dickins <hughd@google.com> writes: > > On Mon, 20 Feb 2023, Huang, Ying wrote: > > > >> Hi, Hugh, > >> > >> Hugh Dickins <hughd@google.com> writes: > >> > >> > On Mon, 13 Feb 2023, Huang Ying wrote: > >> > > >> >> From: "Huang, Ying" <ying.huang@intel.com> > >> >> > >> >> Now, migrate_pages() migrate folios one by one, like the fake code as > >> >> follows, > >> >> > >> >> for each folio > >> >> unmap > >> >> flush TLB > >> >> copy > >> >> restore map > >> >> > >> >> If multiple folios are passed to migrate_pages(), there are > >> >> opportunities to batch the TLB flushing and copying. That is, we can > >> >> change the code to something as follows, > >> >> > >> >> for each folio > >> >> unmap > >> >> for each folio > >> >> flush TLB > >> >> for each folio > >> >> copy > >> >> for each folio > >> >> restore map > >> >> > >> >> The total number of TLB flushing IPI can be reduced considerably. And > >> >> we may use some hardware accelerator such as DSA to accelerate the > >> >> folio copying. > >> >> > >> >> So in this patch, we refactor the migrate_pages() implementation and > >> >> implement the TLB flushing batching. Base on this, hardware > >> >> accelerated folio copying can be implemented. > >> >> > >> >> If too many folios are passed to migrate_pages(), in the naive batched > >> >> implementation, we may unmap too many folios at the same time. The > >> >> possibility for a task to wait for the migrated folios to be mapped > >> >> again increases. So the latency may be hurt. To deal with this > >> >> issue, the max number of folios be unmapped in batch is restricted to > >> >> no more than HPAGE_PMD_NR in the unit of page. That is, the influence > >> >> is at the same level of THP migration. > >> >> > >> >> We use the following test to measure the performance impact of the > >> >> patchset, > >> >> > >> >> On a 2-socket Intel server, > >> >> > >> >> - Run pmbench memory accessing benchmark > >> >> > >> >> - Run `migratepages` to migrate pages of pmbench between node 0 and > >> >> node 1 back and forth. > >> >> > >> >> With the patch, the TLB flushing IPI reduces 99.1% during the test and > >> >> the number of pages migrated successfully per second increases 291.7%. > >> >> > >> >> Xin Hao helped to test the patchset on an ARM64 server with 128 cores, > >> >> 2 NUMA nodes. Test results show that the page migration performance > >> >> increases up to 78%. > >> >> > >> >> This patchset is based on mm-unstable 2023-02-10. > >> > > >> > And back in linux-next this week: I tried next-20230217 overnight. > >> > > >> > There is a deadlock in this patchset (and in previous versions: sorry > >> > it's taken me so long to report), but I think one that's easily solved. > >> > > >> > I've not bisected to precisely which patch (load can take several hours > >> > to hit the deadlock), but it doesn't really matter, and I expect that > >> > you can guess. > >> > > >> > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), > >> > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. > >> > So, plenty of ext4 page cache and buffer_heads. > >> > > >> > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), > >> > either in kcompactd0 or in khugepaged trying to compact, or in both: > >> > it ends up calling __lock_buffer(), and that schedules away, waiting > >> > forever to get BH_lock. I have not identified who is holding BH_lock, > >> > but I imagine a jbd2 journalling thread, and presume that it wants one > >> > of the folio locks which migrate_pages_batch() is already holding; or > >> > maybe it's all more convoluted than that. Other tasks then back up > >> > waiting on those folio locks held in the batch. > >> > > >> > Never a problem with buffer_migrate_folio(), always with the "more > >> > careful" buffer_migrate_folio_norefs(). And the patch below fixes > >> > it for me: I've had enough hours with it now, on enough occasions, > >> > to be confident of that. > >> > > >> > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 > >> > very well, and I hope can assure us that there is an understandable > >> > deadlock here, from holding several random folio locks, then trying > >> > to lock buffers. Cc'ing fsdevel, because there's a risk that mm > >> > folk think something is safe, when it's not sufficient to cope with > >> > the diversity of filesystems. I hope nothing more than the below is > >> > needed (and I've had no other problems with the patchset: good job), > >> > but cannot be sure. > >> > > >> > [PATCH next] migrate_pages: fix deadlock on buffer heads > >> > > >> > When __buffer_migrate_folio() is called from buffer_migrate_folio_norefs(), > >> > force MIGRATE_ASYNC mode so that buffer_migrate_lock_buffers() will only > >> > trylock_buffer(), failing with -EAGAIN as usual if that does not succeed. > >> > > >> > Signed-off-by: Hugh Dickins <hughd@google.com> > >> > > >> > --- next-20230217/mm/migrate.c > >> > +++ fixed/mm/migrate.c > >> > @@ -748,7 +748,8 @@ static int __buffer_migrate_folio(struct > >> > if (folio_ref_count(src) != expected_count) > >> > return -EAGAIN; > >> > > >> > - if (!buffer_migrate_lock_buffers(head, mode)) > >> > + if (!buffer_migrate_lock_buffers(head, > >> > + check_refs ? MIGRATE_ASYNC : mode)) > >> > return -EAGAIN; > >> > > >> > if (check_refs) { > >> > >> Thank you very much for pointing this out and the fix patch. Today, my > >> colleague Pengfei reported a deadlock bug to me. It seems that we > >> cannot wait the writeback to complete when we have locked some folios. > >> Below patch can fix that deadlock. I don't know whether this is related > >> to the deadlock you run into. It appears that we should avoid to > >> lock/wait synchronously if we have locked more than one folios. > > > > Thanks, I've checked now, on next-20230217 without my patch but > > with your patch below: it took a few hours, but still deadlocks > > as I described above, so it's not the same issue. > > > > Yes, that's a good principle, that we should avoid to lock/wait > > synchronously once we have locked one folio (hmm, above you say > > "more than one": I think we mean the same thing, we're just > > stating it differently, given how the code runs at present). > > > > I'm not a great fan of migrate_folio_unmap()'s arguments, > > "force" followed by "oh, but don't force" (but applaud the recent > > "avoid_force_lock" as much better than the original "force_lock"). > > I haven't tried, but I wonder if you can avoid both those arguments, > > and both of these patches, by passing down an adjusted mode (perhaps > > MIGRATE_ASYNC, or perhaps a new mode) to all callees, once the first > > folio of a batch has been acquired (then restore to the original mode > > when starting a new batch). > > Thanks for suggestion! I think it's a good idea, but it will take me > some time to think how to implement. > > > (My patch is weak in that it trylocks for buffer_head even on the > > first folio of a MIGRATE_SYNC norefs batch, although that has never > > given a problem historically: adjusting the mode after acquiring > > the first folio would correct that weakness.) > > Further digging shows that this is related to loop device. That can be > shown in the following trace, > > [ 300.109786] INFO: task kworker/u4:0:9 blocked for more than 147 seconds. > [ 300.110616] Not tainted 6.2.0-rc4-kvm+ #1314 > [ 300.111143] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 300.111985] task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 > [ 300.112964] Workqueue: loop4 loop_rootcg_workfn > [ 300.113658] Call Trace: > [ 300.113996] <TASK> > [ 300.114315] __schedule+0x43b/0xd00 > [ 300.114848] schedule+0x6a/0xf0 > [ 300.115319] io_schedule+0x4a/0x80 > [ 300.115860] folio_wait_bit_common+0x1b5/0x4e0 > [ 300.116430] ? __pfx_wake_page_function+0x10/0x10 > [ 300.116615] __filemap_get_folio+0x73d/0x770 > [ 300.116790] shmem_get_folio_gfp+0x1fd/0xc80 > [ 300.116963] shmem_write_begin+0x91/0x220 > [ 300.117121] generic_perform_write+0x10e/0x2e0 > [ 300.117312] __generic_file_write_iter+0x17e/0x290 > [ 300.117498] ? generic_write_checks+0x12b/0x1a0 > [ 300.117693] generic_file_write_iter+0x97/0x180 > [ 300.117881] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 300.118087] do_iter_readv_writev+0x13c/0x210 > [ 300.118256] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > [ 300.118463] do_iter_write+0xf6/0x330 > [ 300.118608] vfs_iter_write+0x46/0x70 > [ 300.118754] loop_process_work+0x723/0xfe0 > [ 300.118922] loop_rootcg_workfn+0x28/0x40 > [ 300.119078] process_one_work+0x3cc/0x8d0 > [ 300.119240] worker_thread+0x66/0x630 > [ 300.119384] ? __pfx_worker_thread+0x10/0x10 > [ 300.119551] kthread+0x153/0x190 > [ 300.119681] ? __pfx_kthread+0x10/0x10 > [ 300.119833] ret_from_fork+0x29/0x50 > [ 300.119984] </TASK> > > When a folio of the loop device is written back, the underlying shmem > needs to be written back. Which will acquire the lock of the folio of > shmem. While that folio may be locked by migrate_pages_batch() already. > > Your testing involves the loop device too. So is it related to loop? > For example, after the buffer head was locked, is it possible to acquire > lock for folios of the underlying file system? To lock (other than trylock) a folio while holding a buffer head lock would violate lock ordering: they are both bit spin locks, so lockdep would not notice, but I'm sure any such violation would have been caught long ago (unless on some very unlikely path). There was nothing in the stack traces I looked at to implicate loop, though admittedly I paid much more attention to the kcompactd0 and khugepaged pattern which was emerging - the few times I looked at the loop0 thread, IIRC it was either idle or waiting on the lower level. If it had a stack trace like you show above, I would just have ignored that as of little interest, as probably waiting on one of the migrate batch's locked folios, which were themselves waiting on a buffer_head lock somewhere. (And shmem itself doesn't use buffer_heads.) But it's still an interesting idea that these deadlocks might be related to loop: loop by its nature certainly has a propensity for deadlocks, though I think those have all been ironed out years ago. I've run my load overnight using a small ext4 partition instead of ext4 on loop on tmpfs, and that has not deadlocked. I hesitate to say that confirms your idea that loop is somehow causing the deadlock - the timings will have changed, so I've no idea how long such a modified load needs to run to give confidence; but it does suggest that you're right. Though I still think your principle, that we should avoid to lock/wait synchronously once the batch holds one folio, is the safe principle to follow anyway, whether or not the deadlocks can be pinned on loop. Hugh
On Tue, 21 Feb 2023, Huang, Ying wrote: > > On second thought, I think that it may be better to provide a fix as > simple as possible firstly. Then we can work on a more complex fix as > we discussed above. The simple fix is easy to review now. And, we will > have more time to test and review the complex fix. > > In the following fix, I disabled the migration batching except for the > MIGRATE_ASYNC mode, or the split folios of a THP folio. After that, I > will work on the complex fix to enable migration batching for all modes. > > What do you think about that? I don't think there's a need to rush in the wrong fix so quickly. Your series was in (though sometimes out of) linux-next for some while, without causing any widespread problems. Andrew did send it to Linus yesterday, I expect he'll be pushing it out later today or tomorrow, but I don't think it's going to cause big problems. Aiming for a fix in -rc2 would be good. Why would it be complex? Hugh > > Best Regards, > Huang, Ying > > -------------------------------8<--------------------------------- > From 8e475812eacd9f2eeac76776c2b1a17af3e59b89 Mon Sep 17 00:00:00 2001 > From: Huang Ying <ying.huang@intel.com> > Date: Tue, 21 Feb 2023 16:37:50 +0800 > Subject: [PATCH] migrate_pages: fix deadlock in batched migration > > Two deadlock bugs were reported for the migrate_pages() batching > series. Thanks Hugh and Pengfei! For example, in the following > deadlock trace snippet, > > INFO: task kworker/u4:0:9 blocked for more than 147 seconds. > Not tainted 6.2.0-rc4-kvm+ #1314 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 > Workqueue: loop4 loop_rootcg_workfn > Call Trace: > <TASK> > __schedule+0x43b/0xd00 > schedule+0x6a/0xf0 > io_schedule+0x4a/0x80 > folio_wait_bit_common+0x1b5/0x4e0 > ? __pfx_wake_page_function+0x10/0x10 > __filemap_get_folio+0x73d/0x770 > shmem_get_folio_gfp+0x1fd/0xc80 > shmem_write_begin+0x91/0x220 > generic_perform_write+0x10e/0x2e0 > __generic_file_write_iter+0x17e/0x290 > ? generic_write_checks+0x12b/0x1a0 > generic_file_write_iter+0x97/0x180 > ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > do_iter_readv_writev+0x13c/0x210 > ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 > do_iter_write+0xf6/0x330 > vfs_iter_write+0x46/0x70 > loop_process_work+0x723/0xfe0 > loop_rootcg_workfn+0x28/0x40 > process_one_work+0x3cc/0x8d0 > worker_thread+0x66/0x630 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x153/0x190 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x29/0x50 > </TASK> > > INFO: task repro:1023 blocked for more than 147 seconds. > Not tainted 6.2.0-rc4-kvm+ #1314 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 > Call Trace: > <TASK> > __schedule+0x43b/0xd00 > schedule+0x6a/0xf0 > io_schedule+0x4a/0x80 > folio_wait_bit_common+0x1b5/0x4e0 > ? compaction_alloc+0x77/0x1150 > ? __pfx_wake_page_function+0x10/0x10 > folio_wait_bit+0x30/0x40 > folio_wait_writeback+0x2e/0x1e0 > migrate_pages_batch+0x555/0x1ac0 > ? __pfx_compaction_alloc+0x10/0x10 > ? __pfx_compaction_free+0x10/0x10 > ? __this_cpu_preempt_check+0x17/0x20 > ? lock_is_held_type+0xe6/0x140 > migrate_pages+0x100e/0x1180 > ? __pfx_compaction_free+0x10/0x10 > ? __pfx_compaction_alloc+0x10/0x10 > compact_zone+0xe10/0x1b50 > ? lock_is_held_type+0xe6/0x140 > ? check_preemption_disabled+0x80/0xf0 > compact_node+0xa3/0x100 > ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 > ? _find_first_bit+0x7b/0x90 > sysctl_compaction_handler+0x5d/0xb0 > proc_sys_call_handler+0x29d/0x420 > proc_sys_write+0x2b/0x40 > vfs_write+0x3a3/0x780 > ksys_write+0xb7/0x180 > __x64_sys_write+0x26/0x30 > do_syscall_64+0x3b/0x90 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > RIP: 0033:0x7f3a2471f59d > RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 > RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 > R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 > R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 > </TASK> > > The page migration task has held the lock of the shmem folio A, and is > waiting the writeback of the folio B of the file system on the loop > block device to complete. While the loop worker task which writes > back the folio B is waiting to lock the shmem folio A, because the > folio A backs the folio B in the loop device. Thus deadlock is > triggered. > > In general, if we have locked some other folios except the one we are > migrating, it's not safe to wait synchronously, for example, to wait > the writeback to complete or wait to lock the buffer head. > > To fix the deadlock, in this patch, we avoid to batch the page > migration except for MIGRATE_ASYNC mode or the split folios of a THP > folio. In MIGRATE_ASYNC mode, synchronous waiting is avoided. And > there isn't any dependency relationship among the split folios of a > THP folio. > > The fix can be improved via converting migration mode from synchronous > to asynchronous if we have locked some other folios except the one we > are migrating. We will do that in the near future. > > Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/ > Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/ > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Reported-by: Hugh Dickins <hughd@google.com> > Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Stefan Roesch <shr@devkernel.io> > Cc: Tejun Heo <tj@kernel.org> > Cc: Xin Hao <xhao@linux.alibaba.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Yang Shi <shy828301@gmail.com> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > --- > mm/migrate.c | 13 +++++++++---- > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index ef68a1aff35c..bc04c34543f3 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1937,7 +1937,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > enum migrate_mode mode, int reason, unsigned int *ret_succeeded) > { > int rc, rc_gather; > - int nr_pages; > + int nr_pages, batch; > struct folio *folio, *folio2; > LIST_HEAD(folios); > LIST_HEAD(ret_folios); > @@ -1951,6 +1951,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > mode, reason, &stats, &ret_folios); > if (rc_gather < 0) > goto out; > + > + if (mode == MIGRATE_ASYNC) > + batch = NR_MAX_BATCHED_MIGRATION; > + else > + batch = 1; > again: > nr_pages = 0; > list_for_each_entry_safe(folio, folio2, from, lru) { > @@ -1961,11 +1966,11 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, > } > > nr_pages += folio_nr_pages(folio); > - if (nr_pages > NR_MAX_BATCHED_MIGRATION) > + if (nr_pages >= batch) > break; > } > - if (nr_pages > NR_MAX_BATCHED_MIGRATION) > - list_cut_before(&folios, from, &folio->lru); > + if (nr_pages >= batch) > + list_cut_before(&folios, from, &folio2->lru); > else > list_splice_init(from, &folios); > rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private, > -- > 2.39.1
Hugh Dickins <hughd@google.com> writes: > On Tue, 21 Feb 2023, Huang, Ying wrote: >> >> On second thought, I think that it may be better to provide a fix as >> simple as possible firstly. Then we can work on a more complex fix as >> we discussed above. The simple fix is easy to review now. And, we will >> have more time to test and review the complex fix. >> >> In the following fix, I disabled the migration batching except for the >> MIGRATE_ASYNC mode, or the split folios of a THP folio. After that, I >> will work on the complex fix to enable migration batching for all modes. >> >> What do you think about that? > > I don't think there's a need to rush in the wrong fix so quickly. > Your series was in (though sometimes out of) linux-next for some > while, without causing any widespread problems. Andrew did send > it to Linus yesterday, I expect he'll be pushing it out later today > or tomorrow, but I don't think it's going to cause big problems. > Aiming for a fix in -rc2 would be good. Sure, I will target to fix in -rc2. Thanks for suggestion! > Why would it be complex? Now, I think the big picture could be, if (MIGRATE_ASYNC) { migrate_pages_batch(from,); } else { migrate_pages_batch(from,, MIGRATE_ASYNC,); list_for_each_entry_safe (folio,, from) { migrate_pages_batch(one_folio, , MIGRATE_SYNC,); } } That is, for synchronous migration, try asynchronous batched migration firstly, then fall back to synchronous migration one by one. This will make the retry logic easier to be understood. This needs some code change. Anyway, I will try to do that and show the code. Best Regards, Huang, Ying
On Fri 17-02-23 13:47:48, Hugh Dickins wrote: > On Mon, 13 Feb 2023, Huang Ying wrote: > > > From: "Huang, Ying" <ying.huang@intel.com> > > > > Now, migrate_pages() migrate folios one by one, like the fake code as > > follows, > > > > for each folio > > unmap > > flush TLB > > copy > > restore map > > > > If multiple folios are passed to migrate_pages(), there are > > opportunities to batch the TLB flushing and copying. That is, we can > > change the code to something as follows, > > > > for each folio > > unmap > > for each folio > > flush TLB > > for each folio > > copy > > for each folio > > restore map > > > > The total number of TLB flushing IPI can be reduced considerably. And > > we may use some hardware accelerator such as DSA to accelerate the > > folio copying. > > > > So in this patch, we refactor the migrate_pages() implementation and > > implement the TLB flushing batching. Base on this, hardware > > accelerated folio copying can be implemented. > > > > If too many folios are passed to migrate_pages(), in the naive batched > > implementation, we may unmap too many folios at the same time. The > > possibility for a task to wait for the migrated folios to be mapped > > again increases. So the latency may be hurt. To deal with this > > issue, the max number of folios be unmapped in batch is restricted to > > no more than HPAGE_PMD_NR in the unit of page. That is, the influence > > is at the same level of THP migration. > > > > We use the following test to measure the performance impact of the > > patchset, > > > > On a 2-socket Intel server, > > > > - Run pmbench memory accessing benchmark > > > > - Run `migratepages` to migrate pages of pmbench between node 0 and > > node 1 back and forth. > > > > With the patch, the TLB flushing IPI reduces 99.1% during the test and > > the number of pages migrated successfully per second increases 291.7%. > > > > Xin Hao helped to test the patchset on an ARM64 server with 128 cores, > > 2 NUMA nodes. Test results show that the page migration performance > > increases up to 78%. > > > > This patchset is based on mm-unstable 2023-02-10. > > And back in linux-next this week: I tried next-20230217 overnight. > > There is a deadlock in this patchset (and in previous versions: sorry > it's taken me so long to report), but I think one that's easily solved. > > I've not bisected to precisely which patch (load can take several hours > to hit the deadlock), but it doesn't really matter, and I expect that > you can guess. > > My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), > and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. > So, plenty of ext4 page cache and buffer_heads. > > Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), > either in kcompactd0 or in khugepaged trying to compact, or in both: > it ends up calling __lock_buffer(), and that schedules away, waiting > forever to get BH_lock. I have not identified who is holding BH_lock, > but I imagine a jbd2 journalling thread, and presume that it wants one > of the folio locks which migrate_pages_batch() is already holding; or > maybe it's all more convoluted than that. Other tasks then back up > waiting on those folio locks held in the batch. > > Never a problem with buffer_migrate_folio(), always with the "more > careful" buffer_migrate_folio_norefs(). And the patch below fixes > it for me: I've had enough hours with it now, on enough occasions, > to be confident of that. > > Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 > very well, and I hope can assure us that there is an understandable > deadlock here, from holding several random folio locks, then trying > to lock buffers. Cc'ing fsdevel, because there's a risk that mm > folk think something is safe, when it's not sufficient to cope with > the diversity of filesystems. I hope nothing more than the below is > needed (and I've had no other problems with the patchset: good job), > but cannot be sure. I suspect it can indeed be caused by the presence of the loop device as Huang Ying has suggested. What filesystems using buffer_heads do is a pattern like: bh = page_buffers(loop device page cache page); lock_buffer(bh); submit_bh(bh); - now on loop dev this ends up doing: lo_write_bvec() vfs_iter_write() ... folio_lock(backing file folio); So if migration code holds "backing file folio" lock and at the same time waits for 'bh' lock (while trying to migrate loop device page cache page), it is a deadlock. Proposed solution of never waiting for locks in batched mode looks like a sensible one to me... Honza
Hi, Honza, Jan Kara <jack@suse.cz> writes: > On Fri 17-02-23 13:47:48, Hugh Dickins wrote: >> On Mon, 13 Feb 2023, Huang Ying wrote: >> >> > From: "Huang, Ying" <ying.huang@intel.com> >> > >> > Now, migrate_pages() migrate folios one by one, like the fake code as >> > follows, >> > >> > for each folio >> > unmap >> > flush TLB >> > copy >> > restore map >> > >> > If multiple folios are passed to migrate_pages(), there are >> > opportunities to batch the TLB flushing and copying. That is, we can >> > change the code to something as follows, >> > >> > for each folio >> > unmap >> > for each folio >> > flush TLB >> > for each folio >> > copy >> > for each folio >> > restore map >> > >> > The total number of TLB flushing IPI can be reduced considerably. And >> > we may use some hardware accelerator such as DSA to accelerate the >> > folio copying. >> > >> > So in this patch, we refactor the migrate_pages() implementation and >> > implement the TLB flushing batching. Base on this, hardware >> > accelerated folio copying can be implemented. >> > >> > If too many folios are passed to migrate_pages(), in the naive batched >> > implementation, we may unmap too many folios at the same time. The >> > possibility for a task to wait for the migrated folios to be mapped >> > again increases. So the latency may be hurt. To deal with this >> > issue, the max number of folios be unmapped in batch is restricted to >> > no more than HPAGE_PMD_NR in the unit of page. That is, the influence >> > is at the same level of THP migration. >> > >> > We use the following test to measure the performance impact of the >> > patchset, >> > >> > On a 2-socket Intel server, >> > >> > - Run pmbench memory accessing benchmark >> > >> > - Run `migratepages` to migrate pages of pmbench between node 0 and >> > node 1 back and forth. >> > >> > With the patch, the TLB flushing IPI reduces 99.1% during the test and >> > the number of pages migrated successfully per second increases 291.7%. >> > >> > Xin Hao helped to test the patchset on an ARM64 server with 128 cores, >> > 2 NUMA nodes. Test results show that the page migration performance >> > increases up to 78%. >> > >> > This patchset is based on mm-unstable 2023-02-10. >> >> And back in linux-next this week: I tried next-20230217 overnight. >> >> There is a deadlock in this patchset (and in previous versions: sorry >> it's taken me so long to report), but I think one that's easily solved. >> >> I've not bisected to precisely which patch (load can take several hours >> to hit the deadlock), but it doesn't really matter, and I expect that >> you can guess. >> >> My root and home filesystems are ext4 (4kB blocks with 4kB PAGE_SIZE), >> and so is the filesystem I'm testing, ext4 on /dev/loop0 on tmpfs. >> So, plenty of ext4 page cache and buffer_heads. >> >> Again and again, the deadlock is seen with buffer_migrate_folio_norefs(), >> either in kcompactd0 or in khugepaged trying to compact, or in both: >> it ends up calling __lock_buffer(), and that schedules away, waiting >> forever to get BH_lock. I have not identified who is holding BH_lock, >> but I imagine a jbd2 journalling thread, and presume that it wants one >> of the folio locks which migrate_pages_batch() is already holding; or >> maybe it's all more convoluted than that. Other tasks then back up >> waiting on those folio locks held in the batch. >> >> Never a problem with buffer_migrate_folio(), always with the "more >> careful" buffer_migrate_folio_norefs(). And the patch below fixes >> it for me: I've had enough hours with it now, on enough occasions, >> to be confident of that. >> >> Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 >> very well, and I hope can assure us that there is an understandable >> deadlock here, from holding several random folio locks, then trying >> to lock buffers. Cc'ing fsdevel, because there's a risk that mm >> folk think something is safe, when it's not sufficient to cope with >> the diversity of filesystems. I hope nothing more than the below is >> needed (and I've had no other problems with the patchset: good job), >> but cannot be sure. > > I suspect it can indeed be caused by the presence of the loop device as > Huang Ying has suggested. What filesystems using buffer_heads do is a > pattern like: > > bh = page_buffers(loop device page cache page); > lock_buffer(bh); > submit_bh(bh); > - now on loop dev this ends up doing: > lo_write_bvec() > vfs_iter_write() > ... > folio_lock(backing file folio); > > So if migration code holds "backing file folio" lock and at the same time > waits for 'bh' lock (while trying to migrate loop device page cache page), it > is a deadlock. > > Proposed solution of never waiting for locks in batched mode looks like a > sensible one to me... Thank you very much for detail explanation! Best Regards, Huang, Ying
On Tue, 28 Feb 2023, Huang, Ying wrote: > Jan Kara <jack@suse.cz> writes: > > On Fri 17-02-23 13:47:48, Hugh Dickins wrote: > >> > >> Cc'ing Jan Kara, who knows buffer_migrate_folio_norefs() and jbd2 > >> very well, and I hope can assure us that there is an understandable > >> deadlock here, from holding several random folio locks, then trying > >> to lock buffers. Cc'ing fsdevel, because there's a risk that mm > >> folk think something is safe, when it's not sufficient to cope with > >> the diversity of filesystems. I hope nothing more than the below is > >> needed (and I've had no other problems with the patchset: good job), > >> but cannot be sure. > > > > I suspect it can indeed be caused by the presence of the loop device as > > Huang Ying has suggested. What filesystems using buffer_heads do is a > > pattern like: > > > > bh = page_buffers(loop device page cache page); > > lock_buffer(bh); > > submit_bh(bh); > > - now on loop dev this ends up doing: > > lo_write_bvec() > > vfs_iter_write() > > ... > > folio_lock(backing file folio); > > > > So if migration code holds "backing file folio" lock and at the same time > > waits for 'bh' lock (while trying to migrate loop device page cache page), it > > is a deadlock. > > > > Proposed solution of never waiting for locks in batched mode looks like a > > sensible one to me... > > Thank you very much for detail explanation! Yes, thanks a lot, Jan, for elucidating the deadlocks. I was running with the 1/3,2/3,3/3 series for 48 hours on two machines at the weekend, and hit no problems with all of them on. I did try to review them this evening, but there's too much for me to take in there to give an Acked-by: but I'll ask a couple of questions. Hugh
From: "Huang, Ying" <ying.huang@intel.com> Now, migrate_pages() migrate folios one by one, like the fake code as follows, for each folio unmap flush TLB copy restore map If multiple folios are passed to migrate_pages(), there are opportunities to batch the TLB flushing and copying. That is, we can change the code to something as follows, for each folio unmap for each folio flush TLB for each folio copy for each folio restore map The total number of TLB flushing IPI can be reduced considerably. And we may use some hardware accelerator such as DSA to accelerate the folio copying. So in this patch, we refactor the migrate_pages() implementation and implement the TLB flushing batching. Base on this, hardware accelerated folio copying can be implemented. If too many folios are passed to migrate_pages(), in the naive batched implementation, we may unmap too many folios at the same time. The possibility for a task to wait for the migrated folios to be mapped again increases. So the latency may be hurt. To deal with this issue, the max number of folios be unmapped in batch is restricted to no more than HPAGE_PMD_NR in the unit of page. That is, the influence is at the same level of THP migration. We use the following test to measure the performance impact of the patchset, On a 2-socket Intel server, - Run pmbench memory accessing benchmark - Run `migratepages` to migrate pages of pmbench between node 0 and node 1 back and forth. With the patch, the TLB flushing IPI reduces 99.1% during the test and the number of pages migrated successfully per second increases 291.7%. Xin Hao helped to test the patchset on an ARM64 server with 128 cores, 2 NUMA nodes. Test results show that the page migration performance increases up to 78%. This patchset is based on mm-unstable 2023-02-10. Changes: v5: - Added Xin's test results on ARM 64. Thanks Xin! - Fixed many comments. Thanks Zi and Xin! - Collected reviewed-by and tested-by. v4: - Fixed another bug about non-LRU folio migration. Thanks Hyeonggon! v3: - Rebased on v6.2-rc4 - Fixed a bug about non-LRU folio migration. Thanks Mike! - Fixed some comments. Thanks Baolin! - Collected reviewed-by. v2: - Rebased on v6.2-rc3 - Fixed type force cast warning. Thanks Kees! - Added more comments and cleaned up the code. Thanks Andrew, Zi, Alistair, Dan! - Collected reviewed-by. from rfc to v1: - Rebased on v6.2-rc1 - Fix the deadlock issue caused by locking multiple pages synchronously per Alistair's comments. Thanks! - Fix the autonumabench panic per Rao's comments and fix. Thanks! - Other minor fixes per comments. Thanks! Best Regards, Huang, Ying