Message ID | 1665725448-31439-1-git-send-email-zhaoyang.huang@unisoc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: move xa forward when run across zombie page | expand |
On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > superblock's inode list. The direct reason is zombie page keeps staying on the > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > and supposed could be an xa update without synchronize_rcu etc. I would like to > suggest skip this page to break the live lock as a workaround. No, the underlying bug should be fixed. > if (!folio || xa_is_value(folio)) > return folio; > > - if (!folio_try_get_rcu(folio)) > + if (!folio_try_get_rcu(folio)) { > + xas_advance(xas, folio->index + folio_nr_pages(folio) - 1); > goto reset; > + } You can't do this anyway. To call folio_nr_pages() and to look at folio->index, you must have a refcount on the page, and this is the path where we failed to get the refcount.
On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > superblock's inode list. The direct reason is zombie page keeps staying on the > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > suggest skip this page to break the live lock as a workaround. > > No, the underlying bug should be fixed. > > > if (!folio || xa_is_value(folio)) > > return folio; > > > > - if (!folio_try_get_rcu(folio)) > > + if (!folio_try_get_rcu(folio)) { > > + xas_advance(xas, folio->index + folio_nr_pages(folio) - 1); > > goto reset; > > + } > > You can't do this anyway. To call folio_nr_pages() and to look at > folio->index, you must have a refcount on the page, and this is the > path where we failed to get the refcount. OK, could I move the xas like below? + if (!folio_try_get_rcu(folio)) { + xas_next_offset(xas); goto reset; + }
On Mon, Oct 17, 2022 at 1:34 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote: > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > suggest skip this page to break the live lock as a workaround. > > > > No, the underlying bug should be fixed. > > > > > if (!folio || xa_is_value(folio)) > > > return folio; > > > > > > - if (!folio_try_get_rcu(folio)) > > > + if (!folio_try_get_rcu(folio)) { > > > + xas_advance(xas, folio->index + folio_nr_pages(folio) - 1); > > > goto reset; > > > + } > > > > You can't do this anyway. To call folio_nr_pages() and to look at > > folio->index, you must have a refcount on the page, and this is the > > path where we failed to get the refcount. > OK, could I move the xas like below? > > + if (!folio_try_get_rcu(folio)) { > + xas_next_offset(xas); > goto reset; > + } sorry, It seems the above suggestion doesn't break the loop. update it by moving one step and goto retry instead of restarting from ROOT. if (!folio_try_get_rcu(folio)) { xas_next_offset(xas); goto retry; }
On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > suggest skip this page to break the live lock as a workaround. > > > > No, the underlying bug should be fixed. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > OK, could I move the xas like below? > > + if (!folio_try_get_rcu(folio)) { > + xas_next_offset(xas); > goto reset; > + }
On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > suggest skip this page to break the live lock as a workaround. > > > > > > No, the underlying bug should be fixed. > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Understand. IMHO, find_get_entry actruely works as an open API dealing with different kinds of address_spaces page cache, which requires high robustness to deal with any corner cases. Take the current problem as example, the inode with fault page(refcount=0) could remain on the sb's list without live lock problem. > > > OK, could I move the xas like below? > > > > + if (!folio_try_get_rcu(folio)) { > > + xas_next_offset(xas); > > goto reset; > > + }
On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > No, the underlying bug should be fixed. > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > Understand. IMHO, find_get_entry actruely works as an open API dealing > with different kinds of address_spaces page cache, which requires high > robustness to deal with any corner cases. Take the current problem as > example, the inode with fault page(refcount=0) could remain on the > sb's list without live lock problem. But it's a corner case that shouldn't happen! What else is going on at the time? Can you reproduce this problem easily? If so, how?
On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > with different kinds of address_spaces page cache, which requires high > > robustness to deal with any corner cases. Take the current problem as > > example, the inode with fault page(refcount=0) could remain on the > > sb's list without live lock problem. > > But it's a corner case that shouldn't happen! What else is going on > at the time? Can you reproduce this problem easily? If so, how? I've been seeing this livelock, too. The reproducer is, unfortunately, something I can't share - it's a massive program that triggers a data corruption I'm working on solving. Now that I've mostly fixed the data corruption, long duration test runs end up livelocking in page cache lookup after several hours. The test is effectively writing a 100MB file with multiple threads doing reverse adjacent racing 1MB unaligned writes. Once the file is written, it is then mmap()d and read back from the filesystem for verification. THis is then run with tens of processes concurrently, and then under a massively confined memcg (e.g. 32 processes/files are run in a memcg with only 200MB of memory allowed). This causes writeback, readahead and memory reclaim to race with incoming mmap read faults and writes. The livelock occurs on file verification and it appears to be an interaction with readahead thrashing. On my test rig, the physical read to write ratio is at least 20:1 - with 32 processes running, the 5s IO rates are: Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 This is reading and writing the same amount of file data at the application level, but once the data has been written and kicked out of the page cache it seems to require an awful lot more read IO to get it back to the application. i.e. this looks like mmap() is readahead thrashing severely, and eventually it livelocks with this sort of report: [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 [175901.995614] Call Trace: [175901.996090] <TASK> [175901.996594] ? __schedule+0x301/0xa30 [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 [175902.000714] ? xas_start+0x53/0xc0 [175902.001484] ? xas_load+0x24/0xa0 [175902.002208] ? xas_load+0x5/0xa0 [175902.002878] ? __filemap_get_folio+0x87/0x340 [175902.003823] ? filemap_fault+0x139/0x8d0 [175902.004693] ? __do_fault+0x31/0x1d0 [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 [175902.006213] ? handle_mm_fault+0xd0/0x2a0 [175902.006998] ? exc_page_fault+0x1d9/0x810 [175902.007789] ? asm_exc_page_fault+0x22/0x30 [175902.008613] </TASK> Given that filemap_fault on XFS is probably trying to map large folios, I do wonder if this is a result of some kind of race with teardown of a large folio... There is a very simple corruption reproducer script that has been written, but I haven't been using it. I don't know if long term running of the script here: https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/ will trigger the livelock as the verification step is significantly different, but it will give you insight into the setup of the environment that leads to the livelock. Maybe you could replace the md5sum verification with a mmap read with xfs_io to simulate the fault load that seems to lead to this issue... Cheers, Dave.
On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > > with different kinds of address_spaces page cache, which requires high > > > robustness to deal with any corner cases. Take the current problem as > > > example, the inode with fault page(refcount=0) could remain on the > > > sb's list without live lock problem. > > > > But it's a corner case that shouldn't happen! What else is going on > > at the time? Can you reproduce this problem easily? If so, how? > > I've been seeing this livelock, too. The reproducer is, > unfortunately, something I can't share - it's a massive program that > triggers a data corruption I'm working on solving. > > Now that I've > mostly fixed the data corruption, long duration test runs end up > livelocking in page cache lookup after several hours. > > The test is effectively writing a 100MB file with multiple threads > doing reverse adjacent racing 1MB unaligned writes. Once the file is > written, it is then mmap()d and read back from the filesystem for > verification. > > THis is then run with tens of processes concurrently, and then under > a massively confined memcg (e.g. 32 processes/files are run in a > memcg with only 200MB of memory allowed). This causes writeback, > readahead and memory reclaim to race with incoming mmap read faults > and writes. The livelock occurs on file verification and it appears > to be an interaction with readahead thrashing. > > On my test rig, the physical read to write ratio is at least 20:1 - > with 32 processes running, the 5s IO rates are: > > Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd > dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 > dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 > dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 > dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 > dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 > dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 > dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 > dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 > dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 > dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 > dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 > dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 > dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 > > This is reading and writing the same amount of file data at the > application level, but once the data has been written and kicked out > of the page cache it seems to require an awful lot more read IO to > get it back to the application. i.e. this looks like mmap() is > readahead thrashing severely, and eventually it livelocks with this > sort of report: > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > [175901.995614] Call Trace: > [175901.996090] <TASK> > [175901.996594] ? __schedule+0x301/0xa30 > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > [175902.000714] ? xas_start+0x53/0xc0 > [175902.001484] ? xas_load+0x24/0xa0 > [175902.002208] ? xas_load+0x5/0xa0 > [175902.002878] ? __filemap_get_folio+0x87/0x340 > [175902.003823] ? filemap_fault+0x139/0x8d0 > [175902.004693] ? __do_fault+0x31/0x1d0 > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > [175902.006998] ? exc_page_fault+0x1d9/0x810 > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > [175902.008613] </TASK> > > Given that filemap_fault on XFS is probably trying to map large > folios, I do wonder if this is a result of some kind of race with > teardown of a large folio... > > There is a very simple corruption reproducer script that has been > written, but I haven't been using it. I don't know if long term > running of the script here: > > https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/ > > will trigger the livelock as the verification step is > significantly different, but it will give you insight into the > setup of the environment that leads to the livelock. Maybe you could > replace the md5sum verification with a mmap read with xfs_io to > simulate the fault load that seems to lead to this issue... FWIW, just tested this on a current Linus kernel. While there is massive read-ahead thrashing on v6.0, the thrashing is largely gone in v6.1-rc1+ and the iteration rate of the test is much, much better. The livelock remains, however. Cheers, Dave.
On Wed, Oct 19, 2022 at 12:16:36PM +1100, Dave Chinner wrote: > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > > > with different kinds of address_spaces page cache, which requires high > > > > robustness to deal with any corner cases. Take the current problem as > > > > example, the inode with fault page(refcount=0) could remain on the > > > > sb's list without live lock problem. > > > > > > But it's a corner case that shouldn't happen! What else is going on > > > at the time? Can you reproduce this problem easily? If so, how? > > > > I've been seeing this livelock, too. The reproducer is, > > unfortunately, something I can't share - it's a massive program that > > triggers a data corruption I'm working on solving. > > > > Now that I've > > mostly fixed the data corruption, long duration test runs end up > > livelocking in page cache lookup after several hours. > > > > The test is effectively writing a 100MB file with multiple threads > > doing reverse adjacent racing 1MB unaligned writes. Once the file is > > written, it is then mmap()d and read back from the filesystem for > > verification. > > > > THis is then run with tens of processes concurrently, and then under > > a massively confined memcg (e.g. 32 processes/files are run in a > > memcg with only 200MB of memory allowed). This causes writeback, > > readahead and memory reclaim to race with incoming mmap read faults > > and writes. The livelock occurs on file verification and it appears > > to be an interaction with readahead thrashing. > > > > On my test rig, the physical read to write ratio is at least 20:1 - > > with 32 processes running, the 5s IO rates are: > > > > Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd > > dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 > > dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 > > dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 > > dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 > > dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 > > dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 > > dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 > > dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 > > dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 > > dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 > > dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 > > dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 > > dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 > > > > This is reading and writing the same amount of file data at the > > application level, but once the data has been written and kicked out > > of the page cache it seems to require an awful lot more read IO to > > get it back to the application. i.e. this looks like mmap() is > > readahead thrashing severely, and eventually it livelocks with this > > sort of report: > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > [175901.995614] Call Trace: > > [175901.996090] <TASK> > > [175901.996594] ? __schedule+0x301/0xa30 > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [175902.000714] ? xas_start+0x53/0xc0 > > [175902.001484] ? xas_load+0x24/0xa0 > > [175902.002208] ? xas_load+0x5/0xa0 > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > [175902.004693] ? __do_fault+0x31/0x1d0 > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > [175902.008613] </TASK> > > > > Given that filemap_fault on XFS is probably trying to map large > > folios, I do wonder if this is a result of some kind of race with > > teardown of a large folio... > > > > There is a very simple corruption reproducer script that has been > > written, but I haven't been using it. I don't know if long term > > running of the script here: > > > > https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/ > > > > will trigger the livelock as the verification step is > > significantly different, but it will give you insight into the > > setup of the environment that leads to the livelock. Maybe you could > > replace the md5sum verification with a mmap read with xfs_io to > > simulate the fault load that seems to lead to this issue... > > FWIW, just tested this on a current Linus kernel. While there is > massive read-ahead thrashing on v6.0, the thrashing is largely gone > in v6.1-rc1+ and the iteration rate of the test is much, much > better. The livelock remains, however. Evidence is starting to point to an interaction with the multi-page folio support in the page cache. I removed the mapping_set_large_folios() calls in the XFS inode instantiation and the test code has now run over 55,000 iterations without failing. The most iterations I'd seen with large folios enabled was about 7,000 - typically it would fail within 2-3,000 iterations. Cheers, Dave.
On Wed, Oct 19, 2022 at 12:47 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Oct 19, 2022 at 12:16:36PM +1100, Dave Chinner wrote: > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > > > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > > > > with different kinds of address_spaces page cache, which requires high > > > > > robustness to deal with any corner cases. Take the current problem as > > > > > example, the inode with fault page(refcount=0) could remain on the > > > > > sb's list without live lock problem. > > > > > > > > But it's a corner case that shouldn't happen! What else is going on > > > > at the time? Can you reproduce this problem easily? If so, how? > > > > > > I've been seeing this livelock, too. The reproducer is, > > > unfortunately, something I can't share - it's a massive program that > > > triggers a data corruption I'm working on solving. > > > > > > Now that I've > > > mostly fixed the data corruption, long duration test runs end up > > > livelocking in page cache lookup after several hours. > > > > > > The test is effectively writing a 100MB file with multiple threads > > > doing reverse adjacent racing 1MB unaligned writes. Once the file is > > > written, it is then mmap()d and read back from the filesystem for > > > verification. > > > > > > THis is then run with tens of processes concurrently, and then under > > > a massively confined memcg (e.g. 32 processes/files are run in a > > > memcg with only 200MB of memory allowed). This causes writeback, > > > readahead and memory reclaim to race with incoming mmap read faults > > > and writes. The livelock occurs on file verification and it appears > > > to be an interaction with readahead thrashing. > > > > > > On my test rig, the physical read to write ratio is at least 20:1 - > > > with 32 processes running, the 5s IO rates are: > > > > > > Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd > > > dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 > > > dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 > > > dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 > > > dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 > > > dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 > > > dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 > > > dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 > > > dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 > > > dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 > > > dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 > > > dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 > > > dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 > > > dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 > > > > > > This is reading and writing the same amount of file data at the > > > application level, but once the data has been written and kicked out > > > of the page cache it seems to require an awful lot more read IO to > > > get it back to the application. i.e. this looks like mmap() is > > > readahead thrashing severely, and eventually it livelocks with this > > > sort of report: > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > [175901.995614] Call Trace: > > > [175901.996090] <TASK> > > > [175901.996594] ? __schedule+0x301/0xa30 > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > [175902.000714] ? xas_start+0x53/0xc0 > > > [175902.001484] ? xas_load+0x24/0xa0 > > > [175902.002208] ? xas_load+0x5/0xa0 > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > [175902.008613] </TASK> > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > folios, I do wonder if this is a result of some kind of race with > > > teardown of a large folio... > > > > > > There is a very simple corruption reproducer script that has been > > > written, but I haven't been using it. I don't know if long term > > > running of the script here: > > > > > > https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/ > > > > > > will trigger the livelock as the verification step is > > > significantly different, but it will give you insight into the > > > setup of the environment that leads to the livelock. Maybe you could > > > replace the md5sum verification with a mmap read with xfs_io to > > > simulate the fault load that seems to lead to this issue... > > > > FWIW, just tested this on a current Linus kernel. While there is > > massive read-ahead thrashing on v6.0, the thrashing is largely gone > > in v6.1-rc1+ and the iteration rate of the test is much, much > > better. The livelock remains, however. > > Evidence is starting to point to an interaction with the multi-page > folio support in the page cache. > > I removed the mapping_set_large_folios() calls in the XFS inode > instantiation and the test code has now run over 55,000 iterations > without failing. The most iterations I'd seen with large folios > enabled was about 7,000 - typically it would fail within 2-3,000 > iterations. hint from my side. The original problem I raised is under v5.15 where there is no folio yet. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > > with different kinds of address_spaces page cache, which requires high > > > robustness to deal with any corner cases. Take the current problem as > > > example, the inode with fault page(refcount=0) could remain on the > > > sb's list without live lock problem. > > > > But it's a corner case that shouldn't happen! What else is going on > > at the time? Can you reproduce this problem easily? If so, how? > > I've been seeing this livelock, too. The reproducer is, > unfortunately, something I can't share - it's a massive program that > triggers a data corruption I'm working on solving. > > Now that I've > mostly fixed the data corruption, long duration test runs end up > livelocking in page cache lookup after several hours. > > The test is effectively writing a 100MB file with multiple threads > doing reverse adjacent racing 1MB unaligned writes. Once the file is > written, it is then mmap()d and read back from the filesystem for > verification. > > THis is then run with tens of processes concurrently, and then under > a massively confined memcg (e.g. 32 processes/files are run in a > memcg with only 200MB of memory allowed). This causes writeback, > readahead and memory reclaim to race with incoming mmap read faults > and writes. The livelock occurs on file verification and it appears > to be an interaction with readahead thrashing. > > On my test rig, the physical read to write ratio is at least 20:1 - > with 32 processes running, the 5s IO rates are: > > Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd > dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 > dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 > dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 > dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 > dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 > dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 > dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 > dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 > dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 > dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 > dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 > dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 > dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 > > This is reading and writing the same amount of file data at the > application level, but once the data has been written and kicked out > of the page cache it seems to require an awful lot more read IO to > get it back to the application. i.e. this looks like mmap() is > readahead thrashing severely, and eventually it livelocks with this > sort of report: > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > [175901.995614] Call Trace: > [175901.996090] <TASK> > [175901.996594] ? __schedule+0x301/0xa30 > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > [175902.000714] ? xas_start+0x53/0xc0 > [175902.001484] ? xas_load+0x24/0xa0 > [175902.002208] ? xas_load+0x5/0xa0 > [175902.002878] ? __filemap_get_folio+0x87/0x340 > [175902.003823] ? filemap_fault+0x139/0x8d0 > [175902.004693] ? __do_fault+0x31/0x1d0 > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > [175902.006998] ? exc_page_fault+0x1d9/0x810 > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > [175902.008613] </TASK> > > Given that filemap_fault on XFS is probably trying to map large > folios, I do wonder if this is a result of some kind of race with > teardown of a large folio... > I somewhat recently tracked down a hugepage/swap problem that could manifest as a softlockup in the folio lookup path (due to indefinite folio_try_get_rcu() failure): https://lore.kernel.org/linux-mm/20220906190602.1626037-1-bfoster@redhat.com/ It could easily be something different leading to the same side effect, particularly since I believe the issue I saw was introduced in v5.19, but might be worth a test if you have a reliable reproducer. Brian > There is a very simple corruption reproducer script that has been > written, but I haven't been using it. I don't know if long term > running of the script here: > > https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/ > > will trigger the livelock as the verification step is > significantly different, but it will give you insight into the > setup of the environment that leads to the livelock. Maybe you could > replace the md5sum verification with a mmap read with xfs_io to > simulate the fault load that seems to lead to this issue... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote: > On Wed, Oct 19, 2022 at 12:47 PM Dave Chinner <david@fromorbit.com> wrote: > > I removed the mapping_set_large_folios() calls in the XFS inode > > instantiation and the test code has now run over 55,000 iterations > > without failing. The most iterations I'd seen with large folios > > enabled was about 7,000 - typically it would fail within 2-3,000 > > iterations. > hint from my side. The original problem I raised is under v5.15 where > there is no folio yet. But 5.15 does use 2MB pages in shmem. You haven't really provided any information, so I don't know whether the inode that you're having problems with is a shmem inode.
On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > This is reading and writing the same amount of file data at the > application level, but once the data has been written and kicked out > of the page cache it seems to require an awful lot more read IO to > get it back to the application. i.e. this looks like mmap() is > readahead thrashing severely, and eventually it livelocks with this > sort of report: > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > [175901.995614] Call Trace: > [175901.996090] <TASK> > [175901.996594] ? __schedule+0x301/0xa30 > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > [175902.000714] ? xas_start+0x53/0xc0 > [175902.001484] ? xas_load+0x24/0xa0 > [175902.002208] ? xas_load+0x5/0xa0 > [175902.002878] ? __filemap_get_folio+0x87/0x340 > [175902.003823] ? filemap_fault+0x139/0x8d0 > [175902.004693] ? __do_fault+0x31/0x1d0 > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > [175902.006998] ? exc_page_fault+0x1d9/0x810 > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > [175902.008613] </TASK> > > Given that filemap_fault on XFS is probably trying to map large > folios, I do wonder if this is a result of some kind of race with > teardown of a large folio... It doesn't matter whether we're trying to map a large folio; it matters whether a large folio was previously created in the cache. Through the magic of readahead, it may well have been. I suspect it's not teardown of a large folio, but splitting. Removing a page from the page cache stores to the pointer in the XArray first (either NULL or a shadow entry), then decrements the refcount. We must be observing a frozen folio. There are a number of places in the MM which freeze a folio, but the obvious one is splitting. That looks like this: local_irq_disable(); if (mapping) { xas_lock(&xas); (...) if (folio_ref_freeze(folio, 1 + extra_pins)) { So one way to solve this might be to try to take the xa_lock on failure to get the refcount. Otherwise a high-priority task might spin forever without a low-priority task getting the chance to finish the work being done while the folio is frozen. ie this: diff --git a/mm/filemap.c b/mm/filemap.c index 08341616ae7a..ca0eed80580f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index) if (!folio || xa_is_value(folio)) goto out; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get_rcu(folio)) { + unsigned long flags; + + xas_lock_irqsave(&xas, flags); + xas_unlock_irqrestore(&xas, flags); goto repeat; + } if (unlikely(folio != xas_reload(&xas))) { folio_put(folio); @@ -2014,8 +2019,13 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max, if (!folio || xa_is_value(folio)) return folio; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get_rcu(folio)) { + unsigned long flags; + + xas_lock_irqsave(xas, flags); + xas_unlock_irqrestore(xas, flags); goto reset; + } if (unlikely(folio != xas_reload(xas))) { folio_put(folio); @@ -2224,8 +2234,13 @@ unsigned filemap_get_folios_contig(struct address_space *mapping, if (xa_is_value(folio)) goto update_start; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get_rcu(folio)) { + unsigned long flags; + + xas_lock_irqsave(&xas, flags); + xas_unlock_irqrestore(&xas, flags); goto retry; + } if (unlikely(folio != xas_reload(&xas))) goto put_folio; @@ -2365,8 +2380,13 @@ static void filemap_get_read_batch(struct address_space *mapping, break; if (xa_is_sibling(folio)) break; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get_rcu(folio)) { + unsigned long flags; + + xas_lock_irqsave(&xas, flags); + xas_unlock_irqrestore(&xas, flags); goto retry; + } if (unlikely(folio != xas_reload(&xas))) goto put_folio; I'm kicking off an xfstests run to see if that causes any new problems. > There is a very simple corruption reproducer script that has been > written, but I haven't been using it. I don't know if long term > running of the script here: > > https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/ > > will trigger the livelock as the verification step is > significantly different, but it will give you insight into the > setup of the environment that leads to the livelock. Maybe you could > replace the md5sum verification with a mmap read with xfs_io to > simulate the fault load that seems to lead to this issue... If the problem needs scheduling parameters to hit, then that script isn't going to reproduce the problems for me. But I'll give it a try anyway.
On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > This is reading and writing the same amount of file data at the > > application level, but once the data has been written and kicked out > > of the page cache it seems to require an awful lot more read IO to > > get it back to the application. i.e. this looks like mmap() is > > readahead thrashing severely, and eventually it livelocks with this > > sort of report: > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > [175901.995614] Call Trace: > > [175901.996090] <TASK> > > [175901.996594] ? __schedule+0x301/0xa30 > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [175902.000714] ? xas_start+0x53/0xc0 > > [175902.001484] ? xas_load+0x24/0xa0 > > [175902.002208] ? xas_load+0x5/0xa0 > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > [175902.004693] ? __do_fault+0x31/0x1d0 > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > [175902.008613] </TASK> > > > > Given that filemap_fault on XFS is probably trying to map large > > folios, I do wonder if this is a result of some kind of race with > > teardown of a large folio... > > It doesn't matter whether we're trying to map a large folio; it > matters whether a large folio was previously created in the cache. > Through the magic of readahead, it may well have been. I suspect > it's not teardown of a large folio, but splitting. Removing a > page from the page cache stores to the pointer in the XArray > first (either NULL or a shadow entry), then decrements the refcount. > > We must be observing a frozen folio. There are a number of places > in the MM which freeze a folio, but the obvious one is splitting. > That looks like this: > > local_irq_disable(); > if (mapping) { > xas_lock(&xas); > (...) > if (folio_ref_freeze(folio, 1 + extra_pins)) { But the lookup is not doing anything to prevent the split on the frozen page from making progress, right? It's not holding any folio references, and it's not holding the mapping tree lock, either. So how does the lookup in progress prevent the page split from making progress? > So one way to solve this might be to try to take the xa_lock on > failure to get the refcount. Otherwise a high-priority task > might spin forever without a low-priority task getting the chance > to finish the work being done while the folio is frozen. IIUC, then you are saying that there is a scheduling priority inversion because the lookup failure looping path doesn't yeild the CPU? If so, how does taking the mapping tree spin lock on failure cause the looping task to yield the CPU and hence allow the folio split to make progress? Also, AFAICT, the page split has disabled local interrupts, so it should effectively be running with preemption disabled as it has turned off the mechanism the scheduler can use to preempt it. The page split can't sleep, either, because it holds the mapping tree lock. Hence I can't see how a split-in-progress can be preempted in teh first place to cause a priority inversion livelock like this... > ie this: > > diff --git a/mm/filemap.c b/mm/filemap.c > index 08341616ae7a..ca0eed80580f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index) > if (!folio || xa_is_value(folio)) > goto out; > > - if (!folio_try_get_rcu(folio)) > + if (!folio_try_get_rcu(folio)) { > + unsigned long flags; > + > + xas_lock_irqsave(&xas, flags); > + xas_unlock_irqrestore(&xas, flags); > goto repeat; > + } I would have thought: if (!folio_try_get_rcu(folio)) { rcu_read_unlock(); cond_resched(); rcu_read_lock(); goto repeat; } Would be the right way to yeild the CPU to avoid priority inversion related livelocks here... Cheers, Dave.
On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > This is reading and writing the same amount of file data at the > > > application level, but once the data has been written and kicked out > > > of the page cache it seems to require an awful lot more read IO to > > > get it back to the application. i.e. this looks like mmap() is > > > readahead thrashing severely, and eventually it livelocks with this > > > sort of report: > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > [175901.995614] Call Trace: > > > [175901.996090] <TASK> > > > [175901.996594] ? __schedule+0x301/0xa30 > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > [175902.000714] ? xas_start+0x53/0xc0 > > > [175902.001484] ? xas_load+0x24/0xa0 > > > [175902.002208] ? xas_load+0x5/0xa0 > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > [175902.008613] </TASK> > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > folios, I do wonder if this is a result of some kind of race with > > > teardown of a large folio... > > > > It doesn't matter whether we're trying to map a large folio; it > > matters whether a large folio was previously created in the cache. > > Through the magic of readahead, it may well have been. I suspect > > it's not teardown of a large folio, but splitting. Removing a > > page from the page cache stores to the pointer in the XArray > > first (either NULL or a shadow entry), then decrements the refcount. > > > > We must be observing a frozen folio. There are a number of places > > in the MM which freeze a folio, but the obvious one is splitting. > > That looks like this: > > > > local_irq_disable(); > > if (mapping) { > > xas_lock(&xas); > > (...) > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > But the lookup is not doing anything to prevent the split on the > frozen page from making progress, right? It's not holding any folio > references, and it's not holding the mapping tree lock, either. So > how does the lookup in progress prevent the page split from making > progress? > > > > So one way to solve this might be to try to take the xa_lock on > > failure to get the refcount. Otherwise a high-priority task > > might spin forever without a low-priority task getting the chance > > to finish the work being done while the folio is frozen. > > IIUC, then you are saying that there is a scheduling priority > inversion because the lookup failure looping path doesn't yeild the > CPU? > > If so, how does taking the mapping tree spin lock on failure cause > the looping task to yield the CPU and hence allow the folio split to > make progress? > > Also, AFAICT, the page split has disabled local interrupts, so it > should effectively be running with preemption disabled as it has > turned off the mechanism the scheduler can use to preempt it. The > page split can't sleep, either, because it holds the mapping tree > lock. Hence I can't see how a split-in-progress can be preempted in > teh first place to cause a priority inversion livelock like this... > > > ie this: > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > index 08341616ae7a..ca0eed80580f 100644 > > --- a/mm/filemap.c > > +++ b/mm/filemap.c > > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index) > > if (!folio || xa_is_value(folio)) > > goto out; > > > > - if (!folio_try_get_rcu(folio)) > > + if (!folio_try_get_rcu(folio)) { > > + unsigned long flags; > > + > > + xas_lock_irqsave(&xas, flags); > > + xas_unlock_irqrestore(&xas, flags); > > goto repeat; > > + } As I suspected, this change did not prevent the livelock. It reproduced after just 650 test iterations (a few minutes) with this change in place. > I would have thought: > > if (!folio_try_get_rcu(folio)) { > rcu_read_unlock(); > cond_resched(); > rcu_read_lock(); > goto repeat; > } > > Would be the right way to yeild the CPU to avoid priority > inversion related livelocks here... I'm now trying this just to provide a data point that actually yeilding the CPU avoids the downstream effects of the livelock (i.e. RCU grace period never expires, system eventually dies...). Maybe it will avoid the livelock altogether, but I still don't understand how huge page splitting would be getting preempted in the first place... Cheers, Dave.
On Thu, Oct 20, 2022 at 09:46:16AM +1100, Dave Chinner wrote: > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > This is reading and writing the same amount of file data at the > > > > application level, but once the data has been written and kicked out > > > > of the page cache it seems to require an awful lot more read IO to > > > > get it back to the application. i.e. this looks like mmap() is > > > > readahead thrashing severely, and eventually it livelocks with this > > > > sort of report: > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > [175901.995614] Call Trace: > > > > [175901.996090] <TASK> > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > [175902.008613] </TASK> > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > folios, I do wonder if this is a result of some kind of race with > > > > teardown of a large folio... > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > matters whether a large folio was previously created in the cache. > > > Through the magic of readahead, it may well have been. I suspect > > > it's not teardown of a large folio, but splitting. Removing a > > > page from the page cache stores to the pointer in the XArray > > > first (either NULL or a shadow entry), then decrements the refcount. > > > > > > We must be observing a frozen folio. There are a number of places > > > in the MM which freeze a folio, but the obvious one is splitting. > > > That looks like this: > > > > > > local_irq_disable(); > > > if (mapping) { > > > xas_lock(&xas); > > > (...) > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > But the lookup is not doing anything to prevent the split on the > > frozen page from making progress, right? It's not holding any folio > > references, and it's not holding the mapping tree lock, either. So > > how does the lookup in progress prevent the page split from making > > progress? > > > > > > > So one way to solve this might be to try to take the xa_lock on > > > failure to get the refcount. Otherwise a high-priority task > > > might spin forever without a low-priority task getting the chance > > > to finish the work being done while the folio is frozen. > > > > IIUC, then you are saying that there is a scheduling priority > > inversion because the lookup failure looping path doesn't yeild the > > CPU? > > > > If so, how does taking the mapping tree spin lock on failure cause > > the looping task to yield the CPU and hence allow the folio split to > > make progress? > > > > Also, AFAICT, the page split has disabled local interrupts, so it > > should effectively be running with preemption disabled as it has > > turned off the mechanism the scheduler can use to preempt it. The > > page split can't sleep, either, because it holds the mapping tree > > lock. Hence I can't see how a split-in-progress can be preempted in > > teh first place to cause a priority inversion livelock like this... > > > > > ie this: > > > > > > diff --git a/mm/filemap.c b/mm/filemap.c > > > index 08341616ae7a..ca0eed80580f 100644 > > > --- a/mm/filemap.c > > > +++ b/mm/filemap.c > > > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index) > > > if (!folio || xa_is_value(folio)) > > > goto out; > > > > > > - if (!folio_try_get_rcu(folio)) > > > + if (!folio_try_get_rcu(folio)) { > > > + unsigned long flags; > > > + > > > + xas_lock_irqsave(&xas, flags); > > > + xas_unlock_irqrestore(&xas, flags); > > > goto repeat; > > > + } > > As I suspected, this change did not prevent the livelock. It > reproduced after just 650 test iterations (a few minutes) with this > change in place. > > > I would have thought: > > > > if (!folio_try_get_rcu(folio)) { > > rcu_read_unlock(); > > cond_resched(); > > rcu_read_lock(); > > goto repeat; > > } > > > > Would be the right way to yeild the CPU to avoid priority > > inversion related livelocks here... > > I'm now trying this just to provide a data point that actually > yeilding the CPU avoids the downstream effects of the livelock (i.e. This causes a different problems to occur. First, a task hangs evicting the page cache on unlink like so: [ 1352.990246] INFO: task test_write:3572 blocked for more than 122 seconds. [ 1352.991418] Not tainted 6.1.0-rc1-dgc+ #1618 [ 1352.992378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1352.994702] task:test_write state:D stack:13088 pid:3572 ppid:3567 flags:0x00004002 [ 1352.996257] Call Trace: [ 1352.996905] <TASK> [ 1352.997452] __schedule+0x2f9/0xa20 [ 1352.998380] schedule+0x5a/0xc0 [ 1352.999011] io_schedule+0x42/0x70 [ 1352.999565] folio_wait_bit_common+0x137/0x370 [ 1353.000553] ? dio_warn_stale_pagecache.part.0+0x50/0x50 [ 1353.001425] truncate_inode_pages_range+0x41a/0x470 [ 1353.002443] evict+0x1ad/0x1c0 [ 1353.003199] do_unlinkat+0x1db/0x2e0 [ 1353.003907] __x64_sys_unlinkat+0x33/0x60 [ 1353.004569] do_syscall_64+0x35/0x80 [ 1353.005354] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 1353.006315] RIP: 0033:0x7f03d6a42157 [ 1353.007088] RSP: 002b:00007ffde8d98388 EFLAGS: 00000206 ORIG_RAX: 0000000000000107 [ 1353.008367] RAX: ffffffffffffffda RBX: 0000000000000df4 RCX: 00007f03d6a42157 [ 1353.009748] RDX: 0000000000000000 RSI: 000055f8660415a0 RDI: 0000000000000007 [ 1353.011115] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007ffde8d96105 [ 1353.012422] R10: 0000000000000158 R11: 0000000000000206 R12: 0000000000000158 [ 1353.013556] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1353.014958] </TASK> And soon after: [ 1607.746983] ------------[ cut here ]------------ [ 1607.748141] WARNING: CPU: 27 PID: 51385 at fs/iomap/buffered-io.c:80 iomap_page_release+0xaf/0xc0 [ 1607.751098] Modules linked in: [ 1607.751619] CPU: 27 PID: 51385 Comm: test_write Not tainted 6.1.0-rc1-dgc+ #1618 [ 1607.753307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 1607.755443] RIP: 0010:iomap_page_release+0xaf/0xc0 [ 1607.756496] Code: 48 c1 e8 02 83 e0 01 75 10 38 d0 75 21 5b 4c 89 e7 5d 41 5c e9 d2 be ee ff eb ee e8 bb e0 ec ff eb 88 48 8b 07 5b 5d 41 5c c3 <0f> 0b eb b1 0f 0b eb db 0f 0b eb b2 0f 1f 44 00 00 0f 1f 44 00 00 [ 1607.759643] RSP: 0018:ffffc900087cb5b0 EFLAGS: 00010206 [ 1607.760775] RAX: 0000000000200000 RBX: ffffea0004378000 RCX: 000000000000000c [ 1607.762181] RDX: 0017ffffc0050081 RSI: ffff888145cb0178 RDI: ffffea0004378000 [ 1607.764408] RBP: 0000000000000200 R08: 0000000000200000 R09: ffffea0004378008 [ 1607.765560] R10: ffff888141988800 R11: 0000000000033900 R12: ffff88810200ee40 [ 1607.766988] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008 [ 1607.768347] FS: 00007f03d5c00640(0000) GS:ffff88833bdc0000(0000) knlGS:0000000000000000 [ 1607.769756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1607.771421] CR2: 00007f03cae00000 CR3: 0000000246226006 CR4: 0000000000770ee0 [ 1607.773298] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1607.775026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1607.777341] PKRU: 55555554 [ 1607.777995] Call Trace: [ 1607.778637] <TASK> [ 1607.779372] iomap_release_folio+0x4d/0xa0 [ 1607.780510] shrink_folio_list+0x9dd/0xb10 [ 1607.781931] shrink_lruvec+0x558/0xb80 [ 1607.782922] shrink_node+0x2ce/0x6f0 [ 1607.783993] do_try_to_free_pages+0xd5/0x570 [ 1607.785031] try_to_free_mem_cgroup_pages+0x105/0x220 [ 1607.786728] reclaim_high+0xa6/0xf0 [ 1607.787672] mem_cgroup_handle_over_high+0x97/0x290 [ 1607.789179] try_charge_memcg+0x697/0x7f0 [ 1607.790100] charge_memcg+0x35/0xd0 [ 1607.790930] __mem_cgroup_charge+0x29/0x80 [ 1607.792884] __filemap_add_folio+0x21a/0x460 [ 1607.794342] ? scan_shadow_nodes+0x30/0x30 [ 1607.796052] filemap_add_folio+0x37/0xa0 [ 1607.797036] __filemap_get_folio+0x221/0x360 [ 1607.798468] ? xas_load+0x5/0xa0 [ 1607.799417] iomap_write_begin+0x103/0x580 [ 1607.800908] ? filemap_dirty_folio+0x5c/0x80 [ 1607.801940] ? iomap_write_end+0x101/0x250 [ 1607.803305] iomap_file_buffered_write+0x17f/0x330 [ 1607.804700] xfs_file_buffered_write+0xb1/0x2e0 [ 1607.806240] ? xfs_file_buffered_write+0x2b2/0x2e0 [ 1607.807743] vfs_write+0x2ca/0x3d0 [ 1607.808622] __x64_sys_pwrite64+0x8c/0xc0 [ 1607.810502] do_syscall_64+0x35/0x80 [ 1607.811511] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 1607.813116] RIP: 0033:0x7f03d6b40a87 .... [ 1607.925434] kernel BUG at mm/vmscan.c:1306! [ 1607.926977] invalid opcode: 0000 [#1] PREEMPT SMP [ 1607.928427] CPU: 8 PID: 51385 Comm: test_write Tainted: G W 6.1.0-rc1-dgc+ #1618 [ 1607.930742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 1607.932716] RIP: 0010:__remove_mapping+0x229/0x240 [ 1607.933648] Code: a9 00 00 08 00 75 27 48 8b 3b 48 81 c7 88 00 00 00 e8 db 64 c1 00 31 c0 e9 d6 fe ff ff 48 8b 3b e8 7c 29 0d 00 e9 27 ff ff ff <0f> 0b 0f 0b 48 8b 45 00 f6 c4 04 74 d0 31 c0 e9 b5 fe ff ff 0f 1f [ 1607.937079] RSP: 0018:ffffc900087cb5d8 EFLAGS: 00010246 [ 1607.938050] RAX: 0017ffffc0010004 RBX: ffffc900087cb6e8 RCX: ffff888244bc5000 [ 1607.939443] RDX: 0000000000000001 RSI: ffffea0004378000 RDI: ffff888145cb0318 [ 1607.940980] RBP: ffffc900087cb648 R08: ffff888145cb0318 R09: ffffffff813be400 [ 1607.942300] R10: ffff88810200ee40 R11: 0000000000033900 R12: ffffea0004378000 [ 1607.943666] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008 [ 1607.945004] FS: 00007f03d5c00640(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000 [ 1607.946492] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1607.947552] CR2: 00007efd1ae33da8 CR3: 0000000246226001 CR4: 0000000000770ee0 [ 1607.948866] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1607.950250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1607.951738] PKRU: 55555554 [ 1607.952289] Call Trace: [ 1607.952775] <TASK> [ 1607.953190] shrink_folio_list+0x5cd/0xb10 [ 1607.954140] shrink_lruvec+0x558/0xb80 [ 1607.954850] shrink_node+0x2ce/0x6f0 [ 1607.955536] do_try_to_free_pages+0xd5/0x570 [ 1607.956397] try_to_free_mem_cgroup_pages+0x105/0x220 [ 1607.957503] reclaim_high+0xa6/0xf0 [ 1607.958205] mem_cgroup_handle_over_high+0x97/0x290 [ 1607.959149] try_charge_memcg+0x697/0x7f0 [ 1607.959932] charge_memcg+0x35/0xd0 [ 1607.960549] __mem_cgroup_charge+0x29/0x80 [ 1607.961257] __filemap_add_folio+0x21a/0x460 [ 1607.961912] ? scan_shadow_nodes+0x30/0x30 [ 1607.962705] filemap_add_folio+0x37/0xa0 [ 1607.963354] __filemap_get_folio+0x221/0x360 [ 1607.964076] ? xas_load+0x5/0xa0 [ 1607.964761] iomap_write_begin+0x103/0x580 [ 1607.965510] ? filemap_dirty_folio+0x5c/0x80 [ 1607.966349] ? iomap_write_end+0x101/0x250 [ 1607.967185] iomap_file_buffered_write+0x17f/0x330 [ 1607.968133] xfs_file_buffered_write+0xb1/0x2e0 [ 1607.968948] ? xfs_file_buffered_write+0x2b2/0x2e0 [ 1607.969783] vfs_write+0x2ca/0x3d0 [ 1607.970326] __x64_sys_pwrite64+0x8c/0xc0 [ 1607.971073] do_syscall_64+0x35/0x80 [ 1607.971732] entry_SYSCALL_64_after_hwframe+0x63/0xcd which is: 1297 * Same as remove_mapping, but if the folio is removed from the mapping, it 1298 * gets returned with a refcount of 0. 1299 */ 1300 static int __remove_mapping(struct address_space *mapping, struct folio *folio, 1301 bool reclaimed, struct mem_cgroup *target_memcg) 1302 { 1303 int refcount; 1304 void *shadow = NULL; 1305 1306 BUG_ON(!folio_test_locked(folio)); 1307 BUG_ON(mapping != folio_mapping(folio)); 1308 tripping over an unlocked folio in shrink_folio_list().... And, finally, there's also a process spinning on a cpu in truncate_inode_pages_range(): [ 3468.057406] CPU: 1 PID: 3579 Comm: test_write Tainted: G D W 6.1.0-rc1-dgc+ #1618 [ 3468.057407] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 3468.057408] RIP: 0010:truncate_inode_pages_range+0x3c0/0x470 [ 3468.057411] Code: 8b 7c e5 00 41 f6 c6 01 0f 85 62 ff ff ff be a9 03 00 00 48 c7 c7 b8 62 7b 82 e8 0b 46 e9 ff 2e 2e 2e 31 c0 f0 49 0f ba 7 [ 3468.057412] RSP: 0018:ffffc90002cffd40 EFLAGS: 00000246 [ 3468.057414] RAX: 0000000000000000 RBX: ffffc90002cffdd8 RCX: ffff8882461b8000 [ 3468.057415] RDX: 0000000000000000 RSI: 00000000000003a9 RDI: ffffffff827b62b8 [ 3468.057415] RBP: ffff888246eee718 R08: ffffffffffffffc0 R09: fffffffffffffffe [ 3468.057416] R10: 0000000000002838 R11: 0000000000002834 R12: 0000000000000004 [ 3468.057417] R13: ffffc90002cffd60 R14: ffffea0004378000 R15: 0000000000002810 [ 3468.057418] FS: 00007f03d6953740(0000) GS:ffff88813bc40000(0000) knlGS:0000000000000000 [ 3468.057420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3468.057421] CR2: 00007f03d0100000 CR3: 0000000245e2c006 CR4: 0000000000770ee0 root@test4:~# [ 3468.057422] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3468.057422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 3468.057423] PKRU: 55555554 [ 3468.057424] Call Trace: [ 3468.057424] <TASK> [ 3468.057426] evict+0x1ad/0x1c0 [ 3468.057429] do_unlinkat+0x1db/0x2e0 [ 3468.057432] __x64_sys_unlinkat+0x33/0x60 [ 3468.057434] do_syscall_64+0x35/0x80 [ 3468.057437] entry_SYSCALL_64_after_hwframe+0x63/0xcd So that's not the problem, either... -Dave.
On Wed, Oct 19, 2022 at 9:06 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote: > > On Wed, Oct 19, 2022 at 12:47 PM Dave Chinner <david@fromorbit.com> wrote: > > > I removed the mapping_set_large_folios() calls in the XFS inode > > > instantiation and the test code has now run over 55,000 iterations > > > without failing. The most iterations I'd seen with large folios > > > enabled was about 7,000 - typically it would fail within 2-3,000 > > > iterations. > > hint from my side. The original problem I raised is under v5.15 where > > there is no folio yet. > > But 5.15 does use 2MB pages in shmem. You haven't really provided > any information, so I don't know whether the inode that you're having > problems with is a shmem inode. No, it is a f2fs inode using 4KB pages.
On Wed, Oct 19, 2022 at 07:49:57AM -0400, Brian Foster wrote: > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > > > with different kinds of address_spaces page cache, which requires high > > > > robustness to deal with any corner cases. Take the current problem as > > > > example, the inode with fault page(refcount=0) could remain on the > > > > sb's list without live lock problem. > > > > > > But it's a corner case that shouldn't happen! What else is going on > > > at the time? Can you reproduce this problem easily? If so, how? > > > > I've been seeing this livelock, too. The reproducer is, > > unfortunately, something I can't share - it's a massive program that > > triggers a data corruption I'm working on solving. > > > > Now that I've > > mostly fixed the data corruption, long duration test runs end up > > livelocking in page cache lookup after several hours. > > > > The test is effectively writing a 100MB file with multiple threads > > doing reverse adjacent racing 1MB unaligned writes. Once the file is > > written, it is then mmap()d and read back from the filesystem for > > verification. > > > > THis is then run with tens of processes concurrently, and then under > > a massively confined memcg (e.g. 32 processes/files are run in a > > memcg with only 200MB of memory allowed). This causes writeback, > > readahead and memory reclaim to race with incoming mmap read faults > > and writes. The livelock occurs on file verification and it appears > > to be an interaction with readahead thrashing. > > > > On my test rig, the physical read to write ratio is at least 20:1 - > > with 32 processes running, the 5s IO rates are: > > > > Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd > > dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 > > dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 > > dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 > > dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 > > dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 > > dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 > > dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 > > dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 > > dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 > > dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 > > dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 > > dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 > > dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 > > > > This is reading and writing the same amount of file data at the > > application level, but once the data has been written and kicked out > > of the page cache it seems to require an awful lot more read IO to > > get it back to the application. i.e. this looks like mmap() is > > readahead thrashing severely, and eventually it livelocks with this > > sort of report: > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > [175901.995614] Call Trace: > > [175901.996090] <TASK> > > [175901.996594] ? __schedule+0x301/0xa30 > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > [175902.000714] ? xas_start+0x53/0xc0 > > [175902.001484] ? xas_load+0x24/0xa0 > > [175902.002208] ? xas_load+0x5/0xa0 > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > [175902.004693] ? __do_fault+0x31/0x1d0 > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > [175902.008613] </TASK> > > > > Given that filemap_fault on XFS is probably trying to map large > > folios, I do wonder if this is a result of some kind of race with > > teardown of a large folio... > > > > I somewhat recently tracked down a hugepage/swap problem that could > manifest as a softlockup in the folio lookup path (due to indefinite > folio_try_get_rcu() failure): > > https://lore.kernel.org/linux-mm/20220906190602.1626037-1-bfoster@redhat.com/ > > It could easily be something different leading to the same side effect, > particularly since I believe the issue I saw was introduced in v5.19, > but might be worth a test if you have a reliable reproducer. Tests run and, unfortunately, that patch doesn't prevent/fix the problem either. Cheers, Dave.
On Thu, Oct 20, 2022 at 10:04 AM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Oct 19, 2022 at 07:49:57AM -0400, Brian Foster wrote: > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote: > > > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote: > > > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote: > > > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote: > > > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com> > > > > > > > > > > > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink > > > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the > > > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet > > > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to > > > > > > > > > suggest skip this page to break the live lock as a workaround. > > > > > > > > > > > > > > > > No, the underlying bug should be fixed. > > > > > > > > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > Understand. IMHO, find_get_entry actruely works as an open API dealing > > > > > with different kinds of address_spaces page cache, which requires high > > > > > robustness to deal with any corner cases. Take the current problem as > > > > > example, the inode with fault page(refcount=0) could remain on the > > > > > sb's list without live lock problem. > > > > > > > > But it's a corner case that shouldn't happen! What else is going on > > > > at the time? Can you reproduce this problem easily? If so, how? > > > > > > I've been seeing this livelock, too. The reproducer is, > > > unfortunately, something I can't share - it's a massive program that > > > triggers a data corruption I'm working on solving. > > > > > > Now that I've > > > mostly fixed the data corruption, long duration test runs end up > > > livelocking in page cache lookup after several hours. > > > > > > The test is effectively writing a 100MB file with multiple threads > > > doing reverse adjacent racing 1MB unaligned writes. Once the file is > > > written, it is then mmap()d and read back from the filesystem for > > > verification. > > > > > > THis is then run with tens of processes concurrently, and then under > > > a massively confined memcg (e.g. 32 processes/files are run in a > > > memcg with only 200MB of memory allowed). This causes writeback, > > > readahead and memory reclaim to race with incoming mmap read faults > > > and writes. The livelock occurs on file verification and it appears > > > to be an interaction with readahead thrashing. > > > > > > On my test rig, the physical read to write ratio is at least 20:1 - > > > with 32 processes running, the 5s IO rates are: > > > > > > Device tps MB_read/s MB_wrtn/s MB_dscd/s MB_read MB_wrtn MB_dscd > > > dm-0 52187.20 3677.42 1345.92 0.00 18387 6729 0 > > > dm-0 62865.60 5947.29 0.08 0.00 29736 0 0 > > > dm-0 62972.80 5911.20 0.00 0.00 29556 0 0 > > > dm-0 59803.00 5516.72 133.47 0.00 27583 667 0 > > > dm-0 63068.20 5292.34 511.52 0.00 26461 2557 0 > > > dm-0 56775.60 4184.52 1248.38 0.00 20922 6241 0 > > > dm-0 63087.40 5901.26 43.77 0.00 29506 218 0 > > > dm-0 62769.00 5833.97 60.54 0.00 29169 302 0 > > > dm-0 64810.20 5636.13 305.63 0.00 28180 1528 0 > > > dm-0 65222.60 5598.99 349.48 0.00 27994 1747 0 > > > dm-0 62444.00 4887.05 926.67 0.00 24435 4633 0 > > > dm-0 63812.00 5622.68 294.66 0.00 28113 1473 0 > > > dm-0 63482.00 5728.43 195.74 0.00 28642 978 0 > > > > > > This is reading and writing the same amount of file data at the > > > application level, but once the data has been written and kicked out > > > of the page cache it seems to require an awful lot more read IO to > > > get it back to the application. i.e. this looks like mmap() is > > > readahead thrashing severely, and eventually it livelocks with this > > > sort of report: > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > [175901.995614] Call Trace: > > > [175901.996090] <TASK> > > > [175901.996594] ? __schedule+0x301/0xa30 > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > [175902.000714] ? xas_start+0x53/0xc0 > > > [175902.001484] ? xas_load+0x24/0xa0 > > > [175902.002208] ? xas_load+0x5/0xa0 > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > [175902.008613] </TASK> > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > folios, I do wonder if this is a result of some kind of race with > > > teardown of a large folio... > > > > > > > I somewhat recently tracked down a hugepage/swap problem that could > > manifest as a softlockup in the folio lookup path (due to indefinite > > folio_try_get_rcu() failure): > > > > https://lore.kernel.org/linux-mm/20220906190602.1626037-1-bfoster@redhat.com/ > > > > It could easily be something different leading to the same side effect, > > particularly since I believe the issue I saw was introduced in v5.19, > > but might be worth a test if you have a reliable reproducer. > > Tests run and, unfortunately, that patch doesn't prevent/fix the > problem either. Could you please help to test the following change which breaks the livelock by surpassing the zero-refcount page directly without reset iterate to ROOT. if (!folio_try_get_rcu(folio)) { goto retry; } > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > This is reading and writing the same amount of file data at the > > > application level, but once the data has been written and kicked out > > > of the page cache it seems to require an awful lot more read IO to > > > get it back to the application. i.e. this looks like mmap() is > > > readahead thrashing severely, and eventually it livelocks with this > > > sort of report: > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > [175901.995614] Call Trace: > > > [175901.996090] <TASK> > > > [175901.996594] ? __schedule+0x301/0xa30 > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > [175902.000714] ? xas_start+0x53/0xc0 > > > [175902.001484] ? xas_load+0x24/0xa0 > > > [175902.002208] ? xas_load+0x5/0xa0 > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > [175902.008613] </TASK> > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > folios, I do wonder if this is a result of some kind of race with > > > teardown of a large folio... > > > > It doesn't matter whether we're trying to map a large folio; it > > matters whether a large folio was previously created in the cache. > > Through the magic of readahead, it may well have been. I suspect > > it's not teardown of a large folio, but splitting. Removing a > > page from the page cache stores to the pointer in the XArray > > first (either NULL or a shadow entry), then decrements the refcount. > > > > We must be observing a frozen folio. There are a number of places > > in the MM which freeze a folio, but the obvious one is splitting. > > That looks like this: > > > > local_irq_disable(); > > if (mapping) { > > xas_lock(&xas); > > (...) > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > But the lookup is not doing anything to prevent the split on the > frozen page from making progress, right? It's not holding any folio > references, and it's not holding the mapping tree lock, either. So > how does the lookup in progress prevent the page split from making > progress? My thinking was that it keeps hammering the ->refcount field in struct folio. That might prevent a thread on a different socket from making forward progress. In contrast, spinlocks are designed to be fair under contention, so by spinning on an actual lock, we'd remove contention on the folio. But I think the tests you've done refute that theory. I'm all out of ideas at the moment. Either we have a frozen folio from somebody who doesn't hold the lock, or we have someone who's left a frozen folio in the page cache. I'm leaning towards that explanation at the moment, but I don't have a good suggestion for debugging. Perhaps a bad suggestion for debugging would be to call dump_page() with a __ratelimit() wrapper to not be overwhelmed with information? > I would have thought: > > if (!folio_try_get_rcu(folio)) { > rcu_read_unlock(); > cond_resched(); > rcu_read_lock(); > goto repeat; > } > > Would be the right way to yeild the CPU to avoid priority > inversion related livelocks here... I'm not sure we're allowed to schedule here. We might be under another spinlock?
On Fri, Oct 21, 2022 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > This is reading and writing the same amount of file data at the > > > > application level, but once the data has been written and kicked out > > > > of the page cache it seems to require an awful lot more read IO to > > > > get it back to the application. i.e. this looks like mmap() is > > > > readahead thrashing severely, and eventually it livelocks with this > > > > sort of report: > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > [175901.995614] Call Trace: > > > > [175901.996090] <TASK> > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > [175902.008613] </TASK> > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > folios, I do wonder if this is a result of some kind of race with > > > > teardown of a large folio... > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > matters whether a large folio was previously created in the cache. > > > Through the magic of readahead, it may well have been. I suspect > > > it's not teardown of a large folio, but splitting. Removing a > > > page from the page cache stores to the pointer in the XArray > > > first (either NULL or a shadow entry), then decrements the refcount. > > > > > > We must be observing a frozen folio. There are a number of places > > > in the MM which freeze a folio, but the obvious one is splitting. > > > That looks like this: > > > > > > local_irq_disable(); > > > if (mapping) { > > > xas_lock(&xas); > > > (...) > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > But the lookup is not doing anything to prevent the split on the > > frozen page from making progress, right? It's not holding any folio > > references, and it's not holding the mapping tree lock, either. So > > how does the lookup in progress prevent the page split from making > > progress? > > My thinking was that it keeps hammering the ->refcount field in > struct folio. That might prevent a thread on a different socket > from making forward progress. In contrast, spinlocks are designed > to be fair under contention, so by spinning on an actual lock, we'd > remove contention on the folio. > > But I think the tests you've done refute that theory. I'm all out of > ideas at the moment. Either we have a frozen folio from somebody who > doesn't hold the lock, or we have someone who's left a frozen folio in > the page cache. I'm leaning towards that explanation at the moment, > but I don't have a good suggestion for debugging. > > Perhaps a bad suggestion for debugging would be to call dump_page() > with a __ratelimit() wrapper to not be overwhelmed with information? > > > I would have thought: > > > > if (!folio_try_get_rcu(folio)) { > > rcu_read_unlock(); > > cond_resched(); > > rcu_read_lock(); > > goto repeat; > > } > > > > Would be the right way to yeild the CPU to avoid priority > > inversion related livelocks here... > > I'm not sure we're allowed to schedule here. We might be under another > spinlock? Any further ideas on this issue? Could we just deal with it as simply as surpass the zero refed page to break the livelock as a workaround? IMO, the system could survive if it is a single inode leak or expose other faults if the page cache messed up, which is better than livelock here. We do the similar thing during reclaiming as force reset the page's mapcount to -1 even if there is active reference on it. static void unaccount_page_cache_page(struct address_space *mapping, struct page *page) { if (mapping_exiting(mapping) && page_count(page) >= mapcount + 2) { /* * All vmas have already been torn down, so it's * a good bet that actually the page is unmapped, * and we'd prefer not to leak it: if we're wrong, * some other bad page check should catch it later. */ page_mapcount_reset(page); page_ref_sub(page, mapcount); }
On Wed, Oct 26, 2022 at 04:38:31PM +0800, Zhaoyang Huang wrote: > On Fri, Oct 21, 2022 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > > This is reading and writing the same amount of file data at the > > > > > application level, but once the data has been written and kicked out > > > > > of the page cache it seems to require an awful lot more read IO to > > > > > get it back to the application. i.e. this looks like mmap() is > > > > > readahead thrashing severely, and eventually it livelocks with this > > > > > sort of report: > > > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > > [175901.995614] Call Trace: > > > > > [175901.996090] <TASK> > > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > > [175902.008613] </TASK> > > > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > > folios, I do wonder if this is a result of some kind of race with > > > > > teardown of a large folio... > > > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > > matters whether a large folio was previously created in the cache. > > > > Through the magic of readahead, it may well have been. I suspect > > > > it's not teardown of a large folio, but splitting. Removing a > > > > page from the page cache stores to the pointer in the XArray > > > > first (either NULL or a shadow entry), then decrements the refcount. > > > > > > > > We must be observing a frozen folio. There are a number of places > > > > in the MM which freeze a folio, but the obvious one is splitting. > > > > That looks like this: > > > > > > > > local_irq_disable(); > > > > if (mapping) { > > > > xas_lock(&xas); > > > > (...) > > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > > > But the lookup is not doing anything to prevent the split on the > > > frozen page from making progress, right? It's not holding any folio > > > references, and it's not holding the mapping tree lock, either. So > > > how does the lookup in progress prevent the page split from making > > > progress? > > > > My thinking was that it keeps hammering the ->refcount field in > > struct folio. That might prevent a thread on a different socket > > from making forward progress. In contrast, spinlocks are designed > > to be fair under contention, so by spinning on an actual lock, we'd > > remove contention on the folio. > > > > But I think the tests you've done refute that theory. I'm all out of > > ideas at the moment. Either we have a frozen folio from somebody who > > doesn't hold the lock, or we have someone who's left a frozen folio in > > the page cache. I'm leaning towards that explanation at the moment, > > but I don't have a good suggestion for debugging. > > > > Perhaps a bad suggestion for debugging would be to call dump_page() > > with a __ratelimit() wrapper to not be overwhelmed with information? > > > > > I would have thought: > > > > > > if (!folio_try_get_rcu(folio)) { > > > rcu_read_unlock(); > > > cond_resched(); > > > rcu_read_lock(); > > > goto repeat; > > > } > > > > > > Would be the right way to yeild the CPU to avoid priority > > > inversion related livelocks here... > > > > I'm not sure we're allowed to schedule here. We might be under another > > spinlock? > Any further ideas on this issue? Could we just deal with it as simply > as surpass the zero refed page to break the livelock as a workaround? No. This bug needs to be found & fixed. How easily can you reproduce it?
On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote: > But I think the tests you've done refute that theory. I'm all out of > ideas at the moment. I have a new idea. In page_cache_delete_batch(), we don't set the order of the entry before calling xas_store(). That means we can end up in a situation where we have an order-2 folio in the page cache, delete it and end up with a NULL pointer at (say) index 20 and sibling entries at indices 21-23. We can come along (potentially much later) and put an order-0 folio back at index 20. Now all of indices 20-23 point to the index-20, order-0 folio. Worse, the xarray node can be freed with the sibling entries still intact and then be reallocated by an entirely different xarray. I don't know if this is going to fix the problem you're seeing. I can't quite draw a line from this situation to your symptoms. I came across it while auditing all the places which set folio->mapping to NULL. I did notice a mis-ordering; all the other places first remove the folio from the xarray before setting folio to NULL, but I have a hard time connecting that to your symptoms either. diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 44dd6d6e01bc..cc1fd1f849a7 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -1617,6 +1617,12 @@ static inline void xas_advance(struct xa_state *xas, unsigned long index) xas->xa_offset = (index >> shift) & XA_CHUNK_MASK; } +static inline void xas_adjust_order(struct xa_state *xas, unsigned int order) +{ + xas->xa_shift = order - (order % XA_CHUNK_SHIFT); + xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1; +} + /** * xas_set_order() - Set up XArray operation state for a multislot entry. * @xas: XArray operation state. @@ -1628,8 +1634,7 @@ static inline void xas_set_order(struct xa_state *xas, unsigned long index, { #ifdef CONFIG_XARRAY_MULTI xas->xa_index = order < BITS_PER_LONG ? (index >> order) << order : 0; - xas->xa_shift = order - (order % XA_CHUNK_SHIFT); - xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1; + xas_adjust_order(xas, order); xas->xa_node = XAS_RESTART; #else BUG_ON(order > 0); diff --git a/mm/filemap.c b/mm/filemap.c index 08341616ae7a..6e3f486131e4 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -305,11 +305,13 @@ static void page_cache_delete_batch(struct address_space *mapping, WARN_ON_ONCE(!folio_test_locked(folio)); + if (!folio_test_hugetlb(folio)) + xas_adjust_order(&xas, folio_order(folio)); + xas_store(&xas, NULL); folio->mapping = NULL; /* Leave folio->index set: truncation lookup relies on it */ i++; - xas_store(&xas, NULL); total_pages += folio_nr_pages(folio); } mapping->nrpages -= total_pages;
On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote: > hint from my side. The original problem I raised is under v5.15 where > there is no folio yet. I really wish you'd stop dropping hints and give all of the information you have so I can try to figure this out withouot sending individual emails for every little piece. Do you have CONFIG_READ_ONLY_THP_FOR_FS=y in your .config?
On Thu, Oct 27, 2022 at 3:49 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote: > > hint from my side. The original problem I raised is under v5.15 where > > there is no folio yet. > > I really wish you'd stop dropping hints and give all of the information > you have so I can try to figure this out withouot sending individual > emails for every little piece. > > Do you have CONFIG_READ_ONLY_THP_FOR_FS=y in your .config? No, it is not configured. f2fs only use 4KB pages and blocks >
On Wed, Oct 26, 2022 at 05:01:57PM +0100, Matthew Wilcox wrote: > On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote: > > But I think the tests you've done refute that theory. I'm all out of > > ideas at the moment. > > I have a new idea. In page_cache_delete_batch(), we don't set the > order of the entry before calling xas_store(). That means we can end > up in a situation where we have an order-2 folio in the page cache, > delete it and end up with a NULL pointer at (say) index 20 and sibling > entries at indices 21-23. We can come along (potentially much later) > and put an order-0 folio back at index 20. Now all of indices 20-23 > point to the index-20, order-0 folio. Worse, the xarray node can be > freed with the sibling entries still intact and then be reallocated by > an entirely different xarray. > > I don't know if this is going to fix the problem you're seeing. I can't > quite draw a line from this situation to your symptoms. I came across > it while auditing all the places which set folio->mapping to NULL. > I did notice a mis-ordering; all the other places first remove the folio > from the xarray before setting folio to NULL, but I have a hard time > connecting that to your symptoms either. > > diff --git a/include/linux/xarray.h b/include/linux/xarray.h > index 44dd6d6e01bc..cc1fd1f849a7 100644 > --- a/include/linux/xarray.h > +++ b/include/linux/xarray.h > @@ -1617,6 +1617,12 @@ static inline void xas_advance(struct xa_state *xas, unsigned long index) > xas->xa_offset = (index >> shift) & XA_CHUNK_MASK; > } > > +static inline void xas_adjust_order(struct xa_state *xas, unsigned int order) > +{ > + xas->xa_shift = order - (order % XA_CHUNK_SHIFT); > + xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1; > +} > + > /** > * xas_set_order() - Set up XArray operation state for a multislot entry. > * @xas: XArray operation state. > @@ -1628,8 +1634,7 @@ static inline void xas_set_order(struct xa_state *xas, unsigned long index, > { > #ifdef CONFIG_XARRAY_MULTI > xas->xa_index = order < BITS_PER_LONG ? (index >> order) << order : 0; > - xas->xa_shift = order - (order % XA_CHUNK_SHIFT); > - xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1; > + xas_adjust_order(xas, order); > xas->xa_node = XAS_RESTART; > #else > BUG_ON(order > 0); > diff --git a/mm/filemap.c b/mm/filemap.c > index 08341616ae7a..6e3f486131e4 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -305,11 +305,13 @@ static void page_cache_delete_batch(struct address_space *mapping, > > WARN_ON_ONCE(!folio_test_locked(folio)); > > + if (!folio_test_hugetlb(folio)) > + xas_adjust_order(&xas, folio_order(folio)); > + xas_store(&xas, NULL); > folio->mapping = NULL; > /* Leave folio->index set: truncation lookup relies on it */ > > i++; > - xas_store(&xas, NULL); > total_pages += folio_nr_pages(folio); > } > mapping->nrpages -= total_pages; Nope, that ain't it. I think I've got the data corruption fix sorted now (at least, g/270 isn't assert failing on stray delalloc extents anymore), so if that's the case, I can spend some time actively trying to track this down.... -Dave.
On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote: > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > This is reading and writing the same amount of file data at the > > > > application level, but once the data has been written and kicked out > > > > of the page cache it seems to require an awful lot more read IO to > > > > get it back to the application. i.e. this looks like mmap() is > > > > readahead thrashing severely, and eventually it livelocks with this > > > > sort of report: > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > [175901.995614] Call Trace: > > > > [175901.996090] <TASK> > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > [175902.008613] </TASK> > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > folios, I do wonder if this is a result of some kind of race with > > > > teardown of a large folio... > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > matters whether a large folio was previously created in the cache. > > > Through the magic of readahead, it may well have been. I suspect > > > it's not teardown of a large folio, but splitting. Removing a > > > page from the page cache stores to the pointer in the XArray > > > first (either NULL or a shadow entry), then decrements the refcount. > > > > > > We must be observing a frozen folio. There are a number of places > > > in the MM which freeze a folio, but the obvious one is splitting. > > > That looks like this: > > > > > > local_irq_disable(); > > > if (mapping) { > > > xas_lock(&xas); > > > (...) > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > But the lookup is not doing anything to prevent the split on the > > frozen page from making progress, right? It's not holding any folio > > references, and it's not holding the mapping tree lock, either. So > > how does the lookup in progress prevent the page split from making > > progress? > > My thinking was that it keeps hammering the ->refcount field in > struct folio. That might prevent a thread on a different socket > from making forward progress. In contrast, spinlocks are designed > to be fair under contention, so by spinning on an actual lock, we'd > remove contention on the folio. > > But I think the tests you've done refute that theory. I'm all out of > ideas at the moment. Either we have a frozen folio from somebody who > doesn't hold the lock, or we have someone who's left a frozen folio in > the page cache. I'm leaning towards that explanation at the moment, > but I don't have a good suggestion for debugging. It's something else. I got gdb attached to qemu and single stepped the looping lookup. The context I caught this time is truncate after unlink: (gdb) bt #0 find_get_entry (mark=<optimized out>, max=<optimized out>, xas=<optimized out>) at mm/filemap.c:2014 #1 find_lock_entries (mapping=mapping@entry=0xffff8882445e2118, start=start@entry=25089, end=end@entry=18446744073709551614, fbatch=fbatch@entry=0xffffc900082a7dd8, indices=indices@entry=0xffffc900082a7d60) at mm/filemap.c:2095 #2 0xffffffff8128f024 in truncate_inode_pages_range (mapping=mapping@entry=0xffff8882445e2118, lstart=lstart@entry=0, lend=lend@entry=-1) at mm/truncate.c:364 #3 0xffffffff8128f452 in truncate_inode_pages (lstart=0, mapping=0xffff8882445e2118) at mm/truncate.c:452 #4 0xffffffff8136335d in evict (inode=inode@entry=0xffff8882445e1f78) at fs/inode.c:666 #5 0xffffffff813636cc in iput_final (inode=0xffff8882445e1f78) at fs/inode.c:1747 #6 0xffffffff81355b8b in do_unlinkat (dfd=dfd@entry=10, name=0xffff88834170e000) at fs/namei.c:4326 #7 0xffffffff81355cc3 in __do_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4362 #8 __se_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4355 #9 __x64_sys_unlinkat (regs=<optimized out>) at fs/namei.c:4355 #10 0xffffffff81e92e35 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900082a7f58) at arch/x86/entry/common.c:50 #11 do_syscall_64 (regs=0xffffc900082a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80 #12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120 #13 0x0000000000000000 in ?? () The find_lock_entries() call is being asked to start at index 25089, and we are spinning on a folio we find because folio_try_get_rcu(folio) is failing - the folio ref count is zero. The xas state on lookup is: (gdb) p *xas $6 = {xa = 0xffff8882445e2120, xa_index = 25092, xa_shift = 0 '\000', xa_sibs = 0 '\000', xa_offset = 4 '\004', xa_pad = 0 '\000', xa_node = 0xffff888144c15918, xa_alloc = 0x0 <fixed_percpu_data>, xa_update = 0x0 <fixed_percpu_data>, xa_lru = 0x0 <fixed_percpu_data> indicating that we are trying to look up index 25092 (3 pages further in than the start of the batch), and the folio that this keeps returning is this: (gdb) p *folio $7 = {{{flags = 24769796876795904, {lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 0x8 <fixed_percpu_data+8>, _mapcount = { counter = -129}, _refcount = {counter = 0}, memcg_data = 0}, page = {flags = 24769796876795904, {{{lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}, buddy_list = { next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, pcp_list = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 8}, {pp_magic = 18446719884544507912, pp = 0xffff88823ffd5f50, _pp_mapping_pad = 0, dma_addr = 18688, {dma_addr_upper = 8, pp_frag_count = {counter = 8}}}, { compound_head = 18446719884544507912, compound_dtor = 80 'P', compound_order = 95 '_', compound_mapcount = {counter = -30590}, compound_pincount = {counter = 0}, compound_nr = 0}, {_compound_pad_1 = 18446719884544507912, _compound_pad_2 = 18446612691733536592, deferred_list = {next = 0x0 <fixed_percpu_data>, prev = 0x4900 <irq_stack_backing_store+10496>}}, {_pt_pad_1 = 18446719884544507912, pmd_huge_pte = 0xffff88823ffd5f50, _pt_pad_2 = 0, {pt_mm = 0x4900 <irq_stack_backing_store+10496>, pt_frag_refcount = {counter = 18688}}, ptl = 0x8 <fixed_percpu_data+8>}, {pgmap = 0xffffea0005690008, zone_device_data = 0xffff88823ffd5f50}, callback_head = { next = 0xffffea0005690008, func = 0xffff88823ffd5f50}}, {_mapcount = {counter = -129}, page_type = 4294967167}, _refcount = { counter = 0}, memcg_data = 0}}, _flags_1 = 24769796876795904, __head = 0, _folio_dtor = 3 '\003', _folio_order = 8 '\b', _total_mapcount = {counter = -1}, _pincount = {counter = 0}, _folio_nr_pages = 0} (gdb) The folio has a NULL mapping, and an index of 18688, which means even if it was not a folio that has been invalidated or freed, the index is way outside the range we are looking for. If I step it round the lookup loop, xas does not change, and the same folio is returned every time through the loop. Perhaps the mapping tree itself might be corrupt??? It's simple enough to stop the machine once it has become stuck to observe the iteration and dump structures, just tell me what you need to know from here... Cheers, Dave.
On Tue, Nov 1, 2022 at 3:17 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote: > > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote: > > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote: > > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote: > > > > > This is reading and writing the same amount of file data at the > > > > > application level, but once the data has been written and kicked out > > > > > of the page cache it seems to require an awful lot more read IO to > > > > > get it back to the application. i.e. this looks like mmap() is > > > > > readahead thrashing severely, and eventually it livelocks with this > > > > > sort of report: > > > > > > > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > > > > > [175901.985095] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728 > > > > > [175901.987996] (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32) > > > > > [175901.991698] task:test_write state:R running task stack:12784 pid:25728 ppid: 25696 flags:0x00004002 > > > > > [175901.995614] Call Trace: > > > > > [175901.996090] <TASK> > > > > > [175901.996594] ? __schedule+0x301/0xa30 > > > > > [175901.997411] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.998513] ? sysvec_apic_timer_interrupt+0xb/0x90 > > > > > [175901.999578] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > > > > [175902.000714] ? xas_start+0x53/0xc0 > > > > > [175902.001484] ? xas_load+0x24/0xa0 > > > > > [175902.002208] ? xas_load+0x5/0xa0 > > > > > [175902.002878] ? __filemap_get_folio+0x87/0x340 > > > > > [175902.003823] ? filemap_fault+0x139/0x8d0 > > > > > [175902.004693] ? __do_fault+0x31/0x1d0 > > > > > [175902.005372] ? __handle_mm_fault+0xda9/0x17d0 > > > > > [175902.006213] ? handle_mm_fault+0xd0/0x2a0 > > > > > [175902.006998] ? exc_page_fault+0x1d9/0x810 > > > > > [175902.007789] ? asm_exc_page_fault+0x22/0x30 > > > > > [175902.008613] </TASK> > > > > > > > > > > Given that filemap_fault on XFS is probably trying to map large > > > > > folios, I do wonder if this is a result of some kind of race with > > > > > teardown of a large folio... > > > > > > > > It doesn't matter whether we're trying to map a large folio; it > > > > matters whether a large folio was previously created in the cache. > > > > Through the magic of readahead, it may well have been. I suspect > > > > it's not teardown of a large folio, but splitting. Removing a > > > > page from the page cache stores to the pointer in the XArray > > > > first (either NULL or a shadow entry), then decrements the refcount. > > > > > > > > We must be observing a frozen folio. There are a number of places > > > > in the MM which freeze a folio, but the obvious one is splitting. > > > > That looks like this: > > > > > > > > local_irq_disable(); > > > > if (mapping) { > > > > xas_lock(&xas); > > > > (...) > > > > if (folio_ref_freeze(folio, 1 + extra_pins)) { > > > > > > But the lookup is not doing anything to prevent the split on the > > > frozen page from making progress, right? It's not holding any folio > > > references, and it's not holding the mapping tree lock, either. So > > > how does the lookup in progress prevent the page split from making > > > progress? > > > > My thinking was that it keeps hammering the ->refcount field in > > struct folio. That might prevent a thread on a different socket > > from making forward progress. In contrast, spinlocks are designed > > to be fair under contention, so by spinning on an actual lock, we'd > > remove contention on the folio. > > > > But I think the tests you've done refute that theory. I'm all out of > > ideas at the moment. Either we have a frozen folio from somebody who > > doesn't hold the lock, or we have someone who's left a frozen folio in > > the page cache. I'm leaning towards that explanation at the moment, > > but I don't have a good suggestion for debugging. > > It's something else. I got gdb attached to qemu and single stepped > the looping lookup. The context I caught this time is truncate after > unlink: > > (gdb) bt > #0 find_get_entry (mark=<optimized out>, max=<optimized out>, xas=<optimized out>) at mm/filemap.c:2014 > #1 find_lock_entries (mapping=mapping@entry=0xffff8882445e2118, start=start@entry=25089, end=end@entry=18446744073709551614, > fbatch=fbatch@entry=0xffffc900082a7dd8, indices=indices@entry=0xffffc900082a7d60) at mm/filemap.c:2095 > #2 0xffffffff8128f024 in truncate_inode_pages_range (mapping=mapping@entry=0xffff8882445e2118, lstart=lstart@entry=0, lend=lend@entry=-1) > at mm/truncate.c:364 > #3 0xffffffff8128f452 in truncate_inode_pages (lstart=0, mapping=0xffff8882445e2118) at mm/truncate.c:452 > #4 0xffffffff8136335d in evict (inode=inode@entry=0xffff8882445e1f78) at fs/inode.c:666 > #5 0xffffffff813636cc in iput_final (inode=0xffff8882445e1f78) at fs/inode.c:1747 > #6 0xffffffff81355b8b in do_unlinkat (dfd=dfd@entry=10, name=0xffff88834170e000) at fs/namei.c:4326 > #7 0xffffffff81355cc3 in __do_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4362 > #8 __se_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4355 > #9 __x64_sys_unlinkat (regs=<optimized out>) at fs/namei.c:4355 > #10 0xffffffff81e92e35 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900082a7f58) at arch/x86/entry/common.c:50 > #11 do_syscall_64 (regs=0xffffc900082a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80 > #12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120 > #13 0x0000000000000000 in ?? () > > The find_lock_entries() call is being asked to start at index > 25089, and we are spinning on a folio we find because > folio_try_get_rcu(folio) is failing - the folio ref count is zero. > > The xas state on lookup is: > > (gdb) p *xas > $6 = {xa = 0xffff8882445e2120, xa_index = 25092, xa_shift = 0 '\000', xa_sibs = 0 '\000', xa_offset = 4 '\004', xa_pad = 0 '\000', > xa_node = 0xffff888144c15918, xa_alloc = 0x0 <fixed_percpu_data>, xa_update = 0x0 <fixed_percpu_data>, xa_lru = 0x0 <fixed_percpu_data> > > indicating that we are trying to look up index 25092 (3 pages > further in than the start of the batch), and the folio that this > keeps returning is this: > > (gdb) p *folio > $7 = {{{flags = 24769796876795904, {lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, > mlock_count = 1073569616}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 0x8 <fixed_percpu_data+8>, _mapcount = { > counter = -129}, _refcount = {counter = 0}, memcg_data = 0}, page = {flags = 24769796876795904, {{{lru = {next = 0xffffea0005690008, > prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}, buddy_list = { > next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, pcp_list = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}}, > mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 8}, {pp_magic = 18446719884544507912, pp = 0xffff88823ffd5f50, > _pp_mapping_pad = 0, dma_addr = 18688, {dma_addr_upper = 8, pp_frag_count = {counter = 8}}}, { > compound_head = 18446719884544507912, compound_dtor = 80 'P', compound_order = 95 '_', compound_mapcount = {counter = -30590}, > compound_pincount = {counter = 0}, compound_nr = 0}, {_compound_pad_1 = 18446719884544507912, > _compound_pad_2 = 18446612691733536592, deferred_list = {next = 0x0 <fixed_percpu_data>, > prev = 0x4900 <irq_stack_backing_store+10496>}}, {_pt_pad_1 = 18446719884544507912, pmd_huge_pte = 0xffff88823ffd5f50, > _pt_pad_2 = 0, {pt_mm = 0x4900 <irq_stack_backing_store+10496>, pt_frag_refcount = {counter = 18688}}, > ptl = 0x8 <fixed_percpu_data+8>}, {pgmap = 0xffffea0005690008, zone_device_data = 0xffff88823ffd5f50}, callback_head = { > next = 0xffffea0005690008, func = 0xffff88823ffd5f50}}, {_mapcount = {counter = -129}, page_type = 4294967167}, _refcount = { > counter = 0}, memcg_data = 0}}, _flags_1 = 24769796876795904, __head = 0, _folio_dtor = 3 '\003', _folio_order = 8 '\b', > _total_mapcount = {counter = -1}, _pincount = {counter = 0}, _folio_nr_pages = 0} > (gdb) > > The folio has a NULL mapping, and an index of 18688, which means > even if it was not a folio that has been invalidated or freed, the > index is way outside the range we are looking for. > > If I step it round the lookup loop, xas does not change, and the > same folio is returned every time through the loop. Perhaps > the mapping tree itself might be corrupt??? > > It's simple enough to stop the machine once it has become stuck to > observe the iteration and dump structures, just tell me what you > need to know from here... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com This bug emerges again and I would like to propose a reproduce sequence of this bug which has nothing to do with scheduler stuff ( this could be wrong and sorry for wasting your time if so) Thread_isolate: 1. alloc_contig_range->isolate_migratepages_block isolate a certain of pages to cc->migratepages (folio has refcount: 1 + n (alloc_pages, page_cache)) 2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 + extra_pins) set the folio->refcnt to 0 3. alloc_contig_range->migrate_pages->xas_split split the folios to each slot as folio from slot[offset] to slot[offset + sibs] Thread_truncate: 4. enter the livelock by the chain below rcu_read_lock(); find_get_entry folio = xas_find if(!folio_try_get_rcu) xas_reset; rcu_read_unlock(); 4'. alloc_contig_range->migrate_pages->__split_huge_page which will modify folio's refcnt to 2 and breaks the livelock but is blocked by lruvec->lock's contention If the above call chain makes sense, could we solve this by below modification which has split_folio and __split_huge_page be atomic by taking lruvec->lock earlier than now. int split_huge_page_to_list_to_order(struct page *page, struct list_head *list, unsigned int new_order) { + lruvec = folio_lruvec_lock(folio); if (mapping) { int nr = folio_nr_pages(folio); xas_split(&xas, folio, folio_order(folio)); if (folio_test_pmd_mappable(folio) && new_order < HPAGE_PMD_ORDER) { if (folio_test_swapbacked(folio)) { __lruvec_stat_mod_folio(folio, NR_SHMEM_THPS, -nr); } else { __lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr); filemap_nr_thps_dec(mapping); } } } __split_huge_page(page, list, end, new_order); + folio_lruvec_unlock(folio);
diff --git a/mm/filemap.c b/mm/filemap.c index 15800334..25b0a2e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2019,8 +2019,10 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max, if (!folio || xa_is_value(folio)) return folio; - if (!folio_try_get_rcu(folio)) + if (!folio_try_get_rcu(folio)) { + xas_advance(xas, folio->index + folio_nr_pages(folio) - 1); goto reset; + } if (unlikely(folio != xas_reload(xas))) { folio_put(folio);