[RFC] mm: move xa forward when run across zombie page

Message ID	1665725448-31439-1-git-send-email-zhaoyang.huang@unisoc.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: "zhaoyang.huang" <zhaoyang.huang@unisoc.com> To: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Zhaoyang Huang <huangzhaoyang@gmail.com>, <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <ke.wang@unisoc.com>, <steve.kang@unisoc.com>, <baocong.liu@unisoc.com>, <linux-fsdevel@vger.kernel.org> Subject: [RFC PATCH] mm: move xa forward when run across zombie page Date: Fri, 14 Oct 2022 13:30:48 +0800 Message-ID: <1665725448-31439-1-git-send-email-zhaoyang.huang@unisoc.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm: move xa forward when run across zombie page \| expand [RFC] mm: move xa forward when run across zombie page

zhaoyang.huang Oct. 14, 2022, 5:30 a.m. UTC

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
superblock's inode list. The direct reason is zombie page keeps staying on the
xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
and supposed could be an xa update without synchronize_rcu etc. I would like to
suggest skip this page to break the live lock as a workaround.

[167222.620296] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167285.640296] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167348.660296] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167411.680296] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167474.700296] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167537.720299] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167600.740296] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167663.760298] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167726.780298] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167789.800297] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[167726.780305] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167726.780319] (detected by 3, t=17256977 jiffies, g=19883597, q=2397394)
[167726.780325] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800308] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-7): P155
[167789.800322] (detected by 3, t=17272732 jiffies, g=19883597, q=2397470)
[167789.800328] task:kswapd0         state:R  running task     stack:   24 pid:  155 ppid:     2 flags:0x00000008
[167789.800339] Call trace:
[167789.800342]  dump_backtrace.cfi_jt+0x0/0x8
[167789.800355]  show_stack+0x1c/0x2c
[167789.800363]  sched_show_task+0x1ac/0x27c
[167789.800370]  print_other_cpu_stall+0x314/0x4dc
[167789.800377]  check_cpu_stall+0x1c4/0x36c
[167789.800382]  rcu_sched_clock_irq+0xe8/0x388
[167789.800389]  update_process_times+0xa0/0xe0
[167789.800396]  tick_sched_timer+0x7c/0xd4
[167789.800404]  __run_hrtimer+0xd8/0x30c
[167789.800408]  hrtimer_interrupt+0x1e4/0x2d0
[167789.800414]  arch_timer_handler_phys+0x5c/0xa0
[167789.800423]  handle_percpu_devid_irq+0xbc/0x318
[167789.800430]  handle_domain_irq+0x7c/0xf0
[167789.800437]  gic_handle_irq+0x54/0x12c
[167789.800445]  call_on_irq_stack+0x40/0x70
[167789.800451]  do_interrupt_handler+0x44/0xa0
[167789.800457]  el1_interrupt+0x34/0x64
[167789.800464]  el1h_64_irq_handler+0x1c/0x2c
[167789.800470]  el1h_64_irq+0x7c/0x80
[167789.800474]  xas_find+0xb4/0x28c
[167789.800481]  find_get_entry+0x3c/0x178
[167789.800487]  find_lock_entries+0x98/0x2f8
[167789.800492]  __invalidate_mapping_pages.llvm.3657204692649320853+0xc8/0x224
[167789.800500]  invalidate_mapping_pages+0x18/0x28
[167789.800506]  inode_lru_isolate+0x140/0x2a4
[167789.800512]  __list_lru_walk_one+0xd8/0x204
[167789.800519]  list_lru_walk_one+0x64/0x90
[167789.800524]  prune_icache_sb+0x54/0xe0
[167789.800529]  super_cache_scan+0x160/0x1ec
[167789.800535]  do_shrink_slab+0x20c/0x5c0
[167789.800541]  shrink_slab+0xf0/0x20c
[167789.800546]  shrink_node_memcgs+0x98/0x320
[167789.800553]  shrink_node+0xe8/0x45c
[167789.800557]  balance_pgdat+0x464/0x814
[167789.800563]  kswapd+0xfc/0x23c
[167789.800567]  kthread+0x164/0x1c8
[167789.800573]  ret_from_fork+0x10/0x20

Signed-off-by: Baocong Liu <baocong.liu@unisoc.com>
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 mm/filemap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Matthew Wilcox Oct. 14, 2022, 12:11 p.m. UTC | #1

On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> superblock's inode list. The direct reason is zombie page keeps staying on the
> xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> and supposed could be an xa update without synchronize_rcu etc. I would like to
> suggest skip this page to break the live lock as a workaround.

No, the underlying bug should be fixed.

>  	if (!folio || xa_is_value(folio))
>  		return folio;
>  
> -	if (!folio_try_get_rcu(folio))
> +	if (!folio_try_get_rcu(folio)) {
> +		xas_advance(xas, folio->index + folio_nr_pages(folio) - 1);
>  		goto reset;
> +	}

You can't do this anyway.  To call folio_nr_pages() and to look at
folio->index, you must have a refcount on the page, and this is the
path where we failed to get the refcount.

Zhaoyang Huang Oct. 17, 2022, 5:34 a.m. UTC | #2

On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > superblock's inode list. The direct reason is zombie page keeps staying on the
> > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > suggest skip this page to break the live lock as a workaround.
>
> No, the underlying bug should be fixed.
>
> >       if (!folio || xa_is_value(folio))
> >               return folio;
> >
> > -     if (!folio_try_get_rcu(folio))
> > +     if (!folio_try_get_rcu(folio)) {
> > +             xas_advance(xas, folio->index + folio_nr_pages(folio) - 1);
> >               goto reset;
> > +     }
>
> You can't do this anyway.  To call folio_nr_pages() and to look at
> folio->index, you must have a refcount on the page, and this is the
> path where we failed to get the refcount.
OK, could I move the xas like below?

+     if (!folio_try_get_rcu(folio)) {
+             xas_next_offset(xas);
              goto reset;
+     }

Zhaoyang Huang Oct. 17, 2022, 6:58 a.m. UTC | #3

On Mon, Oct 17, 2022 at 1:34 PM Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > suggest skip this page to break the live lock as a workaround.
> >
> > No, the underlying bug should be fixed.
> >
> > >       if (!folio || xa_is_value(folio))
> > >               return folio;
> > >
> > > -     if (!folio_try_get_rcu(folio))
> > > +     if (!folio_try_get_rcu(folio)) {
> > > +             xas_advance(xas, folio->index + folio_nr_pages(folio) - 1);
> > >               goto reset;
> > > +     }
> >
> > You can't do this anyway.  To call folio_nr_pages() and to look at
> > folio->index, you must have a refcount on the page, and this is the
> > path where we failed to get the refcount.
> OK, could I move the xas like below?
>
> +     if (!folio_try_get_rcu(folio)) {
> +             xas_next_offset(xas);
>               goto reset;
> +     }
sorry, It seems the above suggestion doesn't break the loop. update it
by moving one step and goto retry instead of restarting from ROOT.
      if (!folio_try_get_rcu(folio)) {
              xas_next_offset(xas);
               goto retry;
      }

Matthew Wilcox Oct. 17, 2022, 3:55 p.m. UTC | #4

On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > suggest skip this page to break the live lock as a workaround.
> >
> > No, the underlying bug should be fixed.

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> OK, could I move the xas like below?
> 
> +     if (!folio_try_get_rcu(folio)) {
> +             xas_next_offset(xas);
>               goto reset;
> +     }

Zhaoyang Huang Oct. 18, 2022, 2:52 a.m. UTC | #5

On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > suggest skip this page to break the live lock as a workaround.
> > >
> > > No, the underlying bug should be fixed.
>
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Understand. IMHO, find_get_entry actruely works as an open API dealing
with different kinds of address_spaces page cache, which requires high
robustness to deal with any corner cases. Take the current problem as
example, the inode with fault page(refcount=0) could remain on the
sb's list without live lock problem.
>
> > OK, could I move the xas like below?
> >
> > +     if (!folio_try_get_rcu(folio)) {
> > +             xas_next_offset(xas);
> >               goto reset;
> > +     }

Matthew Wilcox Oct. 18, 2022, 3:09 a.m. UTC | #6

On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > >
> > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > suggest skip this page to break the live lock as a workaround.
> > > >
> > > > No, the underlying bug should be fixed.
> >
> >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Understand. IMHO, find_get_entry actruely works as an open API dealing
> with different kinds of address_spaces page cache, which requires high
> robustness to deal with any corner cases. Take the current problem as
> example, the inode with fault page(refcount=0) could remain on the
> sb's list without live lock problem.

But it's a corner case that shouldn't happen!  What else is going on
at the time?  Can you reproduce this problem easily?  If so, how?

Dave Chinner Oct. 18, 2022, 10:30 p.m. UTC | #7

On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote:
> On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > >
> > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > > suggest skip this page to break the live lock as a workaround.
> > > > >
> > > > > No, the underlying bug should be fixed.
> > >
> > >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > Understand. IMHO, find_get_entry actruely works as an open API dealing
> > with different kinds of address_spaces page cache, which requires high
> > robustness to deal with any corner cases. Take the current problem as
> > example, the inode with fault page(refcount=0) could remain on the
> > sb's list without live lock problem.
> 
> But it's a corner case that shouldn't happen!  What else is going on
> at the time?  Can you reproduce this problem easily?  If so, how?

I've been seeing this livelock, too. The reproducer is,
unfortunately, something I can't share - it's a massive program that
triggers a data corruption I'm working on solving.

Now that I've
mostly fixed the data corruption, long duration test runs end up
livelocking in page cache lookup after several hours.

The test is effectively writing a 100MB file with multiple threads
doing reverse adjacent racing 1MB unaligned writes. Once the file is
written, it is then mmap()d and read back from the filesystem for
verification.

THis is then run with tens of processes concurrently, and then under
a massively confined memcg (e.g. 32 processes/files are run in a
memcg with only 200MB of memory allowed). This causes writeback,
readahead and memory reclaim to race with incoming mmap read faults
and writes.  The livelock occurs on file verification and it appears
to be an interaction with readahead thrashing.

On my test rig, the physical read to write ratio is at least 20:1 -
with 32 processes running, the 5s IO rates are:

Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
dm-0          52187.20      3677.42      1345.92         0.00      18387       6729          0
dm-0          62865.60      5947.29         0.08         0.00      29736          0          0
dm-0          62972.80      5911.20         0.00         0.00      29556          0          0
dm-0          59803.00      5516.72       133.47         0.00      27583        667          0
dm-0          63068.20      5292.34       511.52         0.00      26461       2557          0
dm-0          56775.60      4184.52      1248.38         0.00      20922       6241          0
dm-0          63087.40      5901.26        43.77         0.00      29506        218          0
dm-0          62769.00      5833.97        60.54         0.00      29169        302          0
dm-0          64810.20      5636.13       305.63         0.00      28180       1528          0
dm-0          65222.60      5598.99       349.48         0.00      27994       1747          0
dm-0          62444.00      4887.05       926.67         0.00      24435       4633          0
dm-0          63812.00      5622.68       294.66         0.00      28113       1473          0
dm-0          63482.00      5728.43       195.74         0.00      28642        978          0

This is reading and writing the same amount of file data at the
application level, but once the data has been written and kicked out
of the page cache it seems to require an awful lot more read IO to
get it back to the application. i.e. this looks like mmap() is
readahead thrashing severely, and eventually it livelocks with this
sort of report:

[175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
[175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
[175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
[175901.995614] Call Trace:
[175901.996090]  <TASK>
[175901.996594]  ? __schedule+0x301/0xa30
[175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
[175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
[175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[175902.000714]  ? xas_start+0x53/0xc0
[175902.001484]  ? xas_load+0x24/0xa0
[175902.002208]  ? xas_load+0x5/0xa0
[175902.002878]  ? __filemap_get_folio+0x87/0x340
[175902.003823]  ? filemap_fault+0x139/0x8d0
[175902.004693]  ? __do_fault+0x31/0x1d0
[175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
[175902.006213]  ? handle_mm_fault+0xd0/0x2a0
[175902.006998]  ? exc_page_fault+0x1d9/0x810
[175902.007789]  ? asm_exc_page_fault+0x22/0x30
[175902.008613]  </TASK>

Given that filemap_fault on XFS is probably trying to map large
folios, I do wonder if this is a result of some kind of race with
teardown of a large folio...

There is a very simple corruption reproducer script that has been
written, but I haven't been using it. I don't know if long term
running of the script here:

https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/

will trigger the livelock as the verification step is
significantly different, but it will give you insight into the
setup of the environment that leads to the livelock. Maybe you could
replace the md5sum verification with a mmap read with xfs_io to
simulate the fault load that seems to lead to this issue...

Cheers,

Dave.

Dave Chinner Oct. 19, 2022, 1:16 a.m. UTC | #8

On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > >
> > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > > > suggest skip this page to break the live lock as a workaround.
> > > > > >
> > > > > > No, the underlying bug should be fixed.
> > > >
> > > >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > Understand. IMHO, find_get_entry actruely works as an open API dealing
> > > with different kinds of address_spaces page cache, which requires high
> > > robustness to deal with any corner cases. Take the current problem as
> > > example, the inode with fault page(refcount=0) could remain on the
> > > sb's list without live lock problem.
> > 
> > But it's a corner case that shouldn't happen!  What else is going on
> > at the time?  Can you reproduce this problem easily?  If so, how?
> 
> I've been seeing this livelock, too. The reproducer is,
> unfortunately, something I can't share - it's a massive program that
> triggers a data corruption I'm working on solving.
> 
> Now that I've
> mostly fixed the data corruption, long duration test runs end up
> livelocking in page cache lookup after several hours.
> 
> The test is effectively writing a 100MB file with multiple threads
> doing reverse adjacent racing 1MB unaligned writes. Once the file is
> written, it is then mmap()d and read back from the filesystem for
> verification.
> 
> THis is then run with tens of processes concurrently, and then under
> a massively confined memcg (e.g. 32 processes/files are run in a
> memcg with only 200MB of memory allowed). This causes writeback,
> readahead and memory reclaim to race with incoming mmap read faults
> and writes.  The livelock occurs on file verification and it appears
> to be an interaction with readahead thrashing.
> 
> On my test rig, the physical read to write ratio is at least 20:1 -
> with 32 processes running, the 5s IO rates are:
> 
> Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
> dm-0          52187.20      3677.42      1345.92         0.00      18387       6729          0
> dm-0          62865.60      5947.29         0.08         0.00      29736          0          0
> dm-0          62972.80      5911.20         0.00         0.00      29556          0          0
> dm-0          59803.00      5516.72       133.47         0.00      27583        667          0
> dm-0          63068.20      5292.34       511.52         0.00      26461       2557          0
> dm-0          56775.60      4184.52      1248.38         0.00      20922       6241          0
> dm-0          63087.40      5901.26        43.77         0.00      29506        218          0
> dm-0          62769.00      5833.97        60.54         0.00      29169        302          0
> dm-0          64810.20      5636.13       305.63         0.00      28180       1528          0
> dm-0          65222.60      5598.99       349.48         0.00      27994       1747          0
> dm-0          62444.00      4887.05       926.67         0.00      24435       4633          0
> dm-0          63812.00      5622.68       294.66         0.00      28113       1473          0
> dm-0          63482.00      5728.43       195.74         0.00      28642        978          0
> 
> This is reading and writing the same amount of file data at the
> application level, but once the data has been written and kicked out
> of the page cache it seems to require an awful lot more read IO to
> get it back to the application. i.e. this looks like mmap() is
> readahead thrashing severely, and eventually it livelocks with this
> sort of report:
> 
> [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> [175901.995614] Call Trace:
> [175901.996090]  <TASK>
> [175901.996594]  ? __schedule+0x301/0xa30
> [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [175902.000714]  ? xas_start+0x53/0xc0
> [175902.001484]  ? xas_load+0x24/0xa0
> [175902.002208]  ? xas_load+0x5/0xa0
> [175902.002878]  ? __filemap_get_folio+0x87/0x340
> [175902.003823]  ? filemap_fault+0x139/0x8d0
> [175902.004693]  ? __do_fault+0x31/0x1d0
> [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> [175902.006998]  ? exc_page_fault+0x1d9/0x810
> [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> [175902.008613]  </TASK>
> 
> Given that filemap_fault on XFS is probably trying to map large
> folios, I do wonder if this is a result of some kind of race with
> teardown of a large folio...
> 
> There is a very simple corruption reproducer script that has been
> written, but I haven't been using it. I don't know if long term
> running of the script here:
> 
> https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/
> 
> will trigger the livelock as the verification step is
> significantly different, but it will give you insight into the
> setup of the environment that leads to the livelock. Maybe you could
> replace the md5sum verification with a mmap read with xfs_io to
> simulate the fault load that seems to lead to this issue...

FWIW, just tested this on a current Linus kernel. While there is
massive read-ahead thrashing on v6.0, the thrashing is largely gone
in v6.1-rc1+ and the iteration rate of the test is much, much
better. The livelock remains, however.

Cheers,

Dave.

Dave Chinner Oct. 19, 2022, 4:47 a.m. UTC | #9

On Wed, Oct 19, 2022 at 12:16:36PM +1100, Dave Chinner wrote:
> On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > > >
> > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > > > > suggest skip this page to break the live lock as a workaround.
> > > > > > >
> > > > > > > No, the underlying bug should be fixed.
> > > > >
> > > > >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > Understand. IMHO, find_get_entry actruely works as an open API dealing
> > > > with different kinds of address_spaces page cache, which requires high
> > > > robustness to deal with any corner cases. Take the current problem as
> > > > example, the inode with fault page(refcount=0) could remain on the
> > > > sb's list without live lock problem.
> > > 
> > > But it's a corner case that shouldn't happen!  What else is going on
> > > at the time?  Can you reproduce this problem easily?  If so, how?
> > 
> > I've been seeing this livelock, too. The reproducer is,
> > unfortunately, something I can't share - it's a massive program that
> > triggers a data corruption I'm working on solving.
> > 
> > Now that I've
> > mostly fixed the data corruption, long duration test runs end up
> > livelocking in page cache lookup after several hours.
> > 
> > The test is effectively writing a 100MB file with multiple threads
> > doing reverse adjacent racing 1MB unaligned writes. Once the file is
> > written, it is then mmap()d and read back from the filesystem for
> > verification.
> > 
> > THis is then run with tens of processes concurrently, and then under
> > a massively confined memcg (e.g. 32 processes/files are run in a
> > memcg with only 200MB of memory allowed). This causes writeback,
> > readahead and memory reclaim to race with incoming mmap read faults
> > and writes.  The livelock occurs on file verification and it appears
> > to be an interaction with readahead thrashing.
> > 
> > On my test rig, the physical read to write ratio is at least 20:1 -
> > with 32 processes running, the 5s IO rates are:
> > 
> > Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
> > dm-0          52187.20      3677.42      1345.92         0.00      18387       6729          0
> > dm-0          62865.60      5947.29         0.08         0.00      29736          0          0
> > dm-0          62972.80      5911.20         0.00         0.00      29556          0          0
> > dm-0          59803.00      5516.72       133.47         0.00      27583        667          0
> > dm-0          63068.20      5292.34       511.52         0.00      26461       2557          0
> > dm-0          56775.60      4184.52      1248.38         0.00      20922       6241          0
> > dm-0          63087.40      5901.26        43.77         0.00      29506        218          0
> > dm-0          62769.00      5833.97        60.54         0.00      29169        302          0
> > dm-0          64810.20      5636.13       305.63         0.00      28180       1528          0
> > dm-0          65222.60      5598.99       349.48         0.00      27994       1747          0
> > dm-0          62444.00      4887.05       926.67         0.00      24435       4633          0
> > dm-0          63812.00      5622.68       294.66         0.00      28113       1473          0
> > dm-0          63482.00      5728.43       195.74         0.00      28642        978          0
> > 
> > This is reading and writing the same amount of file data at the
> > application level, but once the data has been written and kicked out
> > of the page cache it seems to require an awful lot more read IO to
> > get it back to the application. i.e. this looks like mmap() is
> > readahead thrashing severely, and eventually it livelocks with this
> > sort of report:
> > 
> > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > [175901.995614] Call Trace:
> > [175901.996090]  <TASK>
> > [175901.996594]  ? __schedule+0x301/0xa30
> > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [175902.000714]  ? xas_start+0x53/0xc0
> > [175902.001484]  ? xas_load+0x24/0xa0
> > [175902.002208]  ? xas_load+0x5/0xa0
> > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > [175902.004693]  ? __do_fault+0x31/0x1d0
> > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > [175902.008613]  </TASK>
> > 
> > Given that filemap_fault on XFS is probably trying to map large
> > folios, I do wonder if this is a result of some kind of race with
> > teardown of a large folio...
> > 
> > There is a very simple corruption reproducer script that has been
> > written, but I haven't been using it. I don't know if long term
> > running of the script here:
> > 
> > https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/
> > 
> > will trigger the livelock as the verification step is
> > significantly different, but it will give you insight into the
> > setup of the environment that leads to the livelock. Maybe you could
> > replace the md5sum verification with a mmap read with xfs_io to
> > simulate the fault load that seems to lead to this issue...
> 
> FWIW, just tested this on a current Linus kernel. While there is
> massive read-ahead thrashing on v6.0, the thrashing is largely gone
> in v6.1-rc1+ and the iteration rate of the test is much, much
> better. The livelock remains, however.

Evidence is starting to point to an interaction with the multi-page
folio support in the page cache.

I removed the mapping_set_large_folios() calls in the XFS inode
instantiation and the test code has now run over 55,000 iterations
without failing.  The most iterations I'd seen with large folios
enabled was about 7,000 - typically it would fail within 2-3,000
iterations.

Cheers,

Dave.

Zhaoyang Huang Oct. 19, 2022, 5:48 a.m. UTC | #10

On Wed, Oct 19, 2022 at 12:47 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 19, 2022 at 12:16:36PM +1100, Dave Chinner wrote:
> > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote:
> > > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> > > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > > > >
> > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > > > > > suggest skip this page to break the live lock as a workaround.
> > > > > > > >
> > > > > > > > No, the underlying bug should be fixed.
> > > > > >
> > > > > >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > Understand. IMHO, find_get_entry actruely works as an open API dealing
> > > > > with different kinds of address_spaces page cache, which requires high
> > > > > robustness to deal with any corner cases. Take the current problem as
> > > > > example, the inode with fault page(refcount=0) could remain on the
> > > > > sb's list without live lock problem.
> > > >
> > > > But it's a corner case that shouldn't happen!  What else is going on
> > > > at the time?  Can you reproduce this problem easily?  If so, how?
> > >
> > > I've been seeing this livelock, too. The reproducer is,
> > > unfortunately, something I can't share - it's a massive program that
> > > triggers a data corruption I'm working on solving.
> > >
> > > Now that I've
> > > mostly fixed the data corruption, long duration test runs end up
> > > livelocking in page cache lookup after several hours.
> > >
> > > The test is effectively writing a 100MB file with multiple threads
> > > doing reverse adjacent racing 1MB unaligned writes. Once the file is
> > > written, it is then mmap()d and read back from the filesystem for
> > > verification.
> > >
> > > THis is then run with tens of processes concurrently, and then under
> > > a massively confined memcg (e.g. 32 processes/files are run in a
> > > memcg with only 200MB of memory allowed). This causes writeback,
> > > readahead and memory reclaim to race with incoming mmap read faults
> > > and writes.  The livelock occurs on file verification and it appears
> > > to be an interaction with readahead thrashing.
> > >
> > > On my test rig, the physical read to write ratio is at least 20:1 -
> > > with 32 processes running, the 5s IO rates are:
> > >
> > > Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
> > > dm-0          52187.20      3677.42      1345.92         0.00      18387       6729          0
> > > dm-0          62865.60      5947.29         0.08         0.00      29736          0          0
> > > dm-0          62972.80      5911.20         0.00         0.00      29556          0          0
> > > dm-0          59803.00      5516.72       133.47         0.00      27583        667          0
> > > dm-0          63068.20      5292.34       511.52         0.00      26461       2557          0
> > > dm-0          56775.60      4184.52      1248.38         0.00      20922       6241          0
> > > dm-0          63087.40      5901.26        43.77         0.00      29506        218          0
> > > dm-0          62769.00      5833.97        60.54         0.00      29169        302          0
> > > dm-0          64810.20      5636.13       305.63         0.00      28180       1528          0
> > > dm-0          65222.60      5598.99       349.48         0.00      27994       1747          0
> > > dm-0          62444.00      4887.05       926.67         0.00      24435       4633          0
> > > dm-0          63812.00      5622.68       294.66         0.00      28113       1473          0
> > > dm-0          63482.00      5728.43       195.74         0.00      28642        978          0
> > >
> > > This is reading and writing the same amount of file data at the
> > > application level, but once the data has been written and kicked out
> > > of the page cache it seems to require an awful lot more read IO to
> > > get it back to the application. i.e. this looks like mmap() is
> > > readahead thrashing severely, and eventually it livelocks with this
> > > sort of report:
> > >
> > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > [175901.995614] Call Trace:
> > > [175901.996090]  <TASK>
> > > [175901.996594]  ? __schedule+0x301/0xa30
> > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [175902.000714]  ? xas_start+0x53/0xc0
> > > [175902.001484]  ? xas_load+0x24/0xa0
> > > [175902.002208]  ? xas_load+0x5/0xa0
> > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > [175902.008613]  </TASK>
> > >
> > > Given that filemap_fault on XFS is probably trying to map large
> > > folios, I do wonder if this is a result of some kind of race with
> > > teardown of a large folio...
> > >
> > > There is a very simple corruption reproducer script that has been
> > > written, but I haven't been using it. I don't know if long term
> > > running of the script here:
> > >
> > > https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/
> > >
> > > will trigger the livelock as the verification step is
> > > significantly different, but it will give you insight into the
> > > setup of the environment that leads to the livelock. Maybe you could
> > > replace the md5sum verification with a mmap read with xfs_io to
> > > simulate the fault load that seems to lead to this issue...
> >
> > FWIW, just tested this on a current Linus kernel. While there is
> > massive read-ahead thrashing on v6.0, the thrashing is largely gone
> > in v6.1-rc1+ and the iteration rate of the test is much, much
> > better. The livelock remains, however.
>
> Evidence is starting to point to an interaction with the multi-page
> folio support in the page cache.
>
> I removed the mapping_set_large_folios() calls in the XFS inode
> instantiation and the test code has now run over 55,000 iterations
> without failing.  The most iterations I'd seen with large folios
> enabled was about 7,000 - typically it would fail within 2-3,000
> iterations.
hint from my side. The original problem I raised is under v5.15 where
there is no folio yet.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Brian Foster Oct. 19, 2022, 11:49 a.m. UTC | #11

On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote:
> > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > >
> > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > > > suggest skip this page to break the live lock as a workaround.
> > > > > >
> > > > > > No, the underlying bug should be fixed.
> > > >
> > > >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > Understand. IMHO, find_get_entry actruely works as an open API dealing
> > > with different kinds of address_spaces page cache, which requires high
> > > robustness to deal with any corner cases. Take the current problem as
> > > example, the inode with fault page(refcount=0) could remain on the
> > > sb's list without live lock problem.
> > 
> > But it's a corner case that shouldn't happen!  What else is going on
> > at the time?  Can you reproduce this problem easily?  If so, how?
> 
> I've been seeing this livelock, too. The reproducer is,
> unfortunately, something I can't share - it's a massive program that
> triggers a data corruption I'm working on solving.
> 
> Now that I've
> mostly fixed the data corruption, long duration test runs end up
> livelocking in page cache lookup after several hours.
> 
> The test is effectively writing a 100MB file with multiple threads
> doing reverse adjacent racing 1MB unaligned writes. Once the file is
> written, it is then mmap()d and read back from the filesystem for
> verification.
> 
> THis is then run with tens of processes concurrently, and then under
> a massively confined memcg (e.g. 32 processes/files are run in a
> memcg with only 200MB of memory allowed). This causes writeback,
> readahead and memory reclaim to race with incoming mmap read faults
> and writes.  The livelock occurs on file verification and it appears
> to be an interaction with readahead thrashing.
> 
> On my test rig, the physical read to write ratio is at least 20:1 -
> with 32 processes running, the 5s IO rates are:
> 
> Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
> dm-0          52187.20      3677.42      1345.92         0.00      18387       6729          0
> dm-0          62865.60      5947.29         0.08         0.00      29736          0          0
> dm-0          62972.80      5911.20         0.00         0.00      29556          0          0
> dm-0          59803.00      5516.72       133.47         0.00      27583        667          0
> dm-0          63068.20      5292.34       511.52         0.00      26461       2557          0
> dm-0          56775.60      4184.52      1248.38         0.00      20922       6241          0
> dm-0          63087.40      5901.26        43.77         0.00      29506        218          0
> dm-0          62769.00      5833.97        60.54         0.00      29169        302          0
> dm-0          64810.20      5636.13       305.63         0.00      28180       1528          0
> dm-0          65222.60      5598.99       349.48         0.00      27994       1747          0
> dm-0          62444.00      4887.05       926.67         0.00      24435       4633          0
> dm-0          63812.00      5622.68       294.66         0.00      28113       1473          0
> dm-0          63482.00      5728.43       195.74         0.00      28642        978          0
> 
> This is reading and writing the same amount of file data at the
> application level, but once the data has been written and kicked out
> of the page cache it seems to require an awful lot more read IO to
> get it back to the application. i.e. this looks like mmap() is
> readahead thrashing severely, and eventually it livelocks with this
> sort of report:
> 
> [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> [175901.995614] Call Trace:
> [175901.996090]  <TASK>
> [175901.996594]  ? __schedule+0x301/0xa30
> [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [175902.000714]  ? xas_start+0x53/0xc0
> [175902.001484]  ? xas_load+0x24/0xa0
> [175902.002208]  ? xas_load+0x5/0xa0
> [175902.002878]  ? __filemap_get_folio+0x87/0x340
> [175902.003823]  ? filemap_fault+0x139/0x8d0
> [175902.004693]  ? __do_fault+0x31/0x1d0
> [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> [175902.006998]  ? exc_page_fault+0x1d9/0x810
> [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> [175902.008613]  </TASK>
> 
> Given that filemap_fault on XFS is probably trying to map large
> folios, I do wonder if this is a result of some kind of race with
> teardown of a large folio...
> 

I somewhat recently tracked down a hugepage/swap problem that could
manifest as a softlockup in the folio lookup path (due to indefinite
folio_try_get_rcu() failure):

https://lore.kernel.org/linux-mm/20220906190602.1626037-1-bfoster@redhat.com/

It could easily be something different leading to the same side effect,
particularly since I believe the issue I saw was introduced in v5.19,
but might be worth a test if you have a reliable reproducer.

Brian

> There is a very simple corruption reproducer script that has been
> written, but I haven't been using it. I don't know if long term
> running of the script here:
> 
> https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/
> 
> will trigger the livelock as the verification step is
> significantly different, but it will give you insight into the
> setup of the environment that leads to the livelock. Maybe you could
> replace the md5sum verification with a mmap read with xfs_io to
> simulate the fault load that seems to lead to this issue...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

Matthew Wilcox Oct. 19, 2022, 1:06 p.m. UTC | #12

On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote:
> On Wed, Oct 19, 2022 at 12:47 PM Dave Chinner <david@fromorbit.com> wrote:
> > I removed the mapping_set_large_folios() calls in the XFS inode
> > instantiation and the test code has now run over 55,000 iterations
> > without failing.  The most iterations I'd seen with large folios
> > enabled was about 7,000 - typically it would fail within 2-3,000
> > iterations.
> hint from my side. The original problem I raised is under v5.15 where
> there is no folio yet.

But 5.15 does use 2MB pages in shmem.  You haven't really provided
any information, so I don't know whether the inode that you're having
problems with is a shmem inode.

Matthew Wilcox Oct. 19, 2022, 3:23 p.m. UTC | #13

On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> This is reading and writing the same amount of file data at the
> application level, but once the data has been written and kicked out
> of the page cache it seems to require an awful lot more read IO to
> get it back to the application. i.e. this looks like mmap() is
> readahead thrashing severely, and eventually it livelocks with this
> sort of report:
> 
> [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> [175901.995614] Call Trace:
> [175901.996090]  <TASK>
> [175901.996594]  ? __schedule+0x301/0xa30
> [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> [175902.000714]  ? xas_start+0x53/0xc0
> [175902.001484]  ? xas_load+0x24/0xa0
> [175902.002208]  ? xas_load+0x5/0xa0
> [175902.002878]  ? __filemap_get_folio+0x87/0x340
> [175902.003823]  ? filemap_fault+0x139/0x8d0
> [175902.004693]  ? __do_fault+0x31/0x1d0
> [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> [175902.006998]  ? exc_page_fault+0x1d9/0x810
> [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> [175902.008613]  </TASK>
> 
> Given that filemap_fault on XFS is probably trying to map large
> folios, I do wonder if this is a result of some kind of race with
> teardown of a large folio...

It doesn't matter whether we're trying to map a large folio; it
matters whether a large folio was previously created in the cache.
Through the magic of readahead, it may well have been.  I suspect
it's not teardown of a large folio, but splitting.  Removing a
page from the page cache stores to the pointer in the XArray
first (either NULL or a shadow entry), then decrements the refcount.

We must be observing a frozen folio.  There are a number of places
in the MM which freeze a folio, but the obvious one is splitting.
That looks like this:

        local_irq_disable();
        if (mapping) {
                xas_lock(&xas);
(...)
        if (folio_ref_freeze(folio, 1 + extra_pins)) {

So one way to solve this might be to try to take the xa_lock on
failure to get the refcount.  Otherwise a high-priority task
might spin forever without a low-priority task getting the chance
to finish the work being done while the folio is frozen.

ie this:

diff --git a/mm/filemap.c b/mm/filemap.c
index 08341616ae7a..ca0eed80580f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
 	if (!folio || xa_is_value(folio))
 		goto out;
 
-	if (!folio_try_get_rcu(folio))
+	if (!folio_try_get_rcu(folio)) {
+		unsigned long flags;
+
+		xas_lock_irqsave(&xas, flags);
+		xas_unlock_irqrestore(&xas, flags);
 		goto repeat;
+	}
 
 	if (unlikely(folio != xas_reload(&xas))) {
 		folio_put(folio);
@@ -2014,8 +2019,13 @@ static inline struct folio *find_get_entry(struct xa_state *xas, pgoff_t max,
 	if (!folio || xa_is_value(folio))
 		return folio;
 
-	if (!folio_try_get_rcu(folio))
+	if (!folio_try_get_rcu(folio)) {
+		unsigned long flags;
+
+		xas_lock_irqsave(xas, flags);
+		xas_unlock_irqrestore(xas, flags);
 		goto reset;
+	}
 
 	if (unlikely(folio != xas_reload(xas))) {
 		folio_put(folio);
@@ -2224,8 +2234,13 @@ unsigned filemap_get_folios_contig(struct address_space *mapping,
 		if (xa_is_value(folio))
 			goto update_start;
 
-		if (!folio_try_get_rcu(folio))
+		if (!folio_try_get_rcu(folio)) {
+			unsigned long flags;
+
+			xas_lock_irqsave(&xas, flags);
+			xas_unlock_irqrestore(&xas, flags);
 			goto retry;
+		}
 
 		if (unlikely(folio != xas_reload(&xas)))
 			goto put_folio;
@@ -2365,8 +2380,13 @@ static void filemap_get_read_batch(struct address_space *mapping,
 			break;
 		if (xa_is_sibling(folio))
 			break;
-		if (!folio_try_get_rcu(folio))
+		if (!folio_try_get_rcu(folio)) {
+			unsigned long flags;
+
+			xas_lock_irqsave(&xas, flags);
+			xas_unlock_irqrestore(&xas, flags);
 			goto retry;
+		}
 
 		if (unlikely(folio != xas_reload(&xas)))
 			goto put_folio;

I'm kicking off an xfstests run to see if that causes any new problems.

> There is a very simple corruption reproducer script that has been
> written, but I haven't been using it. I don't know if long term
> running of the script here:
> 
> https://lore.kernel.org/linux-xfs/d00aff43-2bdc-0724-1996-4e58e061ecfd@redhat.com/
> 
> will trigger the livelock as the verification step is
> significantly different, but it will give you insight into the
> setup of the environment that leads to the livelock. Maybe you could
> replace the md5sum verification with a mmap read with xfs_io to
> simulate the fault load that seems to lead to this issue...

If the problem needs scheduling parameters to hit, then that script
isn't going to reproduce the problems for me.  But I'll give it a try
anyway.

Dave Chinner Oct. 19, 2022, 10:04 p.m. UTC | #14

On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > This is reading and writing the same amount of file data at the
> > application level, but once the data has been written and kicked out
> > of the page cache it seems to require an awful lot more read IO to
> > get it back to the application. i.e. this looks like mmap() is
> > readahead thrashing severely, and eventually it livelocks with this
> > sort of report:
> > 
> > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > [175901.995614] Call Trace:
> > [175901.996090]  <TASK>
> > [175901.996594]  ? __schedule+0x301/0xa30
> > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [175902.000714]  ? xas_start+0x53/0xc0
> > [175902.001484]  ? xas_load+0x24/0xa0
> > [175902.002208]  ? xas_load+0x5/0xa0
> > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > [175902.004693]  ? __do_fault+0x31/0x1d0
> > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > [175902.008613]  </TASK>
> > 
> > Given that filemap_fault on XFS is probably trying to map large
> > folios, I do wonder if this is a result of some kind of race with
> > teardown of a large folio...
> 
> It doesn't matter whether we're trying to map a large folio; it
> matters whether a large folio was previously created in the cache.
> Through the magic of readahead, it may well have been.  I suspect
> it's not teardown of a large folio, but splitting.  Removing a
> page from the page cache stores to the pointer in the XArray
> first (either NULL or a shadow entry), then decrements the refcount.
> 
> We must be observing a frozen folio.  There are a number of places
> in the MM which freeze a folio, but the obvious one is splitting.
> That looks like this:
> 
>         local_irq_disable();
>         if (mapping) {
>                 xas_lock(&xas);
> (...)
>         if (folio_ref_freeze(folio, 1 + extra_pins)) {

But the lookup is not doing anything to prevent the split on the
frozen page from making progress, right? It's not holding any folio
references, and it's not holding the mapping tree lock, either. So
how does the lookup in progress prevent the page split from making
progress?


> So one way to solve this might be to try to take the xa_lock on
> failure to get the refcount.  Otherwise a high-priority task
> might spin forever without a low-priority task getting the chance
> to finish the work being done while the folio is frozen.

IIUC, then you are saying that there is a scheduling priority
inversion because the lookup failure looping path doesn't yeild the
CPU?

If so, how does taking the mapping tree spin lock on failure cause
the looping task to yield the CPU and hence allow the folio split to
make progress?

Also, AFAICT, the page split has disabled local interrupts, so it
should effectively be running with preemption disabled as it has
turned off the mechanism the scheduler can use to preempt it. The
page split can't sleep, either, because it holds the mapping tree
lock. Hence I can't see how a split-in-progress can be preempted in
teh first place to cause a priority inversion livelock like this...

> ie this:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 08341616ae7a..ca0eed80580f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
>  	if (!folio || xa_is_value(folio))
>  		goto out;
>  
> -	if (!folio_try_get_rcu(folio))
> +	if (!folio_try_get_rcu(folio)) {
> +		unsigned long flags;
> +
> +		xas_lock_irqsave(&xas, flags);
> +		xas_unlock_irqrestore(&xas, flags);
>  		goto repeat;
> +	}

I would have thought:

	if (!folio_try_get_rcu(folio)) {
		rcu_read_unlock();
		cond_resched();
		rcu_read_lock();
		goto repeat;
	}

Would be the right way to yeild the CPU to avoid priority
inversion related livelocks here...

Cheers,

Dave.

Dave Chinner Oct. 19, 2022, 10:46 p.m. UTC | #15

On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > This is reading and writing the same amount of file data at the
> > > application level, but once the data has been written and kicked out
> > > of the page cache it seems to require an awful lot more read IO to
> > > get it back to the application. i.e. this looks like mmap() is
> > > readahead thrashing severely, and eventually it livelocks with this
> > > sort of report:
> > > 
> > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > [175901.995614] Call Trace:
> > > [175901.996090]  <TASK>
> > > [175901.996594]  ? __schedule+0x301/0xa30
> > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [175902.000714]  ? xas_start+0x53/0xc0
> > > [175902.001484]  ? xas_load+0x24/0xa0
> > > [175902.002208]  ? xas_load+0x5/0xa0
> > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > [175902.008613]  </TASK>
> > > 
> > > Given that filemap_fault on XFS is probably trying to map large
> > > folios, I do wonder if this is a result of some kind of race with
> > > teardown of a large folio...
> > 
> > It doesn't matter whether we're trying to map a large folio; it
> > matters whether a large folio was previously created in the cache.
> > Through the magic of readahead, it may well have been.  I suspect
> > it's not teardown of a large folio, but splitting.  Removing a
> > page from the page cache stores to the pointer in the XArray
> > first (either NULL or a shadow entry), then decrements the refcount.
> > 
> > We must be observing a frozen folio.  There are a number of places
> > in the MM which freeze a folio, but the obvious one is splitting.
> > That looks like this:
> > 
> >         local_irq_disable();
> >         if (mapping) {
> >                 xas_lock(&xas);
> > (...)
> >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> 
> But the lookup is not doing anything to prevent the split on the
> frozen page from making progress, right? It's not holding any folio
> references, and it's not holding the mapping tree lock, either. So
> how does the lookup in progress prevent the page split from making
> progress?
> 
> 
> > So one way to solve this might be to try to take the xa_lock on
> > failure to get the refcount.  Otherwise a high-priority task
> > might spin forever without a low-priority task getting the chance
> > to finish the work being done while the folio is frozen.
> 
> IIUC, then you are saying that there is a scheduling priority
> inversion because the lookup failure looping path doesn't yeild the
> CPU?
> 
> If so, how does taking the mapping tree spin lock on failure cause
> the looping task to yield the CPU and hence allow the folio split to
> make progress?
> 
> Also, AFAICT, the page split has disabled local interrupts, so it
> should effectively be running with preemption disabled as it has
> turned off the mechanism the scheduler can use to preempt it. The
> page split can't sleep, either, because it holds the mapping tree
> lock. Hence I can't see how a split-in-progress can be preempted in
> teh first place to cause a priority inversion livelock like this...
> 
> > ie this:
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 08341616ae7a..ca0eed80580f 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
> >  	if (!folio || xa_is_value(folio))
> >  		goto out;
> >  
> > -	if (!folio_try_get_rcu(folio))
> > +	if (!folio_try_get_rcu(folio)) {
> > +		unsigned long flags;
> > +
> > +		xas_lock_irqsave(&xas, flags);
> > +		xas_unlock_irqrestore(&xas, flags);
> >  		goto repeat;
> > +	}

As I suspected, this change did not prevent the livelock. It
reproduced after just 650 test iterations (a few minutes) with this
change in place.

> I would have thought:
> 
> 	if (!folio_try_get_rcu(folio)) {
> 		rcu_read_unlock();
> 		cond_resched();
> 		rcu_read_lock();
> 		goto repeat;
> 	}
> 
> Would be the right way to yeild the CPU to avoid priority
> inversion related livelocks here...

I'm now trying this just to provide a data point that actually
yeilding the CPU avoids the downstream effects of the livelock (i.e.
RCU grace period never expires, system eventually dies...). Maybe it
will avoid the livelock altogether, but I still don't understand how
huge page splitting would be getting preempted in the first place...

Cheers,

Dave.

Dave Chinner Oct. 19, 2022, 11:42 p.m. UTC | #16

On Thu, Oct 20, 2022 at 09:46:16AM +1100, Dave Chinner wrote:
> On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > This is reading and writing the same amount of file data at the
> > > > application level, but once the data has been written and kicked out
> > > > of the page cache it seems to require an awful lot more read IO to
> > > > get it back to the application. i.e. this looks like mmap() is
> > > > readahead thrashing severely, and eventually it livelocks with this
> > > > sort of report:
> > > > 
> > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > [175901.995614] Call Trace:
> > > > [175901.996090]  <TASK>
> > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > [175902.008613]  </TASK>
> > > > 
> > > > Given that filemap_fault on XFS is probably trying to map large
> > > > folios, I do wonder if this is a result of some kind of race with
> > > > teardown of a large folio...
> > > 
> > > It doesn't matter whether we're trying to map a large folio; it
> > > matters whether a large folio was previously created in the cache.
> > > Through the magic of readahead, it may well have been.  I suspect
> > > it's not teardown of a large folio, but splitting.  Removing a
> > > page from the page cache stores to the pointer in the XArray
> > > first (either NULL or a shadow entry), then decrements the refcount.
> > > 
> > > We must be observing a frozen folio.  There are a number of places
> > > in the MM which freeze a folio, but the obvious one is splitting.
> > > That looks like this:
> > > 
> > >         local_irq_disable();
> > >         if (mapping) {
> > >                 xas_lock(&xas);
> > > (...)
> > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > 
> > But the lookup is not doing anything to prevent the split on the
> > frozen page from making progress, right? It's not holding any folio
> > references, and it's not holding the mapping tree lock, either. So
> > how does the lookup in progress prevent the page split from making
> > progress?
> > 
> > 
> > > So one way to solve this might be to try to take the xa_lock on
> > > failure to get the refcount.  Otherwise a high-priority task
> > > might spin forever without a low-priority task getting the chance
> > > to finish the work being done while the folio is frozen.
> > 
> > IIUC, then you are saying that there is a scheduling priority
> > inversion because the lookup failure looping path doesn't yeild the
> > CPU?
> > 
> > If so, how does taking the mapping tree spin lock on failure cause
> > the looping task to yield the CPU and hence allow the folio split to
> > make progress?
> > 
> > Also, AFAICT, the page split has disabled local interrupts, so it
> > should effectively be running with preemption disabled as it has
> > turned off the mechanism the scheduler can use to preempt it. The
> > page split can't sleep, either, because it holds the mapping tree
> > lock. Hence I can't see how a split-in-progress can be preempted in
> > teh first place to cause a priority inversion livelock like this...
> > 
> > > ie this:
> > > 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 08341616ae7a..ca0eed80580f 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1860,8 +1860,13 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
> > >  	if (!folio || xa_is_value(folio))
> > >  		goto out;
> > >  
> > > -	if (!folio_try_get_rcu(folio))
> > > +	if (!folio_try_get_rcu(folio)) {
> > > +		unsigned long flags;
> > > +
> > > +		xas_lock_irqsave(&xas, flags);
> > > +		xas_unlock_irqrestore(&xas, flags);
> > >  		goto repeat;
> > > +	}
> 
> As I suspected, this change did not prevent the livelock. It
> reproduced after just 650 test iterations (a few minutes) with this
> change in place.
> 
> > I would have thought:
> > 
> > 	if (!folio_try_get_rcu(folio)) {
> > 		rcu_read_unlock();
> > 		cond_resched();
> > 		rcu_read_lock();
> > 		goto repeat;
> > 	}
> > 
> > Would be the right way to yeild the CPU to avoid priority
> > inversion related livelocks here...
> 
> I'm now trying this just to provide a data point that actually
> yeilding the CPU avoids the downstream effects of the livelock (i.e.

This causes a different problems to occur. First, a task hangs
evicting the page cache on unlink like so:

[ 1352.990246] INFO: task test_write:3572 blocked for more than 122 seconds.
[ 1352.991418]       Not tainted 6.1.0-rc1-dgc+ #1618
[ 1352.992378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1352.994702] task:test_write      state:D stack:13088 pid:3572  ppid:3567   flags:0x00004002
[ 1352.996257] Call Trace:
[ 1352.996905]  <TASK>
[ 1352.997452]  __schedule+0x2f9/0xa20
[ 1352.998380]  schedule+0x5a/0xc0
[ 1352.999011]  io_schedule+0x42/0x70
[ 1352.999565]  folio_wait_bit_common+0x137/0x370
[ 1353.000553]  ? dio_warn_stale_pagecache.part.0+0x50/0x50
[ 1353.001425]  truncate_inode_pages_range+0x41a/0x470
[ 1353.002443]  evict+0x1ad/0x1c0
[ 1353.003199]  do_unlinkat+0x1db/0x2e0
[ 1353.003907]  __x64_sys_unlinkat+0x33/0x60
[ 1353.004569]  do_syscall_64+0x35/0x80
[ 1353.005354]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1353.006315] RIP: 0033:0x7f03d6a42157
[ 1353.007088] RSP: 002b:00007ffde8d98388 EFLAGS: 00000206 ORIG_RAX: 0000000000000107
[ 1353.008367] RAX: ffffffffffffffda RBX: 0000000000000df4 RCX: 00007f03d6a42157
[ 1353.009748] RDX: 0000000000000000 RSI: 000055f8660415a0 RDI: 0000000000000007
[ 1353.011115] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007ffde8d96105
[ 1353.012422] R10: 0000000000000158 R11: 0000000000000206 R12: 0000000000000158
[ 1353.013556] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1353.014958]  </TASK>

And soon after:

[ 1607.746983] ------------[ cut here ]------------
[ 1607.748141] WARNING: CPU: 27 PID: 51385 at fs/iomap/buffered-io.c:80 iomap_page_release+0xaf/0xc0
[ 1607.751098] Modules linked in:
[ 1607.751619] CPU: 27 PID: 51385 Comm: test_write Not tainted 6.1.0-rc1-dgc+ #1618
[ 1607.753307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1607.755443] RIP: 0010:iomap_page_release+0xaf/0xc0
[ 1607.756496] Code: 48 c1 e8 02 83 e0 01 75 10 38 d0 75 21 5b 4c 89 e7 5d 41 5c e9 d2 be ee ff eb ee e8 bb e0 ec ff eb 88 48 8b 07 5b 5d 41 5c c3 <0f> 0b eb b1 0f 0b eb db 0f 0b eb b2 0f 1f 44 00 00 0f 1f 44 00 00
[ 1607.759643] RSP: 0018:ffffc900087cb5b0 EFLAGS: 00010206
[ 1607.760775] RAX: 0000000000200000 RBX: ffffea0004378000 RCX: 000000000000000c
[ 1607.762181] RDX: 0017ffffc0050081 RSI: ffff888145cb0178 RDI: ffffea0004378000
[ 1607.764408] RBP: 0000000000000200 R08: 0000000000200000 R09: ffffea0004378008
[ 1607.765560] R10: ffff888141988800 R11: 0000000000033900 R12: ffff88810200ee40
[ 1607.766988] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008
[ 1607.768347] FS:  00007f03d5c00640(0000) GS:ffff88833bdc0000(0000) knlGS:0000000000000000
[ 1607.769756] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.771421] CR2: 00007f03cae00000 CR3: 0000000246226006 CR4: 0000000000770ee0
[ 1607.773298] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1607.775026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1607.777341] PKRU: 55555554
[ 1607.777995] Call Trace:
[ 1607.778637]  <TASK>
[ 1607.779372]  iomap_release_folio+0x4d/0xa0
[ 1607.780510]  shrink_folio_list+0x9dd/0xb10
[ 1607.781931]  shrink_lruvec+0x558/0xb80
[ 1607.782922]  shrink_node+0x2ce/0x6f0
[ 1607.783993]  do_try_to_free_pages+0xd5/0x570
[ 1607.785031]  try_to_free_mem_cgroup_pages+0x105/0x220
[ 1607.786728]  reclaim_high+0xa6/0xf0
[ 1607.787672]  mem_cgroup_handle_over_high+0x97/0x290
[ 1607.789179]  try_charge_memcg+0x697/0x7f0
[ 1607.790100]  charge_memcg+0x35/0xd0
[ 1607.790930]  __mem_cgroup_charge+0x29/0x80
[ 1607.792884]  __filemap_add_folio+0x21a/0x460
[ 1607.794342]  ? scan_shadow_nodes+0x30/0x30
[ 1607.796052]  filemap_add_folio+0x37/0xa0
[ 1607.797036]  __filemap_get_folio+0x221/0x360
[ 1607.798468]  ? xas_load+0x5/0xa0
[ 1607.799417]  iomap_write_begin+0x103/0x580
[ 1607.800908]  ? filemap_dirty_folio+0x5c/0x80
[ 1607.801940]  ? iomap_write_end+0x101/0x250
[ 1607.803305]  iomap_file_buffered_write+0x17f/0x330
[ 1607.804700]  xfs_file_buffered_write+0xb1/0x2e0
[ 1607.806240]  ? xfs_file_buffered_write+0x2b2/0x2e0
[ 1607.807743]  vfs_write+0x2ca/0x3d0
[ 1607.808622]  __x64_sys_pwrite64+0x8c/0xc0
[ 1607.810502]  do_syscall_64+0x35/0x80
[ 1607.811511]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1607.813116] RIP: 0033:0x7f03d6b40a87
....
[ 1607.925434] kernel BUG at mm/vmscan.c:1306!
[ 1607.926977] invalid opcode: 0000 [#1] PREEMPT SMP
[ 1607.928427] CPU: 8 PID: 51385 Comm: test_write Tainted: G        W          6.1.0-rc1-dgc+ #1618
[ 1607.930742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1607.932716] RIP: 0010:__remove_mapping+0x229/0x240
[ 1607.933648] Code: a9 00 00 08 00 75 27 48 8b 3b 48 81 c7 88 00 00 00 e8 db 64 c1 00 31 c0 e9 d6 fe ff ff 48 8b 3b e8 7c 29 0d 00 e9 27 ff ff ff <0f> 0b 0f 0b 48 8b 45 00 f6 c4 04 74 d0 31 c0 e9 b5 fe ff ff 0f 1f
[ 1607.937079] RSP: 0018:ffffc900087cb5d8 EFLAGS: 00010246
[ 1607.938050] RAX: 0017ffffc0010004 RBX: ffffc900087cb6e8 RCX: ffff888244bc5000
[ 1607.939443] RDX: 0000000000000001 RSI: ffffea0004378000 RDI: ffff888145cb0318
[ 1607.940980] RBP: ffffc900087cb648 R08: ffff888145cb0318 R09: ffffffff813be400
[ 1607.942300] R10: ffff88810200ee40 R11: 0000000000033900 R12: ffffea0004378000
[ 1607.943666] R13: 0000000000000200 R14: ffffc900087cb8c0 R15: ffffea0004378008
[ 1607.945004] FS:  00007f03d5c00640(0000) GS:ffff88823bc00000(0000) knlGS:0000000000000000
[ 1607.946492] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1607.947552] CR2: 00007efd1ae33da8 CR3: 0000000246226001 CR4: 0000000000770ee0
[ 1607.948866] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1607.950250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1607.951738] PKRU: 55555554
[ 1607.952289] Call Trace:
[ 1607.952775]  <TASK>
[ 1607.953190]  shrink_folio_list+0x5cd/0xb10
[ 1607.954140]  shrink_lruvec+0x558/0xb80
[ 1607.954850]  shrink_node+0x2ce/0x6f0
[ 1607.955536]  do_try_to_free_pages+0xd5/0x570
[ 1607.956397]  try_to_free_mem_cgroup_pages+0x105/0x220
[ 1607.957503]  reclaim_high+0xa6/0xf0
[ 1607.958205]  mem_cgroup_handle_over_high+0x97/0x290
[ 1607.959149]  try_charge_memcg+0x697/0x7f0
[ 1607.959932]  charge_memcg+0x35/0xd0
[ 1607.960549]  __mem_cgroup_charge+0x29/0x80
[ 1607.961257]  __filemap_add_folio+0x21a/0x460
[ 1607.961912]  ? scan_shadow_nodes+0x30/0x30
[ 1607.962705]  filemap_add_folio+0x37/0xa0
[ 1607.963354]  __filemap_get_folio+0x221/0x360
[ 1607.964076]  ? xas_load+0x5/0xa0
[ 1607.964761]  iomap_write_begin+0x103/0x580
[ 1607.965510]  ? filemap_dirty_folio+0x5c/0x80
[ 1607.966349]  ? iomap_write_end+0x101/0x250
[ 1607.967185]  iomap_file_buffered_write+0x17f/0x330
[ 1607.968133]  xfs_file_buffered_write+0xb1/0x2e0
[ 1607.968948]  ? xfs_file_buffered_write+0x2b2/0x2e0
[ 1607.969783]  vfs_write+0x2ca/0x3d0
[ 1607.970326]  __x64_sys_pwrite64+0x8c/0xc0
[ 1607.971073]  do_syscall_64+0x35/0x80
[ 1607.971732]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

which is:

1297  * Same as remove_mapping, but if the folio is removed from the mapping, it      
1298  * gets returned with a refcount of 0.                                           
1299  */                                                                              
1300 static int __remove_mapping(struct address_space *mapping, struct folio *folio,  
1301                             bool reclaimed, struct mem_cgroup *target_memcg)     
1302 {                                                                                
1303         int refcount;                                                            
1304         void *shadow = NULL;                                                     
1305                                                                                  
1306         BUG_ON(!folio_test_locked(folio));                                       
1307         BUG_ON(mapping != folio_mapping(folio));                                 
1308                                                                                  

tripping over an unlocked folio in shrink_folio_list()....

And, finally, there's also a process spinning on a cpu in
truncate_inode_pages_range():

[ 3468.057406] CPU: 1 PID: 3579 Comm: test_write Tainted: G      D W          6.1.0-rc1-dgc+ #1618
[ 3468.057407] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 3468.057408] RIP: 0010:truncate_inode_pages_range+0x3c0/0x470
[ 3468.057411] Code: 8b 7c e5 00 41 f6 c6 01 0f 85 62 ff ff ff be a9 03 00 00 48 c7 c7 b8 62 7b 82 e8 0b 46 e9 ff 2e 2e 2e 31 c0 f0 49 0f ba 7
[ 3468.057412] RSP: 0018:ffffc90002cffd40 EFLAGS: 00000246
[ 3468.057414] RAX: 0000000000000000 RBX: ffffc90002cffdd8 RCX: ffff8882461b8000
[ 3468.057415] RDX: 0000000000000000 RSI: 00000000000003a9 RDI: ffffffff827b62b8
[ 3468.057415] RBP: ffff888246eee718 R08: ffffffffffffffc0 R09: fffffffffffffffe
[ 3468.057416] R10: 0000000000002838 R11: 0000000000002834 R12: 0000000000000004
[ 3468.057417] R13: ffffc90002cffd60 R14: ffffea0004378000 R15: 0000000000002810
[ 3468.057418] FS:  00007f03d6953740(0000) GS:ffff88813bc40000(0000) knlGS:0000000000000000
[ 3468.057420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3468.057421] CR2: 00007f03d0100000 CR3: 0000000245e2c006 CR4: 0000000000770ee0
root@test4:~# [ 3468.057422] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3468.057422] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3468.057423] PKRU: 55555554
[ 3468.057424] Call Trace:
[ 3468.057424]  <TASK>
[ 3468.057426]  evict+0x1ad/0x1c0
[ 3468.057429]  do_unlinkat+0x1db/0x2e0
[ 3468.057432]  __x64_sys_unlinkat+0x33/0x60
[ 3468.057434]  do_syscall_64+0x35/0x80
[ 3468.057437]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

So that's not the problem, either...

-Dave.

Zhaoyang Huang Oct. 20, 2022, 1:27 a.m. UTC | #17

On Wed, Oct 19, 2022 at 9:06 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote:
> > On Wed, Oct 19, 2022 at 12:47 PM Dave Chinner <david@fromorbit.com> wrote:
> > > I removed the mapping_set_large_folios() calls in the XFS inode
> > > instantiation and the test code has now run over 55,000 iterations
> > > without failing.  The most iterations I'd seen with large folios
> > > enabled was about 7,000 - typically it would fail within 2-3,000
> > > iterations.
> > hint from my side. The original problem I raised is under v5.15 where
> > there is no folio yet.
>
> But 5.15 does use 2MB pages in shmem.  You haven't really provided
> any information, so I don't know whether the inode that you're having
> problems with is a shmem inode.
No, it is a f2fs inode using 4KB pages.

Dave Chinner Oct. 20, 2022, 2:04 a.m. UTC | #18

On Wed, Oct 19, 2022 at 07:49:57AM -0400, Brian Foster wrote:
> On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote:
> > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > > >
> > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > > > > suggest skip this page to break the live lock as a workaround.
> > > > > > >
> > > > > > > No, the underlying bug should be fixed.
> > > > >
> > > > >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > Understand. IMHO, find_get_entry actruely works as an open API dealing
> > > > with different kinds of address_spaces page cache, which requires high
> > > > robustness to deal with any corner cases. Take the current problem as
> > > > example, the inode with fault page(refcount=0) could remain on the
> > > > sb's list without live lock problem.
> > > 
> > > But it's a corner case that shouldn't happen!  What else is going on
> > > at the time?  Can you reproduce this problem easily?  If so, how?
> > 
> > I've been seeing this livelock, too. The reproducer is,
> > unfortunately, something I can't share - it's a massive program that
> > triggers a data corruption I'm working on solving.
> > 
> > Now that I've
> > mostly fixed the data corruption, long duration test runs end up
> > livelocking in page cache lookup after several hours.
> > 
> > The test is effectively writing a 100MB file with multiple threads
> > doing reverse adjacent racing 1MB unaligned writes. Once the file is
> > written, it is then mmap()d and read back from the filesystem for
> > verification.
> > 
> > THis is then run with tens of processes concurrently, and then under
> > a massively confined memcg (e.g. 32 processes/files are run in a
> > memcg with only 200MB of memory allowed). This causes writeback,
> > readahead and memory reclaim to race with incoming mmap read faults
> > and writes.  The livelock occurs on file verification and it appears
> > to be an interaction with readahead thrashing.
> > 
> > On my test rig, the physical read to write ratio is at least 20:1 -
> > with 32 processes running, the 5s IO rates are:
> > 
> > Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
> > dm-0          52187.20      3677.42      1345.92         0.00      18387       6729          0
> > dm-0          62865.60      5947.29         0.08         0.00      29736          0          0
> > dm-0          62972.80      5911.20         0.00         0.00      29556          0          0
> > dm-0          59803.00      5516.72       133.47         0.00      27583        667          0
> > dm-0          63068.20      5292.34       511.52         0.00      26461       2557          0
> > dm-0          56775.60      4184.52      1248.38         0.00      20922       6241          0
> > dm-0          63087.40      5901.26        43.77         0.00      29506        218          0
> > dm-0          62769.00      5833.97        60.54         0.00      29169        302          0
> > dm-0          64810.20      5636.13       305.63         0.00      28180       1528          0
> > dm-0          65222.60      5598.99       349.48         0.00      27994       1747          0
> > dm-0          62444.00      4887.05       926.67         0.00      24435       4633          0
> > dm-0          63812.00      5622.68       294.66         0.00      28113       1473          0
> > dm-0          63482.00      5728.43       195.74         0.00      28642        978          0
> > 
> > This is reading and writing the same amount of file data at the
> > application level, but once the data has been written and kicked out
> > of the page cache it seems to require an awful lot more read IO to
> > get it back to the application. i.e. this looks like mmap() is
> > readahead thrashing severely, and eventually it livelocks with this
> > sort of report:
> > 
> > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > [175901.995614] Call Trace:
> > [175901.996090]  <TASK>
> > [175901.996594]  ? __schedule+0x301/0xa30
> > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > [175902.000714]  ? xas_start+0x53/0xc0
> > [175902.001484]  ? xas_load+0x24/0xa0
> > [175902.002208]  ? xas_load+0x5/0xa0
> > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > [175902.004693]  ? __do_fault+0x31/0x1d0
> > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > [175902.008613]  </TASK>
> > 
> > Given that filemap_fault on XFS is probably trying to map large
> > folios, I do wonder if this is a result of some kind of race with
> > teardown of a large folio...
> > 
> 
> I somewhat recently tracked down a hugepage/swap problem that could
> manifest as a softlockup in the folio lookup path (due to indefinite
> folio_try_get_rcu() failure):
> 
> https://lore.kernel.org/linux-mm/20220906190602.1626037-1-bfoster@redhat.com/
> 
> It could easily be something different leading to the same side effect,
> particularly since I believe the issue I saw was introduced in v5.19,
> but might be worth a test if you have a reliable reproducer.

Tests run and, unfortunately, that patch doesn't prevent/fix the
problem either.

Cheers,

Dave.

Zhaoyang Huang Oct. 20, 2022, 3:12 a.m. UTC | #19

On Thu, Oct 20, 2022 at 10:04 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 19, 2022 at 07:49:57AM -0400, Brian Foster wrote:
> > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 18, 2022 at 04:09:17AM +0100, Matthew Wilcox wrote:
> > > > On Tue, Oct 18, 2022 at 10:52:19AM +0800, Zhaoyang Huang wrote:
> > > > > On Mon, Oct 17, 2022 at 11:55 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Mon, Oct 17, 2022 at 01:34:13PM +0800, Zhaoyang Huang wrote:
> > > > > > > On Fri, Oct 14, 2022 at 8:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, Oct 14, 2022 at 01:30:48PM +0800, zhaoyang.huang wrote:
> > > > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > > > >
> > > > > > > > > Bellowing RCU stall is reported where kswapd traps in a live lock when shrink
> > > > > > > > > superblock's inode list. The direct reason is zombie page keeps staying on the
> > > > > > > > > xarray's slot and make the check and retry loop permanently. The root cause is unknown yet
> > > > > > > > > and supposed could be an xa update without synchronize_rcu etc. I would like to
> > > > > > > > > suggest skip this page to break the live lock as a workaround.
> > > > > > > >
> > > > > > > > No, the underlying bug should be fixed.
> > > > > >
> > > > > >     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > Understand. IMHO, find_get_entry actruely works as an open API dealing
> > > > > with different kinds of address_spaces page cache, which requires high
> > > > > robustness to deal with any corner cases. Take the current problem as
> > > > > example, the inode with fault page(refcount=0) could remain on the
> > > > > sb's list without live lock problem.
> > > >
> > > > But it's a corner case that shouldn't happen!  What else is going on
> > > > at the time?  Can you reproduce this problem easily?  If so, how?
> > >
> > > I've been seeing this livelock, too. The reproducer is,
> > > unfortunately, something I can't share - it's a massive program that
> > > triggers a data corruption I'm working on solving.
> > >
> > > Now that I've
> > > mostly fixed the data corruption, long duration test runs end up
> > > livelocking in page cache lookup after several hours.
> > >
> > > The test is effectively writing a 100MB file with multiple threads
> > > doing reverse adjacent racing 1MB unaligned writes. Once the file is
> > > written, it is then mmap()d and read back from the filesystem for
> > > verification.
> > >
> > > THis is then run with tens of processes concurrently, and then under
> > > a massively confined memcg (e.g. 32 processes/files are run in a
> > > memcg with only 200MB of memory allowed). This causes writeback,
> > > readahead and memory reclaim to race with incoming mmap read faults
> > > and writes.  The livelock occurs on file verification and it appears
> > > to be an interaction with readahead thrashing.
> > >
> > > On my test rig, the physical read to write ratio is at least 20:1 -
> > > with 32 processes running, the 5s IO rates are:
> > >
> > > Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s    MB_read    MB_wrtn    MB_dscd
> > > dm-0          52187.20      3677.42      1345.92         0.00      18387       6729          0
> > > dm-0          62865.60      5947.29         0.08         0.00      29736          0          0
> > > dm-0          62972.80      5911.20         0.00         0.00      29556          0          0
> > > dm-0          59803.00      5516.72       133.47         0.00      27583        667          0
> > > dm-0          63068.20      5292.34       511.52         0.00      26461       2557          0
> > > dm-0          56775.60      4184.52      1248.38         0.00      20922       6241          0
> > > dm-0          63087.40      5901.26        43.77         0.00      29506        218          0
> > > dm-0          62769.00      5833.97        60.54         0.00      29169        302          0
> > > dm-0          64810.20      5636.13       305.63         0.00      28180       1528          0
> > > dm-0          65222.60      5598.99       349.48         0.00      27994       1747          0
> > > dm-0          62444.00      4887.05       926.67         0.00      24435       4633          0
> > > dm-0          63812.00      5622.68       294.66         0.00      28113       1473          0
> > > dm-0          63482.00      5728.43       195.74         0.00      28642        978          0
> > >
> > > This is reading and writing the same amount of file data at the
> > > application level, but once the data has been written and kicked out
> > > of the page cache it seems to require an awful lot more read IO to
> > > get it back to the application. i.e. this looks like mmap() is
> > > readahead thrashing severely, and eventually it livelocks with this
> > > sort of report:
> > >
> > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > [175901.995614] Call Trace:
> > > [175901.996090]  <TASK>
> > > [175901.996594]  ? __schedule+0x301/0xa30
> > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [175902.000714]  ? xas_start+0x53/0xc0
> > > [175902.001484]  ? xas_load+0x24/0xa0
> > > [175902.002208]  ? xas_load+0x5/0xa0
> > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > [175902.008613]  </TASK>
> > >
> > > Given that filemap_fault on XFS is probably trying to map large
> > > folios, I do wonder if this is a result of some kind of race with
> > > teardown of a large folio...
> > >
> >
> > I somewhat recently tracked down a hugepage/swap problem that could
> > manifest as a softlockup in the folio lookup path (due to indefinite
> > folio_try_get_rcu() failure):
> >
> > https://lore.kernel.org/linux-mm/20220906190602.1626037-1-bfoster@redhat.com/
> >
> > It could easily be something different leading to the same side effect,
> > particularly since I believe the issue I saw was introduced in v5.19,
> > but might be worth a test if you have a reliable reproducer.
>
> Tests run and, unfortunately, that patch doesn't prevent/fix the
> problem either.
Could you please help to test the following change which breaks the
livelock by surpassing the zero-refcount page directly without reset
iterate to ROOT.

      if (!folio_try_get_rcu(folio)) {
               goto retry;
      }
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Matthew Wilcox Oct. 20, 2022, 9:52 p.m. UTC | #20

On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > This is reading and writing the same amount of file data at the
> > > application level, but once the data has been written and kicked out
> > > of the page cache it seems to require an awful lot more read IO to
> > > get it back to the application. i.e. this looks like mmap() is
> > > readahead thrashing severely, and eventually it livelocks with this
> > > sort of report:
> > > 
> > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > [175901.995614] Call Trace:
> > > [175901.996090]  <TASK>
> > > [175901.996594]  ? __schedule+0x301/0xa30
> > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > [175902.000714]  ? xas_start+0x53/0xc0
> > > [175902.001484]  ? xas_load+0x24/0xa0
> > > [175902.002208]  ? xas_load+0x5/0xa0
> > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > [175902.008613]  </TASK>
> > > 
> > > Given that filemap_fault on XFS is probably trying to map large
> > > folios, I do wonder if this is a result of some kind of race with
> > > teardown of a large folio...
> > 
> > It doesn't matter whether we're trying to map a large folio; it
> > matters whether a large folio was previously created in the cache.
> > Through the magic of readahead, it may well have been.  I suspect
> > it's not teardown of a large folio, but splitting.  Removing a
> > page from the page cache stores to the pointer in the XArray
> > first (either NULL or a shadow entry), then decrements the refcount.
> > 
> > We must be observing a frozen folio.  There are a number of places
> > in the MM which freeze a folio, but the obvious one is splitting.
> > That looks like this:
> > 
> >         local_irq_disable();
> >         if (mapping) {
> >                 xas_lock(&xas);
> > (...)
> >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> 
> But the lookup is not doing anything to prevent the split on the
> frozen page from making progress, right? It's not holding any folio
> references, and it's not holding the mapping tree lock, either. So
> how does the lookup in progress prevent the page split from making
> progress?

My thinking was that it keeps hammering the ->refcount field in
struct folio.  That might prevent a thread on a different socket
from making forward progress.  In contrast, spinlocks are designed
to be fair under contention, so by spinning on an actual lock, we'd
remove contention on the folio.

But I think the tests you've done refute that theory.  I'm all out of
ideas at the moment.  Either we have a frozen folio from somebody who
doesn't hold the lock, or we have someone who's left a frozen folio in
the page cache.  I'm leaning towards that explanation at the moment,
but I don't have a good suggestion for debugging.

Perhaps a bad suggestion for debugging would be to call dump_page()
with a __ratelimit() wrapper to not be overwhelmed with information?

> I would have thought:
> 
> 	if (!folio_try_get_rcu(folio)) {
> 		rcu_read_unlock();
> 		cond_resched();
> 		rcu_read_lock();
> 		goto repeat;
> 	}
> 
> Would be the right way to yeild the CPU to avoid priority
> inversion related livelocks here...

I'm not sure we're allowed to schedule here.  We might be under another
spinlock?

Zhaoyang Huang Oct. 26, 2022, 8:38 a.m. UTC | #21

On Fri, Oct 21, 2022 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > This is reading and writing the same amount of file data at the
> > > > application level, but once the data has been written and kicked out
> > > > of the page cache it seems to require an awful lot more read IO to
> > > > get it back to the application. i.e. this looks like mmap() is
> > > > readahead thrashing severely, and eventually it livelocks with this
> > > > sort of report:
> > > >
> > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > [175901.995614] Call Trace:
> > > > [175901.996090]  <TASK>
> > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > [175902.008613]  </TASK>
> > > >
> > > > Given that filemap_fault on XFS is probably trying to map large
> > > > folios, I do wonder if this is a result of some kind of race with
> > > > teardown of a large folio...
> > >
> > > It doesn't matter whether we're trying to map a large folio; it
> > > matters whether a large folio was previously created in the cache.
> > > Through the magic of readahead, it may well have been.  I suspect
> > > it's not teardown of a large folio, but splitting.  Removing a
> > > page from the page cache stores to the pointer in the XArray
> > > first (either NULL or a shadow entry), then decrements the refcount.
> > >
> > > We must be observing a frozen folio.  There are a number of places
> > > in the MM which freeze a folio, but the obvious one is splitting.
> > > That looks like this:
> > >
> > >         local_irq_disable();
> > >         if (mapping) {
> > >                 xas_lock(&xas);
> > > (...)
> > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> >
> > But the lookup is not doing anything to prevent the split on the
> > frozen page from making progress, right? It's not holding any folio
> > references, and it's not holding the mapping tree lock, either. So
> > how does the lookup in progress prevent the page split from making
> > progress?
>
> My thinking was that it keeps hammering the ->refcount field in
> struct folio.  That might prevent a thread on a different socket
> from making forward progress.  In contrast, spinlocks are designed
> to be fair under contention, so by spinning on an actual lock, we'd
> remove contention on the folio.
>
> But I think the tests you've done refute that theory.  I'm all out of
> ideas at the moment.  Either we have a frozen folio from somebody who
> doesn't hold the lock, or we have someone who's left a frozen folio in
> the page cache.  I'm leaning towards that explanation at the moment,
> but I don't have a good suggestion for debugging.
>
> Perhaps a bad suggestion for debugging would be to call dump_page()
> with a __ratelimit() wrapper to not be overwhelmed with information?
>
> > I would have thought:
> >
> >       if (!folio_try_get_rcu(folio)) {
> >               rcu_read_unlock();
> >               cond_resched();
> >               rcu_read_lock();
> >               goto repeat;
> >       }
> >
> > Would be the right way to yeild the CPU to avoid priority
> > inversion related livelocks here...
>
> I'm not sure we're allowed to schedule here.  We might be under another
> spinlock?
Any further ideas on this issue? Could we just deal with it as simply
as surpass the zero refed page to break the livelock as a workaround?
IMO, the system could survive if it is a single inode leak or expose
other faults if the page cache messed up, which is better than
livelock here. We do the similar thing during reclaiming as force
reset the page's mapcount to -1 even if there is active reference on
it.

static void unaccount_page_cache_page(struct address_space *mapping,
     struct page *page)
{

if (mapping_exiting(mapping) &&
   page_count(page) >= mapcount + 2) {

/*
* All vmas have already been torn down, so it's
* a good bet that actually the page is unmapped,
* and we'd prefer not to leak it: if we're wrong,
* some other bad page check should catch it later.
*/

page_mapcount_reset(page);
page_ref_sub(page, mapcount);
}

Matthew Wilcox Oct. 26, 2022, 2:38 p.m. UTC | #22

On Wed, Oct 26, 2022 at 04:38:31PM +0800, Zhaoyang Huang wrote:
> On Fri, Oct 21, 2022 at 5:52 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > > This is reading and writing the same amount of file data at the
> > > > > application level, but once the data has been written and kicked out
> > > > > of the page cache it seems to require an awful lot more read IO to
> > > > > get it back to the application. i.e. this looks like mmap() is
> > > > > readahead thrashing severely, and eventually it livelocks with this
> > > > > sort of report:
> > > > >
> > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > > [175901.995614] Call Trace:
> > > > > [175901.996090]  <TASK>
> > > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > > [175902.008613]  </TASK>
> > > > >
> > > > > Given that filemap_fault on XFS is probably trying to map large
> > > > > folios, I do wonder if this is a result of some kind of race with
> > > > > teardown of a large folio...
> > > >
> > > > It doesn't matter whether we're trying to map a large folio; it
> > > > matters whether a large folio was previously created in the cache.
> > > > Through the magic of readahead, it may well have been.  I suspect
> > > > it's not teardown of a large folio, but splitting.  Removing a
> > > > page from the page cache stores to the pointer in the XArray
> > > > first (either NULL or a shadow entry), then decrements the refcount.
> > > >
> > > > We must be observing a frozen folio.  There are a number of places
> > > > in the MM which freeze a folio, but the obvious one is splitting.
> > > > That looks like this:
> > > >
> > > >         local_irq_disable();
> > > >         if (mapping) {
> > > >                 xas_lock(&xas);
> > > > (...)
> > > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > >
> > > But the lookup is not doing anything to prevent the split on the
> > > frozen page from making progress, right? It's not holding any folio
> > > references, and it's not holding the mapping tree lock, either. So
> > > how does the lookup in progress prevent the page split from making
> > > progress?
> >
> > My thinking was that it keeps hammering the ->refcount field in
> > struct folio.  That might prevent a thread on a different socket
> > from making forward progress.  In contrast, spinlocks are designed
> > to be fair under contention, so by spinning on an actual lock, we'd
> > remove contention on the folio.
> >
> > But I think the tests you've done refute that theory.  I'm all out of
> > ideas at the moment.  Either we have a frozen folio from somebody who
> > doesn't hold the lock, or we have someone who's left a frozen folio in
> > the page cache.  I'm leaning towards that explanation at the moment,
> > but I don't have a good suggestion for debugging.
> >
> > Perhaps a bad suggestion for debugging would be to call dump_page()
> > with a __ratelimit() wrapper to not be overwhelmed with information?
> >
> > > I would have thought:
> > >
> > >       if (!folio_try_get_rcu(folio)) {
> > >               rcu_read_unlock();
> > >               cond_resched();
> > >               rcu_read_lock();
> > >               goto repeat;
> > >       }
> > >
> > > Would be the right way to yeild the CPU to avoid priority
> > > inversion related livelocks here...
> >
> > I'm not sure we're allowed to schedule here.  We might be under another
> > spinlock?
> Any further ideas on this issue? Could we just deal with it as simply
> as surpass the zero refed page to break the livelock as a workaround?

No.  This bug needs to be found & fixed.  How easily can you reproduce
it?

Matthew Wilcox Oct. 26, 2022, 4:01 p.m. UTC | #23

On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote:
> But I think the tests you've done refute that theory.  I'm all out of
> ideas at the moment.

I have a new idea.  In page_cache_delete_batch(), we don't set the
order of the entry before calling xas_store().  That means we can end
up in a situation where we have an order-2 folio in the page cache,
delete it and end up with a NULL pointer at (say) index 20 and sibling
entries at indices 21-23.  We can come along (potentially much later)
and put an order-0 folio back at index 20.  Now all of indices 20-23
point to the index-20, order-0 folio.  Worse, the xarray node can be
freed with the sibling entries still intact and then be reallocated by
an entirely different xarray.

I don't know if this is going to fix the problem you're seeing.  I can't
quite draw a line from this situation to your symptoms.  I came across
it while auditing all the places which set folio->mapping to NULL.
I did notice a mis-ordering; all the other places first remove the folio
from the xarray before setting folio to NULL, but I have a hard time
connecting that to your symptoms either.

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 44dd6d6e01bc..cc1fd1f849a7 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1617,6 +1617,12 @@ static inline void xas_advance(struct xa_state *xas, unsigned long index)
 	xas->xa_offset = (index >> shift) & XA_CHUNK_MASK;
 }

+static inline void xas_adjust_order(struct xa_state *xas, unsigned int order)
+{
+	xas->xa_shift = order - (order % XA_CHUNK_SHIFT);
+	xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
+}
+
 /**
  * xas_set_order() - Set up XArray operation state for a multislot entry.
  * @xas: XArray operation state.
@@ -1628,8 +1634,7 @@ static inline void xas_set_order(struct xa_state *xas, unsigned long index,
 {
 #ifdef CONFIG_XARRAY_MULTI
 	xas->xa_index = order < BITS_PER_LONG ? (index >> order) << order : 0;
-	xas->xa_shift = order - (order % XA_CHUNK_SHIFT);
-	xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
+	xas_adjust_order(xas, order);
 	xas->xa_node = XAS_RESTART;
 #else
 	BUG_ON(order > 0);
diff --git a/mm/filemap.c b/mm/filemap.c
index 08341616ae7a..6e3f486131e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -305,11 +305,13 @@ static void page_cache_delete_batch(struct address_space *mapping,

 		WARN_ON_ONCE(!folio_test_locked(folio));

+		if (!folio_test_hugetlb(folio))
+			xas_adjust_order(&xas, folio_order(folio));
+		xas_store(&xas, NULL);
 		folio->mapping = NULL;
 		/* Leave folio->index set: truncation lookup relies on it */

 		i++;
-		xas_store(&xas, NULL);
 		total_pages += folio_nr_pages(folio);
 	}
 	mapping->nrpages -= total_pages;

Matthew Wilcox Oct. 26, 2022, 7:49 p.m. UTC | #24

On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote:
> hint from my side. The original problem I raised is under v5.15 where
> there is no folio yet.

I really wish you'd stop dropping hints and give all of the information
you have so I can try to figure this out withouot sending individual
emails for every little piece.

Do you have CONFIG_READ_ONLY_THP_FOR_FS=y in your .config?

Zhaoyang Huang Oct. 27, 2022, 1:57 a.m. UTC | #25

On Thu, Oct 27, 2022 at 3:49 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 19, 2022 at 01:48:37PM +0800, Zhaoyang Huang wrote:
> > hint from my side. The original problem I raised is under v5.15 where
> > there is no folio yet.
>
> I really wish you'd stop dropping hints and give all of the information
> you have so I can try to figure this out withouot sending individual
> emails for every little piece.
>
> Do you have CONFIG_READ_ONLY_THP_FOR_FS=y in your .config?
No, it is not configured. f2fs only use 4KB pages and blocks
>

Dave Chinner Oct. 28, 2022, 4:05 a.m. UTC | #26

On Wed, Oct 26, 2022 at 05:01:57PM +0100, Matthew Wilcox wrote:
> On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote:
> > But I think the tests you've done refute that theory.  I'm all out of
> > ideas at the moment.
> 
> I have a new idea.  In page_cache_delete_batch(), we don't set the
> order of the entry before calling xas_store().  That means we can end
> up in a situation where we have an order-2 folio in the page cache,
> delete it and end up with a NULL pointer at (say) index 20 and sibling
> entries at indices 21-23.  We can come along (potentially much later)
> and put an order-0 folio back at index 20.  Now all of indices 20-23
> point to the index-20, order-0 folio.  Worse, the xarray node can be
> freed with the sibling entries still intact and then be reallocated by
> an entirely different xarray.
> 
> I don't know if this is going to fix the problem you're seeing.  I can't
> quite draw a line from this situation to your symptoms.  I came across
> it while auditing all the places which set folio->mapping to NULL.
> I did notice a mis-ordering; all the other places first remove the folio
> from the xarray before setting folio to NULL, but I have a hard time
> connecting that to your symptoms either.
> 
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 44dd6d6e01bc..cc1fd1f849a7 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -1617,6 +1617,12 @@ static inline void xas_advance(struct xa_state *xas, unsigned long index)
>  	xas->xa_offset = (index >> shift) & XA_CHUNK_MASK;
>  }
>  
> +static inline void xas_adjust_order(struct xa_state *xas, unsigned int order)
> +{
> +	xas->xa_shift = order - (order % XA_CHUNK_SHIFT);
> +	xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
> +}
> +
>  /**
>   * xas_set_order() - Set up XArray operation state for a multislot entry.
>   * @xas: XArray operation state.
> @@ -1628,8 +1634,7 @@ static inline void xas_set_order(struct xa_state *xas, unsigned long index,
>  {
>  #ifdef CONFIG_XARRAY_MULTI
>  	xas->xa_index = order < BITS_PER_LONG ? (index >> order) << order : 0;
> -	xas->xa_shift = order - (order % XA_CHUNK_SHIFT);
> -	xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
> +	xas_adjust_order(xas, order);
>  	xas->xa_node = XAS_RESTART;
>  #else
>  	BUG_ON(order > 0);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 08341616ae7a..6e3f486131e4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -305,11 +305,13 @@ static void page_cache_delete_batch(struct address_space *mapping,
>  
>  		WARN_ON_ONCE(!folio_test_locked(folio));
>  
> +		if (!folio_test_hugetlb(folio))
> +			xas_adjust_order(&xas, folio_order(folio));
> +		xas_store(&xas, NULL);
>  		folio->mapping = NULL;
>  		/* Leave folio->index set: truncation lookup relies on it */
>  
>  		i++;
> -		xas_store(&xas, NULL);
>  		total_pages += folio_nr_pages(folio);
>  	}
>  	mapping->nrpages -= total_pages;

Nope, that ain't it. I think I've got the data corruption fix sorted
now (at least, g/270 isn't assert failing on stray delalloc extents
anymore), so if that's the case, I can spend some time actively
trying to track this down....

-Dave.

Dave Chinner Nov. 1, 2022, 7:17 a.m. UTC | #27

On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote:
> On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > This is reading and writing the same amount of file data at the
> > > > application level, but once the data has been written and kicked out
> > > > of the page cache it seems to require an awful lot more read IO to
> > > > get it back to the application. i.e. this looks like mmap() is
> > > > readahead thrashing severely, and eventually it livelocks with this
> > > > sort of report:
> > > > 
> > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > [175901.995614] Call Trace:
> > > > [175901.996090]  <TASK>
> > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > [175902.008613]  </TASK>
> > > > 
> > > > Given that filemap_fault on XFS is probably trying to map large
> > > > folios, I do wonder if this is a result of some kind of race with
> > > > teardown of a large folio...
> > > 
> > > It doesn't matter whether we're trying to map a large folio; it
> > > matters whether a large folio was previously created in the cache.
> > > Through the magic of readahead, it may well have been.  I suspect
> > > it's not teardown of a large folio, but splitting.  Removing a
> > > page from the page cache stores to the pointer in the XArray
> > > first (either NULL or a shadow entry), then decrements the refcount.
> > > 
> > > We must be observing a frozen folio.  There are a number of places
> > > in the MM which freeze a folio, but the obvious one is splitting.
> > > That looks like this:
> > > 
> > >         local_irq_disable();
> > >         if (mapping) {
> > >                 xas_lock(&xas);
> > > (...)
> > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > 
> > But the lookup is not doing anything to prevent the split on the
> > frozen page from making progress, right? It's not holding any folio
> > references, and it's not holding the mapping tree lock, either. So
> > how does the lookup in progress prevent the page split from making
> > progress?
> 
> My thinking was that it keeps hammering the ->refcount field in
> struct folio.  That might prevent a thread on a different socket
> from making forward progress.  In contrast, spinlocks are designed
> to be fair under contention, so by spinning on an actual lock, we'd
> remove contention on the folio.
> 
> But I think the tests you've done refute that theory.  I'm all out of
> ideas at the moment.  Either we have a frozen folio from somebody who
> doesn't hold the lock, or we have someone who's left a frozen folio in
> the page cache.  I'm leaning towards that explanation at the moment,
> but I don't have a good suggestion for debugging.

It's something else. I got gdb attached to qemu and single stepped
the looping lookup. The context I caught this time is truncate after
unlink:

(gdb) bt
#0  find_get_entry (mark=<optimized out>, max=<optimized out>, xas=<optimized out>) at mm/filemap.c:2014
#1  find_lock_entries (mapping=mapping@entry=0xffff8882445e2118, start=start@entry=25089, end=end@entry=18446744073709551614, 
    fbatch=fbatch@entry=0xffffc900082a7dd8, indices=indices@entry=0xffffc900082a7d60) at mm/filemap.c:2095
#2  0xffffffff8128f024 in truncate_inode_pages_range (mapping=mapping@entry=0xffff8882445e2118, lstart=lstart@entry=0, lend=lend@entry=-1)
    at mm/truncate.c:364
#3  0xffffffff8128f452 in truncate_inode_pages (lstart=0, mapping=0xffff8882445e2118) at mm/truncate.c:452
#4  0xffffffff8136335d in evict (inode=inode@entry=0xffff8882445e1f78) at fs/inode.c:666
#5  0xffffffff813636cc in iput_final (inode=0xffff8882445e1f78) at fs/inode.c:1747
#6  0xffffffff81355b8b in do_unlinkat (dfd=dfd@entry=10, name=0xffff88834170e000) at fs/namei.c:4326
#7  0xffffffff81355cc3 in __do_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4362
#8  __se_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4355
#9  __x64_sys_unlinkat (regs=<optimized out>) at fs/namei.c:4355
#10 0xffffffff81e92e35 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900082a7f58) at arch/x86/entry/common.c:50
#11 do_syscall_64 (regs=0xffffc900082a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#13 0x0000000000000000 in ?? ()

The find_lock_entries() call is being asked to start at index
25089, and we are spinning on a folio we find because
folio_try_get_rcu(folio) is failing - the folio ref count is zero.

The xas state on lookup is:

(gdb) p *xas
$6 = {xa = 0xffff8882445e2120, xa_index = 25092, xa_shift = 0 '\000', xa_sibs = 0 '\000', xa_offset = 4 '\004', xa_pad = 0 '\000', 
  xa_node = 0xffff888144c15918, xa_alloc = 0x0 <fixed_percpu_data>, xa_update = 0x0 <fixed_percpu_data>, xa_lru = 0x0 <fixed_percpu_data>

indicating that we are trying to look up index 25092 (3 pages
further in than the start of the batch), and the folio that this
keeps returning is this:

(gdb) p *folio
$7 = {{{flags = 24769796876795904, {lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, 
          mlock_count = 1073569616}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 0x8 <fixed_percpu_data+8>, _mapcount = {
        counter = -129}, _refcount = {counter = 0}, memcg_data = 0}, page = {flags = 24769796876795904, {{{lru = {next = 0xffffea0005690008, 
              prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}, buddy_list = {
              next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, pcp_list = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}}, 
          mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 8}, {pp_magic = 18446719884544507912, pp = 0xffff88823ffd5f50, 
          _pp_mapping_pad = 0, dma_addr = 18688, {dma_addr_upper = 8, pp_frag_count = {counter = 8}}}, {
          compound_head = 18446719884544507912, compound_dtor = 80 'P', compound_order = 95 '_', compound_mapcount = {counter = -30590}, 
          compound_pincount = {counter = 0}, compound_nr = 0}, {_compound_pad_1 = 18446719884544507912, 
          _compound_pad_2 = 18446612691733536592, deferred_list = {next = 0x0 <fixed_percpu_data>, 
            prev = 0x4900 <irq_stack_backing_store+10496>}}, {_pt_pad_1 = 18446719884544507912, pmd_huge_pte = 0xffff88823ffd5f50, 
          _pt_pad_2 = 0, {pt_mm = 0x4900 <irq_stack_backing_store+10496>, pt_frag_refcount = {counter = 18688}}, 
          ptl = 0x8 <fixed_percpu_data+8>}, {pgmap = 0xffffea0005690008, zone_device_data = 0xffff88823ffd5f50}, callback_head = {
          next = 0xffffea0005690008, func = 0xffff88823ffd5f50}}, {_mapcount = {counter = -129}, page_type = 4294967167}, _refcount = {
        counter = 0}, memcg_data = 0}}, _flags_1 = 24769796876795904, __head = 0, _folio_dtor = 3 '\003', _folio_order = 8 '\b', 
  _total_mapcount = {counter = -1}, _pincount = {counter = 0}, _folio_nr_pages = 0}
(gdb)

The folio has a NULL mapping, and an index of 18688, which means
even if it was not a folio that has been invalidated or freed, the
index is way outside the range we are looking for.

If I step it round the lookup loop, xas does not change, and the
same folio is returned every time through the loop. Perhaps
the mapping tree itself might be corrupt???

It's simple enough to stop the machine once it has become stuck to
observe the iteration and dump structures, just tell me what you
need to know from here...

Cheers,

Dave.

Zhaoyang Huang April 11, 2024, 7:04 a.m. UTC | #28

On Tue, Nov 1, 2022 at 3:17 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Oct 20, 2022 at 10:52:14PM +0100, Matthew Wilcox wrote:
> > On Thu, Oct 20, 2022 at 09:04:24AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 19, 2022 at 04:23:10PM +0100, Matthew Wilcox wrote:
> > > > On Wed, Oct 19, 2022 at 09:30:42AM +1100, Dave Chinner wrote:
> > > > > This is reading and writing the same amount of file data at the
> > > > > application level, but once the data has been written and kicked out
> > > > > of the page cache it seems to require an awful lot more read IO to
> > > > > get it back to the application. i.e. this looks like mmap() is
> > > > > readahead thrashing severely, and eventually it livelocks with this
> > > > > sort of report:
> > > > >
> > > > > [175901.982484] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > [175901.985095] rcu:    Tasks blocked on level-1 rcu_node (CPUs 0-15): P25728
> > > > > [175901.987996]         (detected by 0, t=97399871 jiffies, g=15891025, q=1972622 ncpus=32)
> > > > > [175901.991698] task:test_write      state:R  running task     stack:12784 pid:25728 ppid: 25696 flags:0x00004002
> > > > > [175901.995614] Call Trace:
> > > > > [175901.996090]  <TASK>
> > > > > [175901.996594]  ? __schedule+0x301/0xa30
> > > > > [175901.997411]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.998513]  ? sysvec_apic_timer_interrupt+0xb/0x90
> > > > > [175901.999578]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> > > > > [175902.000714]  ? xas_start+0x53/0xc0
> > > > > [175902.001484]  ? xas_load+0x24/0xa0
> > > > > [175902.002208]  ? xas_load+0x5/0xa0
> > > > > [175902.002878]  ? __filemap_get_folio+0x87/0x340
> > > > > [175902.003823]  ? filemap_fault+0x139/0x8d0
> > > > > [175902.004693]  ? __do_fault+0x31/0x1d0
> > > > > [175902.005372]  ? __handle_mm_fault+0xda9/0x17d0
> > > > > [175902.006213]  ? handle_mm_fault+0xd0/0x2a0
> > > > > [175902.006998]  ? exc_page_fault+0x1d9/0x810
> > > > > [175902.007789]  ? asm_exc_page_fault+0x22/0x30
> > > > > [175902.008613]  </TASK>
> > > > >
> > > > > Given that filemap_fault on XFS is probably trying to map large
> > > > > folios, I do wonder if this is a result of some kind of race with
> > > > > teardown of a large folio...
> > > >
> > > > It doesn't matter whether we're trying to map a large folio; it
> > > > matters whether a large folio was previously created in the cache.
> > > > Through the magic of readahead, it may well have been.  I suspect
> > > > it's not teardown of a large folio, but splitting.  Removing a
> > > > page from the page cache stores to the pointer in the XArray
> > > > first (either NULL or a shadow entry), then decrements the refcount.
> > > >
> > > > We must be observing a frozen folio.  There are a number of places
> > > > in the MM which freeze a folio, but the obvious one is splitting.
> > > > That looks like this:
> > > >
> > > >         local_irq_disable();
> > > >         if (mapping) {
> > > >                 xas_lock(&xas);
> > > > (...)
> > > >         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > >
> > > But the lookup is not doing anything to prevent the split on the
> > > frozen page from making progress, right? It's not holding any folio
> > > references, and it's not holding the mapping tree lock, either. So
> > > how does the lookup in progress prevent the page split from making
> > > progress?
> >
> > My thinking was that it keeps hammering the ->refcount field in
> > struct folio.  That might prevent a thread on a different socket
> > from making forward progress.  In contrast, spinlocks are designed
> > to be fair under contention, so by spinning on an actual lock, we'd
> > remove contention on the folio.
> >
> > But I think the tests you've done refute that theory.  I'm all out of
> > ideas at the moment.  Either we have a frozen folio from somebody who
> > doesn't hold the lock, or we have someone who's left a frozen folio in
> > the page cache.  I'm leaning towards that explanation at the moment,
> > but I don't have a good suggestion for debugging.
>
> It's something else. I got gdb attached to qemu and single stepped
> the looping lookup. The context I caught this time is truncate after
> unlink:
>
> (gdb) bt
> #0  find_get_entry (mark=<optimized out>, max=<optimized out>, xas=<optimized out>) at mm/filemap.c:2014
> #1  find_lock_entries (mapping=mapping@entry=0xffff8882445e2118, start=start@entry=25089, end=end@entry=18446744073709551614,
>     fbatch=fbatch@entry=0xffffc900082a7dd8, indices=indices@entry=0xffffc900082a7d60) at mm/filemap.c:2095
> #2  0xffffffff8128f024 in truncate_inode_pages_range (mapping=mapping@entry=0xffff8882445e2118, lstart=lstart@entry=0, lend=lend@entry=-1)
>     at mm/truncate.c:364
> #3  0xffffffff8128f452 in truncate_inode_pages (lstart=0, mapping=0xffff8882445e2118) at mm/truncate.c:452
> #4  0xffffffff8136335d in evict (inode=inode@entry=0xffff8882445e1f78) at fs/inode.c:666
> #5  0xffffffff813636cc in iput_final (inode=0xffff8882445e1f78) at fs/inode.c:1747
> #6  0xffffffff81355b8b in do_unlinkat (dfd=dfd@entry=10, name=0xffff88834170e000) at fs/namei.c:4326
> #7  0xffffffff81355cc3 in __do_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4362
> #8  __se_sys_unlinkat (flag=<optimized out>, pathname=<optimized out>, dfd=<optimized out>) at fs/namei.c:4355
> #9  __x64_sys_unlinkat (regs=<optimized out>) at fs/namei.c:4355
> #10 0xffffffff81e92e35 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900082a7f58) at arch/x86/entry/common.c:50
> #11 do_syscall_64 (regs=0xffffc900082a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80
> #12 0xffffffff82000087 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
> #13 0x0000000000000000 in ?? ()
>
> The find_lock_entries() call is being asked to start at index
> 25089, and we are spinning on a folio we find because
> folio_try_get_rcu(folio) is failing - the folio ref count is zero.
>
> The xas state on lookup is:
>
> (gdb) p *xas
> $6 = {xa = 0xffff8882445e2120, xa_index = 25092, xa_shift = 0 '\000', xa_sibs = 0 '\000', xa_offset = 4 '\004', xa_pad = 0 '\000',
>   xa_node = 0xffff888144c15918, xa_alloc = 0x0 <fixed_percpu_data>, xa_update = 0x0 <fixed_percpu_data>, xa_lru = 0x0 <fixed_percpu_data>
>
> indicating that we are trying to look up index 25092 (3 pages
> further in than the start of the batch), and the folio that this
> keeps returning is this:
>
> (gdb) p *folio
> $7 = {{{flags = 24769796876795904, {lru = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008,
>           mlock_count = 1073569616}}, mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 0x8 <fixed_percpu_data+8>, _mapcount = {
>         counter = -129}, _refcount = {counter = 0}, memcg_data = 0}, page = {flags = 24769796876795904, {{{lru = {next = 0xffffea0005690008,
>               prev = 0xffff88823ffd5f50}, {__filler = 0xffffea0005690008, mlock_count = 1073569616}, buddy_list = {
>               next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}, pcp_list = {next = 0xffffea0005690008, prev = 0xffff88823ffd5f50}},
>           mapping = 0x0 <fixed_percpu_data>, index = 18688, private = 8}, {pp_magic = 18446719884544507912, pp = 0xffff88823ffd5f50,
>           _pp_mapping_pad = 0, dma_addr = 18688, {dma_addr_upper = 8, pp_frag_count = {counter = 8}}}, {
>           compound_head = 18446719884544507912, compound_dtor = 80 'P', compound_order = 95 '_', compound_mapcount = {counter = -30590},
>           compound_pincount = {counter = 0}, compound_nr = 0}, {_compound_pad_1 = 18446719884544507912,
>           _compound_pad_2 = 18446612691733536592, deferred_list = {next = 0x0 <fixed_percpu_data>,
>             prev = 0x4900 <irq_stack_backing_store+10496>}}, {_pt_pad_1 = 18446719884544507912, pmd_huge_pte = 0xffff88823ffd5f50,
>           _pt_pad_2 = 0, {pt_mm = 0x4900 <irq_stack_backing_store+10496>, pt_frag_refcount = {counter = 18688}},
>           ptl = 0x8 <fixed_percpu_data+8>}, {pgmap = 0xffffea0005690008, zone_device_data = 0xffff88823ffd5f50}, callback_head = {
>           next = 0xffffea0005690008, func = 0xffff88823ffd5f50}}, {_mapcount = {counter = -129}, page_type = 4294967167}, _refcount = {
>         counter = 0}, memcg_data = 0}}, _flags_1 = 24769796876795904, __head = 0, _folio_dtor = 3 '\003', _folio_order = 8 '\b',
>   _total_mapcount = {counter = -1}, _pincount = {counter = 0}, _folio_nr_pages = 0}
> (gdb)
>
> The folio has a NULL mapping, and an index of 18688, which means
> even if it was not a folio that has been invalidated or freed, the
> index is way outside the range we are looking for.
>
> If I step it round the lookup loop, xas does not change, and the
> same folio is returned every time through the loop. Perhaps
> the mapping tree itself might be corrupt???
>
> It's simple enough to stop the machine once it has become stuck to
> observe the iteration and dump structures, just tell me what you
> need to know from here...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

This bug emerges again and I would like to propose a reproduce
sequence of this bug which has nothing to do with scheduler stuff (
this could be wrong and sorry for wasting your time if so)

Thread_isolate:
1. alloc_contig_range->isolate_migratepages_block isolate a certain of
pages to cc->migratepages
       (folio has refcount: 1 + n (alloc_pages, page_cache))

2. alloc_contig_range->migrate_pages->folio_ref_freeze(folio, 1 +
extra_pins) set the folio->refcnt to 0

3. alloc_contig_range->migrate_pages->xas_split split the folios to
each slot as folio from slot[offset] to slot[offset + sibs]

Thread_truncate:
4. enter the livelock by the chain below
      rcu_read_lock();
        find_get_entry
            folio = xas_find
            if(!folio_try_get_rcu)
                xas_reset;
      rcu_read_unlock();

4'. alloc_contig_range->migrate_pages->__split_huge_page which will
modify folio's refcnt to 2 and breaks the livelock but is blocked by
lruvec->lock's contention

If the above call chain makes sense, could we solve this by below
modification which has split_folio and __split_huge_page be atomic by
taking lruvec->lock earlier than now.

int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                                     unsigned int new_order)
{

+        lruvec = folio_lruvec_lock(folio);
                if (mapping) {
                        int nr = folio_nr_pages(folio);

                        xas_split(&xas, folio, folio_order(folio));
                        if (folio_test_pmd_mappable(folio) &&
                            new_order < HPAGE_PMD_ORDER) {
                                if (folio_test_swapbacked(folio)) {
                                        __lruvec_stat_mod_folio(folio,
                                                        NR_SHMEM_THPS, -nr);
                                } else {
                                        __lruvec_stat_mod_folio(folio,
                                                        NR_FILE_THPS, -nr);
                                        filemap_nr_thps_dec(mapping);
                                }
                        }
                }

                __split_huge_page(page, list, end, new_order);
+        folio_lruvec_unlock(folio);

[RFC] mm: move xa forward when run across zombie page

Commit Message

Comments

Patch