diff mbox series

mm: use memalloc_nofs_save() in page_cache_ra_order()

Message ID 20240426112938.124740-1-wangkefeng.wang@huawei.com (mailing list archive)
State New, archived
Headers show
Series mm: use memalloc_nofs_save() in page_cache_ra_order() | expand

Commit Message

Kefeng Wang April 26, 2024, 11:29 a.m. UTC
See commit f2c817bed58d ("mm: use memalloc_nofs_save in readahead
path"), ensure that page_cache_ra_order() do not attempt to reclaim
file-backed pages too, or it leads to a deadlock, found issue when
test ext4 large folio.

 INFO: task DataXceiver for:7494 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:DataXceiver for state:D stack:0     pid:7494  ppid:1      flags:0x00000200
 Call trace:
  __switch_to+0x14c/0x240
  __schedule+0x82c/0xdd0
  schedule+0x58/0xf0
  io_schedule+0x24/0xa0
  __folio_lock+0x130/0x300
  migrate_pages_batch+0x378/0x918
  migrate_pages+0x350/0x700
  compact_zone+0x63c/0xb38
  compact_zone_order+0xc0/0x118
  try_to_compact_pages+0xb0/0x280
  __alloc_pages_direct_compact+0x98/0x248
  __alloc_pages+0x510/0x1110
  alloc_pages+0x9c/0x130
  folio_alloc+0x20/0x78
  filemap_alloc_folio+0x8c/0x1b0
  page_cache_ra_order+0x174/0x308
  ondemand_readahead+0x1c8/0x2b8
  page_cache_async_ra+0x68/0xb8
  filemap_readahead.isra.0+0x64/0xa8
  filemap_get_pages+0x3fc/0x5b0
  filemap_splice_read+0xf4/0x280
  ext4_file_splice_read+0x2c/0x48 [ext4]
  vfs_splice_read.part.0+0xa8/0x118
  splice_direct_to_actor+0xbc/0x288
  do_splice_direct+0x9c/0x108
  do_sendfile+0x328/0x468
  __arm64_sys_sendfile64+0x8c/0x148
  invoke_syscall+0x4c/0x118
  el0_svc_common.constprop.0+0xc8/0xf0
  do_el0_svc+0x24/0x38
  el0_svc+0x4c/0x1f8
  el0t_64_sync_handler+0xc0/0xc8
  el0t_64_sync+0x188/0x190

Cc: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 mm/readahead.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Andrew Morton April 26, 2024, 6:49 p.m. UTC | #1
On Fri, 26 Apr 2024 19:29:38 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:

> See commit f2c817bed58d ("mm: use memalloc_nofs_save in readahead
> path"), ensure that page_cache_ra_order() do not attempt to reclaim
> file-backed pages too, or it leads to a deadlock, found issue when
> test ext4 large folio.
> 
>  INFO: task DataXceiver for:7494 blocked for more than 120 seconds.
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:DataXceiver for state:D stack:0     pid:7494  ppid:1      flags:0x00000200
>  Call trace:
>   __switch_to+0x14c/0x240
>   __schedule+0x82c/0xdd0
>   schedule+0x58/0xf0
>   io_schedule+0x24/0xa0
>   __folio_lock+0x130/0x300
>   migrate_pages_batch+0x378/0x918
>   migrate_pages+0x350/0x700
>   compact_zone+0x63c/0xb38
>   compact_zone_order+0xc0/0x118
>   try_to_compact_pages+0xb0/0x280
>   __alloc_pages_direct_compact+0x98/0x248
>   __alloc_pages+0x510/0x1110
>   alloc_pages+0x9c/0x130
>   folio_alloc+0x20/0x78
>   filemap_alloc_folio+0x8c/0x1b0
>   page_cache_ra_order+0x174/0x308
>   ondemand_readahead+0x1c8/0x2b8
>   page_cache_async_ra+0x68/0xb8
>   filemap_readahead.isra.0+0x64/0xa8
>   filemap_get_pages+0x3fc/0x5b0
>   filemap_splice_read+0xf4/0x280
>   ext4_file_splice_read+0x2c/0x48 [ext4]
>   vfs_splice_read.part.0+0xa8/0x118
>   splice_direct_to_actor+0xbc/0x288
>   do_splice_direct+0x9c/0x108
>   do_sendfile+0x328/0x468
>   __arm64_sys_sendfile64+0x8c/0x148
>   invoke_syscall+0x4c/0x118
>   el0_svc_common.constprop.0+0xc8/0xf0
>   do_el0_svc+0x24/0x38
>   el0_svc+0x4c/0x1f8
>   el0t_64_sync_handler+0xc0/0xc8
>   el0t_64_sync+0x188/0x190
> 
> Cc: zhangyi (F) <yi.zhang@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>

I'm thinking

Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Cc: stable

> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -494,6 +494,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  	pgoff_t index = readahead_index(ractl);
>  	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
>  	pgoff_t mark = index + ra->size - ra->async_size;
> +	unsigned int nofs;
>  	int err = 0;
>  	gfp_t gfp = readahead_gfp_mask(mapping);
>  
> @@ -508,6 +509,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  	}
>  
> +	/* See comment in page_cache_ra_unbounded() */
> +	nofs = memalloc_nofs_save();
>  	filemap_invalidate_lock_shared(mapping);
>  	while (index <= limit) {
>  		unsigned int order = new_order;
> @@ -531,6 +534,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  
>  	read_pages(ractl);
>  	filemap_invalidate_unlock_shared(mapping);
> +	memalloc_nofs_restore(nofs);
>  
>  	/*
>  	 * If there were already pages in the page cache, then we may have
> -- 
> 2.41.0
Matthew Wilcox (Oracle) April 27, 2024, 3:45 a.m. UTC | #2
On Fri, Apr 26, 2024 at 11:49:05AM -0700, Andrew Morton wrote:
> On Fri, 26 Apr 2024 19:29:38 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >   io_schedule+0x24/0xa0
> >   __folio_lock+0x130/0x300
> >   migrate_pages_batch+0x378/0x918
> >   migrate_pages+0x350/0x700
> >   compact_zone+0x63c/0xb38
> >   compact_zone_order+0xc0/0x118
> >   try_to_compact_pages+0xb0/0x280
> >   __alloc_pages_direct_compact+0x98/0x248
> >   __alloc_pages+0x510/0x1110
> >   alloc_pages+0x9c/0x130
> >   folio_alloc+0x20/0x78
> >   filemap_alloc_folio+0x8c/0x1b0
> >   page_cache_ra_order+0x174/0x308
> >   ondemand_readahead+0x1c8/0x2b8
> 
> I'm thinking
> 
> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
> Cc: stable

I think it goes back earlier than that.
https://lore.kernel.org/linux-mm/20200128060304.GA6615@bombadil.infradead.org/
details how it can happen with the old readpages code.  It's just easier
to hit now.
Kefeng Wang April 28, 2024, 1:08 a.m. UTC | #3
On 2024/4/27 11:45, Matthew Wilcox wrote:
> On Fri, Apr 26, 2024 at 11:49:05AM -0700, Andrew Morton wrote:
>> On Fri, 26 Apr 2024 19:29:38 +0800 Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>    io_schedule+0x24/0xa0
>>>    __folio_lock+0x130/0x300
>>>    migrate_pages_batch+0x378/0x918
>>>    migrate_pages+0x350/0x700
>>>    compact_zone+0x63c/0xb38
>>>    compact_zone_order+0xc0/0x118
>>>    try_to_compact_pages+0xb0/0x280
>>>    __alloc_pages_direct_compact+0x98/0x248
>>>    __alloc_pages+0x510/0x1110
>>>    alloc_pages+0x9c/0x130
>>>    folio_alloc+0x20/0x78
>>>    filemap_alloc_folio+0x8c/0x1b0
>>>    page_cache_ra_order+0x174/0x308
>>>    ondemand_readahead+0x1c8/0x2b8
>>
>> I'm thinking
>>
>> Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
>> Cc: stable
> 
> I think it goes back earlier than that.
> https://lore.kernel.org/linux-mm/20200128060304.GA6615@bombadil.infradead.org/
> details how it can happen with the old readpages code.  It's just easier
> to hit now.
> 

The page_cache_ra_order() is introduced from 793917d997df, but previous
bugfix f2c817bed58d ("mm: use memalloc_nofs_save in readahead path")
don't Cc stable, so the previous patch should be posted to stable?
diff mbox series

Patch

diff --git a/mm/readahead.c b/mm/readahead.c
index 63d6000103f0..c1b23989d9ca 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -494,6 +494,7 @@  void page_cache_ra_order(struct readahead_control *ractl,
 	pgoff_t index = readahead_index(ractl);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
+	unsigned int nofs;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
 
@@ -508,6 +509,8 @@  void page_cache_ra_order(struct readahead_control *ractl,
 		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
 	}
 
+	/* See comment in page_cache_ra_unbounded() */
+	nofs = memalloc_nofs_save();
 	filemap_invalidate_lock_shared(mapping);
 	while (index <= limit) {
 		unsigned int order = new_order;
@@ -531,6 +534,7 @@  void page_cache_ra_order(struct readahead_control *ractl,
 
 	read_pages(ractl);
 	filemap_invalidate_unlock_shared(mapping);
+	memalloc_nofs_restore(nofs);
 
 	/*
 	 * If there were already pages in the page cache, then we may have