diff mbox series

netfs: Add retry stat counters

Message ID 3187377.1738056789@warthog.procyon.org.uk (mailing list archive)
State New
Headers show
Series netfs: Add retry stat counters | expand

Commit Message

David Howells Jan. 28, 2025, 9:33 a.m. UTC
Here's an additional patch to allow stats on the number of retries to be
obtained.  This isn't a fix per se.

David
---
From: David Howells <dhowells@redhat.com>

netfs: Add retry stat counters

Add stat counters to count the number of request and subrequest retries and
display them in /proc/fs/netfs/stats.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/netfs/internal.h    |    4 ++++
 fs/netfs/read_retry.c  |    3 +++
 fs/netfs/stats.c       |    9 +++++++++
 fs/netfs/write_issue.c |    1 +
 fs/netfs/write_retry.c |    2 ++
 5 files changed, 19 insertions(+)

Comments

Ihor Solodrai Jan. 28, 2025, 7:11 p.m. UTC | #1
January 28, 2025 at 1:33 AM, "David Howells" <dhowells@redhat.com> wrote:

> 
> Here's an additional patch to allow stats on the number of retries to be
> 
> obtained. This isn't a fix per se.
> 
> David
>
> [...]

Hi David, Marc.

I regret to report that this patch didn't fix the hanging in
v9fs_evict_inode when running selftests/bpf.

Here is what I tried to test:

  * Checked out latest bpf-next source tree (0fc5dddb9409)
  * Applied patch: https://lore.kernel.org/netfs/3173328.1738024385@warthog.procyon.org.uk/
  * Applied retry stats patch: https://lore.kernel.org/netfs/3187377.1738056789@warthog.procyon.org.uk/
  * Modified tools/testing/selftests/bpf/config to enable /proc/fs/netfs/stats
  * Modified CI scripts to collect the stats
  * Ran the shell script reproducing the CI testing pipeline

Bash piece starting a process collecting /proc/fs/netfs/stats:

    function tail_netfs {
        echo -n > /mnt/vmtest/netfs-stats.log
        while true; do
            echo >> /mnt/vmtest/netfs-stats.log
            cat /proc/fs/netfs/stats >> /mnt/vmtest/netfs-stats.log
            sleep 1
        done
    }
    export -f tail_netfs
    nohup bash -c 'tail_netfs' & disown

Last recored /proc/fs/netfs/stats (note 0 retries):

    Reads  : DR=0 RA=15184 RF=5 RS=0 WB=0 WBZ=0
    Writes : BW=488 WT=0 DW=0 WP=488 2C=0
    ZeroOps: ZR=7964 sh=0 sk=0
    DownOps: DL=15189 ds=15189 df=0 di=0
    CaRdOps: RD=0 rs=0 rf=0
    UpldOps: UL=488 us=488 uf=0
    CaWrOps: WR=0 ws=0 wf=0
    Retries: rq=0 rs=0 wq=0 ws=0
    Objs   : rr=2 sr=1 foq=1 wsc=0
    WbLock : skip=0 wait=0
    -- FS-Cache statistics --
    Cookies: n=0 v=0 vcol=0 voom=0
    Acquire: n=0 ok=0 oom=0
    LRU    : n=0 exp=0 rmv=0 drp=0 at=0
    Invals : n=0
    Updates: n=0 rsz=0 rsn=0
    Relinqs: n=0 rtr=0 drop=0
    NoSpace: nwr=0 ncr=0 cull=0
    IO     : rd=0 wr=0 mis=0

The stack on hung task hasn't changed:

    [  184.375149] INFO: task modprobe:2759 blocked for more than 20 seconds.
    [  184.376149]       Tainted: G           OE      6.13.0-gbb67a65a921c-dirty #3
    [  184.376593] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  184.377119] task:modprobe        state:D stack:0     pid:2759  tgid:2759  ppid:455    flags:0x00004002
    [  184.377701] Call Trace:
    [  184.377886]  <TASK>
    [  184.378039]  __schedule+0xa91/0xe80
    [  184.378282]  schedule+0x41/0xb0
    [  184.378490]  v9fs_evict_inode+0xfe/0x170
    [  184.378754]  ? __pfx_var_wake_function+0x10/0x10
    [  184.379070]  evict+0x1ef/0x360
    [  184.379288]  __dentry_kill+0xb0/0x220
    [  184.379528]  ? dput+0x3a/0x1d0
    [  184.379736]  dput+0x114/0x1d0
    [  184.379946]  __fput+0x136/0x2b0
    [  184.380158]  task_work_run+0x89/0xc0
    [  184.380396]  do_exit+0x2c6/0x9c0
    [  184.380617]  do_group_exit+0xa4/0xb0
    [  184.380870]  __x64_sys_exit_group+0x17/0x20
    [  184.381137]  x64_sys_call+0x21a0/0x21a0
    [  184.381386]  do_syscall_64+0x79/0x120
    [  184.381630]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
    [  184.381969] RIP: 0033:0x7f817bf7c21d
    [  184.382202] RSP: 002b:00007fff92c8d148 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
    [  184.382676] RAX: ffffffffffffffda RBX: 00007f817c092fa8 RCX: 00007f817bf7c21d
    [  184.383138] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000001
    [  184.383582] RBP: 00007fff92c8d1a0 R08: 00007fff92c8d0e8 R09: 0000000000000000
    [  184.384042] R10: 00007fff92c8d05f R11: 0000000000000206 R12: 0000000000000001
    [  184.384486] R13: 0000000000000000 R14: 0000000000000001 R15: 00007f817c092fc0
    [  184.384963]  </TASK>
    [  184.385112]
    [  184.385112] Showing all locks held in the system:
    [  184.385499] 1 lock held by khungtaskd/32:
    [  184.385793]  #0: ffffffff9d195d90 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x2e/0x180
    [  184.386366] 2 locks held by kworker/u8:10/455:
    [  184.386649]  #0: ffffa1a240104d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_scheduled_works+0x23a/0x600
    [  184.387357]  #1: ffffb06380a23e20 ((work_completion)(&sub_info->work)){+.+.}-{0:0}, at: process_scheduled_works+0x25a/0x600
    [  184.388076]
    [  184.388183] =============================================

I pushed full logs to github:
https://github.com/kernel-patches/bpf/commit/88c0d0e1692b04c0d54b7c1941003758d23e0d6a

I recommend trying to reproduce with steps I shared in my initial report:
https://lore.kernel.org/bpf/a7x33d4dnMdGTtRivptq6S1i8btK70SNBP2XyX_xwDAhLvgQoPox6FVBOkifq4eBinfFfbZlIkMZBe3QarlWTxoEtHZwJCZbNKtaqrR7PvI=@pm.me/

I know it may not be very convenient due to all the CI stuff, but you
should be able to use it to iterate on the kernel source locally and
narrow down the problem.

I have everything set up, so you also might share some debugging code
for me to run if you prefer.

Thanks.

---

Not directly related, but it took me a while to figure out how to
collect the netfs stats.

I first added:
    CONFIG_NETFS_DEBUG=y
    CONFIG_NETFS_STATS=y

But that didn't work, because /proc/fs/netfs/stats is created only
with CONFIG_FSCACHE_STATS (fs/netfs/main.c):

    #ifdef CONFIG_FSCACHE_STATS
            if (!proc_create_single("fs/netfs/stats", S_IFREG | 0444, NULL,
                                    netfs_stats_show))
                    goto error_procfile;
    #endif

And that depends on CONFIG_FSCACHE=y, so I ended up with:

    CONFIG_FSCACHE=y
    CONFIG_FSCACHE_STATS=y
    CONFIG_NETFS_SUPPORT=y
    CONFIG_NETFS_DEBUG=y
    CONFIG_NETFS_STATS=y
diff mbox series

Patch

diff --git a/fs/netfs/internal.h b/fs/netfs/internal.h
index eb76f98c894b..1c4f953c3d68 100644
--- a/fs/netfs/internal.h
+++ b/fs/netfs/internal.h
@@ -135,6 +135,8 @@  extern atomic_t netfs_n_rh_write_begin;
 extern atomic_t netfs_n_rh_write_done;
 extern atomic_t netfs_n_rh_write_failed;
 extern atomic_t netfs_n_rh_write_zskip;
+extern atomic_t netfs_n_rh_retry_read_req;
+extern atomic_t netfs_n_rh_retry_read_subreq;
 extern atomic_t netfs_n_wh_buffered_write;
 extern atomic_t netfs_n_wh_writethrough;
 extern atomic_t netfs_n_wh_dio_write;
@@ -147,6 +149,8 @@  extern atomic_t netfs_n_wh_upload_failed;
 extern atomic_t netfs_n_wh_write;
 extern atomic_t netfs_n_wh_write_done;
 extern atomic_t netfs_n_wh_write_failed;
+extern atomic_t netfs_n_wh_retry_write_req;
+extern atomic_t netfs_n_wh_retry_write_subreq;
 extern atomic_t netfs_n_wb_lock_skip;
 extern atomic_t netfs_n_wb_lock_wait;
 extern atomic_t netfs_n_folioq;
diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index 8316c4533a51..0f294b26e08c 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -14,6 +14,7 @@  static void netfs_reissue_read(struct netfs_io_request *rreq,
 {
 	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+	netfs_stat(&netfs_n_rh_retry_read_subreq);
 	subreq->rreq->netfs_ops->issue_read(subreq);
 }
 
@@ -260,6 +261,8 @@  void netfs_retry_reads(struct netfs_io_request *rreq)
 	struct netfs_io_stream *stream = &rreq->io_streams[0];
 	DEFINE_WAIT(myself);
 
+	netfs_stat(&netfs_n_rh_retry_read_req);
+
 	set_bit(NETFS_RREQ_RETRYING, &rreq->flags);
 
 	/* Wait for all outstanding I/O to quiesce before performing retries as
diff --git a/fs/netfs/stats.c b/fs/netfs/stats.c
index f1af344266cc..ab6b916addc4 100644
--- a/fs/netfs/stats.c
+++ b/fs/netfs/stats.c
@@ -29,6 +29,8 @@  atomic_t netfs_n_rh_write_begin;
 atomic_t netfs_n_rh_write_done;
 atomic_t netfs_n_rh_write_failed;
 atomic_t netfs_n_rh_write_zskip;
+atomic_t netfs_n_rh_retry_read_req;
+atomic_t netfs_n_rh_retry_read_subreq;
 atomic_t netfs_n_wh_buffered_write;
 atomic_t netfs_n_wh_writethrough;
 atomic_t netfs_n_wh_dio_write;
@@ -41,6 +43,8 @@  atomic_t netfs_n_wh_upload_failed;
 atomic_t netfs_n_wh_write;
 atomic_t netfs_n_wh_write_done;
 atomic_t netfs_n_wh_write_failed;
+atomic_t netfs_n_wh_retry_write_req;
+atomic_t netfs_n_wh_retry_write_subreq;
 atomic_t netfs_n_wb_lock_skip;
 atomic_t netfs_n_wb_lock_wait;
 atomic_t netfs_n_folioq;
@@ -81,6 +85,11 @@  int netfs_stats_show(struct seq_file *m, void *v)
 		   atomic_read(&netfs_n_wh_write),
 		   atomic_read(&netfs_n_wh_write_done),
 		   atomic_read(&netfs_n_wh_write_failed));
+	seq_printf(m, "Retries: rq=%u rs=%u wq=%u ws=%u\n",
+		   atomic_read(&netfs_n_rh_retry_read_req),
+		   atomic_read(&netfs_n_rh_retry_read_subreq),
+		   atomic_read(&netfs_n_wh_retry_write_req),
+		   atomic_read(&netfs_n_wh_retry_write_subreq));
 	seq_printf(m, "Objs   : rr=%u sr=%u foq=%u wsc=%u\n",
 		   atomic_read(&netfs_n_rh_rreq),
 		   atomic_read(&netfs_n_rh_sreq),
diff --git a/fs/netfs/write_issue.c b/fs/netfs/write_issue.c
index 69727411683e..77279fc5b5a7 100644
--- a/fs/netfs/write_issue.c
+++ b/fs/netfs/write_issue.c
@@ -253,6 +253,7 @@  void netfs_reissue_write(struct netfs_io_stream *stream,
 	subreq->retry_count++;
 	__clear_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags);
 	__set_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags);
+	netfs_stat(&netfs_n_wh_retry_write_subreq);
 	netfs_do_issue_write(stream, subreq);
 }
 
diff --git a/fs/netfs/write_retry.c b/fs/netfs/write_retry.c
index c841a851dd73..545d33079a77 100644
--- a/fs/netfs/write_retry.c
+++ b/fs/netfs/write_retry.c
@@ -203,6 +203,8 @@  void netfs_retry_writes(struct netfs_io_request *wreq)
 	struct netfs_io_stream *stream;
 	int s;
 
+	netfs_stat(&netfs_n_wh_retry_write_req);
+
 	/* Wait for all outstanding I/O to quiesce before performing retries as
 	 * we may need to renegotiate the I/O sizes.
 	 */