Message ID | CAH2r5mvZyNddfAD3ahHFdf5LGTO1FGQsYH0CQJqNQBjUdfysTg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [SMB3] display stats counters for number of slow commands | expand |
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Hi Steve, On Sat, 4 Aug 2018 05:32:25 -0500, Steve French wrote: > When CONFIG_CIFS_STATS2 is enabled keep counters for slow > commands (ie server took longer than 1 second to respond) > by SMB2/SMB3 command code. This can help in diagnosing > whether performance problems are on server (instead of > client) and which commands are causing the problem. > > Sample output (the new lines contain words "slow responses ...") Wouldn't putting these alongside the existing per-session op counts be more suitable, e.g. 1) \\192.168.1.1\rapido-share ... Creates: 1 total 0 failed 1 slow Closes: 0 total 0 failed 0 slow Flushes: 1 total 0 failed 1 slow Reads: 0 total 0 failed 0 slow It'd be helpful if this file included some sort of API version, so that parsers like PCP[1] knew what format to expect. Alternatively, moving to a configfs style format with one metric per file (similar to LIO) might be more useful and extensible, e.g. /sys/kernel/config/cifs/stats/sessions /sys/kernel/config/cifs/stats/shares ... /sys/kernel/config/cifs/stats/<session>/smbs /sys/kernel/config/cifs/stats/<session>/creates/total /sys/kernel/config/cifs/stats/<session>/creates/failed /sys/kernel/config/cifs/stats/<session>/creates/slow /sys/kernel/config/cifs/stats/<session>/reads/total /sys/kernel/config/cifs/stats/<session>/reads/failed /sys/kernel/config/cifs/stats/<session>/reads/slow ... Cheers, David 1. PCP cifs.ko monitoring agent https://github.com/performancecopilot/pcp/blob/master/src/pmdas/cifs/pmda.c -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Disseldorp <ddiss@samba.org> writes: > Hi Steve, > > On Sat, 4 Aug 2018 05:32:25 -0500, Steve French wrote: > >> When CONFIG_CIFS_STATS2 is enabled keep counters for slow >> commands (ie server took longer than 1 second to respond) >> by SMB2/SMB3 command code. This can help in diagnosing >> whether performance problems are on server (instead of >> client) and which commands are causing the problem. >> >> Sample output (the new lines contain words "slow responses ...") > > Wouldn't putting these alongside the existing per-session op counts be > more suitable, e.g. > 1) \\192.168.1.1\rapido-share > ... > Creates: 1 total 0 failed 1 slow > Closes: 0 total 0 failed 0 slow > Flushes: 1 total 0 failed 1 slow > Reads: 0 total 0 failed 0 slow > > It'd be helpful if this file included some sort of API version, so that > parsers like PCP[1] knew what format to expect. Alternatively, moving > to a configfs style format with one metric per file (similar to LIO) > might be more useful and extensible, e.g. > /sys/kernel/config/cifs/stats/sessions > /sys/kernel/config/cifs/stats/shares > ... > /sys/kernel/config/cifs/stats/<session>/smbs > /sys/kernel/config/cifs/stats/<session>/creates/total > /sys/kernel/config/cifs/stats/<session>/creates/failed > /sys/kernel/config/cifs/stats/<session>/creates/slow > /sys/kernel/config/cifs/stats/<session>/reads/total > /sys/kernel/config/cifs/stats/<session>/reads/failed > /sys/kernel/config/cifs/stats/<session>/reads/slow > ... That proc file was never meant for machine consumption. Some have tried (myself included) but obviously it break everytime we update something. I like having a hierarchy but having multiple files means potential race conditions when reading everything. Alternatively we could just output everything we already do in some structured format like JSON or YAML. There is a precedence for JSON in clock device nodes (drivers/clk/clk.c). Cheers,
On Mon, 06 Aug 2018 16:01:37 +0200, Aurélien Aptel wrote: > I like having a hierarchy but having multiple files means potential race > conditions when reading everything. Does it really matter if multiple metrics can't be read atomically? Other filesystems (XFS and Btrfs) seem to cope with this fine. Cheers, David -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Disseldorp <ddiss@samba.org> writes: > Does it really matter if multiple metrics can't be read atomically? > Other filesystems (XFS and Btrfs) seem to cope with this fine. It makes properly handling tcp connection/session/tree connection closing and reconnecting a lot harder (session id changes, files will fail to open etc etc). It's just a lot simpler when everything is in one place. But yes it's not critical I guess. Cheers,
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> -----Original Message----- From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> On Behalf Of Steve French Sent: Saturday, August 4, 2018 3:32 AM To: CIFS <linux-cifs@vger.kernel.org>; samba-technical <samba-technical@lists.samba.org> Subject: [PATCH][SMB3] display stats counters for number of slow commands When CONFIG_CIFS_STATS2 is enabled keep counters for slow commands (ie server took longer than 1 second to respond) by SMB2/SMB3 command code. This can help in diagnosing whether performance problems are on server (instead of client) and which commands are causing the problem. Sample output (the new lines contain words "slow responses ...") $ cat /proc/fs/cifs/Stats Resources in use CIFS Session: 1 Share (unique mount targets): 2 SMB Request/Response Buffer: 1 Pool size: 5 SMB Small Req/Resp Buffer: 1 Pool size: 30 Total Large 10 Small 490 Allocations Operations (MIDs): 0 0 session 0 share reconnects Total vfs operations: 67 maximum at one time: 2 4 slow responses from localhost for command 5 1 slow responses from localhost for command 6 1 slow responses from localhost for command 14 1 slow responses from localhost for command 16 1) \\localhost\test SMBs: 243 Bytes read: 1024000 Bytes written: 104857600 TreeConnects: 1 total 0 failed TreeDisconnects: 0 total 0 failed Creates: 40 total 0 failed Closes: 39 total 0 failed ... Signed-off-by: Steve French <stfrench@microsoft.com> --- fs/cifs/cifs_debug.c | 19 +++++++++++++++++-- fs/cifs/cifsglob.h | 3 ++- fs/cifs/transport.c | 5 +++++ 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c index 748cabd6d20b..f1fbea947fef 100644 --- a/fs/cifs/cifs_debug.c +++ b/fs/cifs/cifs_debug.c @@ -378,6 +378,8 @@ static ssize_t cifs_stats_proc_write(struct file *file, rc = kstrtobool_from_user(buffer, count, &bv); if (rc == 0) { #ifdef CONFIG_CIFS_STATS2 + int i; + atomic_set(&totBufAllocCount, 0); atomic_set(&totSmBufAllocCount, 0); #endif /* CONFIG_CIFS_STATS2 */ @@ -389,6 +391,10 @@ static ssize_t cifs_stats_proc_write(struct file *file, list_for_each(tmp1, &cifs_tcp_ses_list) { server = list_entry(tmp1, struct TCP_Server_Info, tcp_ses_list); +#ifdef CONFIG_CIFS_STATS2 + for (i = 0; i < NUMBER_OF_SMB2_COMMANDS; i++) + atomic_set(&server->smb2slowcmd[i], 0); #endif /* +CONFIG_CIFS_STATS2 */ list_for_each(tmp2, &server->smb_ses_list) { ses = list_entry(tmp2, struct cifs_ses, smb_ses_list); @@ -417,13 +423,15 @@ static ssize_t cifs_stats_proc_write(struct file *file, static int cifs_stats_proc_show(struct seq_file *m, void *v) { int i; +#ifdef CONFIG_CIFS_STATS2 + int j; +#endif /* STATS2 */ struct list_head *tmp1, *tmp2, *tmp3; struct TCP_Server_Info *server; struct cifs_ses *ses; struct cifs_tcon *tcon; - seq_printf(m, - "Resources in use\nCIFS Session: %d\n", + seq_printf(m, "Resources in use\nCIFS Session: %d\n", sesInfoAllocCount.counter); seq_printf(m, "Share (unique mount targets): %d\n", tconInfoAllocCount.counter); @@ -452,6 +460,13 @@ static int cifs_stats_proc_show(struct seq_file *m, void *v) list_for_each(tmp1, &cifs_tcp_ses_list) { server = list_entry(tmp1, struct TCP_Server_Info, tcp_ses_list); +#ifdef CONFIG_CIFS_STATS2 + for (j = 0; j < NUMBER_OF_SMB2_COMMANDS; j++) + if (atomic_read(&server->smb2slowcmd[j])) + seq_printf(m, "%d slow responses from %s for command %d\n", + atomic_read(&server->smb2slowcmd[j]), + server->hostname, j); #endif /* STATS2 */ list_for_each(tmp2, &server->smb_ses_list) { ses = list_entry(tmp2, struct cifs_ses, smb_ses_list); diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index 0553929e8339..41803d374da0 100644 --- a/fs/cifs/cifsglob.h +++ b/fs/cifs/cifsglob.h @@ -680,7 +680,8 @@ struct TCP_Server_Info { #ifdef CONFIG_CIFS_STATS2 atomic_t in_send; /* requests trying to send */ atomic_t num_waiters; /* blocked waiting to get in sendrecv */ -#endif + atomic_t smb2slowcmd[NUMBER_OF_SMB2_COMMANDS]; /* count resps > 1 +sec */ #endif /* STATS2 */ unsigned int max_read; unsigned int max_write; __le16 cipher_type; diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c index 92de5c528161..c53c0908d4c6 100644 --- a/fs/cifs/transport.c +++ b/fs/cifs/transport.c @@ -117,6 +117,11 @@ DeleteMidQEntry(struct mid_q_entry *midEntry) something is wrong, unless it is quite a slow link or server */ if (time_after(now, midEntry->when_alloc + HZ) && (midEntry->command != command)) { + /* smb2slowcmd[NUMBER_OF_SMB2_COMMANDS] counts by command */ + if ((le16_to_cpu(midEntry->command) < NUMBER_OF_SMB2_COMMANDS) && + (le16_to_cpu(midEntry->command) >= 0)) + + cifs_stats_inc(&midEntry->server->smb2slowcmd[le16_to_cpu(midEntry->co + mmand)]); + trace_smb3_slow_rsp(le16_to_cpu(midEntry->command), midEntry->mid, midEntry->pid, midEntry->when_sent, midEntry->when_received); -- Thanks, Steve
The main reason I separated these counters from the per-tree connection ones was that it MUCH simplified the coding of them and made it much less likely that it would introduce a bug/regression. It also takes up less memory which is a minor but real benefit. In the place where these requests are pulled off the wire and where timings were already checked, we don't yet know which tree connection if any (at least 4 of the command types don't have a tree connection) it is associated with - but we do know which socket and thus which host name. In a perfect world we would also track read and write and close/flush timings (maybe even open) and produce nice statistics as Volker and others have suggested, but I wanted to model more closely the read/write timings after iostat or better examples, and this set of statistics I wanted as a very low risk set of counters that would help diagnose various types of problems that I have been seeing with reconnect - where we have to both understand why the reconnect occurred (and whether the server has been showing other signs of problems, other slow requests before the hang which triggered the timeout and reconnect) but also debug the subsequent problems. There are so many reconnect scenarios, getting more objective data on what is going on in a particular time period with slow responses is one useful piece of data that will help. I expect that the reconnect problems I have been discussing with Pavel and others will lead to four or five fixes - but in the short term I needed something objective to more easily determine when the server is having problems. On Mon, Aug 6, 2018 at 6:47 AM David Disseldorp <ddiss@samba.org> wrote: > > Hi Steve, > > On Sat, 4 Aug 2018 05:32:25 -0500, Steve French wrote: > > > When CONFIG_CIFS_STATS2 is enabled keep counters for slow > > commands (ie server took longer than 1 second to respond) > > by SMB2/SMB3 command code. This can help in diagnosing > > whether performance problems are on server (instead of > > client) and which commands are causing the problem. > > > > Sample output (the new lines contain words "slow responses ...") > > Wouldn't putting these alongside the existing per-session op counts be > more suitable, e.g. > 1) \\192.168.1.1\rapido-share > ... > Creates: 1 total 0 failed 1 slow > Closes: 0 total 0 failed 0 slow > Flushes: 1 total 0 failed 1 slow > Reads: 0 total 0 failed 0 slow > > It'd be helpful if this file included some sort of API version, so that > parsers like PCP[1] knew what format to expect. Alternatively, moving > to a configfs style format with one metric per file (similar to LIO) > might be more useful and extensible, e.g. > /sys/kernel/config/cifs/stats/sessions > /sys/kernel/config/cifs/stats/shares > ... > /sys/kernel/config/cifs/stats/<session>/smbs > /sys/kernel/config/cifs/stats/<session>/creates/total > /sys/kernel/config/cifs/stats/<session>/creates/failed > /sys/kernel/config/cifs/stats/<session>/creates/slow > /sys/kernel/config/cifs/stats/<session>/reads/total > /sys/kernel/config/cifs/stats/<session>/reads/failed > /sys/kernel/config/cifs/stats/<session>/reads/slow > ... > > Cheers, David > > 1. PCP cifs.ko monitoring agent > https://github.com/performancecopilot/pcp/blob/master/src/pmdas/cifs/pmda.c
From 3d27dcac7a44cb22b8f75ba18fef9bfc88e506b4 Mon Sep 17 00:00:00 2001 From: Steve French <stfrench@microsoft.com> Date: Sat, 4 Aug 2018 05:24:34 -0500 Subject: [PATCH] smb3: display stats counters for number of slow commands When CONFIG_CIFS_STATS2 is enabled keep counters for slow commands (ie server took longer than 1 second to respond) by SMB2/SMB3 command code. This can help in diagnosing whether performance problems are on server (instead of client) and which commands are causing the problem. Sample output (the new lines contain words "slow responses ...") $ cat /proc/fs/cifs/Stats Resources in use CIFS Session: 1 Share (unique mount targets): 2 SMB Request/Response Buffer: 1 Pool size: 5 SMB Small Req/Resp Buffer: 1 Pool size: 30 Total Large 10 Small 490 Allocations Operations (MIDs): 0 0 session 0 share reconnects Total vfs operations: 67 maximum at one time: 2 4 slow responses from localhost for command 5 1 slow responses from localhost for command 6 1 slow responses from localhost for command 14 1 slow responses from localhost for command 16 1) \\localhost\test SMBs: 243 Bytes read: 1024000 Bytes written: 104857600 TreeConnects: 1 total 0 failed TreeDisconnects: 0 total 0 failed Creates: 40 total 0 failed Closes: 39 total 0 failed ... Signed-off-by: Steve French <stfrench@microsoft.com> --- fs/cifs/cifs_debug.c | 19 +++++++++++++++++-- fs/cifs/cifsglob.h | 3 ++- fs/cifs/transport.c | 5 +++++ 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c index 748cabd6d20b..f1fbea947fef 100644 --- a/fs/cifs/cifs_debug.c +++ b/fs/cifs/cifs_debug.c @@ -378,6 +378,8 @@ static ssize_t cifs_stats_proc_write(struct file *file, rc = kstrtobool_from_user(buffer, count, &bv); if (rc == 0) { #ifdef CONFIG_CIFS_STATS2 + int i; + atomic_set(&totBufAllocCount, 0); atomic_set(&totSmBufAllocCount, 0); #endif /* CONFIG_CIFS_STATS2 */ @@ -389,6 +391,10 @@ static ssize_t cifs_stats_proc_write(struct file *file, list_for_each(tmp1, &cifs_tcp_ses_list) { server = list_entry(tmp1, struct TCP_Server_Info, tcp_ses_list); +#ifdef CONFIG_CIFS_STATS2 + for (i = 0; i < NUMBER_OF_SMB2_COMMANDS; i++) + atomic_set(&server->smb2slowcmd[i], 0); +#endif /* CONFIG_CIFS_STATS2 */ list_for_each(tmp2, &server->smb_ses_list) { ses = list_entry(tmp2, struct cifs_ses, smb_ses_list); @@ -417,13 +423,15 @@ static ssize_t cifs_stats_proc_write(struct file *file, static int cifs_stats_proc_show(struct seq_file *m, void *v) { int i; +#ifdef CONFIG_CIFS_STATS2 + int j; +#endif /* STATS2 */ struct list_head *tmp1, *tmp2, *tmp3; struct TCP_Server_Info *server; struct cifs_ses *ses; struct cifs_tcon *tcon; - seq_printf(m, - "Resources in use\nCIFS Session: %d\n", + seq_printf(m, "Resources in use\nCIFS Session: %d\n", sesInfoAllocCount.counter); seq_printf(m, "Share (unique mount targets): %d\n", tconInfoAllocCount.counter); @@ -452,6 +460,13 @@ static int cifs_stats_proc_show(struct seq_file *m, void *v) list_for_each(tmp1, &cifs_tcp_ses_list) { server = list_entry(tmp1, struct TCP_Server_Info, tcp_ses_list); +#ifdef CONFIG_CIFS_STATS2 + for (j = 0; j < NUMBER_OF_SMB2_COMMANDS; j++) + if (atomic_read(&server->smb2slowcmd[j])) + seq_printf(m, "%d slow responses from %s for command %d\n", + atomic_read(&server->smb2slowcmd[j]), + server->hostname, j); +#endif /* STATS2 */ list_for_each(tmp2, &server->smb_ses_list) { ses = list_entry(tmp2, struct cifs_ses, smb_ses_list); diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index 0553929e8339..41803d374da0 100644 --- a/fs/cifs/cifsglob.h +++ b/fs/cifs/cifsglob.h @@ -680,7 +680,8 @@ struct TCP_Server_Info { #ifdef CONFIG_CIFS_STATS2 atomic_t in_send; /* requests trying to send */ atomic_t num_waiters; /* blocked waiting to get in sendrecv */ -#endif + atomic_t smb2slowcmd[NUMBER_OF_SMB2_COMMANDS]; /* count resps > 1 sec */ +#endif /* STATS2 */ unsigned int max_read; unsigned int max_write; __le16 cipher_type; diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c index 92de5c528161..c53c0908d4c6 100644 --- a/fs/cifs/transport.c +++ b/fs/cifs/transport.c @@ -117,6 +117,11 @@ DeleteMidQEntry(struct mid_q_entry *midEntry) something is wrong, unless it is quite a slow link or server */ if (time_after(now, midEntry->when_alloc + HZ) && (midEntry->command != command)) { + /* smb2slowcmd[NUMBER_OF_SMB2_COMMANDS] counts by command */ + if ((le16_to_cpu(midEntry->command) < NUMBER_OF_SMB2_COMMANDS) && + (le16_to_cpu(midEntry->command) >= 0)) + cifs_stats_inc(&midEntry->server->smb2slowcmd[le16_to_cpu(midEntry->command)]); + trace_smb3_slow_rsp(le16_to_cpu(midEntry->command), midEntry->mid, midEntry->pid, midEntry->when_sent, midEntry->when_received); -- 2.17.1
When CONFIG_CIFS_STATS2 is enabled keep counters for slow commands (ie server took longer than 1 second to respond) by SMB2/SMB3 command code. This can help in diagnosing whether performance problems are on server (instead of client) and which commands are causing the problem. Sample output (the new lines contain words "slow responses ...") $ cat /proc/fs/cifs/Stats Resources in use CIFS Session: 1 Share (unique mount targets): 2 SMB Request/Response Buffer: 1 Pool size: 5 SMB Small Req/Resp Buffer: 1 Pool size: 30 Total Large 10 Small 490 Allocations Operations (MIDs): 0 0 session 0 share reconnects Total vfs operations: 67 maximum at one time: 2 4 slow responses from localhost for command 5 1 slow responses from localhost for command 6 1 slow responses from localhost for command 14 1 slow responses from localhost for command 16 1) \\localhost\test SMBs: 243 Bytes read: 1024000 Bytes written: 104857600 TreeConnects: 1 total 0 failed TreeDisconnects: 0 total 0 failed Creates: 40 total 0 failed Closes: 39 total 0 failed ... Signed-off-by: Steve French <stfrench@microsoft.com> --- fs/cifs/cifs_debug.c | 19 +++++++++++++++++-- fs/cifs/cifsglob.h | 3 ++- fs/cifs/transport.c | 5 +++++ 3 files changed, 24 insertions(+), 3 deletions(-)