[SMB3] display stats counters for number of slow commands

Message ID	CAH2r5mvZyNddfAD3ahHFdf5LGTO1FGQsYH0CQJqNQBjUdfysTg@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-cifs-owner@kernel.org> MIME-Version: 1.0 From: Steve French <smfrench@gmail.com> Date: Sat, 4 Aug 2018 05:32:25 -0500 Message-ID: <CAH2r5mvZyNddfAD3ahHFdf5LGTO1FGQsYH0CQJqNQBjUdfysTg@mail.gmail.com> Subject: [PATCH][SMB3] display stats counters for number of slow commands To: CIFS <linux-cifs@vger.kernel.org>, samba-technical <samba-technical@lists.samba.org> Content-Type: multipart/mixed; boundary="0000000000001ec9880572999078" Sender: linux-cifs-owner@vger.kernel.org Precedence: bulk
Series	[SMB3] display stats counters for number of slow commands \| expand [SMB3] display stats counters for number of slow commands

Steve French Aug. 4, 2018, 10:32 a.m. UTC

When CONFIG_CIFS_STATS2 is enabled keep counters for slow
commands (ie server took longer than 1 second to respond)
by SMB2/SMB3 command code.  This can help in diagnosing
whether performance problems are on server (instead of
client) and which commands are causing the problem.

Sample output (the new lines contain words "slow responses ...")

$ cat /proc/fs/cifs/Stats
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5
SMB Small Req/Resp Buffer: 1 Pool size: 30
Total Large 10 Small 490 Allocations
Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 67 maximum at one time: 2
4 slow responses from localhost for command 5
1 slow responses from localhost for command 6
1 slow responses from localhost for command 14
1 slow responses from localhost for command 16

1) \\localhost\test
SMBs: 243
Bytes read: 1024000  Bytes written: 104857600
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 40 total 0 failed
Closes: 39 total 0 failed
...

Signed-off-by: Steve French <stfrench@microsoft.com>
---
 fs/cifs/cifs_debug.c | 19 +++++++++++++++++--
 fs/cifs/cifsglob.h   |  3 ++-
 fs/cifs/transport.c  |  5 +++++
 3 files changed, 24 insertions(+), 3 deletions(-)

Aurélien Aptel Aug. 6, 2018, 9:44 a.m. UTC | #1

Reviewed-by: Aurelien Aptel <aaptel@suse.com>

David Disseldorp Aug. 6, 2018, 11:47 a.m. UTC | #2

Hi Steve,

On Sat, 4 Aug 2018 05:32:25 -0500, Steve French wrote:

> When CONFIG_CIFS_STATS2 is enabled keep counters for slow
> commands (ie server took longer than 1 second to respond)
> by SMB2/SMB3 command code.  This can help in diagnosing
> whether performance problems are on server (instead of
> client) and which commands are causing the problem.
> 
> Sample output (the new lines contain words "slow responses ...")

Wouldn't putting these alongside the existing per-session op counts be
more suitable, e.g.
1) \\192.168.1.1\rapido-share
...
Creates: 1 total 0 failed 1 slow
Closes: 0 total 0 failed 0 slow
Flushes: 1 total 0 failed 1 slow
Reads: 0 total 0 failed 0 slow

It'd be helpful if this file included some sort of API version, so that
parsers like PCP[1] knew what format to expect. Alternatively, moving
to a configfs style format with one metric per file (similar to LIO)
might be more useful and extensible, e.g.
/sys/kernel/config/cifs/stats/sessions
/sys/kernel/config/cifs/stats/shares
...
/sys/kernel/config/cifs/stats/<session>/smbs
/sys/kernel/config/cifs/stats/<session>/creates/total
/sys/kernel/config/cifs/stats/<session>/creates/failed
/sys/kernel/config/cifs/stats/<session>/creates/slow
/sys/kernel/config/cifs/stats/<session>/reads/total
/sys/kernel/config/cifs/stats/<session>/reads/failed
/sys/kernel/config/cifs/stats/<session>/reads/slow
...

Cheers, David

1. PCP cifs.ko monitoring agent
https://github.com/performancecopilot/pcp/blob/master/src/pmdas/cifs/pmda.c
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aurélien Aptel Aug. 6, 2018, 2:01 p.m. UTC | #3

David Disseldorp <ddiss@samba.org> writes:
> Hi Steve,
>
> On Sat, 4 Aug 2018 05:32:25 -0500, Steve French wrote:
>
>> When CONFIG_CIFS_STATS2 is enabled keep counters for slow
>> commands (ie server took longer than 1 second to respond)
>> by SMB2/SMB3 command code.  This can help in diagnosing
>> whether performance problems are on server (instead of
>> client) and which commands are causing the problem.
>> 
>> Sample output (the new lines contain words "slow responses ...")
>
> Wouldn't putting these alongside the existing per-session op counts be
> more suitable, e.g.
> 1) \\192.168.1.1\rapido-share
> ...
> Creates: 1 total 0 failed 1 slow
> Closes: 0 total 0 failed 0 slow
> Flushes: 1 total 0 failed 1 slow
> Reads: 0 total 0 failed 0 slow
>
> It'd be helpful if this file included some sort of API version, so that
> parsers like PCP[1] knew what format to expect. Alternatively, moving
> to a configfs style format with one metric per file (similar to LIO)
> might be more useful and extensible, e.g.
> /sys/kernel/config/cifs/stats/sessions
> /sys/kernel/config/cifs/stats/shares
> ...
> /sys/kernel/config/cifs/stats/<session>/smbs
> /sys/kernel/config/cifs/stats/<session>/creates/total
> /sys/kernel/config/cifs/stats/<session>/creates/failed
> /sys/kernel/config/cifs/stats/<session>/creates/slow
> /sys/kernel/config/cifs/stats/<session>/reads/total
> /sys/kernel/config/cifs/stats/<session>/reads/failed
> /sys/kernel/config/cifs/stats/<session>/reads/slow
> ...

That proc file was never meant for machine consumption. Some have tried
(myself included) but obviously it break everytime we update something.

I like having a hierarchy but having multiple files means potential race
conditions when reading everything. Alternatively we could just output
everything we already do in some structured format like JSON or
YAML. There is a precedence for JSON in clock device nodes
(drivers/clk/clk.c).

Cheers,

David Disseldorp Aug. 6, 2018, 2:11 p.m. UTC | #4

On Mon, 06 Aug 2018 16:01:37 +0200, Aurélien Aptel wrote:

> I like having a hierarchy but having multiple files means potential race
> conditions when reading everything.

Does it really matter if multiple metrics can't be read atomically?
Other filesystems (XFS and Btrfs) seem to cope with this fine.

Cheers, David
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Aurélien Aptel Aug. 6, 2018, 2:40 p.m. UTC | #5

David Disseldorp <ddiss@samba.org> writes:
> Does it really matter if multiple metrics can't be read atomically?
> Other filesystems (XFS and Btrfs) seem to cope with this fine.

It makes properly handling tcp connection/session/tree connection
closing and reconnecting a lot harder (session id changes, files will
fail to open etc etc). It's just a lot simpler when everything is in one
place. But yes it's not critical I guess.

Cheers,

Pavel Shilovskiy Aug. 6, 2018, 5:28 p.m. UTC | #6

Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>

-----Original Message-----
From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> On Behalf Of Steve French
Sent: Saturday, August 4, 2018 3:32 AM
To: CIFS <linux-cifs@vger.kernel.org>; samba-technical <samba-technical@lists.samba.org>
Subject: [PATCH][SMB3] display stats counters for number of slow commands

When CONFIG_CIFS_STATS2 is enabled keep counters for slow commands (ie server took longer than 1 second to respond) by SMB2/SMB3 command code.  This can help in diagnosing whether performance problems are on server (instead of
client) and which commands are causing the problem.

Sample output (the new lines contain words "slow responses ...")

$ cat /proc/fs/cifs/Stats
Resources in use
CIFS Session: 1
Share (unique mount targets): 2
SMB Request/Response Buffer: 1 Pool size: 5 SMB Small Req/Resp Buffer: 1 Pool size: 30 Total Large 10 Small 490 Allocations Operations (MIDs): 0

0 session 0 share reconnects
Total vfs operations: 67 maximum at one time: 2
4 slow responses from localhost for command 5
1 slow responses from localhost for command 6
1 slow responses from localhost for command 14
1 slow responses from localhost for command 16

1) \\localhost\test
SMBs: 243
Bytes read: 1024000  Bytes written: 104857600
TreeConnects: 1 total 0 failed
TreeDisconnects: 0 total 0 failed
Creates: 40 total 0 failed
Closes: 39 total 0 failed
...

Signed-off-by: Steve French <stfrench@microsoft.com>
---
 fs/cifs/cifs_debug.c | 19 +++++++++++++++++--
 fs/cifs/cifsglob.h   |  3 ++-
 fs/cifs/transport.c  |  5 +++++
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c index 748cabd6d20b..f1fbea947fef 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -378,6 +378,8 @@ static ssize_t cifs_stats_proc_write(struct file *file,
     rc = kstrtobool_from_user(buffer, count, &bv);
     if (rc == 0) {
 #ifdef CONFIG_CIFS_STATS2
+        int i;
+
         atomic_set(&totBufAllocCount, 0);
         atomic_set(&totSmBufAllocCount, 0);  #endif /* CONFIG_CIFS_STATS2 */ @@ -389,6 +391,10 @@ static ssize_t cifs_stats_proc_write(struct file *file,
         list_for_each(tmp1, &cifs_tcp_ses_list) {
             server = list_entry(tmp1, struct TCP_Server_Info,
                         tcp_ses_list);
+#ifdef CONFIG_CIFS_STATS2
+            for (i = 0; i < NUMBER_OF_SMB2_COMMANDS; i++)
+                atomic_set(&server->smb2slowcmd[i], 0); #endif /* 
+CONFIG_CIFS_STATS2 */
             list_for_each(tmp2, &server->smb_ses_list) {
                 ses = list_entry(tmp2, struct cifs_ses,
                          smb_ses_list); @@ -417,13 +423,15 @@ static ssize_t cifs_stats_proc_write(struct file *file,  static int cifs_stats_proc_show(struct seq_file *m, void *v)  {
     int i;
+#ifdef CONFIG_CIFS_STATS2
+    int j;
+#endif /* STATS2 */
     struct list_head *tmp1, *tmp2, *tmp3;
     struct TCP_Server_Info *server;
     struct cifs_ses *ses;
     struct cifs_tcon *tcon;

-    seq_printf(m,
-            "Resources in use\nCIFS Session: %d\n",
+    seq_printf(m, "Resources in use\nCIFS Session: %d\n",
             sesInfoAllocCount.counter);
     seq_printf(m, "Share (unique mount targets): %d\n",
             tconInfoAllocCount.counter); @@ -452,6 +460,13 @@ static int cifs_stats_proc_show(struct seq_file *m, void *v)
     list_for_each(tmp1, &cifs_tcp_ses_list) {
         server = list_entry(tmp1, struct TCP_Server_Info,
                     tcp_ses_list);
+#ifdef CONFIG_CIFS_STATS2
+        for (j = 0; j < NUMBER_OF_SMB2_COMMANDS; j++)
+            if (atomic_read(&server->smb2slowcmd[j]))
+                seq_printf(m, "%d slow responses from %s for command %d\n",
+                    atomic_read(&server->smb2slowcmd[j]),
+                    server->hostname, j); #endif /* STATS2 */
         list_for_each(tmp2, &server->smb_ses_list) {
             ses = list_entry(tmp2, struct cifs_ses,
                      smb_ses_list);
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h index 0553929e8339..41803d374da0 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -680,7 +680,8 @@ struct TCP_Server_Info {  #ifdef CONFIG_CIFS_STATS2
     atomic_t in_send; /* requests trying to send */
     atomic_t num_waiters;   /* blocked waiting to get in sendrecv */
-#endif
+    atomic_t smb2slowcmd[NUMBER_OF_SMB2_COMMANDS]; /* count resps > 1 
+sec */ #endif /* STATS2 */
     unsigned int    max_read;
     unsigned int    max_write;
     __le16    cipher_type;
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c index 92de5c528161..c53c0908d4c6 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -117,6 +117,11 @@ DeleteMidQEntry(struct mid_q_entry *midEntry)
        something is wrong, unless it is quite a slow link or server */
     if (time_after(now, midEntry->when_alloc + HZ) &&
         (midEntry->command != command)) {
+        /* smb2slowcmd[NUMBER_OF_SMB2_COMMANDS] counts by command */
+        if ((le16_to_cpu(midEntry->command) < NUMBER_OF_SMB2_COMMANDS) &&
+            (le16_to_cpu(midEntry->command) >= 0))
+            
+ cifs_stats_inc(&midEntry->server->smb2slowcmd[le16_to_cpu(midEntry->co
+ mmand)]);
+
         trace_smb3_slow_rsp(le16_to_cpu(midEntry->command),
                    midEntry->mid, midEntry->pid,
                    midEntry->when_sent, midEntry->when_received);


--
Thanks,

Steve

Steve French Aug. 6, 2018, 9:56 p.m. UTC | #7

The main reason I separated these counters from the per-tree
connection ones was that it MUCH simplified the coding of them and
made it much less likely that it would introduce a bug/regression. It
also takes up less memory which is a minor but real benefit. In the
place where these requests are pulled off the wire and where timings
were already checked, we don't yet know which tree connection if any
(at least 4 of the command types don't have a tree connection) it is
associated with - but we do know which socket and thus which host
name.  In a perfect world we would also track read and write and
close/flush timings (maybe even open) and produce nice statistics as
Volker and others have suggested, but I wanted to model more closely
the read/write timings after iostat or better examples, and this set
of statistics I wanted as a very low risk set of counters that would
help diagnose various types of problems that I have been seeing with
reconnect - where we have to both understand why the reconnect
occurred (and whether the server has been showing other signs of
problems, other slow requests before the hang which triggered the
timeout and reconnect) but also debug the subsequent problems.   There
are so many reconnect scenarios, getting more objective data on what
is going on in a particular time period with slow responses is one
useful piece of data that will help.  I expect that the reconnect
problems I have been discussing with Pavel and others will lead to
four or five fixes - but in the short term I needed something
objective to more easily determine when the server is having problems.
On Mon, Aug 6, 2018 at 6:47 AM David Disseldorp <ddiss@samba.org> wrote:
>
> Hi Steve,
>
> On Sat, 4 Aug 2018 05:32:25 -0500, Steve French wrote:
>
> > When CONFIG_CIFS_STATS2 is enabled keep counters for slow
> > commands (ie server took longer than 1 second to respond)
> > by SMB2/SMB3 command code.  This can help in diagnosing
> > whether performance problems are on server (instead of
> > client) and which commands are causing the problem.
> >
> > Sample output (the new lines contain words "slow responses ...")
>
> Wouldn't putting these alongside the existing per-session op counts be
> more suitable, e.g.
> 1) \\192.168.1.1\rapido-share
> ...
> Creates: 1 total 0 failed 1 slow
> Closes: 0 total 0 failed 0 slow
> Flushes: 1 total 0 failed 1 slow
> Reads: 0 total 0 failed 0 slow
>
> It'd be helpful if this file included some sort of API version, so that
> parsers like PCP[1] knew what format to expect. Alternatively, moving
> to a configfs style format with one metric per file (similar to LIO)
> might be more useful and extensible, e.g.
> /sys/kernel/config/cifs/stats/sessions
> /sys/kernel/config/cifs/stats/shares
> ...
> /sys/kernel/config/cifs/stats/<session>/smbs
> /sys/kernel/config/cifs/stats/<session>/creates/total
> /sys/kernel/config/cifs/stats/<session>/creates/failed
> /sys/kernel/config/cifs/stats/<session>/creates/slow
> /sys/kernel/config/cifs/stats/<session>/reads/total
> /sys/kernel/config/cifs/stats/<session>/reads/failed
> /sys/kernel/config/cifs/stats/<session>/reads/slow
> ...
>
> Cheers, David
>
> 1. PCP cifs.ko monitoring agent
> https://github.com/performancecopilot/pcp/blob/master/src/pmdas/cifs/pmda.c

[SMB3] display stats counters for number of slow commands

Commit Message

Comments

Patch