mbox series

[0/9] Multiple network connections for a single NFS mount.

Message ID 155917564898.3988.6096672032831115016.stgit@noble.brown (mailing list archive)
Headers show
Series Multiple network connections for a single NFS mount. | expand

Message

NeilBrown May 30, 2019, 12:41 a.m. UTC
This patch set is based on the patches in the multipath_tcp branch of
 git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git

I'd like to add my voice to those supporting this work and wanting to
see it land.
We have had customers/partners wanting this sort of functionality for
years.  In SLES releases prior to SLE15, we've provide a
"nosharetransport" mount option, so that several filesystem could be
mounted from the same server and each would get its own TCP
connection.
In SLE15 we are using this 'nconnect' feature, which is much nicer.

Partners have assured us that it improves total throughput,
particularly with bonded networks, but we haven't had any concrete
data until Olga Kornievskaia provided some concrete test data - thanks
Olga!

My understanding, as I explain in one of the patches, is that parallel
hardware is normally utilized by distributing flows, rather than
packets.  This avoid out-of-order deliver of packets in a flow.
So multiple flows are needed to utilizes parallel hardware.

An earlier version of this patch set was posted in April 2017 and
Chuck raised two issues:
 1/ mountstats only reports on one xprt per mount
 2/ session establishment needs to happen on a single xprt, as you
    cannot bind other xprts to the session until the session is
    established.
I've added patches to address these, and also to add the extra xprts
to the debugfs info.

I've also re-arrange the patches a bit, merged two, and remove the
restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
these restrictions were not needed, I can see no need.

There is a bug with the load balancing code from Trond's tree.
While an xprt is attached to a client, the queuelen is incremented.
Some requests (particularly BIND_CONN_TO_SESSION) pass in an xprt,
and the queuelen was not incremented in this case, but it was
decremented.  This causes it to go 'negative' and havoc results.

I wonder if the last three patches (*Allow multiple connection*) could
be merged into a single patch.

I haven't given much thought to automatically determining the optimal
number of connections, but I doubt it can be done transparently with
any reliability.  When adding a connection improves throughput, then
it was almost certainly a good thing to do. When adding a connection
doesn't improve throughput, the implications are less obvious.
My feeling is that a protocol enhancement where the serve suggests an
upper limit and the client increases toward that limit when it notices
xmit backlog, would be about the best we could do.  But we would need
a lot more experience with the functionality first.

Comments most welcome.  I'd love to see this, or something similar,
merged.

Thanks,
NeilBrown

---

NeilBrown (4):
      NFS: send state management on a single connection.
      SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts.
      SUNRPC: add links for all client xprts to debugfs

Trond Myklebust (5):
      SUNRPC: Add basic load balancing to the transport switch
      SUNRPC: Allow creation of RPC clients with multiple connections
      NFS: Add a mount option to specify number of TCP connections to use
      NFSv4: Allow multiple connections to NFSv4.x servers
      pNFS: Allow multiple connections to the DS
      NFS: Allow multiple connections to a NFSv2 or NFSv3 server


 fs/nfs/client.c                      |    3 +
 fs/nfs/internal.h                    |    2 +
 fs/nfs/nfs3client.c                  |    1 
 fs/nfs/nfs4client.c                  |   13 ++++-
 fs/nfs/nfs4proc.c                    |   22 +++++---
 fs/nfs/super.c                       |   12 ++++
 include/linux/nfs_fs_sb.h            |    1 
 include/linux/sunrpc/clnt.h          |    1 
 include/linux/sunrpc/sched.h         |    1 
 include/linux/sunrpc/xprt.h          |    1 
 include/linux/sunrpc/xprtmultipath.h |    2 +
 net/sunrpc/clnt.c                    |   98 ++++++++++++++++++++++++++++++++--
 net/sunrpc/debugfs.c                 |   46 ++++++++++------
 net/sunrpc/sched.c                   |    3 +
 net/sunrpc/stats.c                   |   15 +++--
 net/sunrpc/sunrpc.h                  |    3 +
 net/sunrpc/xprtmultipath.c           |   23 +++++++-
 17 files changed, 204 insertions(+), 43 deletions(-)

--
Signature

Comments

Tom Talpey May 30, 2019, 5:05 p.m. UTC | #1
On 5/29/2019 8:41 PM, NeilBrown wrote:
> I've also re-arrange the patches a bit, merged two, and remove the
> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
> these restrictions were not needed, I can see no need.

I believe the need is for the correctness of retries. Because NFSv2,
NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
duplicate request caches are important (although often imperfect).
These caches use client XID's, source ports and addresses, sometimes
in addition to other methods, to detect retry. Existing clients are
careful to reconnect with the same source port, to ensure this. And
existing servers won't change.

Multiple connections will result in multiple source ports, and possibly
multiple source addresses, meaning retried client requests may be
accepted as new, rather than having any chance of being recognized as
retries.

NFSv4.1+ don't have this issue, but removing the restrictions would
seem to break the downlevel mounts.

Tom.
Olga Kornievskaia May 30, 2019, 5:20 p.m. UTC | #2
On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>
> On 5/29/2019 8:41 PM, NeilBrown wrote:
> > I've also re-arrange the patches a bit, merged two, and remove the
> > restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
> > these restrictions were not needed, I can see no need.
>
> I believe the need is for the correctness of retries. Because NFSv2,
> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
> duplicate request caches are important (although often imperfect).
> These caches use client XID's, source ports and addresses, sometimes
> in addition to other methods, to detect retry. Existing clients are
> careful to reconnect with the same source port, to ensure this. And
> existing servers won't change.

Retries are already bound to the same connection so there shouldn't be
an issue of a retransmission coming from a different source port.

> Multiple connections will result in multiple source ports, and possibly
> multiple source addresses, meaning retried client requests may be
> accepted as new, rather than having any chance of being recognized as
> retries.
>
> NFSv4.1+ don't have this issue, but removing the restrictions would
> seem to break the downlevel mounts.
>
> Tom.
>
Tom Talpey May 30, 2019, 5:41 p.m. UTC | #3
On 5/30/2019 1:20 PM, Olga Kornievskaia wrote:
> On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>
>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>> I've also re-arrange the patches a bit, merged two, and remove the
>>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>> these restrictions were not needed, I can see no need.
>>
>> I believe the need is for the correctness of retries. Because NFSv2,
>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>> duplicate request caches are important (although often imperfect).
>> These caches use client XID's, source ports and addresses, sometimes
>> in addition to other methods, to detect retry. Existing clients are
>> careful to reconnect with the same source port, to ensure this. And
>> existing servers won't change.
> 
> Retries are already bound to the same connection so there shouldn't be
> an issue of a retransmission coming from a different source port.

So, there's no path redundancy? If any connection is lost and can't
be reestablished, the requests on that connection will time out?

I think a common configuration will be two NICs and two network paths,
a so-called shotgun. Admins will be quite frustrated to discover it
gives no additional robustness, and perhaps even less.

Why not simply restrict this to the fully-correct, fully-functional
NFSv4.1+ scenario, and not try to paper over the shortcomings?

Tom.

> 
>> Multiple connections will result in multiple source ports, and possibly
>> multiple source addresses, meaning retried client requests may be
>> accepted as new, rather than having any chance of being recognized as
>> retries.
>>
>> NFSv4.1+ don't have this issue, but removing the restrictions would
>> seem to break the downlevel mounts.
>>
>> Tom.
>>
> 
>
Chuck Lever III May 30, 2019, 5:56 p.m. UTC | #4
Hi Neil-

Thanks for chasing this a little further.


> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
> 
> This patch set is based on the patches in the multipath_tcp branch of
> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
> 
> I'd like to add my voice to those supporting this work and wanting to
> see it land.
> We have had customers/partners wanting this sort of functionality for
> years.  In SLES releases prior to SLE15, we've provide a
> "nosharetransport" mount option, so that several filesystem could be
> mounted from the same server and each would get its own TCP
> connection.

Is it well understood why splitting up the TCP connections result
in better performance?


> In SLE15 we are using this 'nconnect' feature, which is much nicer.
> 
> Partners have assured us that it improves total throughput,
> particularly with bonded networks, but we haven't had any concrete
> data until Olga Kornievskaia provided some concrete test data - thanks
> Olga!
> 
> My understanding, as I explain in one of the patches, is that parallel
> hardware is normally utilized by distributing flows, rather than
> packets.  This avoid out-of-order deliver of packets in a flow.
> So multiple flows are needed to utilizes parallel hardware.

Indeed.

However I think one of the problems is what happens in simpler scenarios.
We had reports that using nconnect > 1 on virtual clients made things
go slower. It's not always wise to establish multiple connections
between the same two IP addresses. It depends on the hardware on each
end, and the network conditions.


> An earlier version of this patch set was posted in April 2017 and
> Chuck raised two issues:
> 1/ mountstats only reports on one xprt per mount
> 2/ session establishment needs to happen on a single xprt, as you
>    cannot bind other xprts to the session until the session is
>    established.
> I've added patches to address these, and also to add the extra xprts
> to the debugfs info.
> 
> I've also re-arrange the patches a bit, merged two, and remove the
> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
> these restrictions were not needed, I can see no need.

RDMA could certainly benefit for exactly the reason you describe above.


> There is a bug with the load balancing code from Trond's tree.
> While an xprt is attached to a client, the queuelen is incremented.
> Some requests (particularly BIND_CONN_TO_SESSION) pass in an xprt,
> and the queuelen was not incremented in this case, but it was
> decremented.  This causes it to go 'negative' and havoc results.
> 
> I wonder if the last three patches (*Allow multiple connection*) could
> be merged into a single patch.
> 
> I haven't given much thought to automatically determining the optimal
> number of connections, but I doubt it can be done transparently with
> any reliability.

A Solaris client can open up to 8 connections to a server, but there
are always some scenarios where the heuristic creates too many
connections and becomes a performance issue.

We also have concerns about running the client out of privileged port
space.

The problem with nconnect is that it can work well, but it can also be
a very easy way to shoot yourself in the foot.

I also share the concerns about dealing properly with retransmission
and NFSv4 sessions.


> When adding a connection improves throughput, then
> it was almost certainly a good thing to do. When adding a connection
> doesn't improve throughput, the implications are less obvious.
> My feeling is that a protocol enhancement where the serve suggests an
> upper limit and the client increases toward that limit when it notices
> xmit backlog, would be about the best we could do.  But we would need
> a lot more experience with the functionality first.

What about situations where the network capabilities between server and
client change? Problem is that neither endpoint can detect that; TCP
usually just deals with it.

Related Work:

We now have protocol (more like conventions) for clients to discover
when a server has additional endpoints so that it can establish
connections to each of them.

https://datatracker.ietf.org/doc/rfc8587/

and

https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rfc5661-msns-update/

Boiled down, the client uses fs_locations and trunking detection to
figure out when two IP addresses are the same server instance.

This facility can also be used to establish a connection over a
different path if network connectivity is lost.

There has also been some exploration of MP-TCP. The magic happens
under the transport socket in the network layer, and the RPC client
is not involved.


> Comments most welcome.  I'd love to see this, or something similar,
> merged.
> 
> Thanks,
> NeilBrown
> 
> ---
> 
> NeilBrown (4):
>      NFS: send state management on a single connection.
>      SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts.
>      SUNRPC: add links for all client xprts to debugfs
> 
> Trond Myklebust (5):
>      SUNRPC: Add basic load balancing to the transport switch
>      SUNRPC: Allow creation of RPC clients with multiple connections
>      NFS: Add a mount option to specify number of TCP connections to use
>      NFSv4: Allow multiple connections to NFSv4.x servers
>      pNFS: Allow multiple connections to the DS
>      NFS: Allow multiple connections to a NFSv2 or NFSv3 server
> 
> 
> fs/nfs/client.c                      |    3 +
> fs/nfs/internal.h                    |    2 +
> fs/nfs/nfs3client.c                  |    1 
> fs/nfs/nfs4client.c                  |   13 ++++-
> fs/nfs/nfs4proc.c                    |   22 +++++---
> fs/nfs/super.c                       |   12 ++++
> include/linux/nfs_fs_sb.h            |    1 
> include/linux/sunrpc/clnt.h          |    1 
> include/linux/sunrpc/sched.h         |    1 
> include/linux/sunrpc/xprt.h          |    1 
> include/linux/sunrpc/xprtmultipath.h |    2 +
> net/sunrpc/clnt.c                    |   98 ++++++++++++++++++++++++++++++++--
> net/sunrpc/debugfs.c                 |   46 ++++++++++------
> net/sunrpc/sched.c                   |    3 +
> net/sunrpc/stats.c                   |   15 +++--
> net/sunrpc/sunrpc.h                  |    3 +
> net/sunrpc/xprtmultipath.c           |   23 +++++++-
> 17 files changed, 204 insertions(+), 43 deletions(-)
> 
> --
> Signature
> 

--
Chuck Lever
Olga Kornievskaia May 30, 2019, 6:41 p.m. UTC | #5
On Thu, May 30, 2019 at 1:41 PM Tom Talpey <tom@talpey.com> wrote:
>
> On 5/30/2019 1:20 PM, Olga Kornievskaia wrote:
> > On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
> >>
> >> On 5/29/2019 8:41 PM, NeilBrown wrote:
> >>> I've also re-arrange the patches a bit, merged two, and remove the
> >>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
> >>> these restrictions were not needed, I can see no need.
> >>
> >> I believe the need is for the correctness of retries. Because NFSv2,
> >> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
> >> duplicate request caches are important (although often imperfect).
> >> These caches use client XID's, source ports and addresses, sometimes
> >> in addition to other methods, to detect retry. Existing clients are
> >> careful to reconnect with the same source port, to ensure this. And
> >> existing servers won't change.
> >
> > Retries are already bound to the same connection so there shouldn't be
> > an issue of a retransmission coming from a different source port.
>
> So, there's no path redundancy? If any connection is lost and can't
> be reestablished, the requests on that connection will time out?

For v3 and v4.0 in the current code base with a single connection,
when it goes down, you are out of luck. When we have multiple
connections and would like the benefit of using them but not
sacrifices replay cache correctness, it's a small price to restrict
the re-transmissions and suffer the consequence of not being able to
do an operation during network issues.

> I think a common configuration will be two NICs and two network paths,

Are you talking about session trunking here?

Why do you think two NICs would be a common configuration. I have
performance numbers that demonstrate performance improvement for a
single NIC case. I would say a single NIC with a high speed networks
(25/40G) would be a common configuration.

> a so-called shotgun. Admins will be quite frustrated to discover it
> gives no additional robustness, and perhaps even less.
>
> Why not simply restrict this to the fully-correct, fully-functional
> NFSv4.1+ scenario, and not try to paper over the shortcomings?

I think mainly because customers are still using v3 but want to
improve performance. I'd love for everybody to switch to 4.1 but
that's not happening.

>
> Tom.
>
> >
> >> Multiple connections will result in multiple source ports, and possibly
> >> multiple source addresses, meaning retried client requests may be
> >> accepted as new, rather than having any chance of being recognized as
> >> retries.
> >>
> >> NFSv4.1+ don't have this issue, but removing the restrictions would
> >> seem to break the downlevel mounts.
> >>
> >> Tom.
> >>
> >
> >
Olga Kornievskaia May 30, 2019, 6:59 p.m. UTC | #6
On Thu, May 30, 2019 at 1:57 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> Hi Neil-
>
> Thanks for chasing this a little further.
>
>
> > On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
> >
> > This patch set is based on the patches in the multipath_tcp branch of
> > git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
> >
> > I'd like to add my voice to those supporting this work and wanting to
> > see it land.
> > We have had customers/partners wanting this sort of functionality for
> > years.  In SLES releases prior to SLE15, we've provide a
> > "nosharetransport" mount option, so that several filesystem could be
> > mounted from the same server and each would get its own TCP
> > connection.
>
> Is it well understood why splitting up the TCP connections result
> in better performance?

It has been historical that NFS can not fill up a highspeed pipe.
There have been studies that shown a negative interactions between VM
dirty page flushing and TCP window behavior that lead to bad
performance (VM flushes too aggressively which creates TCP congestion
so window closes which then makes the VM to stop flushing and so the
system oscilates between bad states and always underperforms ). But
that a side, there might be different server implementations that
would perform better when multiple connections are used.

I forget the details but there used to be (might still be) Data
transfer storage challenges at conferences to see who can transfer the
largest amount of data faster. They all have shown that to accomplish
that you need to have multiple TCP connections.

But to answer: no I don't think it's "well understood" why splitting
TCP connection performs better.

> > In SLE15 we are using this 'nconnect' feature, which is much nicer.
> >
> > Partners have assured us that it improves total throughput,
> > particularly with bonded networks, but we haven't had any concrete
> > data until Olga Kornievskaia provided some concrete test data - thanks
> > Olga!
> >
> > My understanding, as I explain in one of the patches, is that parallel
> > hardware is normally utilized by distributing flows, rather than
> > packets.  This avoid out-of-order deliver of packets in a flow.
> > So multiple flows are needed to utilizes parallel hardware.
>
> Indeed.
>
> However I think one of the problems is what happens in simpler scenarios.
> We had reports that using nconnect > 1 on virtual clients made things
> go slower. It's not always wise to establish multiple connections
> between the same two IP addresses. It depends on the hardware on each
> end, and the network conditions.
>
>
> > An earlier version of this patch set was posted in April 2017 and
> > Chuck raised two issues:
> > 1/ mountstats only reports on one xprt per mount
> > 2/ session establishment needs to happen on a single xprt, as you
> >    cannot bind other xprts to the session until the session is
> >    established.
> > I've added patches to address these, and also to add the extra xprts
> > to the debugfs info.
> >
> > I've also re-arrange the patches a bit, merged two, and remove the
> > restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
> > these restrictions were not needed, I can see no need.
>
> RDMA could certainly benefit for exactly the reason you describe above.
>
>
> > There is a bug with the load balancing code from Trond's tree.
> > While an xprt is attached to a client, the queuelen is incremented.
> > Some requests (particularly BIND_CONN_TO_SESSION) pass in an xprt,
> > and the queuelen was not incremented in this case, but it was
> > decremented.  This causes it to go 'negative' and havoc results.
> >
> > I wonder if the last three patches (*Allow multiple connection*) could
> > be merged into a single patch.
> >
> > I haven't given much thought to automatically determining the optimal
> > number of connections, but I doubt it can be done transparently with
> > any reliability.
>
> A Solaris client can open up to 8 connections to a server, but there
> are always some scenarios where the heuristic creates too many
> connections and becomes a performance issue.

That's great that a solaris client can have many multiple connections,
let's not leave the linux client behind then :-) Given your knowledge
in this case, do you have words of wisdom/lessons learned that could
help with it?

> We also have concerns about running the client out of privileged port
> space.
>
> The problem with nconnect is that it can work well, but it can also be
> a very easy way to shoot yourself in the foot.

It's an optional feature so I'd argue that if you've chosen to use it,
then don't complain about the consequences.

> I also share the concerns about dealing properly with retransmission
> and NFSv4 sessions.
>
>
> > When adding a connection improves throughput, then
> > it was almost certainly a good thing to do. When adding a connection
> > doesn't improve throughput, the implications are less obvious.
> > My feeling is that a protocol enhancement where the serve suggests an
> > upper limit and the client increases toward that limit when it notices
> > xmit backlog, would be about the best we could do.  But we would need
> > a lot more experience with the functionality first.
>
> What about situations where the network capabilities between server and
> client change? Problem is that neither endpoint can detect that; TCP
> usually just deals with it.
>
> Related Work:
>
> We now have protocol (more like conventions) for clients to discover
> when a server has additional endpoints so that it can establish
> connections to each of them.

Yes I totally agree we need solution for when there are multiple
endpoints. And we need a solution that's being proposed here which is
to establish multiple connections to the same endpoint.

> https://datatracker.ietf.org/doc/rfc8587/
>
> and
>
> https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rfc5661-msns-update/
>
> Boiled down, the client uses fs_locations and trunking detection to
> figure out when two IP addresses are the same server instance.
>
> This facility can also be used to establish a connection over a
> different path if network connectivity is lost.
>
> There has also been some exploration of MP-TCP. The magic happens
> under the transport socket in the network layer, and the RPC client
> is not involved.
>
>
> > Comments most welcome.  I'd love to see this, or something similar,
> > merged.
> >
> > Thanks,
> > NeilBrown
> >
> > ---
> >
> > NeilBrown (4):
> >      NFS: send state management on a single connection.
> >      SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts.
> >      SUNRPC: add links for all client xprts to debugfs
> >
> > Trond Myklebust (5):
> >      SUNRPC: Add basic load balancing to the transport switch
> >      SUNRPC: Allow creation of RPC clients with multiple connections
> >      NFS: Add a mount option to specify number of TCP connections to use
> >      NFSv4: Allow multiple connections to NFSv4.x servers
> >      pNFS: Allow multiple connections to the DS
> >      NFS: Allow multiple connections to a NFSv2 or NFSv3 server
> >
> >
> > fs/nfs/client.c                      |    3 +
> > fs/nfs/internal.h                    |    2 +
> > fs/nfs/nfs3client.c                  |    1
> > fs/nfs/nfs4client.c                  |   13 ++++-
> > fs/nfs/nfs4proc.c                    |   22 +++++---
> > fs/nfs/super.c                       |   12 ++++
> > include/linux/nfs_fs_sb.h            |    1
> > include/linux/sunrpc/clnt.h          |    1
> > include/linux/sunrpc/sched.h         |    1
> > include/linux/sunrpc/xprt.h          |    1
> > include/linux/sunrpc/xprtmultipath.h |    2 +
> > net/sunrpc/clnt.c                    |   98 ++++++++++++++++++++++++++++++++--
> > net/sunrpc/debugfs.c                 |   46 ++++++++++------
> > net/sunrpc/sched.c                   |    3 +
> > net/sunrpc/stats.c                   |   15 +++--
> > net/sunrpc/sunrpc.h                  |    3 +
> > net/sunrpc/xprtmultipath.c           |   23 +++++++-
> > 17 files changed, 204 insertions(+), 43 deletions(-)
> >
> > --
> > Signature
> >
>
> --
> Chuck Lever
>
>
>
NeilBrown May 30, 2019, 10:38 p.m. UTC | #7
On Thu, May 30 2019, Tom Talpey wrote:

> On 5/30/2019 1:20 PM, Olga Kornievskaia wrote:
>> On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>
>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>>> I've also re-arrange the patches a bit, merged two, and remove the
>>>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>>> these restrictions were not needed, I can see no need.
>>>
>>> I believe the need is for the correctness of retries. Because NFSv2,
>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>> duplicate request caches are important (although often imperfect).
>>> These caches use client XID's, source ports and addresses, sometimes
>>> in addition to other methods, to detect retry. Existing clients are
>>> careful to reconnect with the same source port, to ensure this. And
>>> existing servers won't change.
>> 
>> Retries are already bound to the same connection so there shouldn't be
>> an issue of a retransmission coming from a different source port.
>
> So, there's no path redundancy? If any connection is lost and can't
> be reestablished, the requests on that connection will time out?

Path redundancy happens lower down in the stack.  Presumably a bonding
driver will divert flows to a working path when one path fails.
NFS doesn't see paths at all.  It just sees TCP connections - each with
the same source and destination address.  How these are associated, from
time to time, with different hardware is completely transparent to NFS.

>
> I think a common configuration will be two NICs and two network paths,
> a so-called shotgun. Admins will be quite frustrated to discover it
> gives no additional robustness, and perhaps even less.
>
> Why not simply restrict this to the fully-correct, fully-functional
> NFSv4.1+ scenario, and not try to paper over the shortcomings?

Because I cannot see any shortcomings in using it for v3 or v4.0.

Also, there are situations where NFSv3 is a measurably better choice
than NFSv4.1.  Al least it seems to allow a quicker failover for HA.
But that is really a topic for another day.

NeilBrown

>
> Tom.
>
>> 
>>> Multiple connections will result in multiple source ports, and possibly
>>> multiple source addresses, meaning retried client requests may be
>>> accepted as new, rather than having any chance of being recognized as
>>> retries.
>>>
>>> NFSv4.1+ don't have this issue, but removing the restrictions would
>>> seem to break the downlevel mounts.
>>>
>>> Tom.
>>>
>> 
>>
NeilBrown May 30, 2019, 10:56 p.m. UTC | #8
On Thu, May 30 2019, Chuck Lever wrote:

> Hi Neil-
>
> Thanks for chasing this a little further.
>
>
>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> This patch set is based on the patches in the multipath_tcp branch of
>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>> 
>> I'd like to add my voice to those supporting this work and wanting to
>> see it land.
>> We have had customers/partners wanting this sort of functionality for
>> years.  In SLES releases prior to SLE15, we've provide a
>> "nosharetransport" mount option, so that several filesystem could be
>> mounted from the same server and each would get its own TCP
>> connection.
>
> Is it well understood why splitting up the TCP connections result
> in better performance?
>
>
>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>> 
>> Partners have assured us that it improves total throughput,
>> particularly with bonded networks, but we haven't had any concrete
>> data until Olga Kornievskaia provided some concrete test data - thanks
>> Olga!
>> 
>> My understanding, as I explain in one of the patches, is that parallel
>> hardware is normally utilized by distributing flows, rather than
>> packets.  This avoid out-of-order deliver of packets in a flow.
>> So multiple flows are needed to utilizes parallel hardware.
>
> Indeed.
>
> However I think one of the problems is what happens in simpler scenarios.
> We had reports that using nconnect > 1 on virtual clients made things
> go slower. It's not always wise to establish multiple connections
> between the same two IP addresses. It depends on the hardware on each
> end, and the network conditions.

This is a good argument for leaving the default at '1'.  When
documentation is added to nfs(5), we can make it clear that the optimal
number is dependant on hardware.

>
> What about situations where the network capabilities between server and
> client change? Problem is that neither endpoint can detect that; TCP
> usually just deals with it.

Being able to manually change (-o remount) the number of connections
might be useful...

>
> Related Work:
>
> We now have protocol (more like conventions) for clients to discover
> when a server has additional endpoints so that it can establish
> connections to each of them.
>
> https://datatracker.ietf.org/doc/rfc8587/
>
> and
>
> https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rfc5661-msns-update/
>
> Boiled down, the client uses fs_locations and trunking detection to
> figure out when two IP addresses are the same server instance.
>
> This facility can also be used to establish a connection over a
> different path if network connectivity is lost.
>
> There has also been some exploration of MP-TCP. The magic happens
> under the transport socket in the network layer, and the RPC client
> is not involved.

I would think that SCTP would be the best protocol for NFS to use as it
supports multi-streaming - several independent streams.  That would
require that hardware understands it of course.

Though I have examined MP-TCP closely, it looks like it is still fully
sequenced, so it would be tricky for two RPC messages to be assembled
into TCP frames completely independently - at least you would need
synchronization on the sequence number.

Thanks for your thoughts,
NeilBrown


>
>
>> Comments most welcome.  I'd love to see this, or something similar,
>> merged.
>> 
>> Thanks,
>> NeilBrown
>> 
>> ---
>> 
>> NeilBrown (4):
>>      NFS: send state management on a single connection.
>>      SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts.
>>      SUNRPC: add links for all client xprts to debugfs
>> 
>> Trond Myklebust (5):
>>      SUNRPC: Add basic load balancing to the transport switch
>>      SUNRPC: Allow creation of RPC clients with multiple connections
>>      NFS: Add a mount option to specify number of TCP connections to use
>>      NFSv4: Allow multiple connections to NFSv4.x servers
>>      pNFS: Allow multiple connections to the DS
>>      NFS: Allow multiple connections to a NFSv2 or NFSv3 server
>> 
>> 
>> fs/nfs/client.c                      |    3 +
>> fs/nfs/internal.h                    |    2 +
>> fs/nfs/nfs3client.c                  |    1 
>> fs/nfs/nfs4client.c                  |   13 ++++-
>> fs/nfs/nfs4proc.c                    |   22 +++++---
>> fs/nfs/super.c                       |   12 ++++
>> include/linux/nfs_fs_sb.h            |    1 
>> include/linux/sunrpc/clnt.h          |    1 
>> include/linux/sunrpc/sched.h         |    1 
>> include/linux/sunrpc/xprt.h          |    1 
>> include/linux/sunrpc/xprtmultipath.h |    2 +
>> net/sunrpc/clnt.c                    |   98 ++++++++++++++++++++++++++++++++--
>> net/sunrpc/debugfs.c                 |   46 ++++++++++------
>> net/sunrpc/sched.c                   |    3 +
>> net/sunrpc/stats.c                   |   15 +++--
>> net/sunrpc/sunrpc.h                  |    3 +
>> net/sunrpc/xprtmultipath.c           |   23 +++++++-
>> 17 files changed, 204 insertions(+), 43 deletions(-)
>> 
>> --
>> Signature
>> 
>
> --
> Chuck Lever
Rick Macklem May 30, 2019, 11:53 p.m. UTC | #9
Olga Kornievskaia wrote:
>On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>
>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>> > I've also re-arrange the patches a bit, merged two, and remove the
>> > restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>> > these restrictions were not needed, I can see no need.
>>
>> I believe the need is for the correctness of retries. Because NFSv2,
>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>> duplicate request caches are important (although often imperfect).
>> These caches use client XID's, source ports and addresses, sometimes
>> in addition to other methods, to detect retry. Existing clients are
>> careful to reconnect with the same source port, to ensure this. And
>> existing servers won't change.
>
>Retries are already bound to the same connection so there shouldn't be
>an issue of a retransmission coming from a different source port.
I don't think the above is correct for NFSv4.0 (it may very well be true for NFSv3).
Here's what RFC7530 Sec. 3.1.1 says:
3.1.1.  Client Retransmission Behavior

   When processing an NFSv4 request received over a reliable transport
   such as TCP, the NFSv4 server MUST NOT silently drop the request,
   except if the established transport connection has been broken.
   Given such a contract between NFSv4 clients and servers, clients MUST
   NOT retry a request unless one or both of the following are true:

   o  The transport connection has been broken

   o  The procedure being retried is the NULL procedure

If the transport connection is broken, the retry needs to be done on a new TCP
connection, does it not? (I'm assuming you are referring to a retry of an RPC here.)
(My interpretation of "broken" is "can't be fixed, so the client must use a different
 TCP connection.)

Also, NFSv4.0 cannot use Sun RPC over UDP, whereas some DRCs only
work for UDP traffic. (The FreeBSD server does have DRC support for TCP, but
the algorithm is very different than what is used for UDP, due to the long delay
before a retried RPC request is received. This can result in significant server
overheads, so some sites choose to disable the DRC for TCP traffic or tune it
in such a way as it becomes almost useless.)
The FreeBSD DRC code for NFS over TCP expects the retry to be from a different
port# (due to a new connection re: the above) for NFSv4.0. For NFSv3, my best
recollection is that it doesn't care what the source port# is. (It basically uses a
hash on the RPC request excluding TCP/IP header to recognize possible duplicates.)

I don't know what other NFS servers choose to do w.r.t. the DRC for NFS over TCP,
however for some reason I thought that the Linux knfsd only used a DRC for UDP?
(Someone please clarify this.)

rick

> Multiple connections will result in multiple source ports, and possibly
> multiple source addresses, meaning retried client requests may be
> accepted as new, rather than having any chance of being recognized as
> retries.
>
> NFSv4.1+ don't have this issue, but removing the restrictions would
> seem to break the downlevel mounts.
>
> Tom.
>
J. Bruce Fields May 31, 2019, 12:15 a.m. UTC | #10
On Thu, May 30, 2019 at 11:53:19PM +0000, Rick Macklem wrote:
> The FreeBSD DRC code for NFS over TCP expects the retry to be from a
> different port# (due to a new connection re: the above) for NFSv4.0.
> For NFSv3, my best recollection is that it doesn't care what the
> source port# is. (It basically uses a hash on the RPC request
> excluding TCP/IP header to recognize possible duplicates.)
> 
> I don't know what other NFS servers choose to do w.r.t. the DRC for
> NFS over TCP, however for some reason I thought that the Linux knfsd
> only used a DRC for UDP?  (Someone please clarify this.)

The knfsd DRC is used for TCP as well as UDP.  It does take into account
the source port.  I don't think we do any TCP-specific optimizations
though I agree that they sound like a good idea.

--b.
J. Bruce Fields May 31, 2019, 12:24 a.m. UTC | #11
On Thu, May 30, 2019 at 10:41:28AM +1000, NeilBrown wrote:
> This patch set is based on the patches in the multipath_tcp branch of
>  git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
> 
> I'd like to add my voice to those supporting this work and wanting to
> see it land.
> We have had customers/partners wanting this sort of functionality for
> years.  In SLES releases prior to SLE15, we've provide a
> "nosharetransport" mount option, so that several filesystem could be
> mounted from the same server and each would get its own TCP
> connection.
> In SLE15 we are using this 'nconnect' feature, which is much nicer.

For what it's worth, we've also gotten at least one complaint of a
performance regression on 4.0->4.1 upgrade because a user was depending
on the fact that a 4.0 client would use multiple TCP connections to a
server with multiple IP addresses.  (Whereas in the 4.1 case the client
will recognize that the addresses point to the same server and share any
preexisting session.)

--b.
NeilBrown May 31, 2019, 1:01 a.m. UTC | #12
On Thu, May 30 2019, Rick Macklem wrote:

> Olga Kornievskaia wrote:
>>On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>
>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>> > I've also re-arrange the patches a bit, merged two, and remove the
>>> > restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>> > these restrictions were not needed, I can see no need.
>>>
>>> I believe the need is for the correctness of retries. Because NFSv2,
>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>> duplicate request caches are important (although often imperfect).
>>> These caches use client XID's, source ports and addresses, sometimes
>>> in addition to other methods, to detect retry. Existing clients are
>>> careful to reconnect with the same source port, to ensure this. And
>>> existing servers won't change.
>>
>>Retries are already bound to the same connection so there shouldn't be
>>an issue of a retransmission coming from a different source port.
> I don't think the above is correct for NFSv4.0 (it may very well be true for NFSv3).

It is correct for the Linux implementation of NFS, though the term
"xprt" is more accurate than "connection".

A "task" is bound it a specific "xprt" which, in the case of tcp, has a
fixed source port.  If the TCP connection breaks, a new one is created
with the same addresses and ports, and this new connection serves the
same xprt.

> Here's what RFC7530 Sec. 3.1.1 says:
> 3.1.1.  Client Retransmission Behavior
>
>    When processing an NFSv4 request received over a reliable transport
>    such as TCP, the NFSv4 server MUST NOT silently drop the request,
>    except if the established transport connection has been broken.
>    Given such a contract between NFSv4 clients and servers, clients MUST
>    NOT retry a request unless one or both of the following are true:
>
>    o  The transport connection has been broken
>
>    o  The procedure being retried is the NULL procedure
>
> If the transport connection is broken, the retry needs to be done on a new TCP
> connection, does it not? (I'm assuming you are referring to a retry of an RPC here.)
> (My interpretation of "broken" is "can't be fixed, so the client must use a different
>  TCP connection.)

Yes, a new connection.  But the Linux client makes sure to use the same
source port.

>
> Also, NFSv4.0 cannot use Sun RPC over UDP, whereas some DRCs only
> work for UDP traffic. (The FreeBSD server does have DRC support for TCP, but
> the algorithm is very different than what is used for UDP, due to the long delay
> before a retried RPC request is received. This can result in significant server
> overheads, so some sites choose to disable the DRC for TCP traffic or tune it
> in such a way as it becomes almost useless.)
> The FreeBSD DRC code for NFS over TCP expects the retry to be from a different
> port# (due to a new connection re: the above) for NFSv4.0. For NFSv3, my best
> recollection is that it doesn't care what the source port# is. (It basically uses a
> hash on the RPC request excluding TCP/IP header to recognize possible
> duplicates.)

Interesting .... hopefully the hash is sufficiently strong.
I think it is best to assume same source port, but there is no formal
standard.

Thanks,
NeilBrown


>
> I don't know what other NFS servers choose to do w.r.t. the DRC for NFS over TCP,
> however for some reason I thought that the Linux knfsd only used a DRC for UDP?
> (Someone please clarify this.)
>
> rick
>
>> Multiple connections will result in multiple source ports, and possibly
>> multiple source addresses, meaning retried client requests may be
>> accepted as new, rather than having any chance of being recognized as
>> retries.
>>
>> NFSv4.1+ don't have this issue, but removing the restrictions would
>> seem to break the downlevel mounts.
>>
>> Tom.
>>
Tom Talpey May 31, 2019, 1:45 a.m. UTC | #13
On 5/30/2019 2:41 PM, Olga Kornievskaia wrote:
> On Thu, May 30, 2019 at 1:41 PM Tom Talpey <tom@talpey.com> wrote:
>>
>> On 5/30/2019 1:20 PM, Olga Kornievskaia wrote:
>>> On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>>
>>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>>>> I've also re-arrange the patches a bit, merged two, and remove the
>>>>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>>>> these restrictions were not needed, I can see no need.
>>>>
>>>> I believe the need is for the correctness of retries. Because NFSv2,
>>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>>> duplicate request caches are important (although often imperfect).
>>>> These caches use client XID's, source ports and addresses, sometimes
>>>> in addition to other methods, to detect retry. Existing clients are
>>>> careful to reconnect with the same source port, to ensure this. And
>>>> existing servers won't change.
>>>
>>> Retries are already bound to the same connection so there shouldn't be
>>> an issue of a retransmission coming from a different source port.
>>
>> So, there's no path redundancy? If any connection is lost and can't
>> be reestablished, the requests on that connection will time out?
> 
> For v3 and v4.0 in the current code base with a single connection,
> when it goes down, you are out of luck. When we have multiple
> connections and would like the benefit of using them but not
> sacrifices replay cache correctness, it's a small price to restrict
> the re-transmissions and suffer the consequence of not being able to
> do an operation during network issues.

I agree that the corruption resulting from a blown cache lookup would
be bad. But I'm also saying that users will be frustrated when random
operations time out, even when new ones work. Also, I think it may
lead to application issues.

>> I think a common configuration will be two NICs and two network paths,
> 
> Are you talking about session trunking here?

No, not necessarily. Certianly not when doing what you propose
over NFSv3.

> Why do you think two NICs would be a common configuration. I have
> performance numbers that demonstrate performance improvement for a
> single NIC case. I would say a single NIC with a high speed networks
> (25/40G) would be a common configuration.

They're both common! And sure, it's good for a single NIC because of
RSS (receive side scaling). The multiple connections spread interrupts
over several cores. The same as would happen with multiple NICs.

> 
>> a so-called shotgun. Admins will be quite frustrated to discover it
>> gives no additional robustness, and perhaps even less.
>>
>> Why not simply restrict this to the fully-correct, fully-functional
>> NFSv4.1+ scenario, and not try to paper over the shortcomings?
> 
> I think mainly because customers are still using v3 but want to
> improve performance. I'd love for everybody to switch to 4.1 but
> that's not happening.

Yeah, you and me both. But trying to "fix" NFSv3 with this is not
going to move the world forward, and I predict will cost many woeful
days ahead when it fails to work transparently.

Tom.


>>>> Multiple connections will result in multiple source ports, and possibly
>>>> multiple source addresses, meaning retried client requests may be
>>>> accepted as new, rather than having any chance of being recognized as
>>>> retries.
>>>>
>>>> NFSv4.1+ don't have this issue, but removing the restrictions would
>>>> seem to break the downlevel mounts.
>>>>
>>>> Tom.
>>>>
>>>
>>>
> 
>
Tom Talpey May 31, 2019, 1:48 a.m. UTC | #14
On 5/30/2019 6:38 PM, NeilBrown wrote:
> On Thu, May 30 2019, Tom Talpey wrote:
> 
>> On 5/30/2019 1:20 PM, Olga Kornievskaia wrote:
>>> On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>>
>>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>>>> I've also re-arrange the patches a bit, merged two, and remove the
>>>>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>>>> these restrictions were not needed, I can see no need.
>>>>
>>>> I believe the need is for the correctness of retries. Because NFSv2,
>>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>>> duplicate request caches are important (although often imperfect).
>>>> These caches use client XID's, source ports and addresses, sometimes
>>>> in addition to other methods, to detect retry. Existing clients are
>>>> careful to reconnect with the same source port, to ensure this. And
>>>> existing servers won't change.
>>>
>>> Retries are already bound to the same connection so there shouldn't be
>>> an issue of a retransmission coming from a different source port.
>>
>> So, there's no path redundancy? If any connection is lost and can't
>> be reestablished, the requests on that connection will time out?
> 
> Path redundancy happens lower down in the stack.  Presumably a bonding
> driver will divert flows to a working path when one path fails.
> NFS doesn't see paths at all.  It just sees TCP connections - each with
> the same source and destination address.  How these are associated, from
> time to time, with different hardware is completely transparent to NFS.

But, you don't propose to constrain this to bonded connections. So
NFS will create connections on whatever collection of NICs which are
locally, and if these aren't bonded, well, the issues become visible.

BTW, RDMA NICs are never bonded.

Tom.

> 
>>
>> I think a common configuration will be two NICs and two network paths,
>> a so-called shotgun. Admins will be quite frustrated to discover it
>> gives no additional robustness, and perhaps even less.
>>
>> Why not simply restrict this to the fully-correct, fully-functional
>> NFSv4.1+ scenario, and not try to paper over the shortcomings?
> 
> Because I cannot see any shortcomings in using it for v3 or v4.0.
> 
> Also, there are situations where NFSv3 is a measurably better choice
> than NFSv4.1.  Al least it seems to allow a quicker failover for HA.
> But that is really a topic for another day.
> 
> NeilBrown
> 
>>
>> Tom.
>>
>>>
>>>> Multiple connections will result in multiple source ports, and possibly
>>>> multiple source addresses, meaning retried client requests may be
>>>> accepted as new, rather than having any chance of being recognized as
>>>> retries.
>>>>
>>>> NFSv4.1+ don't have this issue, but removing the restrictions would
>>>> seem to break the downlevel mounts.
>>>>
>>>> Tom.
>>>>
>>>
>>>
Rick Macklem May 31, 2019, 2:20 a.m. UTC | #15
NeilBrown wrote:
>On Thu, May 30 2019, Rick Macklem wrote:
>
>> Olga Kornievskaia wrote:
>>>On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>>
>>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>>> > I've also re-arrange the patches a bit, merged two, and remove the
>>>> > restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>>> > these restrictions were not needed, I can see no need.
>>>>
>>>> I believe the need is for the correctness of retries. Because NFSv2,
>>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>>> duplicate request caches are important (although often imperfect).
>>>> These caches use client XID's, source ports and addresses, sometimes
>>>> in addition to other methods, to detect retry. Existing clients are
>>>> careful to reconnect with the same source port, to ensure this. And
>>>> existing servers won't change.
>>>
>>>Retries are already bound to the same connection so there shouldn't be
>>>an issue of a retransmission coming from a different source port.
>> I don't think the above is correct for NFSv4.0 (it may very well be true for NFSv3).
>
>It is correct for the Linux implementation of NFS, though the term
>"xprt" is more accurate than "connection".
>
>A "task" is bound it a specific "xprt" which, in the case of tcp, has a
>fixed source port.  If the TCP connection breaks, a new one is created
>with the same addresses and ports, and this new connection serves the
>same xprt.
Ok, that's interesting. The FreeBSD client side krpc uses "xprt"s too
(I assume they came from some old Sun open sources for RPC)
but it just creates a new socket and binds it to any port# available.
When this happens in the FreeBSD client, the old connection is sometimes still
sitting around in some FIN_WAIT state. My TCP is pretty minimal, but I didn't
think you could safely create a new connection using the same port#s at that point,
or at least the old BSD TCP stack code won't allow it.

Anyhow, the FreeBSD client doesn't use same source port# for the new connection.

>> Here's what RFC7530 Sec. 3.1.1 says:
>> 3.1.1.  Client Retransmission Behavior
>>
>>    When processing an NFSv4 request received over a reliable transport
>>    such as TCP, the NFSv4 server MUST NOT silently drop the request,
>>    except if the established transport connection has been broken.
>>    Given such a contract between NFSv4 clients and servers, clients MUST
>>    NOT retry a request unless one or both of the following are true:
>>
>>    o  The transport connection has been broken
>>
>>    o  The procedure being retried is the NULL procedure
>>
>> If the transport connection is broken, the retry needs to be done on a new TCP
>> connection, does it not? (I'm assuming you are referring to a retry of an RPC here.)
>> (My interpretation of "broken" is "can't be fixed, so the client must use a different
>>  TCP connection.)
>
>Yes, a new connection.  But the Linux client makes sure to use the same
>source port.
Ok. I guess my DRC code that expects "different source port#" for NFSv4.0 is
broken. It will result in a DRC miss, which isn't great, but is always possible for
any DRC design. (Not nearly as bad as a false hit.)

>>
>> Also, NFSv4.0 cannot use Sun RPC over UDP, whereas some DRCs only
>> work for UDP traffic. (The FreeBSD server does have DRC support for TCP, but
>> the algorithm is very different than what is used for UDP, due to the long delay
>> before a retried RPC request is received. This can result in significant server
>> overheads, so some sites choose to disable the DRC for TCP traffic or tune it
>> in such a way as it becomes almost useless.)
>> The FreeBSD DRC code for NFS over TCP expects the retry to be from a different
>> port# (due to a new connection re: the above) for NFSv4.0. For NFSv3, my best
>> recollection is that it doesn't care what the source port# is. (It basically uses a
>> hash on the RPC request excluding TCP/IP header to recognize possible
>> duplicates.)
>
>Interesting .... hopefully the hash is sufficiently strong.
It doesn't just use the hash (it still expects same xid, etc), it just doesn't use the TCP
source port#.

To be honest, when I played with this many years ago, unless the size of the DRC
is very large and entries persist in the cache for a long time, they always fall out
of the cache before the retry happens over TCP. At least for the cases I tried back
then, where the RPC retry timeout for TCP was pretty large.
(Sites that use FreeBSD servers under heavy load usually find the DRC grows too
 large and tune it down until it no longer would work for TCP anyhow.)

My position is that this all got fixed by sessions and if someone uses NFSv4.0 instead
of NFSv4.1, they may just have to live with the limitations of no "exactly once"
semantics. (Personally, NFSv4.0 should just be deprecated. I know people still have good uses for NFSv3, but I have trouble believing NFSv4.0 is preferred over NFSv4.1,
although Bruce did note a case where there was a performance difference.)

>I think it is best to assume same source port, but there is no formal
>standard.
I'd say you can't assume "same port#" or "different port#', since there is no standard.
But I would agree that "assuming same port#" will just result in false misses for
clients that don't use the same port#.

rick

>Thanks,
>NeilBrown
>
>
>
>> I don't know what other NFS servers choose to do w.r.t. the DRC for NFS over TCP,
>> however for some reason I thought that the Linux knfsd only used a DRC for UDP?
>> (Someone please clarify this.)
>>
>> rick
>>
>>> Multiple connections will result in multiple source ports, and possibly
>>> multiple source addresses, meaning retried client requests may be
>>> accepted as new, rather than having any chance of being recognized as
>>> retries.
>>>
>>> NFSv4.1+ don't have this issue, but removing the restrictions would
>>> seem to break the downlevel mounts.
>>>
>>> Tom.
>>>
NeilBrown May 31, 2019, 2:31 a.m. UTC | #16
On Thu, May 30 2019, Tom Talpey wrote:

> On 5/30/2019 6:38 PM, NeilBrown wrote:
>> On Thu, May 30 2019, Tom Talpey wrote:
>> 
>>> On 5/30/2019 1:20 PM, Olga Kornievskaia wrote:
>>>> On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>>>
>>>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>>>>> I've also re-arrange the patches a bit, merged two, and remove the
>>>>>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>>>>> these restrictions were not needed, I can see no need.
>>>>>
>>>>> I believe the need is for the correctness of retries. Because NFSv2,
>>>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>>>> duplicate request caches are important (although often imperfect).
>>>>> These caches use client XID's, source ports and addresses, sometimes
>>>>> in addition to other methods, to detect retry. Existing clients are
>>>>> careful to reconnect with the same source port, to ensure this. And
>>>>> existing servers won't change.
>>>>
>>>> Retries are already bound to the same connection so there shouldn't be
>>>> an issue of a retransmission coming from a different source port.
>>>
>>> So, there's no path redundancy? If any connection is lost and can't
>>> be reestablished, the requests on that connection will time out?
>> 
>> Path redundancy happens lower down in the stack.  Presumably a bonding
>> driver will divert flows to a working path when one path fails.
>> NFS doesn't see paths at all.  It just sees TCP connections - each with
>> the same source and destination address.  How these are associated, from
>> time to time, with different hardware is completely transparent to NFS.
>
> But, you don't propose to constrain this to bonded connections. So
> NFS will create connections on whatever collection of NICs which are
> locally, and if these aren't bonded, well, the issues become visible.

If a client had multiple network interfaces with different addresses,
and several of them had routes to the selected server IP, then this
might result in the multiple connections to the server having different
local addresses (as well as different local ports) - I don't know the
network layer well enough to be sure if this is possible, but it seems
credible.

If one of these interfaces then went down, and there was no automatic
routing reconfiguration in place to restore connectivity through a
different interface, then the TCP connection would timeout and break.
The xprt would then try to reconnect using the same source port and
destination address - it doesn't provide an explicit source address, but
lets the network layer provide one.
This would presumably result in a connection with a different source
address.  So requests would continue to flow on the xprt, but they might
miss the DRC as the source address would be different.

If you have a configuration like this (multi-homed client with
multiple interfaces that can reach the server with equal weight), then
you already have a possible problem of missing the DRC if one interface
goes down a new connection is established from another one.  nconnect
doesn't change that.

So I still don't see any problem.

If I've misunderstood you, please provide a detailed description of the
sort of configuration where you think a problem might arise.

>
> BTW, RDMA NICs are never bonded.

I've come across the concept of "Multi-Rail", but I cannot say that I
fully understand it yet.  I suspect you would need more than nconnect to
make proper use of multi-rail RDMA

Thanks,
NeilBrown
Tom Talpey May 31, 2019, 12:36 p.m. UTC | #17
On 5/30/2019 10:20 PM, Rick Macklem wrote:
> NeilBrown wrote:
>> On Thu, May 30 2019, Rick Macklem wrote:
>>
>>> Olga Kornievskaia wrote:
>>>> On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>>>
>>>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>>>>> I've also re-arrange the patches a bit, merged two, and remove the
>>>>>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>>>>> these restrictions were not needed, I can see no need.
>>>>>
>>>>> I believe the need is for the correctness of retries. Because NFSv2,
>>>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>>>> duplicate request caches are important (although often imperfect).
>>>>> These caches use client XID's, source ports and addresses, sometimes
>>>>> in addition to other methods, to detect retry. Existing clients are
>>>>> careful to reconnect with the same source port, to ensure this. And
>>>>> existing servers won't change.
>>>>
>>>> Retries are already bound to the same connection so there shouldn't be
>>>> an issue of a retransmission coming from a different source port.
>>> I don't think the above is correct for NFSv4.0 (it may very well be true for NFSv3).
>>
>> It is correct for the Linux implementation of NFS, though the term
>> "xprt" is more accurate than "connection".
>>
>> A "task" is bound it a specific "xprt" which, in the case of tcp, has a
>> fixed source port.  If the TCP connection breaks, a new one is created
>> with the same addresses and ports, and this new connection serves the
>> same xprt.
> Ok, that's interesting. The FreeBSD client side krpc uses "xprt"s too
> (I assume they came from some old Sun open sources for RPC)
> but it just creates a new socket and binds it to any port# available.
> When this happens in the FreeBSD client, the old connection is sometimes still
> sitting around in some FIN_WAIT state. My TCP is pretty minimal, but I didn't
> think you could safely create a new connection using the same port#s at that point,
> or at least the old BSD TCP stack code won't allow it.
> 
> Anyhow, the FreeBSD client doesn't use same source port# for the new connection.
> 
>>> Here's what RFC7530 Sec. 3.1.1 says:
>>> 3.1.1.  Client Retransmission Behavior
>>>
>>>     When processing an NFSv4 request received over a reliable transport
>>>     such as TCP, the NFSv4 server MUST NOT silently drop the request,
>>>     except if the established transport connection has been broken.
>>>     Given such a contract between NFSv4 clients and servers, clients MUST
>>>     NOT retry a request unless one or both of the following are true:
>>>
>>>     o  The transport connection has been broken
>>>
>>>     o  The procedure being retried is the NULL procedure
>>>
>>> If the transport connection is broken, the retry needs to be done on a new TCP
>>> connection, does it not? (I'm assuming you are referring to a retry of an RPC here.)
>>> (My interpretation of "broken" is "can't be fixed, so the client must use a different
>>>   TCP connection.)
>>
>> Yes, a new connection.  But the Linux client makes sure to use the same
>> source port.
> Ok. I guess my DRC code that expects "different source port#" for NFSv4.0 is
> broken. It will result in a DRC miss, which isn't great, but is always possible for
> any DRC design. (Not nearly as bad as a false hit.)
> 
>>>
>>> Also, NFSv4.0 cannot use Sun RPC over UDP, whereas some DRCs only
>>> work for UDP traffic. (The FreeBSD server does have DRC support for TCP, but
>>> the algorithm is very different than what is used for UDP, due to the long delay
>>> before a retried RPC request is received. This can result in significant server
>>> overheads, so some sites choose to disable the DRC for TCP traffic or tune it
>>> in such a way as it becomes almost useless.)
>>> The FreeBSD DRC code for NFS over TCP expects the retry to be from a different
>>> port# (due to a new connection re: the above) for NFSv4.0. For NFSv3, my best
>>> recollection is that it doesn't care what the source port# is. (It basically uses a
>>> hash on the RPC request excluding TCP/IP header to recognize possible
>>> duplicates.)
>>
>> Interesting .... hopefully the hash is sufficiently strong.
> It doesn't just use the hash (it still expects same xid, etc), it just doesn't use the TCP
> source port#.
> 
> To be honest, when I played with this many years ago, unless the size of the DRC
> is very large and entries persist in the cache for a long time, they always fall out
> of the cache before the retry happens over TCP. At least for the cases I tried back
> then, where the RPC retry timeout for TCP was pretty large.
> (Sites that use FreeBSD servers under heavy load usually find the DRC grows too
>   large and tune it down until it no longer would work for TCP anyhow.)
> 
> My position is that this all got fixed by sessions and if someone uses NFSv4.0 instead
> of NFSv4.1, they may just have to live with the limitations of no "exactly once"
> semantics. (Personally, NFSv4.0 should just be deprecated. I know people still have good uses for NFSv3, but I have trouble believing NFSv4.0 is preferred over NFSv4.1,
> although Bruce did note a case where there was a performance difference.)
> 
>> I think it is best to assume same source port, but there is no formal
>> standard.
> I'd say you can't assume "same port#" or "different port#', since there is no standard.
> But I would agree that "assuming same port#" will just result in false misses for
> clients that don't use the same port#.

Hey Rick. I think the best summary is to say the traditional DRC is
deeply flawed and can't fully protect this. Many of us, you and I
included, have tried various ways to fix this, with varying degrees
of success.

My point here is not perfection however. My point is, there are servers
out there which will behave quite differently in the face of this
proposed client behavior, and I'm raising the issue.

Tom.


> 
> rick
> 
>> Thanks,
>> NeilBrown
>>
>>
>>
>>> I don't know what other NFS servers choose to do w.r.t. the DRC for NFS over TCP,
>>> however for some reason I thought that the Linux knfsd only used a DRC for UDP?
>>> (Someone please clarify this.)
>>>
>>> rick
>>>
>>>> Multiple connections will result in multiple source ports, and possibly
>>>> multiple source addresses, meaning retried client requests may be
>>>> accepted as new, rather than having any chance of being recognized as
>>>> retries.
>>>>
>>>> NFSv4.1+ don't have this issue, but removing the restrictions would
>>>> seem to break the downlevel mounts.
>>>>
>>>> Tom.
>>>>
> 
>
Tom Talpey May 31, 2019, 12:39 p.m. UTC | #18
On 5/30/2019 10:31 PM, NeilBrown wrote:
> On Thu, May 30 2019, Tom Talpey wrote:
> 
>> On 5/30/2019 6:38 PM, NeilBrown wrote:
>>> On Thu, May 30 2019, Tom Talpey wrote:
>>>
>>>> On 5/30/2019 1:20 PM, Olga Kornievskaia wrote:
>>>>> On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com> wrote:
>>>>>>
>>>>>> On 5/29/2019 8:41 PM, NeilBrown wrote:
>>>>>>> I've also re-arrange the patches a bit, merged two, and remove the
>>>>>>> restriction to TCP and NFSV4.x,x>=1.  Discussions seemed to suggest
>>>>>>> these restrictions were not needed, I can see no need.
>>>>>>
>>>>>> I believe the need is for the correctness of retries. Because NFSv2,
>>>>>> NFSv3 and NFSv4.0 have no exactly-once semantics of their own, server
>>>>>> duplicate request caches are important (although often imperfect).
>>>>>> These caches use client XID's, source ports and addresses, sometimes
>>>>>> in addition to other methods, to detect retry. Existing clients are
>>>>>> careful to reconnect with the same source port, to ensure this. And
>>>>>> existing servers won't change.
>>>>>
>>>>> Retries are already bound to the same connection so there shouldn't be
>>>>> an issue of a retransmission coming from a different source port.
>>>>
>>>> So, there's no path redundancy? If any connection is lost and can't
>>>> be reestablished, the requests on that connection will time out?
>>>
>>> Path redundancy happens lower down in the stack.  Presumably a bonding
>>> driver will divert flows to a working path when one path fails.
>>> NFS doesn't see paths at all.  It just sees TCP connections - each with
>>> the same source and destination address.  How these are associated, from
>>> time to time, with different hardware is completely transparent to NFS.
>>
>> But, you don't propose to constrain this to bonded connections. So
>> NFS will create connections on whatever collection of NICs which are
>> locally, and if these aren't bonded, well, the issues become visible.
> 
> If a client had multiple network interfaces with different addresses,
> and several of them had routes to the selected server IP, then this
> might result in the multiple connections to the server having different
> local addresses (as well as different local ports) - I don't know the
> network layer well enough to be sure if this is possible, but it seems
> credible.
> 
> If one of these interfaces then went down, and there was no automatic
> routing reconfiguration in place to restore connectivity through a
> different interface, then the TCP connection would timeout and break.
> The xprt would then try to reconnect using the same source port and
> destination address - it doesn't provide an explicit source address, but
> lets the network layer provide one.
> This would presumably result in a connection with a different source
> address.  So requests would continue to flow on the xprt, but they might
> miss the DRC as the source address would be different.
> 
> If you have a configuration like this (multi-homed client with
> multiple interfaces that can reach the server with equal weight), then
> you already have a possible problem of missing the DRC if one interface
> goes down a new connection is established from another one.  nconnect
> doesn't change that.
> 
> So I still don't see any problem.
> 
> If I've misunderstood you, please provide a detailed description of the
> sort of configuration where you think a problem might arise.

You nailed it. But, I disagree that there won't be a problem. NFSv4.1
and up will be fine, but NFS versions which rely on a heuristic, space
limited DRC, will not.

Tom.


> 
>>
>> BTW, RDMA NICs are never bonded.
> 
> I've come across the concept of "Multi-Rail", but I cannot say that I
> fully understand it yet.  I suspect you would need more than nconnect to
> make proper use of multi-rail RDMA
> 
> Thanks,
> NeilBrown
>
Trond Myklebust May 31, 2019, 1:33 p.m. UTC | #19
On Fri, 2019-05-31 at 08:36 -0400, Tom Talpey wrote:
> On 5/30/2019 10:20 PM, Rick Macklem wrote:
> > NeilBrown wrote:
> > > On Thu, May 30 2019, Rick Macklem wrote:
> > > 
> > > > Olga Kornievskaia wrote:
> > > > > On Thu, May 30, 2019 at 1:05 PM Tom Talpey <tom@talpey.com>
> > > > > wrote:
> > > > > > On 5/29/2019 8:41 PM, NeilBrown wrote:
> > > > > > > I've also re-arrange the patches a bit, merged two, and
> > > > > > > remove the
> > > > > > > restriction to TCP and NFSV4.x,x>=1.  Discussions seemed
> > > > > > > to suggest
> > > > > > > these restrictions were not needed, I can see no need.
> > > > > > 
> > > > > > I believe the need is for the correctness of retries.
> > > > > > Because NFSv2,
> > > > > > NFSv3 and NFSv4.0 have no exactly-once semantics of their
> > > > > > own, server
> > > > > > duplicate request caches are important (although often
> > > > > > imperfect).
> > > > > > These caches use client XID's, source ports and addresses,
> > > > > > sometimes
> > > > > > in addition to other methods, to detect retry. Existing
> > > > > > clients are
> > > > > > careful to reconnect with the same source port, to ensure
> > > > > > this. And
> > > > > > existing servers won't change.
> > > > > 
> > > > > Retries are already bound to the same connection so there
> > > > > shouldn't be
> > > > > an issue of a retransmission coming from a different source
> > > > > port.
> > > > I don't think the above is correct for NFSv4.0 (it may very
> > > > well be true for NFSv3).
> > > 
> > > It is correct for the Linux implementation of NFS, though the
> > > term
> > > "xprt" is more accurate than "connection".
> > > 
> > > A "task" is bound it a specific "xprt" which, in the case of tcp,
> > > has a
> > > fixed source port.  If the TCP connection breaks, a new one is
> > > created
> > > with the same addresses and ports, and this new connection serves
> > > the
> > > same xprt.
> > Ok, that's interesting. The FreeBSD client side krpc uses "xprt"s
> > too
> > (I assume they came from some old Sun open sources for RPC)
> > but it just creates a new socket and binds it to any port#
> > available.
> > When this happens in the FreeBSD client, the old connection is
> > sometimes still
> > sitting around in some FIN_WAIT state. My TCP is pretty minimal,
> > but I didn't
> > think you could safely create a new connection using the same
> > port#s at that point,
> > or at least the old BSD TCP stack code won't allow it.
> > 
> > Anyhow, the FreeBSD client doesn't use same source port# for the
> > new connection.
> > 
> > > > Here's what RFC7530 Sec. 3.1.1 says:
> > > > 3.1.1.  Client Retransmission Behavior
> > > > 
> > > >     When processing an NFSv4 request received over a reliable
> > > > transport
> > > >     such as TCP, the NFSv4 server MUST NOT silently drop the
> > > > request,
> > > >     except if the established transport connection has been
> > > > broken.
> > > >     Given such a contract between NFSv4 clients and servers,
> > > > clients MUST
> > > >     NOT retry a request unless one or both of the following are
> > > > true:
> > > > 
> > > >     o  The transport connection has been broken
> > > > 
> > > >     o  The procedure being retried is the NULL procedure
> > > > 
> > > > If the transport connection is broken, the retry needs to be
> > > > done on a new TCP
> > > > connection, does it not? (I'm assuming you are referring to a
> > > > retry of an RPC here.)
> > > > (My interpretation of "broken" is "can't be fixed, so the
> > > > client must use a different
> > > >   TCP connection.)
> > > 
> > > Yes, a new connection.  But the Linux client makes sure to use
> > > the same
> > > source port.
> > Ok. I guess my DRC code that expects "different source port#" for
> > NFSv4.0 is
> > broken. It will result in a DRC miss, which isn't great, but is
> > always possible for
> > any DRC design. (Not nearly as bad as a false hit.)
> > 
> > > > Also, NFSv4.0 cannot use Sun RPC over UDP, whereas some DRCs
> > > > only
> > > > work for UDP traffic. (The FreeBSD server does have DRC support
> > > > for TCP, but
> > > > the algorithm is very different than what is used for UDP, due
> > > > to the long delay
> > > > before a retried RPC request is received. This can result in
> > > > significant server
> > > > overheads, so some sites choose to disable the DRC for TCP
> > > > traffic or tune it
> > > > in such a way as it becomes almost useless.)
> > > > The FreeBSD DRC code for NFS over TCP expects the retry to be
> > > > from a different
> > > > port# (due to a new connection re: the above) for NFSv4.0. For
> > > > NFSv3, my best
> > > > recollection is that it doesn't care what the source port# is.
> > > > (It basically uses a
> > > > hash on the RPC request excluding TCP/IP header to recognize
> > > > possible
> > > > duplicates.)
> > > 
> > > Interesting .... hopefully the hash is sufficiently strong.
> > It doesn't just use the hash (it still expects same xid, etc), it
> > just doesn't use the TCP
> > source port#.
> > 
> > To be honest, when I played with this many years ago, unless the
> > size of the DRC
> > is very large and entries persist in the cache for a long time,
> > they always fall out
> > of the cache before the retry happens over TCP. At least for the
> > cases I tried back
> > then, where the RPC retry timeout for TCP was pretty large.
> > (Sites that use FreeBSD servers under heavy load usually find the
> > DRC grows too
> >   large and tune it down until it no longer would work for TCP
> > anyhow.)
> > 
> > My position is that this all got fixed by sessions and if someone
> > uses NFSv4.0 instead
> > of NFSv4.1, they may just have to live with the limitations of no
> > "exactly once"
> > semantics. (Personally, NFSv4.0 should just be deprecated. I know
> > people still have good uses for NFSv3, but I have trouble believing
> > NFSv4.0 is preferred over NFSv4.1,
> > although Bruce did note a case where there was a performance
> > difference.)
> > 
> > > I think it is best to assume same source port, but there is no
> > > formal
> > > standard.
> > I'd say you can't assume "same port#" or "different port#', since
> > there is no standard.
> > But I would agree that "assuming same port#" will just result in
> > false misses for
> > clients that don't use the same port#.
> 
> Hey Rick. I think the best summary is to say the traditional DRC is
> deeply flawed and can't fully protect this. Many of us, you and I
> included, have tried various ways to fix this, with varying degrees
> of success.
> 
> My point here is not perfection however. My point is, there are
> servers
> out there which will behave quite differently in the face of this
> proposed client behavior, and I'm raising the issue.

Tom, this set of patches does _not_ change client behaviour w.r.t.
replays in any way compared to before. I deliberately designed it not
to.

As others have already explained, the design does not change the
behaviour of reusing the same port when reconnecting any given xprt.
The client reuses exactly the same code that is currently used, where
there is only one xprt, to ensure that we first try to bind to the same
port we used before the connection was broken.
Furthermore, there is never a case where the client deliberately tries
to break the connection when there are outstanding RPC requests
(including when replaying NFSv2/v3 requests). Requests are always
replayed on the same xprt on which they were originally sent because
the purpose of this patchset has not been to provide fail-over
redundancy, but to attempt to improve performance in the case where the
server is responsive and able to scale.
Any TCP connection breakage happens from the server side (or from the
network itself), meaning that TIME_WAIT states are generally not a
problem. Any other issues with TCP reconnection are common to both the
existing code and the new code.

When we add dynamic management of the number of xprts per client (and
yes, I do still want to do that) then there will be DRC replay issues
with NFSv2/v3/v4.0 if we start removing xprts which have active
requests associated with them, so that needs to be done with care.
However the current patchset does not do dynamic management, so that
point is moot for now (using the word "moot" in the American, and not
the British sense).
Chuck Lever III May 31, 2019, 1:46 p.m. UTC | #20
> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, May 30 2019, Chuck Lever wrote:
> 
>> Hi Neil-
>> 
>> Thanks for chasing this a little further.
>> 
>> 
>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> This patch set is based on the patches in the multipath_tcp branch of
>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>> 
>>> I'd like to add my voice to those supporting this work and wanting to
>>> see it land.
>>> We have had customers/partners wanting this sort of functionality for
>>> years.  In SLES releases prior to SLE15, we've provide a
>>> "nosharetransport" mount option, so that several filesystem could be
>>> mounted from the same server and each would get its own TCP
>>> connection.
>> 
>> Is it well understood why splitting up the TCP connections result
>> in better performance?
>> 
>> 
>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>> 
>>> Partners have assured us that it improves total throughput,
>>> particularly with bonded networks, but we haven't had any concrete
>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>> Olga!
>>> 
>>> My understanding, as I explain in one of the patches, is that parallel
>>> hardware is normally utilized by distributing flows, rather than
>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>> So multiple flows are needed to utilizes parallel hardware.
>> 
>> Indeed.
>> 
>> However I think one of the problems is what happens in simpler scenarios.
>> We had reports that using nconnect > 1 on virtual clients made things
>> go slower. It's not always wise to establish multiple connections
>> between the same two IP addresses. It depends on the hardware on each
>> end, and the network conditions.
> 
> This is a good argument for leaving the default at '1'.  When
> documentation is added to nfs(5), we can make it clear that the optimal
> number is dependant on hardware.

Is there any visibility into the NIC hardware that can guide this setting?


>> What about situations where the network capabilities between server and
>> client change? Problem is that neither endpoint can detect that; TCP
>> usually just deals with it.
> 
> Being able to manually change (-o remount) the number of connections
> might be useful...

Ugh. I have problems with the administrative interface for this feature,
and this is one of them.

Another is what prevents your client from using a different nconnect=
setting on concurrent mounts of the same server? It's another case of a
per-mount setting being used to control a resource that is shared across
mounts.

Adding user tunables has never been known to increase the aggregate
amount of happiness in the universe. I really hope we can come up with
a better administrative interface... ideally, none would be best.


>> Related Work:
>> 
>> We now have protocol (more like conventions) for clients to discover
>> when a server has additional endpoints so that it can establish
>> connections to each of them.
>> 
>> https://datatracker.ietf.org/doc/rfc8587/
>> 
>> and
>> 
>> https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rfc5661-msns-update/
>> 
>> Boiled down, the client uses fs_locations and trunking detection to
>> figure out when two IP addresses are the same server instance.
>> 
>> This facility can also be used to establish a connection over a
>> different path if network connectivity is lost.
>> 
>> There has also been some exploration of MP-TCP. The magic happens
>> under the transport socket in the network layer, and the RPC client
>> is not involved.
> 
> I would think that SCTP would be the best protocol for NFS to use as it
> supports multi-streaming - several independent streams.  That would
> require that hardware understands it of course.
> 
> Though I have examined MP-TCP closely, it looks like it is still fully
> sequenced, so it would be tricky for two RPC messages to be assembled
> into TCP frames completely independently - at least you would need
> synchronization on the sequence number.
> 
> Thanks for your thoughts,
> NeilBrown
> 
> 
>> 
>> 
>>> Comments most welcome.  I'd love to see this, or something similar,
>>> merged.
>>> 
>>> Thanks,
>>> NeilBrown
>>> 
>>> ---
>>> 
>>> NeilBrown (4):
>>>     NFS: send state management on a single connection.
>>>     SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts.
>>>     SUNRPC: add links for all client xprts to debugfs
>>> 
>>> Trond Myklebust (5):
>>>     SUNRPC: Add basic load balancing to the transport switch
>>>     SUNRPC: Allow creation of RPC clients with multiple connections
>>>     NFS: Add a mount option to specify number of TCP connections to use
>>>     NFSv4: Allow multiple connections to NFSv4.x servers
>>>     pNFS: Allow multiple connections to the DS
>>>     NFS: Allow multiple connections to a NFSv2 or NFSv3 server
>>> 
>>> 
>>> fs/nfs/client.c                      |    3 +
>>> fs/nfs/internal.h                    |    2 +
>>> fs/nfs/nfs3client.c                  |    1 
>>> fs/nfs/nfs4client.c                  |   13 ++++-
>>> fs/nfs/nfs4proc.c                    |   22 +++++---
>>> fs/nfs/super.c                       |   12 ++++
>>> include/linux/nfs_fs_sb.h            |    1 
>>> include/linux/sunrpc/clnt.h          |    1 
>>> include/linux/sunrpc/sched.h         |    1 
>>> include/linux/sunrpc/xprt.h          |    1 
>>> include/linux/sunrpc/xprtmultipath.h |    2 +
>>> net/sunrpc/clnt.c                    |   98 ++++++++++++++++++++++++++++++++--
>>> net/sunrpc/debugfs.c                 |   46 ++++++++++------
>>> net/sunrpc/sched.c                   |    3 +
>>> net/sunrpc/stats.c                   |   15 +++--
>>> net/sunrpc/sunrpc.h                  |    3 +
>>> net/sunrpc/xprtmultipath.c           |   23 +++++++-
>>> 17 files changed, 204 insertions(+), 43 deletions(-)
>>> 
>>> --
>>> Signature
>>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever
J. Bruce Fields May 31, 2019, 3:38 p.m. UTC | #21
On Fri, May 31, 2019 at 09:46:32AM -0400, Chuck Lever wrote:
> Adding user tunables has never been known to increase the aggregate
> amount of happiness in the universe.

I need to add that to my review checklist: "will this patch increase the
aggregate amount of happiness in the universe?".

--b.
NeilBrown June 11, 2019, 1:09 a.m. UTC | #22
On Fri, May 31 2019, Chuck Lever wrote:

>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Thu, May 30 2019, Chuck Lever wrote:
>> 
>>> Hi Neil-
>>> 
>>> Thanks for chasing this a little further.
>>> 
>>> 
>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>> 
>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>> 
>>>> I'd like to add my voice to those supporting this work and wanting to
>>>> see it land.
>>>> We have had customers/partners wanting this sort of functionality for
>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>> "nosharetransport" mount option, so that several filesystem could be
>>>> mounted from the same server and each would get its own TCP
>>>> connection.
>>> 
>>> Is it well understood why splitting up the TCP connections result
>>> in better performance?
>>> 
>>> 
>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>> 
>>>> Partners have assured us that it improves total throughput,
>>>> particularly with bonded networks, but we haven't had any concrete
>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>> Olga!
>>>> 
>>>> My understanding, as I explain in one of the patches, is that parallel
>>>> hardware is normally utilized by distributing flows, rather than
>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>> So multiple flows are needed to utilizes parallel hardware.
>>> 
>>> Indeed.
>>> 
>>> However I think one of the problems is what happens in simpler scenarios.
>>> We had reports that using nconnect > 1 on virtual clients made things
>>> go slower. It's not always wise to establish multiple connections
>>> between the same two IP addresses. It depends on the hardware on each
>>> end, and the network conditions.
>> 
>> This is a good argument for leaving the default at '1'.  When
>> documentation is added to nfs(5), we can make it clear that the optimal
>> number is dependant on hardware.
>
> Is there any visibility into the NIC hardware that can guide this setting?
>

I doubt it, partly because there is more than just the NIC hardware at
issue.
There is also the server-side hardware and possibly hardware in the
middle.


>
>>> What about situations where the network capabilities between server and
>>> client change? Problem is that neither endpoint can detect that; TCP
>>> usually just deals with it.
>> 
>> Being able to manually change (-o remount) the number of connections
>> might be useful...
>
> Ugh. I have problems with the administrative interface for this feature,
> and this is one of them.
>
> Another is what prevents your client from using a different nconnect=
> setting on concurrent mounts of the same server? It's another case of a
> per-mount setting being used to control a resource that is shared across
> mounts.

I think that horse has well and truly bolted.
It would be nice to have a "server" abstraction visible to user-space
where we could adjust settings that make sense server-wide, and then a way
to mount individual filesystems from that "server" - but we don't.

Probably the best we can do is to document (in nfs(5)) which options are
per-server and which are per-mount.

>
> Adding user tunables has never been known to increase the aggregate
> amount of happiness in the universe. I really hope we can come up with
> a better administrative interface... ideally, none would be best.

I agree that none would be best.  It isn't clear to me that that is
possible.
At present, we really don't have enough experience with this
functionality to be able to say what the trade-offs are.
If we delay the functionality until we have the perfect interface,
we may never get that experience.

We can document "nconnect=" as a hint, and possibly add that
"nconnect=1" is a firm guarantee that more will not be used.
Then further down the track, we might change the actual number of
connections automatically if a way can be found to do that without cost.

Do you have any objections apart from the nconnect= mount option?

Thanks,
NeilBrown
Chuck Lever III June 11, 2019, 2:51 p.m. UTC | #23
Hi Neil-


> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Fri, May 31 2019, Chuck Lever wrote:
> 
>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> On Thu, May 30 2019, Chuck Lever wrote:
>>> 
>>>> Hi Neil-
>>>> 
>>>> Thanks for chasing this a little further.
>>>> 
>>>> 
>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>> 
>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>> 
>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>> see it land.
>>>>> We have had customers/partners wanting this sort of functionality for
>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>> mounted from the same server and each would get its own TCP
>>>>> connection.
>>>> 
>>>> Is it well understood why splitting up the TCP connections result
>>>> in better performance?
>>>> 
>>>> 
>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>> 
>>>>> Partners have assured us that it improves total throughput,
>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>> Olga!
>>>>> 
>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>> hardware is normally utilized by distributing flows, rather than
>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>> 
>>>> Indeed.
>>>> 
>>>> However I think one of the problems is what happens in simpler scenarios.
>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>> go slower. It's not always wise to establish multiple connections
>>>> between the same two IP addresses. It depends on the hardware on each
>>>> end, and the network conditions.
>>> 
>>> This is a good argument for leaving the default at '1'.  When
>>> documentation is added to nfs(5), we can make it clear that the optimal
>>> number is dependant on hardware.
>> 
>> Is there any visibility into the NIC hardware that can guide this setting?
>> 
> 
> I doubt it, partly because there is more than just the NIC hardware at issue.
> There is also the server-side hardware and possibly hardware in the middle.

So the best guidance is YMMV. :-)


>>>> What about situations where the network capabilities between server and
>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>> usually just deals with it.
>>> 
>>> Being able to manually change (-o remount) the number of connections
>>> might be useful...
>> 
>> Ugh. I have problems with the administrative interface for this feature,
>> and this is one of them.
>> 
>> Another is what prevents your client from using a different nconnect=
>> setting on concurrent mounts of the same server? It's another case of a
>> per-mount setting being used to control a resource that is shared across
>> mounts.
> 
> I think that horse has well and truly bolted.
> It would be nice to have a "server" abstraction visible to user-space
> where we could adjust settings that make sense server-wide, and then a way
> to mount individual filesystems from that "server" - but we don't.

Even worse, there will be some resource sharing between containers that
might be undesirable. The host should have ultimate control over those
resources.

But that is neither here nor there.


> Probably the best we can do is to document (in nfs(5)) which options are
> per-server and which are per-mount.

Alternately, the behavior of this option could be documented this way:

The default value is one. To resolve conflicts between nconnect settings on
different mount points to the same server, the value set on the first mount
applies until there are no more mounts of that server, unless nosharecache
is specified. When following a referral to another server, the nconnect
setting is inherited, but the effective value is determined by other mounts
of that server that are already in place.

I hate to say it, but the way to make this work deterministically is to
ask administrators to ensure that the setting is the same on all mounts
of the same server. Again I'd rather this take care of itself, but it
appears that is not going to be possible.


>> Adding user tunables has never been known to increase the aggregate
>> amount of happiness in the universe. I really hope we can come up with
>> a better administrative interface... ideally, none would be best.
> 
> I agree that none would be best.  It isn't clear to me that that is
> possible.
> At present, we really don't have enough experience with this
> functionality to be able to say what the trade-offs are.
> If we delay the functionality until we have the perfect interface,
> we may never get that experience.
> 
> We can document "nconnect=" as a hint, and possibly add that
> "nconnect=1" is a firm guarantee that more will not be used.

Agree that 1 should be the default. If we make this setting a
hint, then perhaps it should be renamed; nconnect makes it sound
like the client will always open N connections. How about "maxconn" ?

Then, to better define the behavior:

The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
count of the client’s NUMA nodes? I’d be in favor of a small number
to start with. Solaris' experience with multiple connections is that
there is very little benefit past 8.

If maxconn is specified with a datagram transport, does the mount
operation fail, or is the setting is ignored?

If maxconn is a hint, when does the client open additional
connections?

IMO documentation should be clear that this setting is not for the
purpose of multipathing/trunking (using multiple NICs on the client
or server). The client has to do trunking detection/discovery in that
case, and nconnect doesn't add that logic. This is strictly for
enabling multiple connections between one client-server IP address
pair.

Do we need to state explicitly that all transport connections for a
mount (or client-server pair) are the same connection type (i.e., all
TCP or all RDMA, never a mix)?


> Then further down the track, we might change the actual number of
> connections automatically if a way can be found to do that without cost.

Fair enough.


> Do you have any objections apart from the nconnect= mount option?

Well I realize my last e-mail sounded a little negative, but I'm
actually in favor of adding the ability to open multiple connections
per client-server pair. I just want to be careful about making this
a feature that has as few downsides as possible right from the start.
I'll try to be more helpful in my responses.

Remaining implementation issues that IMO need to be sorted:

• We want to take care that the client can recover network resources
that have gone idle. Can we reuse the auto-close logic to close extra
connections?
• How will the client schedule requests on multiple connections?
Should we enable the use of different schedulers?
• How will retransmits be handled?
• How will the client recover from broken connections? Today's clients
use disconnect to determine when to retransmit, thus there might be
some unwanted interactions here that result in mount hangs.
• Assume NFSv4.1 session ID rather than client ID trunking: is Linux
client support in place for this already?
• Are there any concerns about how the Linux server DRC will behave in
multi-connection scenarios?

None of these seem like a deal breaker. And possibly several of these
are already decided, but just need to be published/documented.


--
Chuck Lever
Tom Talpey June 11, 2019, 3:05 p.m. UTC | #24
On 6/11/2019 10:51 AM, Chuck Lever wrote:
> Hi Neil-
> 
> 
>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>
>> On Fri, May 31 2019, Chuck Lever wrote:
>>
>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>
>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>
>>>>> Hi Neil-
>>>>>
>>>>> Thanks for chasing this a little further.
>>>>>
>>>>>
>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>
>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>
>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>> see it land.
>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>> mounted from the same server and each would get its own TCP
>>>>>> connection.
>>>>>
>>>>> Is it well understood why splitting up the TCP connections result
>>>>> in better performance?
>>>>>
>>>>>
>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>
>>>>>> Partners have assured us that it improves total throughput,
>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>> Olga!
>>>>>>
>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>
>>>>> Indeed.
>>>>>
>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>> go slower. It's not always wise to establish multiple connections
>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>> end, and the network conditions.
>>>>
>>>> This is a good argument for leaving the default at '1'.  When
>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>> number is dependant on hardware.
>>>
>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>
>>
>> I doubt it, partly because there is more than just the NIC hardware at issue.
>> There is also the server-side hardware and possibly hardware in the middle.
> 
> So the best guidance is YMMV. :-)
> 
> 
>>>>> What about situations where the network capabilities between server and
>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>> usually just deals with it.
>>>>
>>>> Being able to manually change (-o remount) the number of connections
>>>> might be useful...
>>>
>>> Ugh. I have problems with the administrative interface for this feature,
>>> and this is one of them.
>>>
>>> Another is what prevents your client from using a different nconnect=
>>> setting on concurrent mounts of the same server? It's another case of a
>>> per-mount setting being used to control a resource that is shared across
>>> mounts.
>>
>> I think that horse has well and truly bolted.
>> It would be nice to have a "server" abstraction visible to user-space
>> where we could adjust settings that make sense server-wide, and then a way
>> to mount individual filesystems from that "server" - but we don't.
> 
> Even worse, there will be some resource sharing between containers that
> might be undesirable. The host should have ultimate control over those
> resources.
> 
> But that is neither here nor there.
> 
> 
>> Probably the best we can do is to document (in nfs(5)) which options are
>> per-server and which are per-mount.
> 
> Alternately, the behavior of this option could be documented this way:
> 
> The default value is one. To resolve conflicts between nconnect settings on
> different mount points to the same server, the value set on the first mount
> applies until there are no more mounts of that server, unless nosharecache
> is specified. When following a referral to another server, the nconnect
> setting is inherited, but the effective value is determined by other mounts
> of that server that are already in place.
> 
> I hate to say it, but the way to make this work deterministically is to
> ask administrators to ensure that the setting is the same on all mounts
> of the same server. Again I'd rather this take care of itself, but it
> appears that is not going to be possible.
> 
> 
>>> Adding user tunables has never been known to increase the aggregate
>>> amount of happiness in the universe. I really hope we can come up with
>>> a better administrative interface... ideally, none would be best.
>>
>> I agree that none would be best.  It isn't clear to me that that is
>> possible.
>> At present, we really don't have enough experience with this
>> functionality to be able to say what the trade-offs are.
>> If we delay the functionality until we have the perfect interface,
>> we may never get that experience.
>>
>> We can document "nconnect=" as a hint, and possibly add that
>> "nconnect=1" is a firm guarantee that more will not be used.
> 
> Agree that 1 should be the default. If we make this setting a
> hint, then perhaps it should be renamed; nconnect makes it sound
> like the client will always open N connections. How about "maxconn" ?
> 
> Then, to better define the behavior:
> 
> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
> count of the client’s NUMA nodes? I’d be in favor of a small number
> to start with. Solaris' experience with multiple connections is that
> there is very little benefit past 8.

If it's of any help, the Windows SMB3 multichannel client limits itself
to 4. The benefit rises slowly at that point, and the unpredictability
heads for the roof, especially when multiple NICs and network paths
are in play. The setting can be increased, but we discourage it for
anything but testing.

Tom.


> If maxconn is specified with a datagram transport, does the mount
> operation fail, or is the setting is ignored?
> 
> If maxconn is a hint, when does the client open additional
> connections?
> 
> IMO documentation should be clear that this setting is not for the
> purpose of multipathing/trunking (using multiple NICs on the client
> or server). The client has to do trunking detection/discovery in that
> case, and nconnect doesn't add that logic. This is strictly for
> enabling multiple connections between one client-server IP address
> pair.
> 
> Do we need to state explicitly that all transport connections for a
> mount (or client-server pair) are the same connection type (i.e., all
> TCP or all RDMA, never a mix)?
> 
> 
>> Then further down the track, we might change the actual number of
>> connections automatically if a way can be found to do that without cost.
> 
> Fair enough.
> 
> 
>> Do you have any objections apart from the nconnect= mount option?
> 
> Well I realize my last e-mail sounded a little negative, but I'm
> actually in favor of adding the ability to open multiple connections
> per client-server pair. I just want to be careful about making this
> a feature that has as few downsides as possible right from the start.
> I'll try to be more helpful in my responses.
> 
> Remaining implementation issues that IMO need to be sorted:
> 
> • We want to take care that the client can recover network resources
> that have gone idle. Can we reuse the auto-close logic to close extra
> connections?
> • How will the client schedule requests on multiple connections?
> Should we enable the use of different schedulers?
> • How will retransmits be handled?
> • How will the client recover from broken connections? Today's clients
> use disconnect to determine when to retransmit, thus there might be
> some unwanted interactions here that result in mount hangs.
> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
> client support in place for this already?
> • Are there any concerns about how the Linux server DRC will behave in
> multi-connection scenarios?
> 
> None of these seem like a deal breaker. And possibly several of these
> are already decided, but just need to be published/documented.
> 
> 
> --
> Chuck Lever
> 
> 
> 
> 
>
Trond Myklebust June 11, 2019, 3:20 p.m. UTC | #25
On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
> Hi Neil-
> 
> 
> > On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
> > 
> > On Fri, May 31 2019, Chuck Lever wrote:
> > 
> > > > On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
> > > > 
> > > > On Thu, May 30 2019, Chuck Lever wrote:
> > > > 
> > > > > Hi Neil-
> > > > > 
> > > > > Thanks for chasing this a little further.
> > > > > 
> > > > > 
> > > > > > On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com>
> > > > > > wrote:
> > > > > > 
> > > > > > This patch set is based on the patches in the multipath_tcp
> > > > > > branch of
> > > > > > git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
> > > > > > 
> > > > > > I'd like to add my voice to those supporting this work and
> > > > > > wanting to
> > > > > > see it land.
> > > > > > We have had customers/partners wanting this sort of
> > > > > > functionality for
> > > > > > years.  In SLES releases prior to SLE15, we've provide a
> > > > > > "nosharetransport" mount option, so that several filesystem
> > > > > > could be
> > > > > > mounted from the same server and each would get its own TCP
> > > > > > connection.
> > > > > 
> > > > > Is it well understood why splitting up the TCP connections
> > > > > result
> > > > > in better performance?
> > > > > 
> > > > > 
> > > > > > In SLE15 we are using this 'nconnect' feature, which is
> > > > > > much nicer.
> > > > > > 
> > > > > > Partners have assured us that it improves total throughput,
> > > > > > particularly with bonded networks, but we haven't had any
> > > > > > concrete
> > > > > > data until Olga Kornievskaia provided some concrete test
> > > > > > data - thanks
> > > > > > Olga!
> > > > > > 
> > > > > > My understanding, as I explain in one of the patches, is
> > > > > > that parallel
> > > > > > hardware is normally utilized by distributing flows, rather
> > > > > > than
> > > > > > packets.  This avoid out-of-order deliver of packets in a
> > > > > > flow.
> > > > > > So multiple flows are needed to utilizes parallel hardware.
> > > > > 
> > > > > Indeed.
> > > > > 
> > > > > However I think one of the problems is what happens in
> > > > > simpler scenarios.
> > > > > We had reports that using nconnect > 1 on virtual clients
> > > > > made things
> > > > > go slower. It's not always wise to establish multiple
> > > > > connections
> > > > > between the same two IP addresses. It depends on the hardware
> > > > > on each
> > > > > end, and the network conditions.
> > > > 
> > > > This is a good argument for leaving the default at '1'.  When
> > > > documentation is added to nfs(5), we can make it clear that the
> > > > optimal
> > > > number is dependant on hardware.
> > > 
> > > Is there any visibility into the NIC hardware that can guide this
> > > setting?
> > > 
> > 
> > I doubt it, partly because there is more than just the NIC hardware
> > at issue.
> > There is also the server-side hardware and possibly hardware in the
> > middle.
> 
> So the best guidance is YMMV. :-)
> 
> 
> > > > > What about situations where the network capabilities between
> > > > > server and
> > > > > client change? Problem is that neither endpoint can detect
> > > > > that; TCP
> > > > > usually just deals with it.
> > > > 
> > > > Being able to manually change (-o remount) the number of
> > > > connections
> > > > might be useful...
> > > 
> > > Ugh. I have problems with the administrative interface for this
> > > feature,
> > > and this is one of them.
> > > 
> > > Another is what prevents your client from using a different
> > > nconnect=
> > > setting on concurrent mounts of the same server? It's another
> > > case of a
> > > per-mount setting being used to control a resource that is shared
> > > across
> > > mounts.
> > 
> > I think that horse has well and truly bolted.
> > It would be nice to have a "server" abstraction visible to user-
> > space
> > where we could adjust settings that make sense server-wide, and
> > then a way
> > to mount individual filesystems from that "server" - but we don't.
> 
> Even worse, there will be some resource sharing between containers
> that
> might be undesirable. The host should have ultimate control over
> those
> resources.
> 
> But that is neither here nor there.

We can't and we don't normally share NFS resources between containers
unless they share a network namespace.

IOW: containers should normally work just fine with each container able
to control its own connections to any given server.

> 
> > Probably the best we can do is to document (in nfs(5)) which
> > options are
> > per-server and which are per-mount.
> 
> Alternately, the behavior of this option could be documented this
> way:
> 
> The default value is one. To resolve conflicts between nconnect
> settings on
> different mount points to the same server, the value set on the first
> mount
> applies until there are no more mounts of that server, unless
> nosharecache
> is specified. When following a referral to another server, the
> nconnect
> setting is inherited, but the effective value is determined by other
> mounts
> of that server that are already in place.
> 
> I hate to say it, but the way to make this work deterministically is
> to
> ask administrators to ensure that the setting is the same on all
> mounts
> of the same server. Again I'd rather this take care of itself, but it
> appears that is not going to be possible.
> 
> 
> > > Adding user tunables has never been known to increase the
> > > aggregate
> > > amount of happiness in the universe. I really hope we can come up
> > > with
> > > a better administrative interface... ideally, none would be best.
> > 
> > I agree that none would be best.  It isn't clear to me that that is
> > possible.
> > At present, we really don't have enough experience with this
> > functionality to be able to say what the trade-offs are.
> > If we delay the functionality until we have the perfect interface,
> > we may never get that experience.
> > 
> > We can document "nconnect=" as a hint, and possibly add that
> > "nconnect=1" is a firm guarantee that more will not be used.
> 
> Agree that 1 should be the default. If we make this setting a
> hint, then perhaps it should be renamed; nconnect makes it sound
> like the client will always open N connections. How about "maxconn" ?
> 
> Then, to better define the behavior:
> 
> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
> count of the client’s NUMA nodes? I’d be in favor of a small number
> to start with. Solaris' experience with multiple connections is that
> there is very little benefit past 8.
> 
> If maxconn is specified with a datagram transport, does the mount
> operation fail, or is the setting is ignored?

It is ignored.

> If maxconn is a hint, when does the client open additional
> connections?

As I've already stated, that functionality is not yet available. When
it is, it will be under the control of a userspace daemon that can
decide on a policy in accordance with a set of user specified
requirements.

> IMO documentation should be clear that this setting is not for the
> purpose of multipathing/trunking (using multiple NICs on the client
> or server). The client has to do trunking detection/discovery in that
> case, and nconnect doesn't add that logic. This is strictly for
> enabling multiple connections between one client-server IP address
> pair.
> 
> Do we need to state explicitly that all transport connections for a
> mount (or client-server pair) are the same connection type (i.e., all
> TCP or all RDMA, never a mix)?
> 
> 
> > Then further down the track, we might change the actual number of
> > connections automatically if a way can be found to do that without
> > cost.
> 
> Fair enough.
> 
> 
> > Do you have any objections apart from the nconnect= mount option?
> 
> Well I realize my last e-mail sounded a little negative, but I'm
> actually in favor of adding the ability to open multiple connections
> per client-server pair. I just want to be careful about making this
> a feature that has as few downsides as possible right from the start.
> I'll try to be more helpful in my responses.
> 
> Remaining implementation issues that IMO need to be sorted:
> 
> • We want to take care that the client can recover network resources
> that have gone idle. Can we reuse the auto-close logic to close extra
> connections?
> • How will the client schedule requests on multiple connections?
> Should we enable the use of different schedulers?
> • How will retransmits be handled?
> • How will the client recover from broken connections? Today's
> clients
> use disconnect to determine when to retransmit, thus there might be
> some unwanted interactions here that result in mount hangs.
> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
> client support in place for this already?
> • Are there any concerns about how the Linux server DRC will behave
> in
> multi-connection scenarios?


Round and round the arguments goes....

Please see the earlier answers to all these questions

> None of these seem like a deal breaker. And possibly several of these
> are already decided, but just need to be published/documented.
> 
> 
> --
> Chuck Lever
> 
> 
>
Olga Kornievskaia June 11, 2019, 3:34 p.m. UTC | #26
On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> Hi Neil-
>
>
> > On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
> >
> > On Fri, May 31 2019, Chuck Lever wrote:
> >
> >>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
> >>>
> >>> On Thu, May 30 2019, Chuck Lever wrote:
> >>>
> >>>> Hi Neil-
> >>>>
> >>>> Thanks for chasing this a little further.
> >>>>
> >>>>
> >>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
> >>>>>
> >>>>> This patch set is based on the patches in the multipath_tcp branch of
> >>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
> >>>>>
> >>>>> I'd like to add my voice to those supporting this work and wanting to
> >>>>> see it land.
> >>>>> We have had customers/partners wanting this sort of functionality for
> >>>>> years.  In SLES releases prior to SLE15, we've provide a
> >>>>> "nosharetransport" mount option, so that several filesystem could be
> >>>>> mounted from the same server and each would get its own TCP
> >>>>> connection.
> >>>>
> >>>> Is it well understood why splitting up the TCP connections result
> >>>> in better performance?
> >>>>
> >>>>
> >>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
> >>>>>
> >>>>> Partners have assured us that it improves total throughput,
> >>>>> particularly with bonded networks, but we haven't had any concrete
> >>>>> data until Olga Kornievskaia provided some concrete test data - thanks
> >>>>> Olga!
> >>>>>
> >>>>> My understanding, as I explain in one of the patches, is that parallel
> >>>>> hardware is normally utilized by distributing flows, rather than
> >>>>> packets.  This avoid out-of-order deliver of packets in a flow.
> >>>>> So multiple flows are needed to utilizes parallel hardware.
> >>>>
> >>>> Indeed.
> >>>>
> >>>> However I think one of the problems is what happens in simpler scenarios.
> >>>> We had reports that using nconnect > 1 on virtual clients made things
> >>>> go slower. It's not always wise to establish multiple connections
> >>>> between the same two IP addresses. It depends on the hardware on each
> >>>> end, and the network conditions.
> >>>
> >>> This is a good argument for leaving the default at '1'.  When
> >>> documentation is added to nfs(5), we can make it clear that the optimal
> >>> number is dependant on hardware.
> >>
> >> Is there any visibility into the NIC hardware that can guide this setting?
> >>
> >
> > I doubt it, partly because there is more than just the NIC hardware at issue.
> > There is also the server-side hardware and possibly hardware in the middle.
>
> So the best guidance is YMMV. :-)
>
>
> >>>> What about situations where the network capabilities between server and
> >>>> client change? Problem is that neither endpoint can detect that; TCP
> >>>> usually just deals with it.
> >>>
> >>> Being able to manually change (-o remount) the number of connections
> >>> might be useful...
> >>
> >> Ugh. I have problems with the administrative interface for this feature,
> >> and this is one of them.
> >>
> >> Another is what prevents your client from using a different nconnect=
> >> setting on concurrent mounts of the same server? It's another case of a
> >> per-mount setting being used to control a resource that is shared across
> >> mounts.
> >
> > I think that horse has well and truly bolted.
> > It would be nice to have a "server" abstraction visible to user-space
> > where we could adjust settings that make sense server-wide, and then a way
> > to mount individual filesystems from that "server" - but we don't.
>
> Even worse, there will be some resource sharing between containers that
> might be undesirable. The host should have ultimate control over those
> resources.
>
> But that is neither here nor there.
>
>
> > Probably the best we can do is to document (in nfs(5)) which options are
> > per-server and which are per-mount.
>
> Alternately, the behavior of this option could be documented this way:
>
> The default value is one. To resolve conflicts between nconnect settings on
> different mount points to the same server, the value set on the first mount
> applies until there are no more mounts of that server, unless nosharecache
> is specified. When following a referral to another server, the nconnect
> setting is inherited, but the effective value is determined by other mounts
> of that server that are already in place.
>
> I hate to say it, but the way to make this work deterministically is to
> ask administrators to ensure that the setting is the same on all mounts
> of the same server. Again I'd rather this take care of itself, but it
> appears that is not going to be possible.
>
>
> >> Adding user tunables has never been known to increase the aggregate
> >> amount of happiness in the universe. I really hope we can come up with
> >> a better administrative interface... ideally, none would be best.
> >
> > I agree that none would be best.  It isn't clear to me that that is
> > possible.
> > At present, we really don't have enough experience with this
> > functionality to be able to say what the trade-offs are.
> > If we delay the functionality until we have the perfect interface,
> > we may never get that experience.
> >
> > We can document "nconnect=" as a hint, and possibly add that
> > "nconnect=1" is a firm guarantee that more will not be used.
>
> Agree that 1 should be the default. If we make this setting a
> hint, then perhaps it should be renamed; nconnect makes it sound
> like the client will always open N connections. How about "maxconn" ?

"maxconn" sounds to me like it's possible that the code would choose a
number that's less than that which I think would be misleading given
that the implementation (as is now) will open the specified number of
connection (bounded by the hard coded default we currently have set at
some value X which I'm in favor is increasing from 16 to 32).

> Then, to better define the behavior:
>
> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
> count of the client’s NUMA nodes? I’d be in favor of a small number
> to start with. Solaris' experience with multiple connections is that
> there is very little benefit past 8.

My linux to linux experience has been that there is benefit of having
more than 8 connections. I have previously posted results that went
upto 10 connection (it's on my list of thing to test uptown 16). With
the Netapp performance lab they have maxed out 25G connection setup
they were using with so they didn't experiment with nconnect=8 but no
evidence that with a larger network pipe performance would stop
improving.

Given the existing performance studies, I would like to argue that
having such low values are not warranted.

> If maxconn is specified with a datagram transport, does the mount
> operation fail, or is the setting is ignored?

Perhaps we can add a warning on the mount command saying that option
is ignored but succeed the mount.

> If maxconn is a hint, when does the client open additional
> connections?
>
> IMO documentation should be clear that this setting is not for the
> purpose of multipathing/trunking (using multiple NICs on the client
> or server). The client has to do trunking detection/discovery in that
> case, and nconnect doesn't add that logic. This is strictly for
> enabling multiple connections between one client-server IP address
> pair.

I agree this should be as that last statement says multiple connection
to the same IP and in my option this shouldn't be a hint.

> Do we need to state explicitly that all transport connections for a
> mount (or client-server pair) are the same connection type (i.e., all
> TCP or all RDMA, never a mix)?

That might be an interesting future option but I think for now, we can
clearly say it's a TCP only option in documentation which can always
be changed if extension to that functionality will be implemented.

> > Then further down the track, we might change the actual number of
> > connections automatically if a way can be found to do that without cost.
>
> Fair enough.
>
>
> > Do you have any objections apart from the nconnect= mount option?
>
> Well I realize my last e-mail sounded a little negative, but I'm
> actually in favor of adding the ability to open multiple connections
> per client-server pair. I just want to be careful about making this
> a feature that has as few downsides as possible right from the start.
> I'll try to be more helpful in my responses.
>
> Remaining implementation issues that IMO need to be sorted:

I'm curious are you saying all this need to be resolved before we
consider including this functionality? These are excellent questions
but I think they imply some complex enhancements (like ability to do
different schedulers and not only round robin) that are "enhancement"
and not requirements.

> • We want to take care that the client can recover network resources
> that have gone idle. Can we reuse the auto-close logic to close extra
> connections?
Since we are using round-robin scheduler then can we consider any
resources going idle? It's hard to know the future, we might set a
timer after which we can say that a connection has been idle for long
enough time and we close it and as soon as that happens the traffic is
going to be generated again and we'll have to pay the penalty of
establishing a new connection before sending traffic.

> • How will the client schedule requests on multiple connections?
> Should we enable the use of different schedulers?
That's an interesting idea but I don't think it shouldn't stop the
round robin solution from going thru.

> • How will retransmits be handled?
> • How will the client recover from broken connections? Today's clients
> use disconnect to determine when to retransmit, thus there might be
> some unwanted interactions here that result in mount hangs.
> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
> client support in place for this already?
> • Are there any concerns about how the Linux server DRC will behave in
> multi-connection scenarios?

I think we've talked about retransmission question. Retransmission are
handled by existing logic and are done by the same transport (ie
connection).

> None of these seem like a deal breaker. And possibly several of these
> are already decided, but just need to be published/documented.
>
>
> --
> Chuck Lever
>
>
>
Chuck Lever III June 11, 2019, 3:35 p.m. UTC | #27
> On Jun 11, 2019, at 11:20 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
> 
>> If maxconn is a hint, when does the client open additional
>> connections?
> 
> As I've already stated, that functionality is not yet available. When
> it is, it will be under the control of a userspace daemon that can
> decide on a policy in accordance with a set of user specified
> requirements.

Then why do we need a mount option at all?


--
Chuck Lever
Trond Myklebust June 11, 2019, 4:41 p.m. UTC | #28
On Tue, 2019-06-11 at 11:35 -0400, Chuck Lever wrote:
> > On Jun 11, 2019, at 11:20 AM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
> > 
> > > If maxconn is a hint, when does the client open additional
> > > connections?
> > 
> > As I've already stated, that functionality is not yet available.
> > When
> > it is, it will be under the control of a userspace daemon that can
> > decide on a policy in accordance with a set of user specified
> > requirements.
> 
> Then why do we need a mount option at all?
> 

For one thing, it allows people to play with this until we have a fully
automated solution. The fact that people are actually pulling down
these patches, forward porting them and trying them out would indicate
that there is interest in doing so.

Secondly, if your policy is 'I just want n connections' because that
fits your workload requirements (e.g. because said workload is both
latency sensitive and bursty), then a daemon solution would be
unnecessary, and may be error prone.
A mount option is helpful in this case, because you can perform the
setup through the normal fstab or autofs config file configuration
route. It also make sense if you have a nfsroot setup.

Finally, even if you do want to have a daemon manage your transport,
configuration, you do want a mechanism to help it reach an equilibrium
state quickly. Connections take time to bring up and tear down because
performance measurements take time to build up sufficient statistical
precision. Furthermore, doing so comes with a number of hidden costs,
e.g.: chewing up privileged port numbers by putting them in a TIME_WAIT
state. If you know that a given server is always subject to heavy
traffic, then initialising the number of connections appropriately has
value.
Chuck Lever III June 11, 2019, 5:32 p.m. UTC | #29
> On Jun 11, 2019, at 12:41 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Tue, 2019-06-11 at 11:35 -0400, Chuck Lever wrote:
>>> On Jun 11, 2019, at 11:20 AM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>> 
>>> On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
>>> 
>>>> If maxconn is a hint, when does the client open additional
>>>> connections?
>>> 
>>> As I've already stated, that functionality is not yet available.
>>> When
>>> it is, it will be under the control of a userspace daemon that can
>>> decide on a policy in accordance with a set of user specified
>>> requirements.
>> 
>> Then why do we need a mount option at all?
>> 
> 
> For one thing, it allows people to play with this until we have a fully
> automated solution. The fact that people are actually pulling down
> these patches, forward porting them and trying them out would indicate
> that there is interest in doing so.

Agreed that it demonstrates that folks are interested in having
multiple connections. I count myself among them.


> Secondly, if your policy is 'I just want n connections' because that
> fits your workload requirements (e.g. because said workload is both
> latency sensitive and bursty), then a daemon solution would be
> unnecessary, and may be error prone.

Why wouldn't that be the default out-of-the-shrinkwrap configuration
that is installed by nfs-utils?


> A mount option is helpful in this case, because you can perform the
> setup through the normal fstab or autofs config file configuration
> route. It also make sense if you have a nfsroot setup.

NFSROOT is the only usage scenario where I see a mount option being
a superior administrative interface. However I don't feel that
NFSROOT is going to host workloads that would need multiple
connections. KIS


> Finally, even if you do want to have a daemon manage your transport,
> configuration, you do want a mechanism to help it reach an equilibrium
> state quickly. Connections take time to bring up and tear down because
> performance measurements take time to build up sufficient statistical
> precision. Furthermore, doing so comes with a number of hidden costs,
> e.g.: chewing up privileged port numbers by putting them in a TIME_WAIT
> state. If you know that a given server is always subject to heavy
> traffic, then initialising the number of connections appropriately has
> value.

Again, I don't see how this is not something a config file can do.

The stated intent of "nconnect" way back when was for experimentation.
It works great for that!

I don't see it as a desirable long-term administrative interface,
though. I'd rather not nail in a new mount option that we actually
plan to obsolete in favor of an automated mechanism. I'd rather see
us design the administrative interface with automation from the
start. That will have a lower long-term maintenance cost.

Again, I'm not objecting to support for multiple connections. It's
just that adding a mount option doesn't feel like a friendly or
finished interface for actual users. A config file (or re-using
nfs.conf) seems to me like a better approach.


--
Chuck Lever
Trond Myklebust June 11, 2019, 5:44 p.m. UTC | #30
On Tue, 2019-06-11 at 13:32 -0400, Chuck Lever wrote:
> > On Jun 11, 2019, at 12:41 PM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Tue, 2019-06-11 at 11:35 -0400, Chuck Lever wrote:
> > > > On Jun 11, 2019, at 11:20 AM, Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
> > > > 
> > > > > If maxconn is a hint, when does the client open additional
> > > > > connections?
> > > > 
> > > > As I've already stated, that functionality is not yet
> > > > available.
> > > > When
> > > > it is, it will be under the control of a userspace daemon that
> > > > can
> > > > decide on a policy in accordance with a set of user specified
> > > > requirements.
> > > 
> > > Then why do we need a mount option at all?
> > > 
> > 
> > For one thing, it allows people to play with this until we have a
> > fully
> > automated solution. The fact that people are actually pulling down
> > these patches, forward porting them and trying them out would
> > indicate
> > that there is interest in doing so.
> 
> Agreed that it demonstrates that folks are interested in having
> multiple connections. I count myself among them.
> 
> 
> > Secondly, if your policy is 'I just want n connections' because
> > that
> > fits your workload requirements (e.g. because said workload is both
> > latency sensitive and bursty), then a daemon solution would be
> > unnecessary, and may be error prone.
> 
> Why wouldn't that be the default out-of-the-shrinkwrap configuration
> that is installed by nfs-utils?

What is the point of forcing people to run a daemon if all they want to
do is set up a fixed number of connections?

> 
> > A mount option is helpful in this case, because you can perform the
> > setup through the normal fstab or autofs config file configuration
> > route. It also make sense if you have a nfsroot setup.
> 
> NFSROOT is the only usage scenario where I see a mount option being
> a superior administrative interface. However I don't feel that
> NFSROOT is going to host workloads that would need multiple
> connections. KIS
> 
> 
> > Finally, even if you do want to have a daemon manage your
> > transport,
> > configuration, you do want a mechanism to help it reach an
> > equilibrium
> > state quickly. Connections take time to bring up and tear down
> > because
> > performance measurements take time to build up sufficient
> > statistical
> > precision. Furthermore, doing so comes with a number of hidden
> > costs,
> > e.g.: chewing up privileged port numbers by putting them in a
> > TIME_WAIT
> > state. If you know that a given server is always subject to heavy
> > traffic, then initialising the number of connections appropriately
> > has
> > value.
> 
> Again, I don't see how this is not something a config file can do.

You can, but that means you have to keep said config file up to date
with the contents of /etc/fstab etc. Pulverising configuration into
little bits and pieces that are scattered around in different files is
not a user friendly interface either.

> The stated intent of "nconnect" way back when was for
> experimentation.
> It works great for that!
> 
> I don't see it as a desirable long-term administrative interface,
> though. I'd rather not nail in a new mount option that we actually
> plan to obsolete in favor of an automated mechanism. I'd rather see
> us design the administrative interface with automation from the
> start. That will have a lower long-term maintenance cost.
> 
> Again, I'm not objecting to support for multiple connections. It's
> just that adding a mount option doesn't feel like a friendly or
> finished interface for actual users. A config file (or re-using
> nfs.conf) seems to me like a better approach.

nfs.conf is great for defining global defaults.

It can do server specific configuration, but is not a popular solution
for that. Most people are still putting that information in /etc/fstab
so that it appears in one spot.
Chuck Lever III June 11, 2019, 5:46 p.m. UTC | #31
> On Jun 11, 2019, at 11:34 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> Hi Neil-
>> 
>> 
>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> On Fri, May 31 2019, Chuck Lever wrote:
>>> 
>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>> 
>>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>> 
>>>>>> Hi Neil-
>>>>>> 
>>>>>> Thanks for chasing this a little further.
>>>>>> 
>>>>>> 
>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>> 
>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>> 
>>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>>> see it land.
>>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>>> mounted from the same server and each would get its own TCP
>>>>>>> connection.
>>>>>> 
>>>>>> Is it well understood why splitting up the TCP connections result
>>>>>> in better performance?
>>>>>> 
>>>>>> 
>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>> 
>>>>>>> Partners have assured us that it improves total throughput,
>>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>>> Olga!
>>>>>>> 
>>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>> 
>>>>>> Indeed.
>>>>>> 
>>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>>> go slower. It's not always wise to establish multiple connections
>>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>>> end, and the network conditions.
>>>>> 
>>>>> This is a good argument for leaving the default at '1'.  When
>>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>>> number is dependant on hardware.
>>>> 
>>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>> 
>>> 
>>> I doubt it, partly because there is more than just the NIC hardware at issue.
>>> There is also the server-side hardware and possibly hardware in the middle.
>> 
>> So the best guidance is YMMV. :-)
>> 
>> 
>>>>>> What about situations where the network capabilities between server and
>>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>>> usually just deals with it.
>>>>> 
>>>>> Being able to manually change (-o remount) the number of connections
>>>>> might be useful...
>>>> 
>>>> Ugh. I have problems with the administrative interface for this feature,
>>>> and this is one of them.
>>>> 
>>>> Another is what prevents your client from using a different nconnect=
>>>> setting on concurrent mounts of the same server? It's another case of a
>>>> per-mount setting being used to control a resource that is shared across
>>>> mounts.
>>> 
>>> I think that horse has well and truly bolted.
>>> It would be nice to have a "server" abstraction visible to user-space
>>> where we could adjust settings that make sense server-wide, and then a way
>>> to mount individual filesystems from that "server" - but we don't.
>> 
>> Even worse, there will be some resource sharing between containers that
>> might be undesirable. The host should have ultimate control over those
>> resources.
>> 
>> But that is neither here nor there.
>> 
>> 
>>> Probably the best we can do is to document (in nfs(5)) which options are
>>> per-server and which are per-mount.
>> 
>> Alternately, the behavior of this option could be documented this way:
>> 
>> The default value is one. To resolve conflicts between nconnect settings on
>> different mount points to the same server, the value set on the first mount
>> applies until there are no more mounts of that server, unless nosharecache
>> is specified. When following a referral to another server, the nconnect
>> setting is inherited, but the effective value is determined by other mounts
>> of that server that are already in place.
>> 
>> I hate to say it, but the way to make this work deterministically is to
>> ask administrators to ensure that the setting is the same on all mounts
>> of the same server. Again I'd rather this take care of itself, but it
>> appears that is not going to be possible.
>> 
>> 
>>>> Adding user tunables has never been known to increase the aggregate
>>>> amount of happiness in the universe. I really hope we can come up with
>>>> a better administrative interface... ideally, none would be best.
>>> 
>>> I agree that none would be best.  It isn't clear to me that that is
>>> possible.
>>> At present, we really don't have enough experience with this
>>> functionality to be able to say what the trade-offs are.
>>> If we delay the functionality until we have the perfect interface,
>>> we may never get that experience.
>>> 
>>> We can document "nconnect=" as a hint, and possibly add that
>>> "nconnect=1" is a firm guarantee that more will not be used.
>> 
>> Agree that 1 should be the default. If we make this setting a
>> hint, then perhaps it should be renamed; nconnect makes it sound
>> like the client will always open N connections. How about "maxconn" ?
> 
> "maxconn" sounds to me like it's possible that the code would choose a
> number that's less than that which I think would be misleading given
> that the implementation (as is now) will open the specified number of
> connection (bounded by the hard coded default we currently have set at
> some value X which I'm in favor is increasing from 16 to 32).

Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
like the long term plan is to allow "up to N" connections with some
mechanism to create new connections on-demand." maxconn fits that idea
better, though I'd prefer no new mount options... the point being that
eventually, this setting is likely to be an upper bound rather than a
fixed value.


>> Then, to better define the behavior:
>> 
>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
>> count of the client’s NUMA nodes? I’d be in favor of a small number
>> to start with. Solaris' experience with multiple connections is that
>> there is very little benefit past 8.
> 
> My linux to linux experience has been that there is benefit of having
> more than 8 connections. I have previously posted results that went
> upto 10 connection (it's on my list of thing to test uptown 16). With
> the Netapp performance lab they have maxed out 25G connection setup
> they were using with so they didn't experiment with nconnect=8 but no
> evidence that with a larger network pipe performance would stop
> improving.
> 
> Given the existing performance studies, I would like to argue that
> having such low values are not warranted.

They are warranted until we have a better handle on the risks of a
performance regression occurring with large nconnect settings. The
maximum number can always be raised once we are confident the
behaviors are well understood.

Also, I'd like to see some careful studies that demonstrate why
you don't see excellent results with just two or three connections.
Nearly full link bandwidth has been achieved with MP-TCP and two or
three subflows on one NIC. Why is it not possible with NFS/TCP ?


>> If maxconn is specified with a datagram transport, does the mount
>> operation fail, or is the setting is ignored?
> 
> Perhaps we can add a warning on the mount command saying that option
> is ignored but succeed the mount.
> 
>> If maxconn is a hint, when does the client open additional
>> connections?
>> 
>> IMO documentation should be clear that this setting is not for the
>> purpose of multipathing/trunking (using multiple NICs on the client
>> or server). The client has to do trunking detection/discovery in that
>> case, and nconnect doesn't add that logic. This is strictly for
>> enabling multiple connections between one client-server IP address
>> pair.
> 
> I agree this should be as that last statement says multiple connection
> to the same IP and in my option this shouldn't be a hint.
> 
>> Do we need to state explicitly that all transport connections for a
>> mount (or client-server pair) are the same connection type (i.e., all
>> TCP or all RDMA, never a mix)?
> 
> That might be an interesting future option but I think for now, we can
> clearly say it's a TCP only option in documentation which can always
> be changed if extension to that functionality will be implemented.

Is there a reason you feel RDMA shouldn't be included? I've tried
nconnect with my RDMA rig, and didn't see any problem with it.


>>> Then further down the track, we might change the actual number of
>>> connections automatically if a way can be found to do that without cost.
>> 
>> Fair enough.
>> 
>> 
>>> Do you have any objections apart from the nconnect= mount option?
>> 
>> Well I realize my last e-mail sounded a little negative, but I'm
>> actually in favor of adding the ability to open multiple connections
>> per client-server pair. I just want to be careful about making this
>> a feature that has as few downsides as possible right from the start.
>> I'll try to be more helpful in my responses.
>> 
>> Remaining implementation issues that IMO need to be sorted:
> 
> I'm curious are you saying all this need to be resolved before we
> consider including this functionality? These are excellent questions
> but I think they imply some complex enhancements (like ability to do
> different schedulers and not only round robin) that are "enhancement"
> and not requirements.
> 
>> • We want to take care that the client can recover network resources
>> that have gone idle. Can we reuse the auto-close logic to close extra
>> connections?
> Since we are using round-robin scheduler then can we consider any
> resources going idle?

Again, I was thinking of nconnect as a hint here, not as a fixed
number of connections.


> It's hard to know the future, we might set a
> timer after which we can say that a connection has been idle for long
> enough time and we close it and as soon as that happens the traffic is
> going to be generated again and we'll have to pay the penalty of
> establishing a new connection before sending traffic.
> 
>> • How will the client schedule requests on multiple connections?
>> Should we enable the use of different schedulers?
> That's an interesting idea but I don't think it shouldn't stop the
> round robin solution from going thru.
> 
>> • How will retransmits be handled?
>> • How will the client recover from broken connections? Today's clients
>> use disconnect to determine when to retransmit, thus there might be
>> some unwanted interactions here that result in mount hangs.
>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
>> client support in place for this already?
>> • Are there any concerns about how the Linux server DRC will behave in
>> multi-connection scenarios?
> 
> I think we've talked about retransmission question. Retransmission are
> handled by existing logic and are done by the same transport (ie
> connection).

Given the proposition that nconnect will be a hint (eventually) in
the form of a dynamically managed set of connections, I think we need
to answer some of these questions again. The answers could be "not
yet implemented" or "no way jose".

It would be helpful if the answers were all in one place (eg a design
document or FAQ).


>> None of these seem like a deal breaker. And possibly several of these
>> are already decided, but just need to be published/documented.
>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever
Olga Kornievskaia June 11, 2019, 7:13 p.m. UTC | #32
On Tue, Jun 11, 2019 at 1:47 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Jun 11, 2019, at 11:34 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> Hi Neil-
> >>
> >>
> >>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
> >>>
> >>> On Fri, May 31 2019, Chuck Lever wrote:
> >>>
> >>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
> >>>>>
> >>>>> On Thu, May 30 2019, Chuck Lever wrote:
> >>>>>
> >>>>>> Hi Neil-
> >>>>>>
> >>>>>> Thanks for chasing this a little further.
> >>>>>>
> >>>>>>
> >>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
> >>>>>>>
> >>>>>>> This patch set is based on the patches in the multipath_tcp branch of
> >>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
> >>>>>>>
> >>>>>>> I'd like to add my voice to those supporting this work and wanting to
> >>>>>>> see it land.
> >>>>>>> We have had customers/partners wanting this sort of functionality for
> >>>>>>> years.  In SLES releases prior to SLE15, we've provide a
> >>>>>>> "nosharetransport" mount option, so that several filesystem could be
> >>>>>>> mounted from the same server and each would get its own TCP
> >>>>>>> connection.
> >>>>>>
> >>>>>> Is it well understood why splitting up the TCP connections result
> >>>>>> in better performance?
> >>>>>>
> >>>>>>
> >>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
> >>>>>>>
> >>>>>>> Partners have assured us that it improves total throughput,
> >>>>>>> particularly with bonded networks, but we haven't had any concrete
> >>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
> >>>>>>> Olga!
> >>>>>>>
> >>>>>>> My understanding, as I explain in one of the patches, is that parallel
> >>>>>>> hardware is normally utilized by distributing flows, rather than
> >>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
> >>>>>>> So multiple flows are needed to utilizes parallel hardware.
> >>>>>>
> >>>>>> Indeed.
> >>>>>>
> >>>>>> However I think one of the problems is what happens in simpler scenarios.
> >>>>>> We had reports that using nconnect > 1 on virtual clients made things
> >>>>>> go slower. It's not always wise to establish multiple connections
> >>>>>> between the same two IP addresses. It depends on the hardware on each
> >>>>>> end, and the network conditions.
> >>>>>
> >>>>> This is a good argument for leaving the default at '1'.  When
> >>>>> documentation is added to nfs(5), we can make it clear that the optimal
> >>>>> number is dependant on hardware.
> >>>>
> >>>> Is there any visibility into the NIC hardware that can guide this setting?
> >>>>
> >>>
> >>> I doubt it, partly because there is more than just the NIC hardware at issue.
> >>> There is also the server-side hardware and possibly hardware in the middle.
> >>
> >> So the best guidance is YMMV. :-)
> >>
> >>
> >>>>>> What about situations where the network capabilities between server and
> >>>>>> client change? Problem is that neither endpoint can detect that; TCP
> >>>>>> usually just deals with it.
> >>>>>
> >>>>> Being able to manually change (-o remount) the number of connections
> >>>>> might be useful...
> >>>>
> >>>> Ugh. I have problems with the administrative interface for this feature,
> >>>> and this is one of them.
> >>>>
> >>>> Another is what prevents your client from using a different nconnect=
> >>>> setting on concurrent mounts of the same server? It's another case of a
> >>>> per-mount setting being used to control a resource that is shared across
> >>>> mounts.
> >>>
> >>> I think that horse has well and truly bolted.
> >>> It would be nice to have a "server" abstraction visible to user-space
> >>> where we could adjust settings that make sense server-wide, and then a way
> >>> to mount individual filesystems from that "server" - but we don't.
> >>
> >> Even worse, there will be some resource sharing between containers that
> >> might be undesirable. The host should have ultimate control over those
> >> resources.
> >>
> >> But that is neither here nor there.
> >>
> >>
> >>> Probably the best we can do is to document (in nfs(5)) which options are
> >>> per-server and which are per-mount.
> >>
> >> Alternately, the behavior of this option could be documented this way:
> >>
> >> The default value is one. To resolve conflicts between nconnect settings on
> >> different mount points to the same server, the value set on the first mount
> >> applies until there are no more mounts of that server, unless nosharecache
> >> is specified. When following a referral to another server, the nconnect
> >> setting is inherited, but the effective value is determined by other mounts
> >> of that server that are already in place.
> >>
> >> I hate to say it, but the way to make this work deterministically is to
> >> ask administrators to ensure that the setting is the same on all mounts
> >> of the same server. Again I'd rather this take care of itself, but it
> >> appears that is not going to be possible.
> >>
> >>
> >>>> Adding user tunables has never been known to increase the aggregate
> >>>> amount of happiness in the universe. I really hope we can come up with
> >>>> a better administrative interface... ideally, none would be best.
> >>>
> >>> I agree that none would be best.  It isn't clear to me that that is
> >>> possible.
> >>> At present, we really don't have enough experience with this
> >>> functionality to be able to say what the trade-offs are.
> >>> If we delay the functionality until we have the perfect interface,
> >>> we may never get that experience.
> >>>
> >>> We can document "nconnect=" as a hint, and possibly add that
> >>> "nconnect=1" is a firm guarantee that more will not be used.
> >>
> >> Agree that 1 should be the default. If we make this setting a
> >> hint, then perhaps it should be renamed; nconnect makes it sound
> >> like the client will always open N connections. How about "maxconn" ?
> >
> > "maxconn" sounds to me like it's possible that the code would choose a
> > number that's less than that which I think would be misleading given
> > that the implementation (as is now) will open the specified number of
> > connection (bounded by the hard coded default we currently have set at
> > some value X which I'm in favor is increasing from 16 to 32).
>
> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
> like the long term plan is to allow "up to N" connections with some
> mechanism to create new connections on-demand." maxconn fits that idea
> better, though I'd prefer no new mount options... the point being that
> eventually, this setting is likely to be an upper bound rather than a
> fixed value.

Fair enough. If the dynamic connection management is in the cards,
then "maxconn" would be an appropriate name but I also agree with you
that if we are doing dynamic management then we shouldn't need a mount
option at all. I, for one, am skeptical that we'll gain benefits from
dynamic connection management given that cost of tearing and starting
the new connection.

I would argue that since now no dynamic management is implemented then
we stay with the "nconnect" mount option and if and when such feature
is found desirable then we get rid of the mount option all together.

> >> Then, to better define the behavior:
> >>
> >> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
> >> count of the client’s NUMA nodes? I’d be in favor of a small number
> >> to start with. Solaris' experience with multiple connections is that
> >> there is very little benefit past 8.
> >
> > My linux to linux experience has been that there is benefit of having
> > more than 8 connections. I have previously posted results that went
> > upto 10 connection (it's on my list of thing to test uptown 16). With
> > the Netapp performance lab they have maxed out 25G connection setup
> > they were using with so they didn't experiment with nconnect=8 but no
> > evidence that with a larger network pipe performance would stop
> > improving.
> >
> > Given the existing performance studies, I would like to argue that
> > having such low values are not warranted.
>
> They are warranted until we have a better handle on the risks of a
> performance regression occurring with large nconnect settings. The
> maximum number can always be raised once we are confident the
> behaviors are well understood.
>
> Also, I'd like to see some careful studies that demonstrate why
> you don't see excellent results with just two or three connections.
> Nearly full link bandwidth has been achieved with MP-TCP and two or
> three subflows on one NIC. Why is it not possible with NFS/TCP ?

Performance tests that do simple buffer to buffer measurements are one
thing but doing a complicated system that involves a filesystem is
another thing. The closest we can get to this network performance
tests is NFSoRDMA which saves various copies and as you know with that
we can get close to network link capacity.

> >> If maxconn is specified with a datagram transport, does the mount
> >> operation fail, or is the setting is ignored?
> >
> > Perhaps we can add a warning on the mount command saying that option
> > is ignored but succeed the mount.
> >
> >> If maxconn is a hint, when does the client open additional
> >> connections?
> >>
> >> IMO documentation should be clear that this setting is not for the
> >> purpose of multipathing/trunking (using multiple NICs on the client
> >> or server). The client has to do trunking detection/discovery in that
> >> case, and nconnect doesn't add that logic. This is strictly for
> >> enabling multiple connections between one client-server IP address
> >> pair.
> >
> > I agree this should be as that last statement says multiple connection
> > to the same IP and in my option this shouldn't be a hint.
> >
> >> Do we need to state explicitly that all transport connections for a
> >> mount (or client-server pair) are the same connection type (i.e., all
> >> TCP or all RDMA, never a mix)?
> >
> > That might be an interesting future option but I think for now, we can
> > clearly say it's a TCP only option in documentation which can always
> > be changed if extension to that functionality will be implemented.
>
> Is there a reason you feel RDMA shouldn't be included? I've tried
> nconnect with my RDMA rig, and didn't see any problem with it.

No reason, I should have said "a single type of a connection only
option" not a mix. Of course with RDMA even with a single connection
we can achieve almost max bandwidth so having using nconnect seems
unnecessary.

> >>> Then further down the track, we might change the actual number of
> >>> connections automatically if a way can be found to do that without cost.
> >>
> >> Fair enough.
> >>
> >>
> >>> Do you have any objections apart from the nconnect= mount option?
> >>
> >> Well I realize my last e-mail sounded a little negative, but I'm
> >> actually in favor of adding the ability to open multiple connections
> >> per client-server pair. I just want to be careful about making this
> >> a feature that has as few downsides as possible right from the start.
> >> I'll try to be more helpful in my responses.
> >>
> >> Remaining implementation issues that IMO need to be sorted:
> >
> > I'm curious are you saying all this need to be resolved before we
> > consider including this functionality? These are excellent questions
> > but I think they imply some complex enhancements (like ability to do
> > different schedulers and not only round robin) that are "enhancement"
> > and not requirements.
> >
> >> • We want to take care that the client can recover network resources
> >> that have gone idle. Can we reuse the auto-close logic to close extra
> >> connections?
> > Since we are using round-robin scheduler then can we consider any
> > resources going idle?
>
> Again, I was thinking of nconnect as a hint here, not as a fixed
> number of connections.
>
>
> > It's hard to know the future, we might set a
> > timer after which we can say that a connection has been idle for long
> > enough time and we close it and as soon as that happens the traffic is
> > going to be generated again and we'll have to pay the penalty of
> > establishing a new connection before sending traffic.
> >
> >> • How will the client schedule requests on multiple connections?
> >> Should we enable the use of different schedulers?
> > That's an interesting idea but I don't think it shouldn't stop the
> > round robin solution from going thru.
> >
> >> • How will retransmits be handled?
> >> • How will the client recover from broken connections? Today's clients
> >> use disconnect to determine when to retransmit, thus there might be
> >> some unwanted interactions here that result in mount hangs.
> >> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
> >> client support in place for this already?
> >> • Are there any concerns about how the Linux server DRC will behave in
> >> multi-connection scenarios?
> >
> > I think we've talked about retransmission question. Retransmission are
> > handled by existing logic and are done by the same transport (ie
> > connection).
>
> Given the proposition that nconnect will be a hint (eventually) in
> the form of a dynamically managed set of connections, I think we need
> to answer some of these questions again. The answers could be "not
> yet implemented" or "no way jose".
>
> It would be helpful if the answers were all in one place (eg a design
> document or FAQ).
>
>
> >> None of these seem like a deal breaker. And possibly several of these
> >> are already decided, but just need to be published/documented.
> >>
> >>
> >> --
> >> Chuck Lever
>
> --
> Chuck Lever
>
>
>
Tom Talpey June 11, 2019, 8:02 p.m. UTC | #33
On 6/11/2019 3:13 PM, Olga Kornievskaia wrote:
> On Tue, Jun 11, 2019 at 1:47 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>>
>>
>>> On Jun 11, 2019, at 11:34 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>
>>> On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>
>>>> Hi Neil-
>>>>
>>>>
>>>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>
>>>>> On Fri, May 31 2019, Chuck Lever wrote:
>>>>>
>>>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>
>>>>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>>>>
>>>>>>>> Hi Neil-
>>>>>>>>
>>>>>>>> Thanks for chasing this a little further.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>>
>>>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>>>>
>>>>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>>>>> see it land.
>>>>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>>>>> mounted from the same server and each would get its own TCP
>>>>>>>>> connection.
>>>>>>>>
>>>>>>>> Is it well understood why splitting up the TCP connections result
>>>>>>>> in better performance?
>>>>>>>>
>>>>>>>>
>>>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>>>>
>>>>>>>>> Partners have assured us that it improves total throughput,
>>>>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>>>>> Olga!
>>>>>>>>>
>>>>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>>>>
>>>>>>>> Indeed.
>>>>>>>>
>>>>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>>>>> go slower. It's not always wise to establish multiple connections
>>>>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>>>>> end, and the network conditions.
>>>>>>>
>>>>>>> This is a good argument for leaving the default at '1'.  When
>>>>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>>>>> number is dependant on hardware.
>>>>>>
>>>>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>>>>
>>>>>
>>>>> I doubt it, partly because there is more than just the NIC hardware at issue.
>>>>> There is also the server-side hardware and possibly hardware in the middle.
>>>>
>>>> So the best guidance is YMMV. :-)
>>>>
>>>>
>>>>>>>> What about situations where the network capabilities between server and
>>>>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>>>>> usually just deals with it.
>>>>>>>
>>>>>>> Being able to manually change (-o remount) the number of connections
>>>>>>> might be useful...
>>>>>>
>>>>>> Ugh. I have problems with the administrative interface for this feature,
>>>>>> and this is one of them.
>>>>>>
>>>>>> Another is what prevents your client from using a different nconnect=
>>>>>> setting on concurrent mounts of the same server? It's another case of a
>>>>>> per-mount setting being used to control a resource that is shared across
>>>>>> mounts.
>>>>>
>>>>> I think that horse has well and truly bolted.
>>>>> It would be nice to have a "server" abstraction visible to user-space
>>>>> where we could adjust settings that make sense server-wide, and then a way
>>>>> to mount individual filesystems from that "server" - but we don't.
>>>>
>>>> Even worse, there will be some resource sharing between containers that
>>>> might be undesirable. The host should have ultimate control over those
>>>> resources.
>>>>
>>>> But that is neither here nor there.
>>>>
>>>>
>>>>> Probably the best we can do is to document (in nfs(5)) which options are
>>>>> per-server and which are per-mount.
>>>>
>>>> Alternately, the behavior of this option could be documented this way:
>>>>
>>>> The default value is one. To resolve conflicts between nconnect settings on
>>>> different mount points to the same server, the value set on the first mount
>>>> applies until there are no more mounts of that server, unless nosharecache
>>>> is specified. When following a referral to another server, the nconnect
>>>> setting is inherited, but the effective value is determined by other mounts
>>>> of that server that are already in place.
>>>>
>>>> I hate to say it, but the way to make this work deterministically is to
>>>> ask administrators to ensure that the setting is the same on all mounts
>>>> of the same server. Again I'd rather this take care of itself, but it
>>>> appears that is not going to be possible.
>>>>
>>>>
>>>>>> Adding user tunables has never been known to increase the aggregate
>>>>>> amount of happiness in the universe. I really hope we can come up with
>>>>>> a better administrative interface... ideally, none would be best.
>>>>>
>>>>> I agree that none would be best.  It isn't clear to me that that is
>>>>> possible.
>>>>> At present, we really don't have enough experience with this
>>>>> functionality to be able to say what the trade-offs are.
>>>>> If we delay the functionality until we have the perfect interface,
>>>>> we may never get that experience.
>>>>>
>>>>> We can document "nconnect=" as a hint, and possibly add that
>>>>> "nconnect=1" is a firm guarantee that more will not be used.
>>>>
>>>> Agree that 1 should be the default. If we make this setting a
>>>> hint, then perhaps it should be renamed; nconnect makes it sound
>>>> like the client will always open N connections. How about "maxconn" ?
>>>
>>> "maxconn" sounds to me like it's possible that the code would choose a
>>> number that's less than that which I think would be misleading given
>>> that the implementation (as is now) will open the specified number of
>>> connection (bounded by the hard coded default we currently have set at
>>> some value X which I'm in favor is increasing from 16 to 32).
>>
>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>> like the long term plan is to allow "up to N" connections with some
>> mechanism to create new connections on-demand." maxconn fits that idea
>> better, though I'd prefer no new mount options... the point being that
>> eventually, this setting is likely to be an upper bound rather than a
>> fixed value.
> 
> Fair enough. If the dynamic connection management is in the cards,
> then "maxconn" would be an appropriate name but I also agree with you
> that if we are doing dynamic management then we shouldn't need a mount
> option at all. I, for one, am skeptical that we'll gain benefits from
> dynamic connection management given that cost of tearing and starting
> the new connection.
> 
> I would argue that since now no dynamic management is implemented then
> we stay with the "nconnect" mount option and if and when such feature
> is found desirable then we get rid of the mount option all together.
> 
>>>> Then, to better define the behavior:
>>>>
>>>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
>>>> count of the client’s NUMA nodes? I’d be in favor of a small number
>>>> to start with. Solaris' experience with multiple connections is that
>>>> there is very little benefit past 8.
>>>
>>> My linux to linux experience has been that there is benefit of having
>>> more than 8 connections. I have previously posted results that went
>>> upto 10 connection (it's on my list of thing to test uptown 16). With
>>> the Netapp performance lab they have maxed out 25G connection setup
>>> they were using with so they didn't experiment with nconnect=8 but no
>>> evidence that with a larger network pipe performance would stop
>>> improving.
>>>
>>> Given the existing performance studies, I would like to argue that
>>> having such low values are not warranted.
>>
>> They are warranted until we have a better handle on the risks of a
>> performance regression occurring with large nconnect settings. The
>> maximum number can always be raised once we are confident the
>> behaviors are well understood.
>>
>> Also, I'd like to see some careful studies that demonstrate why
>> you don't see excellent results with just two or three connections.
>> Nearly full link bandwidth has been achieved with MP-TCP and two or
>> three subflows on one NIC. Why is it not possible with NFS/TCP ?
> 
> Performance tests that do simple buffer to buffer measurements are one
> thing but doing a complicated system that involves a filesystem is
> another thing. The closest we can get to this network performance
> tests is NFSoRDMA which saves various copies and as you know with that
> we can get close to network link capacity.

I really hope nconnect is not just a workaround for some undiscovered
performance issue. All that does is kick the can down the road.

But a word of experience from SMB3 multichannel - more connections also
bring more issues for customers. Inevitably, with many connections
active under load, one or more will experience disconnects or slowdowns.
When this happens, some very unpredictable and hard to diagnose
behaviors start to occur. For example, all that careful load balancing
immediately goes out the window, and retries start to take over the
latencies. Some IOs sail through (the ones on the good connections) and
others delay for many seconds (while the connection is reestablished).
I don't recommend starting this effort with such a lofty goal as 8, 10
or 16 connections, especially with a protocol such as NFSv3.

JMHO.

Tom.


>>>> If maxconn is specified with a datagram transport, does the mount
>>>> operation fail, or is the setting is ignored?
>>>
>>> Perhaps we can add a warning on the mount command saying that option
>>> is ignored but succeed the mount.
>>>
>>>> If maxconn is a hint, when does the client open additional
>>>> connections?
>>>>
>>>> IMO documentation should be clear that this setting is not for the
>>>> purpose of multipathing/trunking (using multiple NICs on the client
>>>> or server). The client has to do trunking detection/discovery in that
>>>> case, and nconnect doesn't add that logic. This is strictly for
>>>> enabling multiple connections between one client-server IP address
>>>> pair.
>>>
>>> I agree this should be as that last statement says multiple connection
>>> to the same IP and in my option this shouldn't be a hint.
>>>
>>>> Do we need to state explicitly that all transport connections for a
>>>> mount (or client-server pair) are the same connection type (i.e., all
>>>> TCP or all RDMA, never a mix)?
>>>
>>> That might be an interesting future option but I think for now, we can
>>> clearly say it's a TCP only option in documentation which can always
>>> be changed if extension to that functionality will be implemented.
>>
>> Is there a reason you feel RDMA shouldn't be included? I've tried
>> nconnect with my RDMA rig, and didn't see any problem with it.
> 
> No reason, I should have said "a single type of a connection only
> option" not a mix. Of course with RDMA even with a single connection
> we can achieve almost max bandwidth so having using nconnect seems
> unnecessary.
> 
>>>>> Then further down the track, we might change the actual number of
>>>>> connections automatically if a way can be found to do that without cost.
>>>>
>>>> Fair enough.
>>>>
>>>>
>>>>> Do you have any objections apart from the nconnect= mount option?
>>>>
>>>> Well I realize my last e-mail sounded a little negative, but I'm
>>>> actually in favor of adding the ability to open multiple connections
>>>> per client-server pair. I just want to be careful about making this
>>>> a feature that has as few downsides as possible right from the start.
>>>> I'll try to be more helpful in my responses.
>>>>
>>>> Remaining implementation issues that IMO need to be sorted:
>>>
>>> I'm curious are you saying all this need to be resolved before we
>>> consider including this functionality? These are excellent questions
>>> but I think they imply some complex enhancements (like ability to do
>>> different schedulers and not only round robin) that are "enhancement"
>>> and not requirements.
>>>
>>>> • We want to take care that the client can recover network resources
>>>> that have gone idle. Can we reuse the auto-close logic to close extra
>>>> connections?
>>> Since we are using round-robin scheduler then can we consider any
>>> resources going idle?
>>
>> Again, I was thinking of nconnect as a hint here, not as a fixed
>> number of connections.
>>
>>
>>> It's hard to know the future, we might set a
>>> timer after which we can say that a connection has been idle for long
>>> enough time and we close it and as soon as that happens the traffic is
>>> going to be generated again and we'll have to pay the penalty of
>>> establishing a new connection before sending traffic.
>>>
>>>> • How will the client schedule requests on multiple connections?
>>>> Should we enable the use of different schedulers?
>>> That's an interesting idea but I don't think it shouldn't stop the
>>> round robin solution from going thru.
>>>
>>>> • How will retransmits be handled?
>>>> • How will the client recover from broken connections? Today's clients
>>>> use disconnect to determine when to retransmit, thus there might be
>>>> some unwanted interactions here that result in mount hangs.
>>>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
>>>> client support in place for this already?
>>>> • Are there any concerns about how the Linux server DRC will behave in
>>>> multi-connection scenarios?
>>>
>>> I think we've talked about retransmission question. Retransmission are
>>> handled by existing logic and are done by the same transport (ie
>>> connection).
>>
>> Given the proposition that nconnect will be a hint (eventually) in
>> the form of a dynamically managed set of connections, I think we need
>> to answer some of these questions again. The answers could be "not
>> yet implemented" or "no way jose".
>>
>> It would be helpful if the answers were all in one place (eg a design
>> document or FAQ).
>>
>>
>>>> None of these seem like a deal breaker. And possibly several of these
>>>> are already decided, but just need to be published/documented.
>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>
>>
>>
> 
>
Chuck Lever III June 11, 2019, 8:09 p.m. UTC | #34
> On Jun 11, 2019, at 4:02 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 6/11/2019 3:13 PM, Olga Kornievskaia wrote:
>> On Tue, Jun 11, 2019 at 1:47 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Jun 11, 2019, at 11:34 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>> 
>>>> On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>> 
>>>>> Hi Neil-
>>>>> 
>>>>> 
>>>>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>> 
>>>>>> On Fri, May 31 2019, Chuck Lever wrote:
>>>>>> 
>>>>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>> 
>>>>>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>>>>> 
>>>>>>>>> Hi Neil-
>>>>>>>>> 
>>>>>>>>> Thanks for chasing this a little further.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>>>>> 
>>>>>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>>>>>> see it land.
>>>>>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>>>>>> mounted from the same server and each would get its own TCP
>>>>>>>>>> connection.
>>>>>>>>> 
>>>>>>>>> Is it well understood why splitting up the TCP connections result
>>>>>>>>> in better performance?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>>>>> 
>>>>>>>>>> Partners have assured us that it improves total throughput,
>>>>>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>>>>>> Olga!
>>>>>>>>>> 
>>>>>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>>>>> 
>>>>>>>>> Indeed.
>>>>>>>>> 
>>>>>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>>>>>> go slower. It's not always wise to establish multiple connections
>>>>>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>>>>>> end, and the network conditions.
>>>>>>>> 
>>>>>>>> This is a good argument for leaving the default at '1'.  When
>>>>>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>>>>>> number is dependant on hardware.
>>>>>>> 
>>>>>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>>>>> 
>>>>>> 
>>>>>> I doubt it, partly because there is more than just the NIC hardware at issue.
>>>>>> There is also the server-side hardware and possibly hardware in the middle.
>>>>> 
>>>>> So the best guidance is YMMV. :-)
>>>>> 
>>>>> 
>>>>>>>>> What about situations where the network capabilities between server and
>>>>>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>>>>>> usually just deals with it.
>>>>>>>> 
>>>>>>>> Being able to manually change (-o remount) the number of connections
>>>>>>>> might be useful...
>>>>>>> 
>>>>>>> Ugh. I have problems with the administrative interface for this feature,
>>>>>>> and this is one of them.
>>>>>>> 
>>>>>>> Another is what prevents your client from using a different nconnect=
>>>>>>> setting on concurrent mounts of the same server? It's another case of a
>>>>>>> per-mount setting being used to control a resource that is shared across
>>>>>>> mounts.
>>>>>> 
>>>>>> I think that horse has well and truly bolted.
>>>>>> It would be nice to have a "server" abstraction visible to user-space
>>>>>> where we could adjust settings that make sense server-wide, and then a way
>>>>>> to mount individual filesystems from that "server" - but we don't.
>>>>> 
>>>>> Even worse, there will be some resource sharing between containers that
>>>>> might be undesirable. The host should have ultimate control over those
>>>>> resources.
>>>>> 
>>>>> But that is neither here nor there.
>>>>> 
>>>>> 
>>>>>> Probably the best we can do is to document (in nfs(5)) which options are
>>>>>> per-server and which are per-mount.
>>>>> 
>>>>> Alternately, the behavior of this option could be documented this way:
>>>>> 
>>>>> The default value is one. To resolve conflicts between nconnect settings on
>>>>> different mount points to the same server, the value set on the first mount
>>>>> applies until there are no more mounts of that server, unless nosharecache
>>>>> is specified. When following a referral to another server, the nconnect
>>>>> setting is inherited, but the effective value is determined by other mounts
>>>>> of that server that are already in place.
>>>>> 
>>>>> I hate to say it, but the way to make this work deterministically is to
>>>>> ask administrators to ensure that the setting is the same on all mounts
>>>>> of the same server. Again I'd rather this take care of itself, but it
>>>>> appears that is not going to be possible.
>>>>> 
>>>>> 
>>>>>>> Adding user tunables has never been known to increase the aggregate
>>>>>>> amount of happiness in the universe. I really hope we can come up with
>>>>>>> a better administrative interface... ideally, none would be best.
>>>>>> 
>>>>>> I agree that none would be best.  It isn't clear to me that that is
>>>>>> possible.
>>>>>> At present, we really don't have enough experience with this
>>>>>> functionality to be able to say what the trade-offs are.
>>>>>> If we delay the functionality until we have the perfect interface,
>>>>>> we may never get that experience.
>>>>>> 
>>>>>> We can document "nconnect=" as a hint, and possibly add that
>>>>>> "nconnect=1" is a firm guarantee that more will not be used.
>>>>> 
>>>>> Agree that 1 should be the default. If we make this setting a
>>>>> hint, then perhaps it should be renamed; nconnect makes it sound
>>>>> like the client will always open N connections. How about "maxconn" ?
>>>> 
>>>> "maxconn" sounds to me like it's possible that the code would choose a
>>>> number that's less than that which I think would be misleading given
>>>> that the implementation (as is now) will open the specified number of
>>>> connection (bounded by the hard coded default we currently have set at
>>>> some value X which I'm in favor is increasing from 16 to 32).
>>> 
>>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>>> like the long term plan is to allow "up to N" connections with some
>>> mechanism to create new connections on-demand." maxconn fits that idea
>>> better, though I'd prefer no new mount options... the point being that
>>> eventually, this setting is likely to be an upper bound rather than a
>>> fixed value.
>> Fair enough. If the dynamic connection management is in the cards,
>> then "maxconn" would be an appropriate name but I also agree with you
>> that if we are doing dynamic management then we shouldn't need a mount
>> option at all. I, for one, am skeptical that we'll gain benefits from
>> dynamic connection management given that cost of tearing and starting
>> the new connection.
>> I would argue that since now no dynamic management is implemented then
>> we stay with the "nconnect" mount option and if and when such feature
>> is found desirable then we get rid of the mount option all together.
>>>>> Then, to better define the behavior:
>>>>> 
>>>>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
>>>>> count of the client’s NUMA nodes? I’d be in favor of a small number
>>>>> to start with. Solaris' experience with multiple connections is that
>>>>> there is very little benefit past 8.
>>>> 
>>>> My linux to linux experience has been that there is benefit of having
>>>> more than 8 connections. I have previously posted results that went
>>>> upto 10 connection (it's on my list of thing to test uptown 16). With
>>>> the Netapp performance lab they have maxed out 25G connection setup
>>>> they were using with so they didn't experiment with nconnect=8 but no
>>>> evidence that with a larger network pipe performance would stop
>>>> improving.
>>>> 
>>>> Given the existing performance studies, I would like to argue that
>>>> having such low values are not warranted.
>>> 
>>> They are warranted until we have a better handle on the risks of a
>>> performance regression occurring with large nconnect settings. The
>>> maximum number can always be raised once we are confident the
>>> behaviors are well understood.
>>> 
>>> Also, I'd like to see some careful studies that demonstrate why
>>> you don't see excellent results with just two or three connections.
>>> Nearly full link bandwidth has been achieved with MP-TCP and two or
>>> three subflows on one NIC. Why is it not possible with NFS/TCP ?
>> Performance tests that do simple buffer to buffer measurements are one
>> thing but doing a complicated system that involves a filesystem is
>> another thing. The closest we can get to this network performance
>> tests is NFSoRDMA which saves various copies and as you know with that
>> we can get close to network link capacity.

Yes, in certain circumstances, but there are still areas that can
benefit or need substantial improvement (NFS WRITE performance is
one such area).


> I really hope nconnect is not just a workaround for some undiscovered
> performance issue. All that does is kick the can down the road.
> 
> But a word of experience from SMB3 multichannel - more connections also
> bring more issues for customers. Inevitably, with many connections
> active under load, one or more will experience disconnects or slowdowns.
> When this happens, some very unpredictable and hard to diagnose
> behaviors start to occur. For example, all that careful load balancing
> immediately goes out the window, and retries start to take over the
> latencies. Some IOs sail through (the ones on the good connections) and
> others delay for many seconds (while the connection is reestablished).
> I don't recommend starting this effort with such a lofty goal as 8, 10
> or 16 connections, especially with a protocol such as NFSv3.

+1

Learn to crawl then walk then run.


> JMHO.
> 
> Tom.
> 
> 
>>>>> If maxconn is specified with a datagram transport, does the mount
>>>>> operation fail, or is the setting is ignored?
>>>> 
>>>> Perhaps we can add a warning on the mount command saying that option
>>>> is ignored but succeed the mount.
>>>> 
>>>>> If maxconn is a hint, when does the client open additional
>>>>> connections?
>>>>> 
>>>>> IMO documentation should be clear that this setting is not for the
>>>>> purpose of multipathing/trunking (using multiple NICs on the client
>>>>> or server). The client has to do trunking detection/discovery in that
>>>>> case, and nconnect doesn't add that logic. This is strictly for
>>>>> enabling multiple connections between one client-server IP address
>>>>> pair.
>>>> 
>>>> I agree this should be as that last statement says multiple connection
>>>> to the same IP and in my option this shouldn't be a hint.
>>>> 
>>>>> Do we need to state explicitly that all transport connections for a
>>>>> mount (or client-server pair) are the same connection type (i.e., all
>>>>> TCP or all RDMA, never a mix)?
>>>> 
>>>> That might be an interesting future option but I think for now, we can
>>>> clearly say it's a TCP only option in documentation which can always
>>>> be changed if extension to that functionality will be implemented.
>>> 
>>> Is there a reason you feel RDMA shouldn't be included? I've tried
>>> nconnect with my RDMA rig, and didn't see any problem with it.
>> No reason, I should have said "a single type of a connection only
>> option" not a mix. Of course with RDMA even with a single connection
>> we can achieve almost max bandwidth so having using nconnect seems
>> unnecessary.
>>>>>> Then further down the track, we might change the actual number of
>>>>>> connections automatically if a way can be found to do that without cost.
>>>>> 
>>>>> Fair enough.
>>>>> 
>>>>> 
>>>>>> Do you have any objections apart from the nconnect= mount option?
>>>>> 
>>>>> Well I realize my last e-mail sounded a little negative, but I'm
>>>>> actually in favor of adding the ability to open multiple connections
>>>>> per client-server pair. I just want to be careful about making this
>>>>> a feature that has as few downsides as possible right from the start.
>>>>> I'll try to be more helpful in my responses.
>>>>> 
>>>>> Remaining implementation issues that IMO need to be sorted:
>>>> 
>>>> I'm curious are you saying all this need to be resolved before we
>>>> consider including this functionality? These are excellent questions
>>>> but I think they imply some complex enhancements (like ability to do
>>>> different schedulers and not only round robin) that are "enhancement"
>>>> and not requirements.
>>>> 
>>>>> • We want to take care that the client can recover network resources
>>>>> that have gone idle. Can we reuse the auto-close logic to close extra
>>>>> connections?
>>>> Since we are using round-robin scheduler then can we consider any
>>>> resources going idle?
>>> 
>>> Again, I was thinking of nconnect as a hint here, not as a fixed
>>> number of connections.
>>> 
>>> 
>>>> It's hard to know the future, we might set a
>>>> timer after which we can say that a connection has been idle for long
>>>> enough time and we close it and as soon as that happens the traffic is
>>>> going to be generated again and we'll have to pay the penalty of
>>>> establishing a new connection before sending traffic.
>>>> 
>>>>> • How will the client schedule requests on multiple connections?
>>>>> Should we enable the use of different schedulers?
>>>> That's an interesting idea but I don't think it shouldn't stop the
>>>> round robin solution from going thru.
>>>> 
>>>>> • How will retransmits be handled?
>>>>> • How will the client recover from broken connections? Today's clients
>>>>> use disconnect to determine when to retransmit, thus there might be
>>>>> some unwanted interactions here that result in mount hangs.
>>>>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
>>>>> client support in place for this already?
>>>>> • Are there any concerns about how the Linux server DRC will behave in
>>>>> multi-connection scenarios?
>>>> 
>>>> I think we've talked about retransmission question. Retransmission are
>>>> handled by existing logic and are done by the same transport (ie
>>>> connection).
>>> 
>>> Given the proposition that nconnect will be a hint (eventually) in
>>> the form of a dynamically managed set of connections, I think we need
>>> to answer some of these questions again. The answers could be "not
>>> yet implemented" or "no way jose".
>>> 
>>> It would be helpful if the answers were all in one place (eg a design
>>> document or FAQ).
>>> 
>>> 
>>>>> None of these seem like a deal breaker. And possibly several of these
>>>>> are already decided, but just need to be published/documented.
>>>>> 
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>> 
>>> --
>>> Chuck Lever

--
Chuck Lever
Olga Kornievskaia June 11, 2019, 9:10 p.m. UTC | #35
On Tue, Jun 11, 2019 at 4:09 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Jun 11, 2019, at 4:02 PM, Tom Talpey <tom@talpey.com> wrote:
> >
> > On 6/11/2019 3:13 PM, Olga Kornievskaia wrote:
> >> On Tue, Jun 11, 2019 at 1:47 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>
> >>>
> >>>
> >>>> On Jun 11, 2019, at 11:34 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >>>>
> >>>> On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>>>
> >>>>> Hi Neil-
> >>>>>
> >>>>>
> >>>>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
> >>>>>>
> >>>>>> On Fri, May 31 2019, Chuck Lever wrote:
> >>>>>>
> >>>>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, May 30 2019, Chuck Lever wrote:
> >>>>>>>>
> >>>>>>>>> Hi Neil-
> >>>>>>>>>
> >>>>>>>>> Thanks for chasing this a little further.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
> >>>>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
> >>>>>>>>>>
> >>>>>>>>>> I'd like to add my voice to those supporting this work and wanting to
> >>>>>>>>>> see it land.
> >>>>>>>>>> We have had customers/partners wanting this sort of functionality for
> >>>>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
> >>>>>>>>>> "nosharetransport" mount option, so that several filesystem could be
> >>>>>>>>>> mounted from the same server and each would get its own TCP
> >>>>>>>>>> connection.
> >>>>>>>>>
> >>>>>>>>> Is it well understood why splitting up the TCP connections result
> >>>>>>>>> in better performance?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
> >>>>>>>>>>
> >>>>>>>>>> Partners have assured us that it improves total throughput,
> >>>>>>>>>> particularly with bonded networks, but we haven't had any concrete
> >>>>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
> >>>>>>>>>> Olga!
> >>>>>>>>>>
> >>>>>>>>>> My understanding, as I explain in one of the patches, is that parallel
> >>>>>>>>>> hardware is normally utilized by distributing flows, rather than
> >>>>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
> >>>>>>>>>> So multiple flows are needed to utilizes parallel hardware.
> >>>>>>>>>
> >>>>>>>>> Indeed.
> >>>>>>>>>
> >>>>>>>>> However I think one of the problems is what happens in simpler scenarios.
> >>>>>>>>> We had reports that using nconnect > 1 on virtual clients made things
> >>>>>>>>> go slower. It's not always wise to establish multiple connections
> >>>>>>>>> between the same two IP addresses. It depends on the hardware on each
> >>>>>>>>> end, and the network conditions.
> >>>>>>>>
> >>>>>>>> This is a good argument for leaving the default at '1'.  When
> >>>>>>>> documentation is added to nfs(5), we can make it clear that the optimal
> >>>>>>>> number is dependant on hardware.
> >>>>>>>
> >>>>>>> Is there any visibility into the NIC hardware that can guide this setting?
> >>>>>>>
> >>>>>>
> >>>>>> I doubt it, partly because there is more than just the NIC hardware at issue.
> >>>>>> There is also the server-side hardware and possibly hardware in the middle.
> >>>>>
> >>>>> So the best guidance is YMMV. :-)
> >>>>>
> >>>>>
> >>>>>>>>> What about situations where the network capabilities between server and
> >>>>>>>>> client change? Problem is that neither endpoint can detect that; TCP
> >>>>>>>>> usually just deals with it.
> >>>>>>>>
> >>>>>>>> Being able to manually change (-o remount) the number of connections
> >>>>>>>> might be useful...
> >>>>>>>
> >>>>>>> Ugh. I have problems with the administrative interface for this feature,
> >>>>>>> and this is one of them.
> >>>>>>>
> >>>>>>> Another is what prevents your client from using a different nconnect=
> >>>>>>> setting on concurrent mounts of the same server? It's another case of a
> >>>>>>> per-mount setting being used to control a resource that is shared across
> >>>>>>> mounts.
> >>>>>>
> >>>>>> I think that horse has well and truly bolted.
> >>>>>> It would be nice to have a "server" abstraction visible to user-space
> >>>>>> where we could adjust settings that make sense server-wide, and then a way
> >>>>>> to mount individual filesystems from that "server" - but we don't.
> >>>>>
> >>>>> Even worse, there will be some resource sharing between containers that
> >>>>> might be undesirable. The host should have ultimate control over those
> >>>>> resources.
> >>>>>
> >>>>> But that is neither here nor there.
> >>>>>
> >>>>>
> >>>>>> Probably the best we can do is to document (in nfs(5)) which options are
> >>>>>> per-server and which are per-mount.
> >>>>>
> >>>>> Alternately, the behavior of this option could be documented this way:
> >>>>>
> >>>>> The default value is one. To resolve conflicts between nconnect settings on
> >>>>> different mount points to the same server, the value set on the first mount
> >>>>> applies until there are no more mounts of that server, unless nosharecache
> >>>>> is specified. When following a referral to another server, the nconnect
> >>>>> setting is inherited, but the effective value is determined by other mounts
> >>>>> of that server that are already in place.
> >>>>>
> >>>>> I hate to say it, but the way to make this work deterministically is to
> >>>>> ask administrators to ensure that the setting is the same on all mounts
> >>>>> of the same server. Again I'd rather this take care of itself, but it
> >>>>> appears that is not going to be possible.
> >>>>>
> >>>>>
> >>>>>>> Adding user tunables has never been known to increase the aggregate
> >>>>>>> amount of happiness in the universe. I really hope we can come up with
> >>>>>>> a better administrative interface... ideally, none would be best.
> >>>>>>
> >>>>>> I agree that none would be best.  It isn't clear to me that that is
> >>>>>> possible.
> >>>>>> At present, we really don't have enough experience with this
> >>>>>> functionality to be able to say what the trade-offs are.
> >>>>>> If we delay the functionality until we have the perfect interface,
> >>>>>> we may never get that experience.
> >>>>>>
> >>>>>> We can document "nconnect=" as a hint, and possibly add that
> >>>>>> "nconnect=1" is a firm guarantee that more will not be used.
> >>>>>
> >>>>> Agree that 1 should be the default. If we make this setting a
> >>>>> hint, then perhaps it should be renamed; nconnect makes it sound
> >>>>> like the client will always open N connections. How about "maxconn" ?
> >>>>
> >>>> "maxconn" sounds to me like it's possible that the code would choose a
> >>>> number that's less than that which I think would be misleading given
> >>>> that the implementation (as is now) will open the specified number of
> >>>> connection (bounded by the hard coded default we currently have set at
> >>>> some value X which I'm in favor is increasing from 16 to 32).
> >>>
> >>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
> >>> like the long term plan is to allow "up to N" connections with some
> >>> mechanism to create new connections on-demand." maxconn fits that idea
> >>> better, though I'd prefer no new mount options... the point being that
> >>> eventually, this setting is likely to be an upper bound rather than a
> >>> fixed value.
> >> Fair enough. If the dynamic connection management is in the cards,
> >> then "maxconn" would be an appropriate name but I also agree with you
> >> that if we are doing dynamic management then we shouldn't need a mount
> >> option at all. I, for one, am skeptical that we'll gain benefits from
> >> dynamic connection management given that cost of tearing and starting
> >> the new connection.
> >> I would argue that since now no dynamic management is implemented then
> >> we stay with the "nconnect" mount option and if and when such feature
> >> is found desirable then we get rid of the mount option all together.
> >>>>> Then, to better define the behavior:
> >>>>>
> >>>>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
> >>>>> count of the client’s NUMA nodes? I’d be in favor of a small number
> >>>>> to start with. Solaris' experience with multiple connections is that
> >>>>> there is very little benefit past 8.
> >>>>
> >>>> My linux to linux experience has been that there is benefit of having
> >>>> more than 8 connections. I have previously posted results that went
> >>>> upto 10 connection (it's on my list of thing to test uptown 16). With
> >>>> the Netapp performance lab they have maxed out 25G connection setup
> >>>> they were using with so they didn't experiment with nconnect=8 but no
> >>>> evidence that with a larger network pipe performance would stop
> >>>> improving.
> >>>>
> >>>> Given the existing performance studies, I would like to argue that
> >>>> having such low values are not warranted.
> >>>
> >>> They are warranted until we have a better handle on the risks of a
> >>> performance regression occurring with large nconnect settings. The
> >>> maximum number can always be raised once we are confident the
> >>> behaviors are well understood.
> >>>
> >>> Also, I'd like to see some careful studies that demonstrate why
> >>> you don't see excellent results with just two or three connections.
> >>> Nearly full link bandwidth has been achieved with MP-TCP and two or
> >>> three subflows on one NIC. Why is it not possible with NFS/TCP ?
> >> Performance tests that do simple buffer to buffer measurements are one
> >> thing but doing a complicated system that involves a filesystem is
> >> another thing. The closest we can get to this network performance
> >> tests is NFSoRDMA which saves various copies and as you know with that
> >> we can get close to network link capacity.
>
> Yes, in certain circumstances, but there are still areas that can
> benefit or need substantial improvement (NFS WRITE performance is
> one such area).
>
>
> > I really hope nconnect is not just a workaround for some undiscovered
> > performance issue. All that does is kick the can down the road.
> >
> > But a word of experience from SMB3 multichannel - more connections also
> > bring more issues for customers. Inevitably, with many connections
> > active under load, one or more will experience disconnects or slowdowns.
> > When this happens, some very unpredictable and hard to diagnose
> > behaviors start to occur. For example, all that careful load balancing
> > immediately goes out the window, and retries start to take over the
> > latencies. Some IOs sail through (the ones on the good connections) and
> > others delay for many seconds (while the connection is reestablished).
> > I don't recommend starting this effort with such a lofty goal as 8, 10
> > or 16 connections, especially with a protocol such as NFSv3.
>
> +1
>
> Learn to crawl then walk then run.

Neil,

What's your experience with providing "nosharedtransport" option to
the SLE customers? Were you are having customers coming back and
complain about the multiple connections issues?

When the connection is having issues, because we have to retransmit
from the same port, there isn't anything to be done but wait for the
new connection to be established and add to the latency of the
operation over the bad connection. There could be smarts added to the
(new) scheduler to grade the connections and if connection is having
issues not assign tasks to it until it recovers but all that are
additional improvement and I don't think we should restrict
connections right of the bet. This is an option that allows for 8, 10,
16 (32) connections but it doesn't mean customer have to set such high
value and we can recommend for low values.

Solaris has it, Microsoft has it and linux has been deprived of it,
let's join the party.


>
>
> > JMHO.
> >
> > Tom.
> >
> >
> >>>>> If maxconn is specified with a datagram transport, does the mount
> >>>>> operation fail, or is the setting is ignored?
> >>>>
> >>>> Perhaps we can add a warning on the mount command saying that option
> >>>> is ignored but succeed the mount.
> >>>>
> >>>>> If maxconn is a hint, when does the client open additional
> >>>>> connections?
> >>>>>
> >>>>> IMO documentation should be clear that this setting is not for the
> >>>>> purpose of multipathing/trunking (using multiple NICs on the client
> >>>>> or server). The client has to do trunking detection/discovery in that
> >>>>> case, and nconnect doesn't add that logic. This is strictly for
> >>>>> enabling multiple connections between one client-server IP address
> >>>>> pair.
> >>>>
> >>>> I agree this should be as that last statement says multiple connection
> >>>> to the same IP and in my option this shouldn't be a hint.
> >>>>
> >>>>> Do we need to state explicitly that all transport connections for a
> >>>>> mount (or client-server pair) are the same connection type (i.e., all
> >>>>> TCP or all RDMA, never a mix)?
> >>>>
> >>>> That might be an interesting future option but I think for now, we can
> >>>> clearly say it's a TCP only option in documentation which can always
> >>>> be changed if extension to that functionality will be implemented.
> >>>
> >>> Is there a reason you feel RDMA shouldn't be included? I've tried
> >>> nconnect with my RDMA rig, and didn't see any problem with it.
> >> No reason, I should have said "a single type of a connection only
> >> option" not a mix. Of course with RDMA even with a single connection
> >> we can achieve almost max bandwidth so having using nconnect seems
> >> unnecessary.
> >>>>>> Then further down the track, we might change the actual number of
> >>>>>> connections automatically if a way can be found to do that without cost.
> >>>>>
> >>>>> Fair enough.
> >>>>>
> >>>>>
> >>>>>> Do you have any objections apart from the nconnect= mount option?
> >>>>>
> >>>>> Well I realize my last e-mail sounded a little negative, but I'm
> >>>>> actually in favor of adding the ability to open multiple connections
> >>>>> per client-server pair. I just want to be careful about making this
> >>>>> a feature that has as few downsides as possible right from the start.
> >>>>> I'll try to be more helpful in my responses.
> >>>>>
> >>>>> Remaining implementation issues that IMO need to be sorted:
> >>>>
> >>>> I'm curious are you saying all this need to be resolved before we
> >>>> consider including this functionality? These are excellent questions
> >>>> but I think they imply some complex enhancements (like ability to do
> >>>> different schedulers and not only round robin) that are "enhancement"
> >>>> and not requirements.
> >>>>
> >>>>> • We want to take care that the client can recover network resources
> >>>>> that have gone idle. Can we reuse the auto-close logic to close extra
> >>>>> connections?
> >>>> Since we are using round-robin scheduler then can we consider any
> >>>> resources going idle?
> >>>
> >>> Again, I was thinking of nconnect as a hint here, not as a fixed
> >>> number of connections.
> >>>
> >>>
> >>>> It's hard to know the future, we might set a
> >>>> timer after which we can say that a connection has been idle for long
> >>>> enough time and we close it and as soon as that happens the traffic is
> >>>> going to be generated again and we'll have to pay the penalty of
> >>>> establishing a new connection before sending traffic.
> >>>>
> >>>>> • How will the client schedule requests on multiple connections?
> >>>>> Should we enable the use of different schedulers?
> >>>> That's an interesting idea but I don't think it shouldn't stop the
> >>>> round robin solution from going thru.
> >>>>
> >>>>> • How will retransmits be handled?
> >>>>> • How will the client recover from broken connections? Today's clients
> >>>>> use disconnect to determine when to retransmit, thus there might be
> >>>>> some unwanted interactions here that result in mount hangs.
> >>>>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
> >>>>> client support in place for this already?
> >>>>> • Are there any concerns about how the Linux server DRC will behave in
> >>>>> multi-connection scenarios?
> >>>>
> >>>> I think we've talked about retransmission question. Retransmission are
> >>>> handled by existing logic and are done by the same transport (ie
> >>>> connection).
> >>>
> >>> Given the proposition that nconnect will be a hint (eventually) in
> >>> the form of a dynamically managed set of connections, I think we need
> >>> to answer some of these questions again. The answers could be "not
> >>> yet implemented" or "no way jose".
> >>>
> >>> It would be helpful if the answers were all in one place (eg a design
> >>> document or FAQ).
> >>>
> >>>
> >>>>> None of these seem like a deal breaker. And possibly several of these
> >>>>> are already decided, but just need to be published/documented.
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Chuck Lever
> >>>
> >>> --
> >>> Chuck Lever
>
> --
> Chuck Lever
>
>
>
Tom Talpey June 11, 2019, 9:35 p.m. UTC | #36
On 6/11/2019 5:10 PM, Olga Kornievskaia wrote:
> On Tue, Jun 11, 2019 at 4:09 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>>
>>
>>> On Jun 11, 2019, at 4:02 PM, Tom Talpey <tom@talpey.com> wrote:
>>>
>>> On 6/11/2019 3:13 PM, Olga Kornievskaia wrote:
>>>> On Tue, Jun 11, 2019 at 1:47 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Jun 11, 2019, at 11:34 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>>
>>>>>> On Tue, Jun 11, 2019 at 10:52 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>>
>>>>>>> Hi Neil-
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>
>>>>>>>> On Fri, May 31 2019, Chuck Lever wrote:
>>>>>>>>
>>>>>>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Neil-
>>>>>>>>>>>
>>>>>>>>>>> Thanks for chasing this a little further.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>>>>>>>> see it land.
>>>>>>>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>>>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>>>>>>>> mounted from the same server and each would get its own TCP
>>>>>>>>>>>> connection.
>>>>>>>>>>>
>>>>>>>>>>> Is it well understood why splitting up the TCP connections result
>>>>>>>>>>> in better performance?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>>>>>>>
>>>>>>>>>>>> Partners have assured us that it improves total throughput,
>>>>>>>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>>>>>>>> Olga!
>>>>>>>>>>>>
>>>>>>>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>>>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>>>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>>>>>>>
>>>>>>>>>>> Indeed.
>>>>>>>>>>>
>>>>>>>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>>>>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>>>>>>>> go slower. It's not always wise to establish multiple connections
>>>>>>>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>>>>>>>> end, and the network conditions.
>>>>>>>>>>
>>>>>>>>>> This is a good argument for leaving the default at '1'.  When
>>>>>>>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>>>>>>>> number is dependant on hardware.
>>>>>>>>>
>>>>>>>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I doubt it, partly because there is more than just the NIC hardware at issue.
>>>>>>>> There is also the server-side hardware and possibly hardware in the middle.
>>>>>>>
>>>>>>> So the best guidance is YMMV. :-)
>>>>>>>
>>>>>>>
>>>>>>>>>>> What about situations where the network capabilities between server and
>>>>>>>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>>>>>>>> usually just deals with it.
>>>>>>>>>>
>>>>>>>>>> Being able to manually change (-o remount) the number of connections
>>>>>>>>>> might be useful...
>>>>>>>>>
>>>>>>>>> Ugh. I have problems with the administrative interface for this feature,
>>>>>>>>> and this is one of them.
>>>>>>>>>
>>>>>>>>> Another is what prevents your client from using a different nconnect=
>>>>>>>>> setting on concurrent mounts of the same server? It's another case of a
>>>>>>>>> per-mount setting being used to control a resource that is shared across
>>>>>>>>> mounts.
>>>>>>>>
>>>>>>>> I think that horse has well and truly bolted.
>>>>>>>> It would be nice to have a "server" abstraction visible to user-space
>>>>>>>> where we could adjust settings that make sense server-wide, and then a way
>>>>>>>> to mount individual filesystems from that "server" - but we don't.
>>>>>>>
>>>>>>> Even worse, there will be some resource sharing between containers that
>>>>>>> might be undesirable. The host should have ultimate control over those
>>>>>>> resources.
>>>>>>>
>>>>>>> But that is neither here nor there.
>>>>>>>
>>>>>>>
>>>>>>>> Probably the best we can do is to document (in nfs(5)) which options are
>>>>>>>> per-server and which are per-mount.
>>>>>>>
>>>>>>> Alternately, the behavior of this option could be documented this way:
>>>>>>>
>>>>>>> The default value is one. To resolve conflicts between nconnect settings on
>>>>>>> different mount points to the same server, the value set on the first mount
>>>>>>> applies until there are no more mounts of that server, unless nosharecache
>>>>>>> is specified. When following a referral to another server, the nconnect
>>>>>>> setting is inherited, but the effective value is determined by other mounts
>>>>>>> of that server that are already in place.
>>>>>>>
>>>>>>> I hate to say it, but the way to make this work deterministically is to
>>>>>>> ask administrators to ensure that the setting is the same on all mounts
>>>>>>> of the same server. Again I'd rather this take care of itself, but it
>>>>>>> appears that is not going to be possible.
>>>>>>>
>>>>>>>
>>>>>>>>> Adding user tunables has never been known to increase the aggregate
>>>>>>>>> amount of happiness in the universe. I really hope we can come up with
>>>>>>>>> a better administrative interface... ideally, none would be best.
>>>>>>>>
>>>>>>>> I agree that none would be best.  It isn't clear to me that that is
>>>>>>>> possible.
>>>>>>>> At present, we really don't have enough experience with this
>>>>>>>> functionality to be able to say what the trade-offs are.
>>>>>>>> If we delay the functionality until we have the perfect interface,
>>>>>>>> we may never get that experience.
>>>>>>>>
>>>>>>>> We can document "nconnect=" as a hint, and possibly add that
>>>>>>>> "nconnect=1" is a firm guarantee that more will not be used.
>>>>>>>
>>>>>>> Agree that 1 should be the default. If we make this setting a
>>>>>>> hint, then perhaps it should be renamed; nconnect makes it sound
>>>>>>> like the client will always open N connections. How about "maxconn" ?
>>>>>>
>>>>>> "maxconn" sounds to me like it's possible that the code would choose a
>>>>>> number that's less than that which I think would be misleading given
>>>>>> that the implementation (as is now) will open the specified number of
>>>>>> connection (bounded by the hard coded default we currently have set at
>>>>>> some value X which I'm in favor is increasing from 16 to 32).
>>>>>
>>>>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>>>>> like the long term plan is to allow "up to N" connections with some
>>>>> mechanism to create new connections on-demand." maxconn fits that idea
>>>>> better, though I'd prefer no new mount options... the point being that
>>>>> eventually, this setting is likely to be an upper bound rather than a
>>>>> fixed value.
>>>> Fair enough. If the dynamic connection management is in the cards,
>>>> then "maxconn" would be an appropriate name but I also agree with you
>>>> that if we are doing dynamic management then we shouldn't need a mount
>>>> option at all. I, for one, am skeptical that we'll gain benefits from
>>>> dynamic connection management given that cost of tearing and starting
>>>> the new connection.
>>>> I would argue that since now no dynamic management is implemented then
>>>> we stay with the "nconnect" mount option and if and when such feature
>>>> is found desirable then we get rid of the mount option all together.
>>>>>>> Then, to better define the behavior:
>>>>>>>
>>>>>>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
>>>>>>> count of the client’s NUMA nodes? I’d be in favor of a small number
>>>>>>> to start with. Solaris' experience with multiple connections is that
>>>>>>> there is very little benefit past 8.
>>>>>>
>>>>>> My linux to linux experience has been that there is benefit of having
>>>>>> more than 8 connections. I have previously posted results that went
>>>>>> upto 10 connection (it's on my list of thing to test uptown 16). With
>>>>>> the Netapp performance lab they have maxed out 25G connection setup
>>>>>> they were using with so they didn't experiment with nconnect=8 but no
>>>>>> evidence that with a larger network pipe performance would stop
>>>>>> improving.
>>>>>>
>>>>>> Given the existing performance studies, I would like to argue that
>>>>>> having such low values are not warranted.
>>>>>
>>>>> They are warranted until we have a better handle on the risks of a
>>>>> performance regression occurring with large nconnect settings. The
>>>>> maximum number can always be raised once we are confident the
>>>>> behaviors are well understood.
>>>>>
>>>>> Also, I'd like to see some careful studies that demonstrate why
>>>>> you don't see excellent results with just two or three connections.
>>>>> Nearly full link bandwidth has been achieved with MP-TCP and two or
>>>>> three subflows on one NIC. Why is it not possible with NFS/TCP ?
>>>> Performance tests that do simple buffer to buffer measurements are one
>>>> thing but doing a complicated system that involves a filesystem is
>>>> another thing. The closest we can get to this network performance
>>>> tests is NFSoRDMA which saves various copies and as you know with that
>>>> we can get close to network link capacity.
>>
>> Yes, in certain circumstances, but there are still areas that can
>> benefit or need substantial improvement (NFS WRITE performance is
>> one such area).
>>
>>
>>> I really hope nconnect is not just a workaround for some undiscovered
>>> performance issue. All that does is kick the can down the road.
>>>
>>> But a word of experience from SMB3 multichannel - more connections also
>>> bring more issues for customers. Inevitably, with many connections
>>> active under load, one or more will experience disconnects or slowdowns.
>>> When this happens, some very unpredictable and hard to diagnose
>>> behaviors start to occur. For example, all that careful load balancing
>>> immediately goes out the window, and retries start to take over the
>>> latencies. Some IOs sail through (the ones on the good connections) and
>>> others delay for many seconds (while the connection is reestablished).
>>> I don't recommend starting this effort with such a lofty goal as 8, 10
>>> or 16 connections, especially with a protocol such as NFSv3.
>>
>> +1
>>
>> Learn to crawl then walk then run.
> 
> Neil,
> 
> What's your experience with providing "nosharedtransport" option to
> the SLE customers? Were you are having customers coming back and
> complain about the multiple connections issues?
> 
> When the connection is having issues, because we have to retransmit
> from the same port, there isn't anything to be done but wait for the
> new connection to be established and add to the latency of the
> operation over the bad connection. There could be smarts added to the
> (new) scheduler to grade the connections and if connection is having
> issues not assign tasks to it until it recovers but all that are
> additional improvement and I don't think we should restrict
> connections right of the bet. This is an option that allows for 8, 10,
> 16 (32) connections but it doesn't mean customer have to set such high
> value and we can recommend for low values.
> 
> Solaris has it, Microsoft has it and linux has been deprived of it,
> let's join the party.

Let me be clear about one thing - SMB3 has it because the protocol
is designed for it. Multichannel leverages SMB2 sessions to allow
retransmit on any active bound connection. NFSv4.1 (and later) have
a similar capability.

NFSv2 and NFSv3, however, do not, and I've already stated my concerns
about pushing them too far. I agree with your sentiment, but for these
protocols, please bear in mind the risks.

Tom.

> 
> 
>>
>>
>>> JMHO.
>>>
>>> Tom.
>>>
>>>
>>>>>>> If maxconn is specified with a datagram transport, does the mount
>>>>>>> operation fail, or is the setting is ignored?
>>>>>>
>>>>>> Perhaps we can add a warning on the mount command saying that option
>>>>>> is ignored but succeed the mount.
>>>>>>
>>>>>>> If maxconn is a hint, when does the client open additional
>>>>>>> connections?
>>>>>>>
>>>>>>> IMO documentation should be clear that this setting is not for the
>>>>>>> purpose of multipathing/trunking (using multiple NICs on the client
>>>>>>> or server). The client has to do trunking detection/discovery in that
>>>>>>> case, and nconnect doesn't add that logic. This is strictly for
>>>>>>> enabling multiple connections between one client-server IP address
>>>>>>> pair.
>>>>>>
>>>>>> I agree this should be as that last statement says multiple connection
>>>>>> to the same IP and in my option this shouldn't be a hint.
>>>>>>
>>>>>>> Do we need to state explicitly that all transport connections for a
>>>>>>> mount (or client-server pair) are the same connection type (i.e., all
>>>>>>> TCP or all RDMA, never a mix)?
>>>>>>
>>>>>> That might be an interesting future option but I think for now, we can
>>>>>> clearly say it's a TCP only option in documentation which can always
>>>>>> be changed if extension to that functionality will be implemented.
>>>>>
>>>>> Is there a reason you feel RDMA shouldn't be included? I've tried
>>>>> nconnect with my RDMA rig, and didn't see any problem with it.
>>>> No reason, I should have said "a single type of a connection only
>>>> option" not a mix. Of course with RDMA even with a single connection
>>>> we can achieve almost max bandwidth so having using nconnect seems
>>>> unnecessary.
>>>>>>>> Then further down the track, we might change the actual number of
>>>>>>>> connections automatically if a way can be found to do that without cost.
>>>>>>>
>>>>>>> Fair enough.
>>>>>>>
>>>>>>>
>>>>>>>> Do you have any objections apart from the nconnect= mount option?
>>>>>>>
>>>>>>> Well I realize my last e-mail sounded a little negative, but I'm
>>>>>>> actually in favor of adding the ability to open multiple connections
>>>>>>> per client-server pair. I just want to be careful about making this
>>>>>>> a feature that has as few downsides as possible right from the start.
>>>>>>> I'll try to be more helpful in my responses.
>>>>>>>
>>>>>>> Remaining implementation issues that IMO need to be sorted:
>>>>>>
>>>>>> I'm curious are you saying all this need to be resolved before we
>>>>>> consider including this functionality? These are excellent questions
>>>>>> but I think they imply some complex enhancements (like ability to do
>>>>>> different schedulers and not only round robin) that are "enhancement"
>>>>>> and not requirements.
>>>>>>
>>>>>>> • We want to take care that the client can recover network resources
>>>>>>> that have gone idle. Can we reuse the auto-close logic to close extra
>>>>>>> connections?
>>>>>> Since we are using round-robin scheduler then can we consider any
>>>>>> resources going idle?
>>>>>
>>>>> Again, I was thinking of nconnect as a hint here, not as a fixed
>>>>> number of connections.
>>>>>
>>>>>
>>>>>> It's hard to know the future, we might set a
>>>>>> timer after which we can say that a connection has been idle for long
>>>>>> enough time and we close it and as soon as that happens the traffic is
>>>>>> going to be generated again and we'll have to pay the penalty of
>>>>>> establishing a new connection before sending traffic.
>>>>>>
>>>>>>> • How will the client schedule requests on multiple connections?
>>>>>>> Should we enable the use of different schedulers?
>>>>>> That's an interesting idea but I don't think it shouldn't stop the
>>>>>> round robin solution from going thru.
>>>>>>
>>>>>>> • How will retransmits be handled?
>>>>>>> • How will the client recover from broken connections? Today's clients
>>>>>>> use disconnect to determine when to retransmit, thus there might be
>>>>>>> some unwanted interactions here that result in mount hangs.
>>>>>>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
>>>>>>> client support in place for this already?
>>>>>>> • Are there any concerns about how the Linux server DRC will behave in
>>>>>>> multi-connection scenarios?
>>>>>>
>>>>>> I think we've talked about retransmission question. Retransmission are
>>>>>> handled by existing logic and are done by the same transport (ie
>>>>>> connection).
>>>>>
>>>>> Given the proposition that nconnect will be a hint (eventually) in
>>>>> the form of a dynamically managed set of connections, I think we need
>>>>> to answer some of these questions again. The answers could be "not
>>>>> yet implemented" or "no way jose".
>>>>>
>>>>> It would be helpful if the answers were all in one place (eg a design
>>>>> document or FAQ).
>>>>>
>>>>>
>>>>>>> None of these seem like a deal breaker. And possibly several of these
>>>>>>> are already decided, but just need to be published/documented.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>
>>>>> --
>>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>
>>
>>
> 
>
NeilBrown June 11, 2019, 10:55 p.m. UTC | #37
On Tue, Jun 11 2019, Tom Talpey wrote:

> On 6/11/2019 5:10 PM, Olga Kornievskaia wrote:
...
>> 
>> Solaris has it, Microsoft has it and linux has been deprived of it,
>> let's join the party.
>
> Let me be clear about one thing - SMB3 has it because the protocol
> is designed for it. Multichannel leverages SMB2 sessions to allow
> retransmit on any active bound connection. NFSv4.1 (and later) have
> a similar capability.
>
> NFSv2 and NFSv3, however, do not, and I've already stated my concerns
> about pushing them too far. I agree with your sentiment, but for these
> protocols, please bear in mind the risks.

NFSv2 and NFSv3 were designed to work with UDP.  That works a lot like
one-connection-per-message.   I don't think there is any reason to think
NFSv2,3 would have any problems with multiple connections.

NeilBrown
NeilBrown June 11, 2019, 11:02 p.m. UTC | #38
On Tue, Jun 11 2019, Olga Kornievskaia wrote:

>
> Neil,
>
> What's your experience with providing "nosharedtransport" option to
> the SLE customers? Were you are having customers coming back and
> complain about the multiple connections issues?

Never had customers come back at all.
Every major SLE release saw a request that we preserve this non-upstream
functionality, but we got very little information about how it was being
used, and how well it performed.

>
> When the connection is having issues, because we have to retransmit
> from the same port, there isn't anything to be done but wait for the
> new connection to be established and add to the latency of the
> operation over the bad connection. There could be smarts added to the
> (new) scheduler to grade the connections and if connection is having
> issues not assign tasks to it until it recovers but all that are
> additional improvement and I don't think we should restrict
> connections right of the bet. This is an option that allows for 8, 10,
> 16 (32) connections but it doesn't mean customer have to set such high
> value and we can recommend for low values.

The current load-balancing code will stop adding new tasks to any
connection that already has more than the average number of tasks
pending.
So if a connection breaks  (which would require lots of packet loss I
think), then it will soon be ignored by new tasks.  Those tasks which
have been assigned to it will just have to wait for the reconnect.

In terms of a maximum number of connections, I don't think it is our place
to stop people shooting themselves in the foot.
Given the limit of 1024 reserved ports, I can justify enforcing a limit
of (say) 256.  Forcing a limit lower than that might just stop people
from experimenting, and I think we want people to experiment.

NeilBrown
NeilBrown June 11, 2019, 11:21 p.m. UTC | #39
On Tue, Jun 11 2019, Tom Talpey wrote:
>
> I really hope nconnect is not just a workaround for some undiscovered
> performance issue. All that does is kick the can down the road.

This is one of my fears too.

My current perspective is to ask
  "What do hardware designers optimise for".
because the speeds we are looking at really require various bits of
hardware to be working together harmoniously.

In context, that question becomes "Do they optimise for single
connection throughput, or multiple connection throughput".

Given the amount of money in web-services, I think multiple connection
throughput is most likely to provide dollars.
I also think that is would be a lot easier to parallelise than single
connection.

So if we NFS developers want to work with the strengths of the hardware,
I think multiple connections and increased parallelism is a sensible
long-term strategy.

So while I cannot rule out any undiscovered performance issue, I don't
think this is just kicking the can down the road.

Thanks,
NeilBrown
NeilBrown June 11, 2019, 11:42 p.m. UTC | #40
On Tue, Jun 11 2019, Chuck Lever wrote:

>
> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
> like the long term plan is to allow "up to N" connections with some
> mechanism to create new connections on-demand." maxconn fits that idea
> better, though I'd prefer no new mount options... the point being that
> eventually, this setting is likely to be an upper bound rather than a
> fixed value.

When I suggested making at I hint, I considered and rejected the the
idea of making it a maximum.  Maybe I should have been explicit about
that.

I think it *is* important to be able to disable multiple connections,
hence my suggestion that "nconnect=1", as a special case, could be a
firm maximum.
My intent was that if nconnect was not specified, or was given a larger
number, then the implementation should be free to use however many
connections it chose from time to time.  The number given would be just
a hint - maybe an initial value.  Neither a maximum nor a minimum.
Maybe we should add "nonconnect" (or similar) to enforce a single
connection, rather than overloading "nconnect=1"

You have said elsewhere that you would prefer configuration in a config
file rather than as a mount option.
How do you imagine that configuration information getting into the
kernel?
Do we create /sys/fs/nfs/something?  or add to /proc/sys/sunrpc
or /proc/net/rpc .... we have so many options !!
There is even /sys/kernel/debug/sunrpc/rpc_clnt, but that is not
a good place for configuration.

I suspect that you don't really have an opinion, you just don't like the
mount option.  However I don't have that luxury.  I need to put the
configuration somewhere.  As it is per-server configuration the only
existing place that works at all is a mount option.
While that might not be ideal, I do think it is most realistic.
Mount options can be deprecated, and carrying support for a deprecated
mount option is not expensive.

The option still can be placed in a per-server part of
/etc/nfsmount.conf rather than /etc/fstab, if that is what a sysadmin
wants to do.

Thanks,
NeilBrown
NeilBrown June 12, 2019, 1:49 a.m. UTC | #41
On Tue, Jun 11 2019, Chuck Lever wrote:

> Hi Neil-
>
>
>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Fri, May 31 2019, Chuck Lever wrote:
>> 
>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>> 
>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>> 
>>>>> Hi Neil-
>>>>> 
>>>>> Thanks for chasing this a little further.
>>>>> 
>>>>> 
>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>> 
>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>> 
>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>> see it land.
>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>> mounted from the same server and each would get its own TCP
>>>>>> connection.
>>>>> 
>>>>> Is it well understood why splitting up the TCP connections result
>>>>> in better performance?
>>>>> 
>>>>> 
>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>> 
>>>>>> Partners have assured us that it improves total throughput,
>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>> Olga!
>>>>>> 
>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>> 
>>>>> Indeed.
>>>>> 
>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>> go slower. It's not always wise to establish multiple connections
>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>> end, and the network conditions.
>>>> 
>>>> This is a good argument for leaving the default at '1'.  When
>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>> number is dependant on hardware.
>>> 
>>> Is there any visibility into the NIC hardware that can guide this setting?
>>> 
>> 
>> I doubt it, partly because there is more than just the NIC hardware at issue.
>> There is also the server-side hardware and possibly hardware in the middle.
>
> So the best guidance is YMMV. :-)
>
>
>>>>> What about situations where the network capabilities between server and
>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>> usually just deals with it.
>>>> 
>>>> Being able to manually change (-o remount) the number of connections
>>>> might be useful...
>>> 
>>> Ugh. I have problems with the administrative interface for this feature,
>>> and this is one of them.
>>> 
>>> Another is what prevents your client from using a different nconnect=
>>> setting on concurrent mounts of the same server? It's another case of a
>>> per-mount setting being used to control a resource that is shared across
>>> mounts.
>> 
>> I think that horse has well and truly bolted.
>> It would be nice to have a "server" abstraction visible to user-space
>> where we could adjust settings that make sense server-wide, and then a way
>> to mount individual filesystems from that "server" - but we don't.
>
> Even worse, there will be some resource sharing between containers that
> might be undesirable. The host should have ultimate control over those
> resources.
>
> But that is neither here nor there.
>
>
>> Probably the best we can do is to document (in nfs(5)) which options are
>> per-server and which are per-mount.
>
> Alternately, the behavior of this option could be documented this way:
>
> The default value is one. To resolve conflicts between nconnect settings on
> different mount points to the same server, the value set on the first mount
> applies until there are no more mounts of that server, unless nosharecache
> is specified. When following a referral to another server, the nconnect
> setting is inherited, but the effective value is determined by other mounts
> of that server that are already in place.
>
> I hate to say it, but the way to make this work deterministically is to
> ask administrators to ensure that the setting is the same on all mounts
> of the same server. Again I'd rather this take care of itself, but it
> appears that is not going to be possible.
>
>
>>> Adding user tunables has never been known to increase the aggregate
>>> amount of happiness in the universe. I really hope we can come up with
>>> a better administrative interface... ideally, none would be best.
>> 
>> I agree that none would be best.  It isn't clear to me that that is
>> possible.
>> At present, we really don't have enough experience with this
>> functionality to be able to say what the trade-offs are.
>> If we delay the functionality until we have the perfect interface,
>> we may never get that experience.
>> 
>> We can document "nconnect=" as a hint, and possibly add that
>> "nconnect=1" is a firm guarantee that more will not be used.
>
> Agree that 1 should be the default. If we make this setting a
> hint, then perhaps it should be renamed; nconnect makes it sound
> like the client will always open N connections. How about "maxconn" ?
>
> Then, to better define the behavior:
>
> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
> count of the client’s NUMA nodes? I’d be in favor of a small number
> to start with. Solaris' experience with multiple connections is that
> there is very little benefit past 8.
>
> If maxconn is specified with a datagram transport, does the mount
> operation fail, or is the setting is ignored?

With Trond's patches, the setting is ignored (as he said in a reply).
With my version, the setting is honoured.
Specifically, 'n' separate UDP sockets are created, each bound to a
different local port, each sending to the same server port.
If a bonding driver is using the source-port in the output hash
(xmit_policy=layer3+4 in the terminology of
linux/Documentation/net/bonding.txt),
then this would get better throughput over bonded network interfaces.

>
> If maxconn is a hint, when does the client open additional
> connections?
>
> IMO documentation should be clear that this setting is not for the
> purpose of multipathing/trunking (using multiple NICs on the client
> or server). The client has to do trunking detection/discovery in that
> case, and nconnect doesn't add that logic. This is strictly for
> enabling multiple connections between one client-server IP address
> pair.
>
> Do we need to state explicitly that all transport connections for a
> mount (or client-server pair) are the same connection type (i.e., all
> TCP or all RDMA, never a mix)?
>
>
>> Then further down the track, we might change the actual number of
>> connections automatically if a way can be found to do that without cost.
>
> Fair enough.
>
>
>> Do you have any objections apart from the nconnect= mount option?
>
> Well I realize my last e-mail sounded a little negative, but I'm
> actually in favor of adding the ability to open multiple connections
> per client-server pair. I just want to be careful about making this
> a feature that has as few downsides as possible right from the start.
> I'll try to be more helpful in my responses.
>
> Remaining implementation issues that IMO need to be sorted:
>
> • We want to take care that the client can recover network resources
> that have gone idle. Can we reuse the auto-close logic to close extra
> connections?

Were you aware that auto-close was ineffective with NFSv4 as the regular
RENEW (or SEQUENCE for v4.1) keeps a connection open?
My patches already force session management requests onto a single xprt.
It probably makes sense to do the same for RENEW and SEQUENCE.
Then when there is no fs activity, the other connections will close.
There is no mechanism to re-open only some of them though.  Any
non-trivial amount of traffic will cause all connection to re-open.

> • How will the client schedule requests on multiple connections?
> Should we enable the use of different schedulers?
> • How will retransmits be handled?
> • How will the client recover from broken connections? Today's clients
> use disconnect to determine when to retransmit, thus there might be
> some unwanted interactions here that result in mount hangs.
> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
> client support in place for this already?
> • Are there any concerns about how the Linux server DRC will behave in
> multi-connection scenarios?
>
> None of these seem like a deal breaker. And possibly several of these
> are already decided, but just need to be published/documented.

How about this:

 NFS normally sends all requests to the server (and receives all replies)
 over a single network connection, whether TCP, RDMA or (for NFSv3 and
 earlier) UDP.  Often this is sufficient to utilize all available
 network bandwidth, but not always.  When there is sufficient
 parallelism in the server, the client, and the network connection, the
 restriction to a single TCP stream can become a limitation.

 A simple scenario which portrays this limitation involves several
 direct network connections between client and server where the multiple
 interfaces on each end are bonded together.  If this bonding diverts
 different flows to different interfaces, then a single TCP connection
 will be limited to a single network interface, while multiple
 connections could make use of all interfaces.  Various other scenarios
 are possible including network controllers with multiple DMA/TSO
 engines where a given flow can only be associated with a single engine
 at a time, or Receive-side scaling which can direct different flows to
 different receive queues and thence to different CPU cores.

 NFS has two distinct and complementary mechanisms to enable the use of
 multiple connections to carry requests and replies.  We will refer to
 these as trunking and nconnect, though the NFS RFCs use the term
 "trunking" in a way that covers both.

 With trunking (also known as multipathing), the server-side IP address
 of each connection is different.  RFC8587 (and other documents)
 describe how a client can determine if two connections to different
 addresses actually refer to the same server and so can be used for
 trucking.  The client can use explicit configuration, possibly using
 the NFSv4 `fs_locations` attribute, to find the different addresses,
 and can then establish multiple trunks.  With trunking, the different
 connections could conceivably be over different protocols, both TCP and
 RDMA for example.  Trunking makes use of explicit parallelism in the
 network configuration.

 With nconnect, both the client and server side IP addresses are the
 same on each connection, but the client side port number varies.  This
 enables NFS to benefit from transparent parallelism in the network
 stack, such as interface bonding and receive-side scaling as described
 earlier.

 When multiple connections are available, NFS will send
 session-management requests on a single connection (the first
 connection opened) while general filesystem access requests will be
 distrubuted over all available connections.  When load is light (as
 measured by the number of outstanding requests on each connection)
 requests will be distributed in a round-robin fashion.  When the number
 of outstanding requests on any connection exceeds 2, and also exceeds
 the average across all connection, that connection will be skipping in
 the round-robin.  As flows are likely to be distributed over hardware
 in a non-fair manner (such as a hash on the port number), it is likely
 that each hardware resource might serve a different number of flows.
 Bypassing flows with above-average backlog goes some way to restoring
 fairness to the distribution of requests across hardware resources.

 In the (hopefully rare) case that a retransmit is needed for an
 (apparently) lost packet,  the same connection - or at least the same
 source port number - will be used for all retransmits.  This ensures
 that any Duplicate Reply Cache on the server has the best possible
 chance of recognizing the retransmission for what it is.  When a given
 connection breaks and needs to be re-established, pending requests on
 that connection will be resent.  Pending requests on other connections
 will not be affected.

 Trunking (as described here) is not currently supported by the Linux
 NFS client except in pNFS configurations (I think - is that right?).
 nconnect is supported and currently requires a mount option.

 If the "nonconnect" mount option is given, the nconnect is completely
 disabled to the target server.  If "nconnect=N" is given (for some N
 from 1 to 256) then that many connections will initially be created and
 used.  Over time, the number of connections may be increased or
 decreased depending on available resources and recent demand.  This may
 also happen if neither "nonconnect" or "nconnect=" is given.  However
 no design or implementation yet exists for this possibility.

 Where multiple filesystem are mounted from the same server, the
 "nconnect" option given for the first mount will apply to all mounts
 from that server.  If the option is given on subsequent mounts from the
 server, it will be silently ignored.


What doesn't that cover?
Have written it, I wonder if I should change the terminology do
distinguish between "multipath trunking" where the server IP address
varies, and "connection trunking" where the server IP address is fixed.

Suppose we do add multi-path (non-pNFS) trunking support.   Would it
make sense to have multiple connections over each path?  Would each path
benefit from the same number of connections?  How do we manage that?

Thanks,
NeilBrown
Steve Dickson June 12, 2019, 12:34 p.m. UTC | #42
On 6/11/19 1:44 PM, Trond Myklebust wrote:
> On Tue, 2019-06-11 at 13:32 -0400, Chuck Lever wrote:
>>> On Jun 11, 2019, at 12:41 PM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>>
>>> On Tue, 2019-06-11 at 11:35 -0400, Chuck Lever wrote:
>>>>> On Jun 11, 2019, at 11:20 AM, Trond Myklebust <
>>>>> trondmy@hammerspace.com> wrote:
>>>>>
>>>>> On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
>>>>>
>>>>>> If maxconn is a hint, when does the client open additional
>>>>>> connections?
>>>>>
>>>>> As I've already stated, that functionality is not yet
>>>>> available.
>>>>> When
>>>>> it is, it will be under the control of a userspace daemon that
>>>>> can
>>>>> decide on a policy in accordance with a set of user specified
>>>>> requirements.
>>>>
>>>> Then why do we need a mount option at all?
>>>>
>>>
>>> For one thing, it allows people to play with this until we have a
>>> fully
>>> automated solution. The fact that people are actually pulling down
>>> these patches, forward porting them and trying them out would
>>> indicate
>>> that there is interest in doing so.
>>
>> Agreed that it demonstrates that folks are interested in having
>> multiple connections. I count myself among them.
>>
>>
>>> Secondly, if your policy is 'I just want n connections' because
>>> that
>>> fits your workload requirements (e.g. because said workload is both
>>> latency sensitive and bursty), then a daemon solution would be
>>> unnecessary, and may be error prone.
>>
>> Why wouldn't that be the default out-of-the-shrinkwrap configuration
>> that is installed by nfs-utils?
> 
> What is the point of forcing people to run a daemon if all they want to
> do is set up a fixed number of connections?
> 
>>
>>> A mount option is helpful in this case, because you can perform the
>>> setup through the normal fstab or autofs config file configuration
>>> route. It also make sense if you have a nfsroot setup.
>>
>> NFSROOT is the only usage scenario where I see a mount option being
>> a superior administrative interface. However I don't feel that
>> NFSROOT is going to host workloads that would need multiple
>> connections. KIS
>>
>>
>>> Finally, even if you do want to have a daemon manage your
>>> transport,
>>> configuration, you do want a mechanism to help it reach an
>>> equilibrium
>>> state quickly. Connections take time to bring up and tear down
>>> because
>>> performance measurements take time to build up sufficient
>>> statistical
>>> precision. Furthermore, doing so comes with a number of hidden
>>> costs,
>>> e.g.: chewing up privileged port numbers by putting them in a
>>> TIME_WAIT
>>> state. If you know that a given server is always subject to heavy
>>> traffic, then initialising the number of connections appropriately
>>> has
>>> value.
>>
>> Again, I don't see how this is not something a config file can do.
> 
> You can, but that means you have to keep said config file up to date
> with the contents of /etc/fstab etc. Pulverising configuration into
> little bits and pieces that are scattered around in different files is
> not a user friendly interface either.
> 
>> The stated intent of "nconnect" way back when was for
>> experimentation.
>> It works great for that!
>>
>> I don't see it as a desirable long-term administrative interface,
>> though. I'd rather not nail in a new mount option that we actually
>> plan to obsolete in favor of an automated mechanism. I'd rather see
>> us design the administrative interface with automation from the
>> start. That will have a lower long-term maintenance cost.
>>
>> Again, I'm not objecting to support for multiple connections. It's
>> just that adding a mount option doesn't feel like a friendly or
>> finished interface for actual users. A config file (or re-using
>> nfs.conf) seems to me like a better approach.
> 
> nfs.conf is great for defining global defaults.
> 
> It can do server specific configuration, but is not a popular solution
> for that. Most people are still putting that information in /etc/fstab
> so that it appears in one spot.
> 
What about nfsmount.conf? That seems like a more reasonable place
to define how mounts should work... 

steved.
Steve Dickson June 12, 2019, 12:39 p.m. UTC | #43
On 6/11/19 7:42 PM, NeilBrown wrote:
> On Tue, Jun 11 2019, Chuck Lever wrote:
> 
>>
>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>> like the long term plan is to allow "up to N" connections with some
>> mechanism to create new connections on-demand." maxconn fits that idea
>> better, though I'd prefer no new mount options... the point being that
>> eventually, this setting is likely to be an upper bound rather than a
>> fixed value.
> 
> When I suggested making at I hint, I considered and rejected the the
> idea of making it a maximum.  Maybe I should have been explicit about
> that.
> 
> I think it *is* important to be able to disable multiple connections,
> hence my suggestion that "nconnect=1", as a special case, could be a
> firm maximum.
> My intent was that if nconnect was not specified, or was given a larger
> number, then the implementation should be free to use however many
> connections it chose from time to time.  The number given would be just
> a hint - maybe an initial value.  Neither a maximum nor a minimum.
> Maybe we should add "nonconnect" (or similar) to enforce a single
> connection, rather than overloading "nconnect=1"
> 
> You have said elsewhere that you would prefer configuration in a config
> file rather than as a mount option.
> How do you imagine that configuration information getting into the
> kernel?
> Do we create /sys/fs/nfs/something?  or add to /proc/sys/sunrpc
> or /proc/net/rpc .... we have so many options !!
> There is even /sys/kernel/debug/sunrpc/rpc_clnt, but that is not
> a good place for configuration.
> 
> I suspect that you don't really have an opinion, you just don't like the
> mount option.  However I don't have that luxury.  I need to put the
> configuration somewhere.  As it is per-server configuration the only
> existing place that works at all is a mount option.
> While that might not be ideal, I do think it is most realistic.
> Mount options can be deprecated, and carrying support for a deprecated
> mount option is not expensive.
> 
> The option still can be placed in a per-server part of
> /etc/nfsmount.conf rather than /etc/fstab, if that is what a sysadmin
> wants to do.
+1 making it per-server is the way to go... IMHO... 

steved.
Trond Myklebust June 12, 2019, 12:47 p.m. UTC | #44
On Wed, 2019-06-12 at 08:34 -0400, Steve Dickson wrote:
> 
> On 6/11/19 1:44 PM, Trond Myklebust wrote:
> > On Tue, 2019-06-11 at 13:32 -0400, Chuck Lever wrote:
> > > > On Jun 11, 2019, at 12:41 PM, Trond Myklebust <
> > > > trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Tue, 2019-06-11 at 11:35 -0400, Chuck Lever wrote:
> > > > > > On Jun 11, 2019, at 11:20 AM, Trond Myklebust <
> > > > > > trondmy@hammerspace.com> wrote:
> > > > > > 
> > > > > > On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
> > > > > > 
> > > > > > > If maxconn is a hint, when does the client open
> > > > > > > additional
> > > > > > > connections?
> > > > > > 
> > > > > > As I've already stated, that functionality is not yet
> > > > > > available.
> > > > > > When
> > > > > > it is, it will be under the control of a userspace daemon
> > > > > > that
> > > > > > can
> > > > > > decide on a policy in accordance with a set of user
> > > > > > specified
> > > > > > requirements.
> > > > > 
> > > > > Then why do we need a mount option at all?
> > > > > 
> > > > 
> > > > For one thing, it allows people to play with this until we have
> > > > a
> > > > fully
> > > > automated solution. The fact that people are actually pulling
> > > > down
> > > > these patches, forward porting them and trying them out would
> > > > indicate
> > > > that there is interest in doing so.
> > > 
> > > Agreed that it demonstrates that folks are interested in having
> > > multiple connections. I count myself among them.
> > > 
> > > 
> > > > Secondly, if your policy is 'I just want n connections' because
> > > > that
> > > > fits your workload requirements (e.g. because said workload is
> > > > both
> > > > latency sensitive and bursty), then a daemon solution would be
> > > > unnecessary, and may be error prone.
> > > 
> > > Why wouldn't that be the default out-of-the-shrinkwrap
> > > configuration
> > > that is installed by nfs-utils?
> > 
> > What is the point of forcing people to run a daemon if all they
> > want to
> > do is set up a fixed number of connections?
> > 
> > > > A mount option is helpful in this case, because you can perform
> > > > the
> > > > setup through the normal fstab or autofs config file
> > > > configuration
> > > > route. It also make sense if you have a nfsroot setup.
> > > 
> > > NFSROOT is the only usage scenario where I see a mount option
> > > being
> > > a superior administrative interface. However I don't feel that
> > > NFSROOT is going to host workloads that would need multiple
> > > connections. KIS
> > > 
> > > 
> > > > Finally, even if you do want to have a daemon manage your
> > > > transport,
> > > > configuration, you do want a mechanism to help it reach an
> > > > equilibrium
> > > > state quickly. Connections take time to bring up and tear down
> > > > because
> > > > performance measurements take time to build up sufficient
> > > > statistical
> > > > precision. Furthermore, doing so comes with a number of hidden
> > > > costs,
> > > > e.g.: chewing up privileged port numbers by putting them in a
> > > > TIME_WAIT
> > > > state. If you know that a given server is always subject to
> > > > heavy
> > > > traffic, then initialising the number of connections
> > > > appropriately
> > > > has
> > > > value.
> > > 
> > > Again, I don't see how this is not something a config file can
> > > do.
> > 
> > You can, but that means you have to keep said config file up to
> > date
> > with the contents of /etc/fstab etc. Pulverising configuration into
> > little bits and pieces that are scattered around in different files
> > is
> > not a user friendly interface either.
> > 
> > > The stated intent of "nconnect" way back when was for
> > > experimentation.
> > > It works great for that!
> > > 
> > > I don't see it as a desirable long-term administrative interface,
> > > though. I'd rather not nail in a new mount option that we
> > > actually
> > > plan to obsolete in favor of an automated mechanism. I'd rather
> > > see
> > > us design the administrative interface with automation from the
> > > start. That will have a lower long-term maintenance cost.
> > > 
> > > Again, I'm not objecting to support for multiple connections.
> > > It's
> > > just that adding a mount option doesn't feel like a friendly or
> > > finished interface for actual users. A config file (or re-using
> > > nfs.conf) seems to me like a better approach.
> > 
> > nfs.conf is great for defining global defaults.
> > 
> > It can do server specific configuration, but is not a popular
> > solution
> > for that. Most people are still putting that information in
> > /etc/fstab
> > so that it appears in one spot.
> > 
> What about nfsmount.conf? That seems like a more reasonable place
> to define how mounts should work... 
> 

That has the exact same problem. As long as it defines global defaults,
then fine, but if it pulverises the configuration for each and every
server, and makes it harder to trace what what overrides are being
applied, and where they are being applied then it is not helpful.

Another issue there is that neither nfs.conf nor nfsmount.conf are
being used by all implementations of the mount utility. As far as I
know they are not supported by busybox, for instance.
Tom Talpey June 12, 2019, 12:52 p.m. UTC | #45
On 6/11/2019 7:21 PM, NeilBrown wrote:
> On Tue, Jun 11 2019, Tom Talpey wrote:
>>
>> I really hope nconnect is not just a workaround for some undiscovered
>> performance issue. All that does is kick the can down the road.
> 
> This is one of my fears too.
> 
> My current perspective is to ask
>    "What do hardware designers optimise for".
> because the speeds we are looking at really require various bits of
> hardware to be working together harmoniously.
> 
> In context, that question becomes "Do they optimise for single
> connection throughput, or multiple connection throughput".

I assume you mean NIC hardware designers. The answer is both of
course, but there are distinct advantages in the multiple-connection
case. The main feature is RSS - Receive Side Scaling - which computes
a hash of each 5-tuple-based IP flow and spreads interrupts based on
the value. Generally speaking, that's why multiple connections can
speed up a single NIC, on today's high core count machines.

RDMA has a similar capability, by more explicitly directing its
CQs - Completion Queues - to multiple cores. Of course, RDMA has
further abilities to reduce CPU overhead through direct data placement.

> Given the amount of money in web-services, I think multiple connection
> throughput is most likely to provide dollars.
> I also think that is would be a lot easier to parallelise than single
> connection.

Yep, that's another advantage. As you observe, this kind of parallelism
is easier to achieve on the server side. IOW, this helps both ends of
the connection.

> So if we NFS developers want to work with the strengths of the hardware,
> I think multiple connections and increased parallelism is a sensible
> long-term strategy.
> 
> So while I cannot rule out any undiscovered performance issue, I don't
> think this is just kicking the can down the road.

Agreed. But driving this to one or two dozen connections is different.
Typical NICs have relatively small RSS limits, and even if they have
more, the system's core count and MSI-X vectors (interrupt steering)
rarely approach this kind of limit. If you measure the improvement
vs connection count, you'll find it increases sharply at 2 or 4, then
flattens out. At that point, the complexity takes over and you'll only
see the advantage in a lab. In the real world, a very different picture
emerges, and it can be very un-pretty.

Just some advice, that's all.

Tom.
Tom Talpey June 12, 2019, 12:55 p.m. UTC | #46
On 6/11/2019 6:55 PM, NeilBrown wrote:
> On Tue, Jun 11 2019, Tom Talpey wrote:
> 
>> On 6/11/2019 5:10 PM, Olga Kornievskaia wrote:
> ...
>>>
>>> Solaris has it, Microsoft has it and linux has been deprived of it,
>>> let's join the party.
>>
>> Let me be clear about one thing - SMB3 has it because the protocol
>> is designed for it. Multichannel leverages SMB2 sessions to allow
>> retransmit on any active bound connection. NFSv4.1 (and later) have
>> a similar capability.
>>
>> NFSv2 and NFSv3, however, do not, and I've already stated my concerns
>> about pushing them too far. I agree with your sentiment, but for these
>> protocols, please bear in mind the risks.
> 
> NFSv2 and NFSv3 were designed to work with UDP.  That works a lot like
> one-connection-per-message.   I don't think there is any reason to think
> NFSv2,3 would have any problems with multiple connections.

Sorry, but are you saying NFS over UDP works? It does not. There
are 10- and 20-year old reports of this.

NFSv2 was designed in the 1980's. NFSv3 came to be in 1992. Do
you truly want to spend your time fixing 30 year old protocols?

Ok, I'll be quiet now. :-)

Tom.
Trond Myklebust June 12, 2019, 1:10 p.m. UTC | #47
On Wed, 2019-06-12 at 12:47 +0000, Trond Myklebust wrote:
> On Wed, 2019-06-12 at 08:34 -0400, Steve Dickson wrote:
> > On 6/11/19 1:44 PM, Trond Myklebust wrote:
> > > On Tue, 2019-06-11 at 13:32 -0400, Chuck Lever wrote:
> > > > > On Jun 11, 2019, at 12:41 PM, Trond Myklebust <
> > > > > trondmy@hammerspace.com> wrote:
> > > > > 
> > > > > On Tue, 2019-06-11 at 11:35 -0400, Chuck Lever wrote:
> > > > > > > On Jun 11, 2019, at 11:20 AM, Trond Myklebust <
> > > > > > > trondmy@hammerspace.com> wrote:
> > > > > > > 
> > > > > > > On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote:
> > > > > > > 
> > > > > > > > If maxconn is a hint, when does the client open
> > > > > > > > additional
> > > > > > > > connections?
> > > > > > > 
> > > > > > > As I've already stated, that functionality is not yet
> > > > > > > available.
> > > > > > > When
> > > > > > > it is, it will be under the control of a userspace daemon
> > > > > > > that
> > > > > > > can
> > > > > > > decide on a policy in accordance with a set of user
> > > > > > > specified
> > > > > > > requirements.
> > > > > > 
> > > > > > Then why do we need a mount option at all?
> > > > > > 
> > > > > 
> > > > > For one thing, it allows people to play with this until we
> > > > > have
> > > > > a
> > > > > fully
> > > > > automated solution. The fact that people are actually pulling
> > > > > down
> > > > > these patches, forward porting them and trying them out would
> > > > > indicate
> > > > > that there is interest in doing so.
> > > > 
> > > > Agreed that it demonstrates that folks are interested in having
> > > > multiple connections. I count myself among them.
> > > > 
> > > > 
> > > > > Secondly, if your policy is 'I just want n connections'
> > > > > because
> > > > > that
> > > > > fits your workload requirements (e.g. because said workload
> > > > > is
> > > > > both
> > > > > latency sensitive and bursty), then a daemon solution would
> > > > > be
> > > > > unnecessary, and may be error prone.
> > > > 
> > > > Why wouldn't that be the default out-of-the-shrinkwrap
> > > > configuration
> > > > that is installed by nfs-utils?
> > > 
> > > What is the point of forcing people to run a daemon if all they
> > > want to
> > > do is set up a fixed number of connections?
> > > 
> > > > > A mount option is helpful in this case, because you can
> > > > > perform
> > > > > the
> > > > > setup through the normal fstab or autofs config file
> > > > > configuration
> > > > > route. It also make sense if you have a nfsroot setup.
> > > > 
> > > > NFSROOT is the only usage scenario where I see a mount option
> > > > being
> > > > a superior administrative interface. However I don't feel that
> > > > NFSROOT is going to host workloads that would need multiple
> > > > connections. KIS
> > > > 
> > > > 
> > > > > Finally, even if you do want to have a daemon manage your
> > > > > transport,
> > > > > configuration, you do want a mechanism to help it reach an
> > > > > equilibrium
> > > > > state quickly. Connections take time to bring up and tear
> > > > > down
> > > > > because
> > > > > performance measurements take time to build up sufficient
> > > > > statistical
> > > > > precision. Furthermore, doing so comes with a number of
> > > > > hidden
> > > > > costs,
> > > > > e.g.: chewing up privileged port numbers by putting them in a
> > > > > TIME_WAIT
> > > > > state. If you know that a given server is always subject to
> > > > > heavy
> > > > > traffic, then initialising the number of connections
> > > > > appropriately
> > > > > has
> > > > > value.
> > > > 
> > > > Again, I don't see how this is not something a config file can
> > > > do.
> > > 
> > > You can, but that means you have to keep said config file up to
> > > date
> > > with the contents of /etc/fstab etc. Pulverising configuration
> > > into
> > > little bits and pieces that are scattered around in different
> > > files
> > > is
> > > not a user friendly interface either.
> > > 
> > > > The stated intent of "nconnect" way back when was for
> > > > experimentation.
> > > > It works great for that!
> > > > 
> > > > I don't see it as a desirable long-term administrative
> > > > interface,
> > > > though. I'd rather not nail in a new mount option that we
> > > > actually
> > > > plan to obsolete in favor of an automated mechanism. I'd rather
> > > > see
> > > > us design the administrative interface with automation from the
> > > > start. That will have a lower long-term maintenance cost.
> > > > 
> > > > Again, I'm not objecting to support for multiple connections.
> > > > It's
> > > > just that adding a mount option doesn't feel like a friendly or
> > > > finished interface for actual users. A config file (or re-using
> > > > nfs.conf) seems to me like a better approach.
> > > 
> > > nfs.conf is great for defining global defaults.
> > > 
> > > It can do server specific configuration, but is not a popular
> > > solution
> > > for that. Most people are still putting that information in
> > > /etc/fstab
> > > so that it appears in one spot.
> > > 
> > What about nfsmount.conf? That seems like a more reasonable place
> > to define how mounts should work... 
> > 
> 
> That has the exact same problem. As long as it defines global
> defaults,
> then fine, but if it pulverises the configuration for each and every
> server, and makes it harder to trace what what overrides are being
> applied, and where they are being applied then it is not helpful.
> 
> Another issue there is that neither nfs.conf nor nfsmount.conf are
> being used by all implementations of the mount utility. As far as I
> know they are not supported by busybox, for instance.
> 

BTW: Just a reminder that neither nfs.conf nor nfsmount.conf are kernel
APIs. They are just configuration files for other utilities and daemons
that actually call kernel APIs. So talk about shifting the
responsibility for defining connection topologies to those files is not
helpful unless you also describe (and develop) the kernel interfaces to
be used by whatever reads those files.
Chuck Lever III June 12, 2019, 5:36 p.m. UTC | #48
Hi Neil-

> On Jun 11, 2019, at 7:42 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Tue, Jun 11 2019, Chuck Lever wrote:
> 
>> 
>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>> like the long term plan is to allow "up to N" connections with some
>> mechanism to create new connections on-demand." maxconn fits that idea
>> better, though I'd prefer no new mount options... the point being that
>> eventually, this setting is likely to be an upper bound rather than a
>> fixed value.
> 
> When I suggested making at I hint, I considered and rejected the the
> idea of making it a maximum.  Maybe I should have been explicit about
> that.
> 
> I think it *is* important to be able to disable multiple connections,
> hence my suggestion that "nconnect=1", as a special case, could be a
> firm maximum.
> My intent was that if nconnect was not specified, or was given a larger
> number, then the implementation should be free to use however many
> connections it chose from time to time.  The number given would be just
> a hint - maybe an initial value.  Neither a maximum nor a minimum.
> Maybe we should add "nonconnect" (or similar) to enforce a single
> connection, rather than overloading "nconnect=1"

So then I think, for the immediate future, you want to see nconnect=
specify the exact number of connections that will be opened. (later
it can be something the client chooses automatically). IIRC that's
what Trond's patches already do.

Actually I prefer that the default behavior be the current behavior,
where the client uses one connection per client-server pair. That
serves the majority of use cases well enough. Saying that default is
nconnect=1 is then intuitive to understand.

At some later point if we convince ourselves that a higher default
is safe (ie, does not result in performance regressions in some cases)
then raise the default to nconnect=2 or 3.

I'm not anxious to allow everyone to open an unlimited number of
connections just yet. That has all kinds of consequences for servers,
privileged port consumption, etc, etc. I'm not wont to hand an
unlimited capability to admins who are not NFS-savvy in the name of
experimentation. That will just make for more phone calls to our
support centers and possibly some annoyed storage administrators.
And it seems like something that can be abused pretty easily by
certain ne'er-do-wells.

Starting with a maximum of 3 or 4 is conservative yet exposes immediate
benefits. The default connection behavior remains the same. No surprises
when a stock Linux NFS client is upgraded to a kernel that supports
nconnect.

The maximum setting can be raised once we understand the corner cases,
the benefits, and the pitfalls.


> You have said elsewhere that you would prefer configuration in a config
> file rather than as a mount option.
> How do you imagine that configuration information getting into the
> kernel?

I'm assuming Trond's design, where the kernel RPC client upcalls to
a user space agent (a new daemon, or request-key).


> Do we create /sys/fs/nfs/something?  or add to /proc/sys/sunrpc
> or /proc/net/rpc .... we have so many options !!
> There is even /sys/kernel/debug/sunrpc/rpc_clnt, but that is not
> a good place for configuration.
> 
> I suspect that you don't really have an opinion, you just don't like the
> mount option.  However I don't have that luxury.  I need to put the
> configuration somewhere.  As it is per-server configuration the only
> existing place that works at all is a mount option.
> While that might not be ideal, I do think it is most realistic.
> Mount options can be deprecated, and carrying support for a deprecated
> mount option is not expensive.

It's not deprecation that worries me, it's having to change the
mount option; and the fact that we already believe it will have to
change makes it especially worrisome that we are picking the wrong
horse at the start.

NFS mount options will appear in automounter maps for a very long
time. They will be copied to other OSes. Deprecation is more
expensive than you might at first think.


> The option still can be placed in a per-server part of
> /etc/nfsmount.conf rather than /etc/fstab, if that is what a sysadmin
> wants to do.

I don't see that having a mount option /and/ a configuration file
addresses Trond's concern about config pulverization. It makes it
worse, in fact. But my fundamental problem is with a per-server
setting specified as a per-mount option. Using a config file is
just a possible way to address that problem.

For a moment, let's turn the mount option idea on its head. Another
alternative would be to make nconnect into a real per-mount setting
instead of a per-server setting.

So now each mount gets to choose the number of connections it is
permitted to use. Suppose we have three concurrent mounts:

   mount -o nconnect=3 server1:/export /mnt/one
   mount server2:/export /mnt/two
   mount -o nconnect=2 server3:/export /mnt/three

The client opens the maximum of the three nconnect values, which
is 3. Then:

Traffic to server2 may use only one of these connections. Traffic
to server3 may use no more than two of those connections. Traffic
to server1 may use all three of those connections.

Does that make more sense than a per-server setting? Is it feasible
to implement?


--
Chuck Lever
Chuck Lever III June 12, 2019, 6:32 p.m. UTC | #49
Hi Neil-


> On Jun 11, 2019, at 9:49 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Tue, Jun 11 2019, Chuck Lever wrote:
> 
>> Hi Neil-
>> 
>> 
>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> On Fri, May 31 2019, Chuck Lever wrote:
>>> 
>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>> 
>>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>> 
>>>>>> Hi Neil-
>>>>>> 
>>>>>> Thanks for chasing this a little further.
>>>>>> 
>>>>>> 
>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>> 
>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>> 
>>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>>> see it land.
>>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>>> mounted from the same server and each would get its own TCP
>>>>>>> connection.
>>>>>> 
>>>>>> Is it well understood why splitting up the TCP connections result
>>>>>> in better performance?
>>>>>> 
>>>>>> 
>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>> 
>>>>>>> Partners have assured us that it improves total throughput,
>>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>>> Olga!
>>>>>>> 
>>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>> 
>>>>>> Indeed.
>>>>>> 
>>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>>> go slower. It's not always wise to establish multiple connections
>>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>>> end, and the network conditions.
>>>>> 
>>>>> This is a good argument for leaving the default at '1'.  When
>>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>>> number is dependant on hardware.
>>>> 
>>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>> 
>>> 
>>> I doubt it, partly because there is more than just the NIC hardware at issue.
>>> There is also the server-side hardware and possibly hardware in the middle.
>> 
>> So the best guidance is YMMV. :-)
>> 
>> 
>>>>>> What about situations where the network capabilities between server and
>>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>>> usually just deals with it.
>>>>> 
>>>>> Being able to manually change (-o remount) the number of connections
>>>>> might be useful...
>>>> 
>>>> Ugh. I have problems with the administrative interface for this feature,
>>>> and this is one of them.
>>>> 
>>>> Another is what prevents your client from using a different nconnect=
>>>> setting on concurrent mounts of the same server? It's another case of a
>>>> per-mount setting being used to control a resource that is shared across
>>>> mounts.
>>> 
>>> I think that horse has well and truly bolted.
>>> It would be nice to have a "server" abstraction visible to user-space
>>> where we could adjust settings that make sense server-wide, and then a way
>>> to mount individual filesystems from that "server" - but we don't.
>> 
>> Even worse, there will be some resource sharing between containers that
>> might be undesirable. The host should have ultimate control over those
>> resources.
>> 
>> But that is neither here nor there.
>> 
>> 
>>> Probably the best we can do is to document (in nfs(5)) which options are
>>> per-server and which are per-mount.
>> 
>> Alternately, the behavior of this option could be documented this way:
>> 
>> The default value is one. To resolve conflicts between nconnect settings on
>> different mount points to the same server, the value set on the first mount
>> applies until there are no more mounts of that server, unless nosharecache
>> is specified. When following a referral to another server, the nconnect
>> setting is inherited, but the effective value is determined by other mounts
>> of that server that are already in place.
>> 
>> I hate to say it, but the way to make this work deterministically is to
>> ask administrators to ensure that the setting is the same on all mounts
>> of the same server. Again I'd rather this take care of itself, but it
>> appears that is not going to be possible.
>> 
>> 
>>>> Adding user tunables has never been known to increase the aggregate
>>>> amount of happiness in the universe. I really hope we can come up with
>>>> a better administrative interface... ideally, none would be best.
>>> 
>>> I agree that none would be best.  It isn't clear to me that that is
>>> possible.
>>> At present, we really don't have enough experience with this
>>> functionality to be able to say what the trade-offs are.
>>> If we delay the functionality until we have the perfect interface,
>>> we may never get that experience.
>>> 
>>> We can document "nconnect=" as a hint, and possibly add that
>>> "nconnect=1" is a firm guarantee that more will not be used.
>> 
>> Agree that 1 should be the default. If we make this setting a
>> hint, then perhaps it should be renamed; nconnect makes it sound
>> like the client will always open N connections. How about "maxconn" ?
>> 
>> Then, to better define the behavior:
>> 
>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
>> count of the client’s NUMA nodes? I’d be in favor of a small number
>> to start with. Solaris' experience with multiple connections is that
>> there is very little benefit past 8.
>> 
>> If maxconn is specified with a datagram transport, does the mount
>> operation fail, or is the setting is ignored?
> 
> With Trond's patches, the setting is ignored (as he said in a reply).
> With my version, the setting is honoured.
> Specifically, 'n' separate UDP sockets are created, each bound to a
> different local port, each sending to the same server port.
> If a bonding driver is using the source-port in the output hash
> (xmit_policy=layer3+4 in the terminology of
> linux/Documentation/net/bonding.txt),
> then this would get better throughput over bonded network interfaces.

One assumes the server end is careful to send a reply back
to the same client UDP source port from whence came the
matching request?


>> If maxconn is a hint, when does the client open additional
>> connections?
>> 
>> IMO documentation should be clear that this setting is not for the
>> purpose of multipathing/trunking (using multiple NICs on the client
>> or server). The client has to do trunking detection/discovery in that
>> case, and nconnect doesn't add that logic. This is strictly for
>> enabling multiple connections between one client-server IP address
>> pair.
>> 
>> Do we need to state explicitly that all transport connections for a
>> mount (or client-server pair) are the same connection type (i.e., all
>> TCP or all RDMA, never a mix)?
>> 
>> 
>>> Then further down the track, we might change the actual number of
>>> connections automatically if a way can be found to do that without cost.
>> 
>> Fair enough.
>> 
>> 
>>> Do you have any objections apart from the nconnect= mount option?
>> 
>> Well I realize my last e-mail sounded a little negative, but I'm
>> actually in favor of adding the ability to open multiple connections
>> per client-server pair. I just want to be careful about making this
>> a feature that has as few downsides as possible right from the start.
>> I'll try to be more helpful in my responses.
>> 
>> Remaining implementation issues that IMO need to be sorted:
>> 
>> • We want to take care that the client can recover network resources
>> that have gone idle. Can we reuse the auto-close logic to close extra
>> connections?
> 
> Were you aware that auto-close was ineffective with NFSv4 as the regular
> RENEW (or SEQUENCE for v4.1) keeps a connection open?
> My patches already force session management requests onto a single xprt.
> It probably makes sense to do the same for RENEW and SEQUENCE.
> Then when there is no fs activity, the other connections will close.
> There is no mechanism to re-open only some of them though.  Any
> non-trivial amount of traffic will cause all connection to re-open.

This seems sensible.


>> • How will the client schedule requests on multiple connections?
>> Should we enable the use of different schedulers?
>> • How will retransmits be handled?
>> • How will the client recover from broken connections? Today's clients
>> use disconnect to determine when to retransmit, thus there might be
>> some unwanted interactions here that result in mount hangs.
>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
>> client support in place for this already?
>> • Are there any concerns about how the Linux server DRC will behave in
>> multi-connection scenarios?
>> 
>> None of these seem like a deal breaker. And possibly several of these
>> are already decided, but just need to be published/documented.
> 
> How about this:

Thanks for writing this up.


> NFS normally sends all requests to the server (and receives all replies)
> over a single network connection, whether TCP, RDMA or (for NFSv3 and
> earlier) UDP.  Often this is sufficient to utilize all available
> network bandwidth, but not always.  When there is sufficient
> parallelism in the server, the client, and the network connection, the
> restriction to a single TCP stream can become a limitation.
> 
> A simple scenario which portrays this limitation involves several
> direct network connections between client and server where the multiple
> interfaces on each end are bonded together.  If this bonding diverts
> different flows to different interfaces, then a single TCP connection
> will be limited to a single network interface, while multiple
> connections could make use of all interfaces.  Various other scenarios
> are possible including network controllers with multiple DMA/TSO
> engines where a given flow can only be associated with a single engine
> at a time, or Receive-side scaling which can direct different flows to
> different receive queues and thence to different CPU cores.
> 
> NFS has two distinct and complementary mechanisms to enable the use of
> multiple connections to carry requests and replies.  We will refer to
> these as trunking and nconnect, though the NFS RFCs use the term
> "trunking" in a way that covers both.
> 
> With trunking (also known as multipathing), the server-side IP address
> of each connection is different.  RFC8587 (and other documents)
> describe how a client can determine if two connections to different
> addresses actually refer to the same server and so can be used for
> trunking. The client can use explicit configuration, possibly using
> the NFSv4 `fs_locations` attribute, to find the different addresses,
> and can then establish multiple trunks.  With trunking, the different
> connections could conceivably be over different protocols, both TCP and
> RDMA for example.  Trunking makes use of explicit parallelism in the
> network configuration.
> 
> With nconnect, both the client and server side IP addresses are the
> same on each connection, but the client side port number varies.

Note that the client IP source port number is not relevant for RDMA
connections. Multiple connections to the same service are de-
multiplexed using other means.

So then the goal of nconnect is specifically to enable multiple
independent flows between the same two network endpoints.

Note that a server is also responsible for detecting when two
unique IP addresses are the same client for purposes of open/lock
state recovery. It's possible that the same client IP address can
host multiple NFS client instances each at different source ports.

NFSv4 has protocol to do this (SETCLIENTID and EXCHANGE_ID), but
NFSv2/3 do not. This is one reason why Tom has been counseling
caution about multichannel NFSv2/3. Perhaps that is only an issue
for NLM, which is already a separate connection...


> This enables NFS to benefit from transparent parallelism in the network
> stack, such as interface bonding and receive-side scaling as described
> earlier.
> 
> When multiple connections are available, NFS will send
> session-management requests on a single connection (the first
> connection opened)

Maybe you meant "lease management" requests?

EXCHANGE_ID, RECLAIM_COMPLETE, CREATE_SESSION, DESTROY_SESSION
and DESTROY_CLIENTID will of course go over the main connection.
However, each connection will need to use BIND_CONN_TO_SESSION
to join an existing session. That's how the server knows
the additional connections are from a client instance it has
already recognized.

For NFSv4.0, SETCLIENTID, SETCLIENTID_CONFIRM, and RENEW
would go over the main connection (and those have nothing to do
with sessions).


> while general filesystem access requests will be
> distrubuted over all available connections.  When load is light (as
> measured by the number of outstanding requests on each connection)
> requests will be distributed in a round-robin fashion.  When the number
> of outstanding requests on any connection exceeds 2, and also exceeds
> the average across all connection, that connection will be skipping in
> the round-robin.  As flows are likely to be distributed over hardware
> in a non-fair manner (such as a hash on the port number), it is likely
> that each hardware resource might serve a different number of flows.
> Bypassing flows with above-average backlog goes some way to restoring
> fairness to the distribution of requests across hardware resources.
> 
> In the (hopefully rare) case that a retransmit is needed for an
> (apparently) lost packet,  the same connection - or at least the same
> source port number - will be used for all retransmits.  This ensures
> that any Duplicate Reply Cache on the server has the best possible
> chance of recognizing the retransmission for what it is.  When a given
> connection breaks and needs to be re-established, pending requests on
> that connection will be resent.  Pending requests on other connections
> will not be affected.

I'm having trouble with several points regarding retransmission.

1. Retransmission is also used to recover when a server or its
backend storage drops a request. An NFSv3 server is permitted by
spec to drop requests without notifying clients. That's a nit with
your write-up, but...

2. An NFSv4 server MUST NOT drop requests; if it ever does it is
required to close the connection to force the client to retransmit.
In fact, current clients depend on connection loss to know when
to retransmit. Both Linux and Solaris no longer use retransmit
timeouts to trigger retransmit; they will only retransmit after
connection loss.

2a. IMO the spec is written such that a client is allowed to send
a retransmission on another connection that already exists. But
maybe that is not what we want to implement.

3. RPC/RDMA clients always drop the connection before retransmitting
because they have to reset the connection's credit accounting.

4. RPC/RDMA cannot depend on IP source port, because the RPC part
of the stack has no visibility into the choice of source port that
is chosen. Thus the server's DRC cannot use the source port. I
think server DRC's need to be prepared to deal with multiple client
connections.

5. The DRC (and thus considerations about the client IP source port)
does not come into play for NFSv4.1 sessions.


> Trunking (as described here) is not currently supported by the Linux
> NFS client except in pNFS configurations (I think - is that right?).
> nconnect is supported and currently requires a mount option.
> 
> If the "nonconnect" mount option is given, the nconnect is completely
> disabled to the target server.  If "nconnect=N" is given (for some N
> from 1 to 256) then that many connections will initially be created and
> used.  Over time, the number of connections may be increased or
> decreased depending on available resources and recent demand.  This may
> also happen if neither "nonconnect" or "nconnect=" is given.  However
> no design or implementation yet exists for this possibility.

See my e-mail from earlier today on mount option behavior.

I prefer "nconnect=1" to "nonconnect"....


> Where multiple filesystem are mounted from the same server, the
> "nconnect" option given for the first mount will apply to all mounts
> from that server.  If the option is given on subsequent mounts from the
> server, it will be silently ignored.
> 
> 
> What doesn't that cover?
> Have written it, I wonder if I should change the terminology do
> distinguish between "multipath trunking" where the server IP address
> varies, and "connection trunking" where the server IP address is fixed.

I agree that the write-up needs to be especially careful about
terminology.

"multi-path trunking" is probably not appropriate, but "connection
trunking" might be close. I used "multi-flow" above, fwiw.


> Suppose we do add multi-path (non-pNFS) trunking support.   Would it
> make sense to have multiple connections over each path?

IMO, yes.

> Would each path benefit from the same number of connections?

Probably not, the client will need to have some mechanism for
deciding how many connections to open for each trunk, or it
will have to use a fixed number (like nconnect).


> How do we manage that?

Presumably via the same mechanism that the client would use
to determine how many connections to open for a single pair
of endpoints.

Actually I suspect that for pNFS file and flexfile layouts,
the client will want to use multi-flow when communicating
with DS's. So this future may be here pretty quickly.


--
Chuck Lever
NeilBrown June 12, 2019, 11:03 p.m. UTC | #50
On Wed, Jun 12 2019, Chuck Lever wrote:

> Hi Neil-
>
>> On Jun 11, 2019, at 7:42 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Tue, Jun 11 2019, Chuck Lever wrote:
>> 
>>> 
>>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>>> like the long term plan is to allow "up to N" connections with some
>>> mechanism to create new connections on-demand." maxconn fits that idea
>>> better, though I'd prefer no new mount options... the point being that
>>> eventually, this setting is likely to be an upper bound rather than a
>>> fixed value.
>> 
>> When I suggested making at I hint, I considered and rejected the the
>> idea of making it a maximum.  Maybe I should have been explicit about
>> that.
>> 
>> I think it *is* important to be able to disable multiple connections,
>> hence my suggestion that "nconnect=1", as a special case, could be a
>> firm maximum.
>> My intent was that if nconnect was not specified, or was given a larger
>> number, then the implementation should be free to use however many
>> connections it chose from time to time.  The number given would be just
>> a hint - maybe an initial value.  Neither a maximum nor a minimum.
>> Maybe we should add "nonconnect" (or similar) to enforce a single
>> connection, rather than overloading "nconnect=1"
>
> So then I think, for the immediate future, you want to see nconnect=
> specify the exact number of connections that will be opened. (later
> it can be something the client chooses automatically). IIRC that's
> what Trond's patches already do.
>
> Actually I prefer that the default behavior be the current behavior,
> where the client uses one connection per client-server pair. That
> serves the majority of use cases well enough. Saying that default is
> nconnect=1 is then intuitive to understand.
>
> At some later point if we convince ourselves that a higher default
> is safe (ie, does not result in performance regressions in some cases)
> then raise the default to nconnect=2 or 3.
>
> I'm not anxious to allow everyone to open an unlimited number of
> connections just yet. That has all kinds of consequences for servers,
> privileged port consumption, etc, etc. I'm not wont to hand an
> unlimited capability to admins who are not NFS-savvy in the name of
> experimentation. That will just make for more phone calls to our
> support centers and possibly some annoyed storage administrators.
> And it seems like something that can be abused pretty easily by
> certain ne'er-do-wells.

I'm sorry, but this comes across to me as very paternalistic.
It is not our place to stop people shooting themselves in the foot.
It *is* our place to avoid security vulnerabilities, but not to prevent
a self-inflicted denial of service.

And no-one is suggesting unlimited (even Solaris limits clnt_max_conns to
2^31-1).  I'm suggesting 256.
If you like, we can make the limit a module parameter so distros can
easily tune it down.  But I'm strongly against imposing a hard limit of
4 or even 8.

>
> Starting with a maximum of 3 or 4 is conservative yet exposes immediate
> benefits. The default connection behavior remains the same. No surprises
> when a stock Linux NFS client is upgraded to a kernel that supports
> nconnect.
>
> The maximum setting can be raised once we understand the corner cases,
> the benefits, and the pitfalls.

I'm quite certain that some customers will have much more performant
hardware than any of us might have in the lab.  They will be the ones to
reap the benefits and find the corner cases and pitfalls.  We need to let
them.

>
>
>> You have said elsewhere that you would prefer configuration in a config
>> file rather than as a mount option.
>> How do you imagine that configuration information getting into the
>> kernel?
>
> I'm assuming Trond's design, where the kernel RPC client upcalls to
> a user space agent (a new daemon, or request-key).
>
>
>> Do we create /sys/fs/nfs/something?  or add to /proc/sys/sunrpc
>> or /proc/net/rpc .... we have so many options !!
>> There is even /sys/kernel/debug/sunrpc/rpc_clnt, but that is not
>> a good place for configuration.
>> 
>> I suspect that you don't really have an opinion, you just don't like the
>> mount option.  However I don't have that luxury.  I need to put the
>> configuration somewhere.  As it is per-server configuration the only
>> existing place that works at all is a mount option.
>> While that might not be ideal, I do think it is most realistic.
>> Mount options can be deprecated, and carrying support for a deprecated
>> mount option is not expensive.
>
> It's not deprecation that worries me, it's having to change the
> mount option; and the fact that we already believe it will have to
> change makes it especially worrisome that we are picking the wrong
> horse at the start.
>
> NFS mount options will appear in automounter maps for a very long
> time. They will be copied to other OSes. Deprecation is more
> expensive than you might at first think.

automounter maps are a good point .... if this functionality isn't
supported as a mount option, how does someone who uses automounter maps
roll it out?

>
>
>> The option still can be placed in a per-server part of
>> /etc/nfsmount.conf rather than /etc/fstab, if that is what a sysadmin
>> wants to do.
>
> I don't see that having a mount option /and/ a configuration file
> addresses Trond's concern about config pulverization. It makes it
> worse, in fact. But my fundamental problem is with a per-server
> setting specified as a per-mount option. Using a config file is
> just a possible way to address that problem.
>
> For a moment, let's turn the mount option idea on its head. Another
> alternative would be to make nconnect into a real per-mount setting
> instead of a per-server setting.
>
> So now each mount gets to choose the number of connections it is
> permitted to use. Suppose we have three concurrent mounts:
>
>    mount -o nconnect=3 server1:/export /mnt/one
>    mount server2:/export /mnt/two
>    mount -o nconnect=2 server3:/export /mnt/three
>
> The client opens the maximum of the three nconnect values, which
> is 3. Then:
>
> Traffic to server2 may use only one of these connections. Traffic
> to server3 may use no more than two of those connections. Traffic
> to server1 may use all three of those connections.
>
> Does that make more sense than a per-server setting? Is it feasible
> to implement?

If the servers are distinct, then the connections to them must be
distinct, so no sharing happens here.

But I suspect you meant to have three mounts from the same server, each
with different nconnect values.
So 3 connections are created:
  /mnt/one is allowed all of them
  /mnt/two is allowed to use only one
  /mnt/three is allowed to use only two

Which one or two?  Can /mnt/two use any one as long as it only uses one
at a time, or must it choose one up front and stick to that?
Can /mnt/three arrange to use the two that /mnt/two isn't using?

I think the easiest, and possibly most obvious, would be that each used
the "first" N connections.  So the third connection would only ever be
used by /mnt/one.  Load-balancing would be interesting, but not
impossible.  It might lead to /mnt/one preferentially using the third
connection because it has exclusive access.

I don't think this complexity gains us anything.

A different approach would be have the number of connections to each
server be the maximum number that any mount requested.  Then all mounts
use all connections.

So when /mnt/one is mounted, three connections are established, and they
are all used by /mnt/two and /mnt/three.  But if /mnt/one is unmounted,
then the third connection is closed (once it becomes idle) and /mnt/two
and /mnt/three continue using just two connections.

Adding a new connection is probably quite easy.  Deleting a connection
is probably a little less straight forward, but should be manageable.

How would you feel about that approach?

Thanks,
NeilBrown


>
>
> --
> Chuck Lever
NeilBrown June 12, 2019, 11:37 p.m. UTC | #51
On Wed, Jun 12 2019, Chuck Lever wrote:

> Hi Neil-
>
>
>> On Jun 11, 2019, at 9:49 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Tue, Jun 11 2019, Chuck Lever wrote:
>> 
>>> Hi Neil-
>>> 
>>> 
>>>> On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@suse.com> wrote:
>>>> 
>>>> On Fri, May 31 2019, Chuck Lever wrote:
>>>> 
>>>>>> On May 30, 2019, at 6:56 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>> 
>>>>>> On Thu, May 30 2019, Chuck Lever wrote:
>>>>>> 
>>>>>>> Hi Neil-
>>>>>>> 
>>>>>>> Thanks for chasing this a little further.
>>>>>>> 
>>>>>>> 
>>>>>>>> On May 29, 2019, at 8:41 PM, NeilBrown <neilb@suse.com> wrote:
>>>>>>>> 
>>>>>>>> This patch set is based on the patches in the multipath_tcp branch of
>>>>>>>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>>>>>>>> 
>>>>>>>> I'd like to add my voice to those supporting this work and wanting to
>>>>>>>> see it land.
>>>>>>>> We have had customers/partners wanting this sort of functionality for
>>>>>>>> years.  In SLES releases prior to SLE15, we've provide a
>>>>>>>> "nosharetransport" mount option, so that several filesystem could be
>>>>>>>> mounted from the same server and each would get its own TCP
>>>>>>>> connection.
>>>>>>> 
>>>>>>> Is it well understood why splitting up the TCP connections result
>>>>>>> in better performance?
>>>>>>> 
>>>>>>> 
>>>>>>>> In SLE15 we are using this 'nconnect' feature, which is much nicer.
>>>>>>>> 
>>>>>>>> Partners have assured us that it improves total throughput,
>>>>>>>> particularly with bonded networks, but we haven't had any concrete
>>>>>>>> data until Olga Kornievskaia provided some concrete test data - thanks
>>>>>>>> Olga!
>>>>>>>> 
>>>>>>>> My understanding, as I explain in one of the patches, is that parallel
>>>>>>>> hardware is normally utilized by distributing flows, rather than
>>>>>>>> packets.  This avoid out-of-order deliver of packets in a flow.
>>>>>>>> So multiple flows are needed to utilizes parallel hardware.
>>>>>>> 
>>>>>>> Indeed.
>>>>>>> 
>>>>>>> However I think one of the problems is what happens in simpler scenarios.
>>>>>>> We had reports that using nconnect > 1 on virtual clients made things
>>>>>>> go slower. It's not always wise to establish multiple connections
>>>>>>> between the same two IP addresses. It depends on the hardware on each
>>>>>>> end, and the network conditions.
>>>>>> 
>>>>>> This is a good argument for leaving the default at '1'.  When
>>>>>> documentation is added to nfs(5), we can make it clear that the optimal
>>>>>> number is dependant on hardware.
>>>>> 
>>>>> Is there any visibility into the NIC hardware that can guide this setting?
>>>>> 
>>>> 
>>>> I doubt it, partly because there is more than just the NIC hardware at issue.
>>>> There is also the server-side hardware and possibly hardware in the middle.
>>> 
>>> So the best guidance is YMMV. :-)
>>> 
>>> 
>>>>>>> What about situations where the network capabilities between server and
>>>>>>> client change? Problem is that neither endpoint can detect that; TCP
>>>>>>> usually just deals with it.
>>>>>> 
>>>>>> Being able to manually change (-o remount) the number of connections
>>>>>> might be useful...
>>>>> 
>>>>> Ugh. I have problems with the administrative interface for this feature,
>>>>> and this is one of them.
>>>>> 
>>>>> Another is what prevents your client from using a different nconnect=
>>>>> setting on concurrent mounts of the same server? It's another case of a
>>>>> per-mount setting being used to control a resource that is shared across
>>>>> mounts.
>>>> 
>>>> I think that horse has well and truly bolted.
>>>> It would be nice to have a "server" abstraction visible to user-space
>>>> where we could adjust settings that make sense server-wide, and then a way
>>>> to mount individual filesystems from that "server" - but we don't.
>>> 
>>> Even worse, there will be some resource sharing between containers that
>>> might be undesirable. The host should have ultimate control over those
>>> resources.
>>> 
>>> But that is neither here nor there.
>>> 
>>> 
>>>> Probably the best we can do is to document (in nfs(5)) which options are
>>>> per-server and which are per-mount.
>>> 
>>> Alternately, the behavior of this option could be documented this way:
>>> 
>>> The default value is one. To resolve conflicts between nconnect settings on
>>> different mount points to the same server, the value set on the first mount
>>> applies until there are no more mounts of that server, unless nosharecache
>>> is specified. When following a referral to another server, the nconnect
>>> setting is inherited, but the effective value is determined by other mounts
>>> of that server that are already in place.
>>> 
>>> I hate to say it, but the way to make this work deterministically is to
>>> ask administrators to ensure that the setting is the same on all mounts
>>> of the same server. Again I'd rather this take care of itself, but it
>>> appears that is not going to be possible.
>>> 
>>> 
>>>>> Adding user tunables has never been known to increase the aggregate
>>>>> amount of happiness in the universe. I really hope we can come up with
>>>>> a better administrative interface... ideally, none would be best.
>>>> 
>>>> I agree that none would be best.  It isn't clear to me that that is
>>>> possible.
>>>> At present, we really don't have enough experience with this
>>>> functionality to be able to say what the trade-offs are.
>>>> If we delay the functionality until we have the perfect interface,
>>>> we may never get that experience.
>>>> 
>>>> We can document "nconnect=" as a hint, and possibly add that
>>>> "nconnect=1" is a firm guarantee that more will not be used.
>>> 
>>> Agree that 1 should be the default. If we make this setting a
>>> hint, then perhaps it should be renamed; nconnect makes it sound
>>> like the client will always open N connections. How about "maxconn" ?
>>> 
>>> Then, to better define the behavior:
>>> 
>>> The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the
>>> count of the client’s NUMA nodes? I’d be in favor of a small number
>>> to start with. Solaris' experience with multiple connections is that
>>> there is very little benefit past 8.
>>> 
>>> If maxconn is specified with a datagram transport, does the mount
>>> operation fail, or is the setting is ignored?
>> 
>> With Trond's patches, the setting is ignored (as he said in a reply).
>> With my version, the setting is honoured.
>> Specifically, 'n' separate UDP sockets are created, each bound to a
>> different local port, each sending to the same server port.
>> If a bonding driver is using the source-port in the output hash
>> (xmit_policy=layer3+4 in the terminology of
>> linux/Documentation/net/bonding.txt),
>> then this would get better throughput over bonded network interfaces.
>
> One assumes the server end is careful to send a reply back
> to the same client UDP source port from whence came the
> matching request?
>
>
>>> If maxconn is a hint, when does the client open additional
>>> connections?
>>> 
>>> IMO documentation should be clear that this setting is not for the
>>> purpose of multipathing/trunking (using multiple NICs on the client
>>> or server). The client has to do trunking detection/discovery in that
>>> case, and nconnect doesn't add that logic. This is strictly for
>>> enabling multiple connections between one client-server IP address
>>> pair.
>>> 
>>> Do we need to state explicitly that all transport connections for a
>>> mount (or client-server pair) are the same connection type (i.e., all
>>> TCP or all RDMA, never a mix)?
>>> 
>>> 
>>>> Then further down the track, we might change the actual number of
>>>> connections automatically if a way can be found to do that without cost.
>>> 
>>> Fair enough.
>>> 
>>> 
>>>> Do you have any objections apart from the nconnect= mount option?
>>> 
>>> Well I realize my last e-mail sounded a little negative, but I'm
>>> actually in favor of adding the ability to open multiple connections
>>> per client-server pair. I just want to be careful about making this
>>> a feature that has as few downsides as possible right from the start.
>>> I'll try to be more helpful in my responses.
>>> 
>>> Remaining implementation issues that IMO need to be sorted:
>>> 
>>> • We want to take care that the client can recover network resources
>>> that have gone idle. Can we reuse the auto-close logic to close extra
>>> connections?
>> 
>> Were you aware that auto-close was ineffective with NFSv4 as the regular
>> RENEW (or SEQUENCE for v4.1) keeps a connection open?
>> My patches already force session management requests onto a single xprt.
>> It probably makes sense to do the same for RENEW and SEQUENCE.
>> Then when there is no fs activity, the other connections will close.
>> There is no mechanism to re-open only some of them though.  Any
>> non-trivial amount of traffic will cause all connection to re-open.
>
> This seems sensible.
>
>
>>> • How will the client schedule requests on multiple connections?
>>> Should we enable the use of different schedulers?
>>> • How will retransmits be handled?
>>> • How will the client recover from broken connections? Today's clients
>>> use disconnect to determine when to retransmit, thus there might be
>>> some unwanted interactions here that result in mount hangs.
>>> • Assume NFSv4.1 session ID rather than client ID trunking: is Linux
>>> client support in place for this already?
>>> • Are there any concerns about how the Linux server DRC will behave in
>>> multi-connection scenarios?
>>> 
>>> None of these seem like a deal breaker. And possibly several of these
>>> are already decided, but just need to be published/documented.
>> 
>> How about this:
>
> Thanks for writing this up.
>
>
>> NFS normally sends all requests to the server (and receives all replies)
>> over a single network connection, whether TCP, RDMA or (for NFSv3 and
>> earlier) UDP.  Often this is sufficient to utilize all available
>> network bandwidth, but not always.  When there is sufficient
>> parallelism in the server, the client, and the network connection, the
>> restriction to a single TCP stream can become a limitation.
>> 
>> A simple scenario which portrays this limitation involves several
>> direct network connections between client and server where the multiple
>> interfaces on each end are bonded together.  If this bonding diverts
>> different flows to different interfaces, then a single TCP connection
>> will be limited to a single network interface, while multiple
>> connections could make use of all interfaces.  Various other scenarios
>> are possible including network controllers with multiple DMA/TSO
>> engines where a given flow can only be associated with a single engine
>> at a time, or Receive-side scaling which can direct different flows to
>> different receive queues and thence to different CPU cores.
>> 
>> NFS has two distinct and complementary mechanisms to enable the use of
>> multiple connections to carry requests and replies.  We will refer to
>> these as trunking and nconnect, though the NFS RFCs use the term
>> "trunking" in a way that covers both.
>> 
>> With trunking (also known as multipathing), the server-side IP address
>> of each connection is different.  RFC8587 (and other documents)
>> describe how a client can determine if two connections to different
>> addresses actually refer to the same server and so can be used for
>> trunking. The client can use explicit configuration, possibly using
>> the NFSv4 `fs_locations` attribute, to find the different addresses,
>> and can then establish multiple trunks.  With trunking, the different
>> connections could conceivably be over different protocols, both TCP and
>> RDMA for example.  Trunking makes use of explicit parallelism in the
>> network configuration.
>> 
>> With nconnect, both the client and server side IP addresses are the
>> same on each connection, but the client side port number varies.
>
> Note that the client IP source port number is not relevant for RDMA
> connections. Multiple connections to the same service are de-
> multiplexed using other means.
>
> So then the goal of nconnect is specifically to enable multiple
> independent flows between the same two network endpoints.

Yes, focusing on "independent flows" is likely to be best.  Multiple
ports must be just an example.

>
> Note that a server is also responsible for detecting when two
> unique IP addresses are the same client for purposes of open/lock
> state recovery. It's possible that the same client IP address can
> host multiple NFS client instances each at different source ports.
>
> NFSv4 has protocol to do this (SETCLIENTID and EXCHANGE_ID), but
> NFSv2/3 do not. This is one reason why Tom has been counseling
> caution about multichannel NFSv2/3. Perhaps that is only an issue
> for NLM, which is already a separate connection...
>

I don't think there are any interesting issues here.  NLM and STATMON
remain separate for NFSv3 and don't change their behaviour at all.
NFSv3 has no concept of clients, only of permissions associated with
each individual request.  The server cannot differentiate between
requests from different (privileged) ports on the same client.

It is really the client that has responsibility for identifying itself.
The server only needs to reliably track whatever the client claims.

>
>> This enables NFS to benefit from transparent parallelism in the network
>> stack, such as interface bonding and receive-side scaling as described
>> earlier.
>> 
>> When multiple connections are available, NFS will send
>> session-management requests on a single connection (the first
>> connection opened)
>
> Maybe you meant "lease management" requests?

Probably I do .... though maybe I can be forgiven for mistakenly
thinking that CREATE_SESSION and DESTROY_SESSION could be described as
"session management" :-)

>
> EXCHANGE_ID, RECLAIM_COMPLETE, CREATE_SESSION, DESTROY_SESSION
> and DESTROY_CLIENTID will of course go over the main connection.
> However, each connection will need to use BIND_CONN_TO_SESSION
> to join an existing session. That's how the server knows
> the additional connections are from a client instance it has
> already recognized.
>
> For NFSv4.0, SETCLIENTID, SETCLIENTID_CONFIRM, and RENEW
> would go over the main connection (and those have nothing to do
> with sessions).

Well.... they have nothing to do with NFSv4.1 Sessions.
But it is useful to have a name for the collection of RPCs related to a
particular negotiated clientid, and "session" (small 's') seems as good
a name as any....

>
>
>> while general filesystem access requests will be
>> distrubuted over all available connections.  When load is light (as
>> measured by the number of outstanding requests on each connection)
>> requests will be distributed in a round-robin fashion.  When the number
>> of outstanding requests on any connection exceeds 2, and also exceeds
>> the average across all connection, that connection will be skipping in
>> the round-robin.  As flows are likely to be distributed over hardware
>> in a non-fair manner (such as a hash on the port number), it is likely
>> that each hardware resource might serve a different number of flows.
>> Bypassing flows with above-average backlog goes some way to restoring
>> fairness to the distribution of requests across hardware resources.
>> 
>> In the (hopefully rare) case that a retransmit is needed for an
>> (apparently) lost packet,  the same connection - or at least the same
>> source port number - will be used for all retransmits.  This ensures
>> that any Duplicate Reply Cache on the server has the best possible
>> chance of recognizing the retransmission for what it is.  When a given
>> connection breaks and needs to be re-established, pending requests on
>> that connection will be resent.  Pending requests on other connections
>> will not be affected.
>
> I'm having trouble with several points regarding retransmission.
>
> 1. Retransmission is also used to recover when a server or its
> backend storage drops a request. An NFSv3 server is permitted by
> spec to drop requests without notifying clients. That's a nit with
> your write-up, but...

Only if you think that when a server drops a request, it isn't "lost".

>
> 2. An NFSv4 server MUST NOT drop requests; if it ever does it is
> required to close the connection to force the client to retransmit.
> In fact, current clients depend on connection loss to know when
> to retransmit. Both Linux and Solaris no longer use retransmit
> timeouts to trigger retransmit; they will only retransmit after
> connection loss.
>
> 2a. IMO the spec is written such that a client is allowed to send
> a retransmission on another connection that already exists. But
> maybe that is not what we want to implement.

It certainly isn't what we *do* implement.
For v3 and v4.0, I think it is best to use the same xprt - which may or
may not be the same connection, but does have the same port numbers.
For v4.1 it might make sense to use another xprt if that is easy to implement.

>
> 3. RPC/RDMA clients always drop the connection before retransmitting
> because they have to reset the connection's credit accounting.
>
> 4. RPC/RDMA cannot depend on IP source port, because the RPC part
> of the stack has no visibility into the choice of source port that
> is chosen. Thus the server's DRC cannot use the source port. I
> think server DRC's need to be prepared to deal with multiple client
> connections.

OK, that could be an issue.
Linux uses an independent xid sequence for each xprt, so two separate
xprts can easily use the same xid for different requests.
If RDMA cannot see the source port, it might depend more on the xid and
so risk getting confused.

There was a patch floating around which reserved a few bits of the xid
for an xprt index to ensure all xids were unique, but Trond didn't like
sub-dividing the xid space (which is fair enough).
So maybe it isn't safe to use nconnect with RDMA and protocol versions
earlier than 4.1.

>
> 5. The DRC (and thus considerations about the client IP source port)
> does not come into play for NFSv4.1 sessions.
>
>
>> Trunking (as described here) is not currently supported by the Linux
>> NFS client except in pNFS configurations (I think - is that right?).
>> nconnect is supported and currently requires a mount option.
>> 
>> If the "nonconnect" mount option is given, the nconnect is completely
>> disabled to the target server.  If "nconnect=N" is given (for some N
>> from 1 to 256) then that many connections will initially be created and
>> used.  Over time, the number of connections may be increased or
>> decreased depending on available resources and recent demand.  This may
>> also happen if neither "nonconnect" or "nconnect=" is given.  However
>> no design or implementation yet exists for this possibility.
>
> See my e-mail from earlier today on mount option behavior.
>
> I prefer "nconnect=1" to "nonconnect"....
>
>
>> Where multiple filesystem are mounted from the same server, the
>> "nconnect" option given for the first mount will apply to all mounts
>> from that server.  If the option is given on subsequent mounts from the
>> server, it will be silently ignored.
>> 
>> 
>> What doesn't that cover?
>> Have written it, I wonder if I should change the terminology do
>> distinguish between "multipath trunking" where the server IP address
>> varies, and "connection trunking" where the server IP address is fixed.
>
> I agree that the write-up needs to be especially careful about
> terminology.
>
> "multi-path trunking" is probably not appropriate, but "connection
> trunking" might be close. I used "multi-flow" above, fwiw.

Why not "multi-path trunking" when the server IP varies?
I like "multi-flow trunking" when the server IP doesn't change!

Maybe the mount option should be flows=N ??

Thanks a lot,
NeilBrown


>
>
>> Suppose we do add multi-path (non-pNFS) trunking support.   Would it
>> make sense to have multiple connections over each path?
>
> IMO, yes.
>
>> Would each path benefit from the same number of connections?
>
> Probably not, the client will need to have some mechanism for
> deciding how many connections to open for each trunk, or it
> will have to use a fixed number (like nconnect).
>
>
>> How do we manage that?
>
> Presumably via the same mechanism that the client would use
> to determine how many connections to open for a single pair
> of endpoints.
>
> Actually I suspect that for pNFS file and flexfile layouts,
> the client will want to use multi-flow when communicating
> with DS's. So this future may be here pretty quickly.
>
>
> --
> Chuck Lever
Chuck Lever III June 13, 2019, 4:13 p.m. UTC | #52
> On Jun 12, 2019, at 7:03 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Wed, Jun 12 2019, Chuck Lever wrote:
> 
>> Hi Neil-
>> 
>>> On Jun 11, 2019, at 7:42 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> On Tue, Jun 11 2019, Chuck Lever wrote:
>>> 
>>>> 
>>>> Earlier in this thread, Neil proposed to make nconnect a hint. Sounds
>>>> like the long term plan is to allow "up to N" connections with some
>>>> mechanism to create new connections on-demand." maxconn fits that idea
>>>> better, though I'd prefer no new mount options... the point being that
>>>> eventually, this setting is likely to be an upper bound rather than a
>>>> fixed value.
>>> 
>>> When I suggested making at I hint, I considered and rejected the the
>>> idea of making it a maximum.  Maybe I should have been explicit about
>>> that.
>>> 
>>> I think it *is* important to be able to disable multiple connections,
>>> hence my suggestion that "nconnect=1", as a special case, could be a
>>> firm maximum.
>>> My intent was that if nconnect was not specified, or was given a larger
>>> number, then the implementation should be free to use however many
>>> connections it chose from time to time.  The number given would be just
>>> a hint - maybe an initial value.  Neither a maximum nor a minimum.
>>> Maybe we should add "nonconnect" (or similar) to enforce a single
>>> connection, rather than overloading "nconnect=1"
>> 
>> So then I think, for the immediate future, you want to see nconnect=
>> specify the exact number of connections that will be opened. (later
>> it can be something the client chooses automatically). IIRC that's
>> what Trond's patches already do.
>> 
>> Actually I prefer that the default behavior be the current behavior,
>> where the client uses one connection per client-server pair. That
>> serves the majority of use cases well enough. Saying that default is
>> nconnect=1 is then intuitive to understand.
>> 
>> At some later point if we convince ourselves that a higher default
>> is safe (ie, does not result in performance regressions in some cases)
>> then raise the default to nconnect=2 or 3.
>> 
>> I'm not anxious to allow everyone to open an unlimited number of
>> connections just yet. That has all kinds of consequences for servers,
>> privileged port consumption, etc, etc. I'm not wont to hand an
>> unlimited capability to admins who are not NFS-savvy in the name of
>> experimentation. That will just make for more phone calls to our
>> support centers and possibly some annoyed storage administrators.
>> And it seems like something that can be abused pretty easily by
>> certain ne'er-do-wells.
> 
> I'm sorry, but this comes across to me as very paternalistic.
> It is not our place to stop people shooting themselves in the foot.
> It *is* our place to avoid security vulnerabilities, but not to prevent
> a self-inflicted denial of service.

It is our place to try to prevent an easily predictable DoS of an
NFS server. In fact, this is a security review question asked of
every new IETF protocol: how can the protocol or implementation
be abused to cause DoS attacks? It would be irresponsible to
ignore this issue.


> And no-one is suggesting unlimited (even Solaris limits clnt_max_conns to
> 2^31-1).  I'm suggesting 256.
> If you like, we can make the limit a module parameter so distros can
> easily tune it down.  But I'm strongly against imposing a hard limit of
> 4 or even 8.

There are many reasons an initial lower maximum is wise.

- O_DIRECT was designed to enable direct I/O for particular files
instead of a whole mount point (as Solaris does), in part because we
didn't want to overrun a server with FILE_SYNC WRITE requests. I'm
thinking of this precedent mainly when suggesting a lower limit.

- SMB uses a low maximum for good reasons.

- Solaris may be architected for a very high limit, but they never
test with or recommend larger than 8.

- Practically speaking we really do need to care about our support
centers. Adding tunables that will initiate more calls and e-mails
is an avoidable error with a real monetary cost.

- Linux NFS clients work in cloud environments. We have to focus on
being good neighbors. We are not providing much guidance (if any)
on how to determine a good value for this setting. Tenants will likely
just crank it up and leave it, which will be bad for shared
infrastructure.

- A mount option instead of a more obscure interface makes it very
easy to abuse.

- Anyone who is interested in testing a large value can rebuild
their kernel as needed because this is open source, after all.

- It is very easy to raise the maximum later. As I have said all
along, I'm not talking about a permanent cap, but one that allows us
to roll out the benefits gradually while minimizing risks.

- As filesystem architects, data integrity is our priority. Performance
is an important, but always secondary, goal.

- Can we add nconnect to the community Continuous Integration testing
rigs and regularly test with large values as well as the values that
are going to be used commonly?

Do any of these reasons smack of paternalism?


>> Starting with a maximum of 3 or 4 is conservative yet exposes immediate
>> benefits. The default connection behavior remains the same. No surprises
>> when a stock Linux NFS client is upgraded to a kernel that supports
>> nconnect.
>> 
>> The maximum setting can be raised once we understand the corner cases,
>> the benefits, and the pitfalls.
> 
> I'm quite certain that some customers will have much more performant
> hardware than any of us might have in the lab.  They will be the ones to
> reap the benefits and find the corner cases and pitfalls.  We need to let
> them.

IMO, merging multi-flow is good enough to do that. If they want to
experiment with it, they can make their own modifications, raise
the maximum, or whatever. I'm very happy to enable that kind of
experimentation and anxiously await their results.

We're supposed to optimize for the common case. The common case
here is nconnect=1, by far. Many users will never change this setting,
and most who do will need only 2 or 3 connections before they see no
more gain, I predict.


>>> You have said elsewhere that you would prefer configuration in a config
>>> file rather than as a mount option.
>>> How do you imagine that configuration information getting into the
>>> kernel?
>> 
>> I'm assuming Trond's design, where the kernel RPC client upcalls to
>> a user space agent (a new daemon, or request-key).
>> 
>> 
>>> Do we create /sys/fs/nfs/something?  or add to /proc/sys/sunrpc
>>> or /proc/net/rpc .... we have so many options !!
>>> There is even /sys/kernel/debug/sunrpc/rpc_clnt, but that is not
>>> a good place for configuration.
>>> 
>>> I suspect that you don't really have an opinion, you just don't like the
>>> mount option.  However I don't have that luxury.  I need to put the
>>> configuration somewhere.  As it is per-server configuration the only
>>> existing place that works at all is a mount option.
>>> While that might not be ideal, I do think it is most realistic.
>>> Mount options can be deprecated, and carrying support for a deprecated
>>> mount option is not expensive.
>> 
>> It's not deprecation that worries me, it's having to change the
>> mount option; and the fact that we already believe it will have to
>> change makes it especially worrisome that we are picking the wrong
>> horse at the start.
>> 
>> NFS mount options will appear in automounter maps for a very long
>> time. They will be copied to other OSes. Deprecation is more
>> expensive than you might at first think.
> 
> automounter maps are a good point .... if this functionality isn't
> supported as a mount option, how does someone who uses automounter maps
> roll it out?
> 
>> 
>> 
>>> The option still can be placed in a per-server part of
>>> /etc/nfsmount.conf rather than /etc/fstab, if that is what a sysadmin
>>> wants to do.
>> 
>> I don't see that having a mount option /and/ a configuration file
>> addresses Trond's concern about config pulverization. It makes it
>> worse, in fact. But my fundamental problem is with a per-server
>> setting specified as a per-mount option. Using a config file is
>> just a possible way to address that problem.
>> 
>> For a moment, let's turn the mount option idea on its head. Another
>> alternative would be to make nconnect into a real per-mount setting
>> instead of a per-server setting.
>> 
>> So now each mount gets to choose the number of connections it is
>> permitted to use. Suppose we have three concurrent mounts:
>> 
>>   mount -o nconnect=3 server1:/export /mnt/one
>>   mount server2:/export /mnt/two
>>   mount -o nconnect=2 server3:/export /mnt/three
>> 
>> The client opens the maximum of the three nconnect values, which
>> is 3. Then:
>> 
>> Traffic to server2 may use only one of these connections. Traffic
>> to server3 may use no more than two of those connections. Traffic
>> to server1 may use all three of those connections.
>> 
>> Does that make more sense than a per-server setting? Is it feasible
>> to implement?
> 
> If the servers are distinct, then the connections to them must be
> distinct, so no sharing happens here.
> 
> But I suspect you meant to have three mounts from the same server, each
> with different nconnect values.
> So 3 connections are created:
>  /mnt/one is allowed all of them
>  /mnt/two is allowed to use only one
>  /mnt/three is allowed to use only two

Yes, sorry for the confusion.


> Which one or two?  Can /mnt/two use any one as long as it only uses one
> at a time, or must it choose one up front and stick to that?
> Can /mnt/three arrange to use the two that /mnt/two isn't using?
> 
> I think the easiest, and possibly most obvious, would be that each used
> the "first" N connections.  So the third connection would only ever be
> used by /mnt/one.  Load-balancing would be interesting, but not
> impossible.  It might lead to /mnt/one preferentially using the third
> connection because it has exclusive access.
> 
> I don't think this complexity gains us anything.

A per-mount setting makes the administrative interface intuitive and
the resulting behavior and performance is predictable.

With nconnect as a per-server option, all mounts of that server get
the nconnect value of the first mount. If mount ordering isn't
fixed (say, automounted based on user workload) then performance
will vary.


> A different approach would be have the number of connections to each
> server be the maximum number that any mount requested.  Then all mounts
> use all connections.
> 
> So when /mnt/one is mounted, three connections are established, and they
> are all used by /mnt/two and /mnt/three.  But if /mnt/one is unmounted,
> then the third connection is closed (once it becomes idle) and /mnt/two
> and /mnt/three continue using just two connections.
> 
> Adding a new connection is probably quite easy.  Deleting a connection
> is probably a little less straight forward, but should be manageable.
> 
> How would you feel about that approach?

Performance/scalability still varies depending on the order of the
mount operations.

The Solaris mechanism sets a global nconnect value for all mounted
servers, connections are created on-demand, and requests are round-
robin'd over the open connections.

Perhaps that is not granular enough for us, but once that setting is
changed, performance and scalability is predictable.


--
Chuck Lever
Chuck Lever III June 13, 2019, 4:27 p.m. UTC | #53
> On Jun 12, 2019, at 7:37 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Wed, Jun 12 2019, Chuck Lever wrote:
> 
>> Hi Neil-
>> 
>> 
>>> On Jun 11, 2019, at 9:49 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> This enables NFS to benefit from transparent parallelism in the network
>>> stack, such as interface bonding and receive-side scaling as described
>>> earlier.
>>> 
>>> When multiple connections are available, NFS will send
>>> session-management requests on a single connection (the first
>>> connection opened)
>> 
>> Maybe you meant "lease management" requests?
> 
> Probably I do .... though maybe I can be forgiven for mistakenly
> thinking that CREATE_SESSION and DESTROY_SESSION could be described as
> "session management" :-)
> 
>> 
>> EXCHANGE_ID, RECLAIM_COMPLETE, CREATE_SESSION, DESTROY_SESSION
>> and DESTROY_CLIENTID will of course go over the main connection.
>> However, each connection will need to use BIND_CONN_TO_SESSION
>> to join an existing session. That's how the server knows
>> the additional connections are from a client instance it has
>> already recognized.
>> 
>> For NFSv4.0, SETCLIENTID, SETCLIENTID_CONFIRM, and RENEW
>> would go over the main connection (and those have nothing to do
>> with sessions).
> 
> Well.... they have nothing to do with NFSv4.1 Sessions.
> But it is useful to have a name for the collection of RPCs related to a
> particular negotiated clientid, and "session" (small 's') seems as good
> a name as any....

Lease management is the proper terminology, as it covers NFSv4.0
as well as NFSv4.1 and has been used for years to describe this
set of NFS operations. Overloading the word "session" is just going
to confuse things.


>> 3. RPC/RDMA clients always drop the connection before retransmitting
>> because they have to reset the connection's credit accounting.
>> 
>> 4. RPC/RDMA cannot depend on IP source port, because the RPC part
>> of the stack has no visibility into the choice of source port that
>> is chosen. Thus the server's DRC cannot use the source port. I
>> think server DRC's need to be prepared to deal with multiple client
>> connections.
> 
> OK, that could be an issue.

It isn't. The Linux NFS server computes a hash over the first ~200
bytes of each RPC call. We can safely ignore the client IP source
port and rely solely on that hash to sort the requests, thanks to
Jeff Layton.

My overall point is this descriptive text should ignore consideration
of IP source port in favor of describing the creation of multiple
flows of requests.


> Linux uses an independent xid sequence for each xprt, so two separate
> xprts can easily use the same xid for different requests.
> If RDMA cannot see the source port, it might depend more on the xid and
> so risk getting confused.
> 
> There was a patch floating around which reserved a few bits of the xid
> for an xprt index to ensure all xids were unique, but Trond didn't like
> sub-dividing the xid space (which is fair enough).
> So maybe it isn't safe to use nconnect with RDMA and protocol versions
> earlier than 4.1.


--
Chuck Lever