Message ID | 20240228213523.35819-1-trondmy@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | NFS: enable nconnect for RDMA | expand |
On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote: > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > It appears that in certain cases, RDMA capable transports can benefit > from the ability to establish multiple connections to increase their > throughput. This patch therefore enables the use of the "nconnect" mount > option for those use cases. > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> No objection to this patch. You don't mention here if you have root-caused the throughput issue. One thing I've noticed is that contention for the transport's queue_lock is holding back the RPC/RDMA Receive completion handler, which is single-threaded per transport. A way to mitigate this would be to replace the recv_queue R-B tree with a data structure that can perform a lookup while holding only the RCU read lock. I have experimented with using an xarray for the recv_queue, and saw improvement. The workload on that data structure is different for TCP versus RDMA, though: on RDMA, the number of RPCs in flight is significantly smaller. For tens of thousands of RPCs in flight, an xarray might not scale well. The newer Maple tree or rose bush hash, or maybe a more classic data structure like rhashtable, might handle this workload better. It's also worth considering deleting each RPC from the recv_queue in a less performance-sensitive context, for example, xprt_release, rather than in xprt_complete_rqst. > --- > fs/nfs/nfs3client.c | 1 + > fs/nfs/nfs4client.c | 2 ++ > 2 files changed, 3 insertions(+) > > diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c > index 674c012868b1..b0c8a39c2bbd 100644 > --- a/fs/nfs/nfs3client.c > +++ b/fs/nfs/nfs3client.c > @@ -111,6 +111,7 @@ struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv, > cl_init.hostname = buf; > > switch (ds_proto) { > + case XPRT_TRANSPORT_RDMA: > case XPRT_TRANSPORT_TCP: > case XPRT_TRANSPORT_TCP_TLS: > if (mds_clp->cl_nconnect > 1) > diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c > index 11e3a285594c..84573df5cf5a 100644 > --- a/fs/nfs/nfs4client.c > +++ b/fs/nfs/nfs4client.c > @@ -924,6 +924,7 @@ static int nfs4_set_client(struct nfs_server *server, > else > cl_init.max_connect = max_connect; > switch (proto) { > + case XPRT_TRANSPORT_RDMA: > case XPRT_TRANSPORT_TCP: > case XPRT_TRANSPORT_TCP_TLS: > cl_init.nconnect = nconnect; > @@ -1000,6 +1001,7 @@ struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv, > cl_init.hostname = buf; > > switch (ds_proto) { > + case XPRT_TRANSPORT_RDMA: > case XPRT_TRANSPORT_TCP: > case XPRT_TRANSPORT_TCP_TLS: > if (mds_clp->cl_nconnect > 1) { > -- > 2.44.0 > >
On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> wrote: > > On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote: > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > It appears that in certain cases, RDMA capable transports can benefit > > from the ability to establish multiple connections to increase their > > throughput. This patch therefore enables the use of the "nconnect" mount > > option for those use cases. > > > > Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > > No objection to this patch. > > You don't mention here if you have root-caused the throughput issue. > One thing I've noticed is that contention for the transport's > queue_lock is holding back the RPC/RDMA Receive completion handler, > which is single-threaded per transport. Curious: how does a queue_lock per transport is a problem for nconnect? nconnect would create its own transport, would it now and so it would have its own queue_lock (per nconnect). > A way to mitigate this would be to replace the recv_queue > R-B tree with a data structure that can perform a lookup while > holding only the RCU read lock. I have experimented with using an > xarray for the recv_queue, and saw improvement. > > The workload on that data structure is different for TCP versus > RDMA, though: on RDMA, the number of RPCs in flight is significantly > smaller. For tens of thousands of RPCs in flight, an xarray might > not scale well. The newer Maple tree or rose bush hash, or maybe a > more classic data structure like rhashtable, might handle this > workload better. > > It's also worth considering deleting each RPC from the recv_queue > in a less performance-sensitive context, for example, xprt_release, > rather than in xprt_complete_rqst. > > > > --- > > fs/nfs/nfs3client.c | 1 + > > fs/nfs/nfs4client.c | 2 ++ > > 2 files changed, 3 insertions(+) > > > > diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c > > index 674c012868b1..b0c8a39c2bbd 100644 > > --- a/fs/nfs/nfs3client.c > > +++ b/fs/nfs/nfs3client.c > > @@ -111,6 +111,7 @@ struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv, > > cl_init.hostname = buf; > > > > switch (ds_proto) { > > + case XPRT_TRANSPORT_RDMA: > > case XPRT_TRANSPORT_TCP: > > case XPRT_TRANSPORT_TCP_TLS: > > if (mds_clp->cl_nconnect > 1) > > diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c > > index 11e3a285594c..84573df5cf5a 100644 > > --- a/fs/nfs/nfs4client.c > > +++ b/fs/nfs/nfs4client.c > > @@ -924,6 +924,7 @@ static int nfs4_set_client(struct nfs_server *server, > > else > > cl_init.max_connect = max_connect; > > switch (proto) { > > + case XPRT_TRANSPORT_RDMA: > > case XPRT_TRANSPORT_TCP: > > case XPRT_TRANSPORT_TCP_TLS: > > cl_init.nconnect = nconnect; > > @@ -1000,6 +1001,7 @@ struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv, > > cl_init.hostname = buf; > > > > switch (ds_proto) { > > + case XPRT_TRANSPORT_RDMA: > > case XPRT_TRANSPORT_TCP: > > case XPRT_TRANSPORT_TCP_TLS: > > if (mds_clp->cl_nconnect > 1) { > > -- > > 2.44.0 > > > > > > -- > Chuck Lever >
> On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu> wrote: > > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> wrote: >> >> On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote: >>> From: Trond Myklebust <trond.myklebust@hammerspace.com> >>> >>> It appears that in certain cases, RDMA capable transports can benefit >>> from the ability to establish multiple connections to increase their >>> throughput. This patch therefore enables the use of the "nconnect" mount >>> option for those use cases. >>> >>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> >> >> No objection to this patch. >> >> You don't mention here if you have root-caused the throughput issue. >> One thing I've noticed is that contention for the transport's >> queue_lock is holding back the RPC/RDMA Receive completion handler, >> which is single-threaded per transport. > > Curious: how does a queue_lock per transport is a problem for > nconnect? nconnect would create its own transport, would it now and so > it would have its own queue_lock (per nconnect). I did not mean to imply that queue_lock contention is a problem for nconnect or would increase when there are multiple transports. But there is definitely lock contention between the send and receive code paths, and that could be one source of the relief that Trond saw by adding more transports. IMO that contention should be addressed at some point. I'm not asking for a change to the proposed patch. But I am suggesting some possible future work. >> A way to mitigate this would be to replace the recv_queue >> R-B tree with a data structure that can perform a lookup while >> holding only the RCU read lock. I have experimented with using an >> xarray for the recv_queue, and saw improvement. >> >> The workload on that data structure is different for TCP versus >> RDMA, though: on RDMA, the number of RPCs in flight is significantly >> smaller. For tens of thousands of RPCs in flight, an xarray might >> not scale well. The newer Maple tree or rose bush hash, or maybe a >> more classic data structure like rhashtable, might handle this >> workload better. >> >> It's also worth considering deleting each RPC from the recv_queue >> in a less performance-sensitive context, for example, xprt_release, >> rather than in xprt_complete_rqst. -- Chuck Lever
On Mon, Mar 4, 2024 at 2:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote: > > > > > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu> wrote: > > > > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> wrote: > >> > >> On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@kernel.org wrote: > >>> From: Trond Myklebust <trond.myklebust@hammerspace.com> > >>> > >>> It appears that in certain cases, RDMA capable transports can benefit > >>> from the ability to establish multiple connections to increase their > >>> throughput. This patch therefore enables the use of the "nconnect" mount > >>> option for those use cases. > >>> > >>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> > >> > >> No objection to this patch. > >> > >> You don't mention here if you have root-caused the throughput issue. > >> One thing I've noticed is that contention for the transport's > >> queue_lock is holding back the RPC/RDMA Receive completion handler, > >> which is single-threaded per transport. > > > > Curious: how does a queue_lock per transport is a problem for > > nconnect? nconnect would create its own transport, would it now and so > > it would have its own queue_lock (per nconnect). > > I did not mean to imply that queue_lock contention is a > problem for nconnect or would increase when there are > multiple transports. Got it. I wanted to make sure I didn't misunderstand something.You are stating that even for a single transport there could be improvements made to make sending and receiving more independent of each other. > But there is definitely lock contention between the send and > receive code paths, and that could be one source of the relief > that Trond saw by adding more transports. IMO that contention > should be addressed at some point. > > I'm not asking for a change to the proposed patch. But I am > suggesting some possible future work. > > > >> A way to mitigate this would be to replace the recv_queue > >> R-B tree with a data structure that can perform a lookup while > >> holding only the RCU read lock. I have experimented with using an > >> xarray for the recv_queue, and saw improvement. > >> > >> The workload on that data structure is different for TCP versus > >> RDMA, though: on RDMA, the number of RPCs in flight is significantly > >> smaller. For tens of thousands of RPCs in flight, an xarray might > >> not scale well. The newer Maple tree or rose bush hash, or maybe a > >> more classic data structure like rhashtable, might handle this > >> workload better. > >> > >> It's also worth considering deleting each RPC from the recv_queue > >> in a less performance-sensitive context, for example, xprt_release, > >> rather than in xprt_complete_rqst. > > > -- > Chuck Lever > >
On Mon, 2024-03-04 at 19:32 +0000, Chuck Lever III wrote: > > > > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu> > > wrote: > > > > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> > > wrote: > > > > > > On Wed, Feb 28, 2024 at 04:35:23PM -0500, > > > trondmy@kernel.org wrote: > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > > > It appears that in certain cases, RDMA capable transports can > > > > benefit > > > > from the ability to establish multiple connections to increase > > > > their > > > > throughput. This patch therefore enables the use of the > > > > "nconnect" mount > > > > option for those use cases. > > > > > > > > Signed-off-by: Trond Myklebust > > > > <trond.myklebust@hammerspace.com> > > > > > > No objection to this patch. > > > > > > You don't mention here if you have root-caused the throughput > > > issue. > > > One thing I've noticed is that contention for the transport's > > > queue_lock is holding back the RPC/RDMA Receive completion > > > handler, > > > which is single-threaded per transport. > > > > Curious: how does a queue_lock per transport is a problem for > > nconnect? nconnect would create its own transport, would it now and > > so > > it would have its own queue_lock (per nconnect). > > I did not mean to imply that queue_lock contention is a > problem for nconnect or would increase when there are > multiple transports. > > But there is definitely lock contention between the send and > receive code paths, and that could be one source of the relief > that Trond saw by adding more transports. IMO that contention > should be addressed at some point. > > I'm not asking for a change to the proposed patch. But I am > suggesting some possible future work. > We were comparing NFS/RDMA performance to that of NFS/TCP, and it was clear that the nconnect value was giving the latter a major boost. Once we enabled nconnect for the RDMA channel, then the values evened out a lot more. Once we fixed the nconnect issue, what we were seeing when the RDMA code maxed out was actually that the CPU got pegged running the IB completion work queues on writes. We can certainly look into improving the performance of xprt_lookup_rqst() if we have evidence that is slow, but I'm not yet sure that was what we were seeing.
On Mon, Mar 04, 2024 at 11:08:00PM +0000, Trond Myklebust wrote: > On Mon, 2024-03-04 at 19:32 +0000, Chuck Lever III wrote: > > > > > > > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@umich.edu> > > > wrote: > > > > > > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@oracle.com> > > > wrote: > > > > > > > > On Wed, Feb 28, 2024 at 04:35:23PM -0500, > > > > trondmy@kernel.org wrote: > > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com> > > > > > > > > > > It appears that in certain cases, RDMA capable transports can > > > > > benefit > > > > > from the ability to establish multiple connections to increase > > > > > their > > > > > throughput. This patch therefore enables the use of the > > > > > "nconnect" mount > > > > > option for those use cases. > > > > > > > > > > Signed-off-by: Trond Myklebust > > > > > <trond.myklebust@hammerspace.com> > > > > > > > > No objection to this patch. > > > > > > > > You don't mention here if you have root-caused the throughput > > > > issue. > > > > One thing I've noticed is that contention for the transport's > > > > queue_lock is holding back the RPC/RDMA Receive completion > > > > handler, > > > > which is single-threaded per transport. > > > > > > Curious: how does a queue_lock per transport is a problem for > > > nconnect? nconnect would create its own transport, would it now and > > > so > > > it would have its own queue_lock (per nconnect). > > > > I did not mean to imply that queue_lock contention is a > > problem for nconnect or would increase when there are > > multiple transports. > > > > But there is definitely lock contention between the send and > > receive code paths, and that could be one source of the relief > > that Trond saw by adding more transports. IMO that contention > > should be addressed at some point. > > > > I'm not asking for a change to the proposed patch. But I am > > suggesting some possible future work. > > > > We were comparing NFS/RDMA performance to that of NFS/TCP, and it was > clear that the nconnect value was giving the latter a major boost. Once > we enabled nconnect for the RDMA channel, then the values evened out a > lot more. > Once we fixed the nconnect issue, what we were seeing when the RDMA > code maxed out was actually that the CPU got pegged running the IB > completion work queues on writes. > > We can certainly look into improving the performance of > xprt_lookup_rqst() if we have evidence that is slow, but I'm not yet > sure that was what we were seeing. One observation: the Receive completion handler doesn't do anything that is CPU-intensive. If ib_comp_wq is hot, that's an indication of lock contention. I've found there are typically two contended locks when handling RPC/RDMA Receive completions: - The workqueue pool lock. Tejun mitigated that issue in v6.7. - The queue_lock, as described above. A flame graph might narrow the issue.
diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c index 674c012868b1..b0c8a39c2bbd 100644 --- a/fs/nfs/nfs3client.c +++ b/fs/nfs/nfs3client.c @@ -111,6 +111,7 @@ struct nfs_client *nfs3_set_ds_client(struct nfs_server *mds_srv, cl_init.hostname = buf; switch (ds_proto) { + case XPRT_TRANSPORT_RDMA: case XPRT_TRANSPORT_TCP: case XPRT_TRANSPORT_TCP_TLS: if (mds_clp->cl_nconnect > 1) diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c index 11e3a285594c..84573df5cf5a 100644 --- a/fs/nfs/nfs4client.c +++ b/fs/nfs/nfs4client.c @@ -924,6 +924,7 @@ static int nfs4_set_client(struct nfs_server *server, else cl_init.max_connect = max_connect; switch (proto) { + case XPRT_TRANSPORT_RDMA: case XPRT_TRANSPORT_TCP: case XPRT_TRANSPORT_TCP_TLS: cl_init.nconnect = nconnect; @@ -1000,6 +1001,7 @@ struct nfs_client *nfs4_set_ds_client(struct nfs_server *mds_srv, cl_init.hostname = buf; switch (ds_proto) { + case XPRT_TRANSPORT_RDMA: case XPRT_TRANSPORT_TCP: case XPRT_TRANSPORT_TCP_TLS: if (mds_clp->cl_nconnect > 1) {